Chapter 11. Beyond Conventional Data types

PostgreSQL can handle rich data types due to its powerful extensions. Data that does not fit the relational model inherently, such as semistructured data, can be stored and manipulated, either using out-of-the-box data types or extensions. Also, the PostgreSQL community focuses not only on enhancing relational database features, but also on supporting rich data types, such as arrays, XMLs, hash stores, and JSON documents. The focus shift is a result of embracing changes in the software development process' life cycle, such as agile development methods, and supporting unknown and rapid software requirements.

Nonconventional data types allow PostgreSQL to store different data types such as geographical, binary, as well as schema-less data, such as JSON documents and hash stores. PostgreSQL supports some of these data types out of the box, including JSON, JSONB, XML, array, bytea, and BLOB. More data types are available via extensions such as hstore and PostGIS.

JSON, JSONB, and hstore allow PostgreSQL to handle schema-less models, which in turn allow developers to make real-time adjustments to data in an automatic and transparent way. Using JSON, JSONB, and hash store allows developer to change the data structure without changing the table structure using the ALTER command. Also, it allows them to have a flexible data model without using the entity-attribute-value (EAV) model, which is difficult to handle in the relational model. However, developers should take care of handling data integrity in the business logic to ensure that the data is clean and error-free.

In this chapter, arrays, hstore, XML, and JSON data types will be introduced. However, it would be nice to have a look at the PostGIS extension, knowing that PostGIS supports raster and vector formats and provides very rich functions to manipulate data.

PostgreSQL arrays

Multidimensional arrays are supported; here, the array type can be a base, enum, or composite type. Array elements should have only one data type. PostgreSQL arrays allow duplicate values as well as null values. The following example shows how to initialize a one-dimensional array and get the first element:

SELECT ('{red, green, blue}'::text[])[1] as red ;
red
-----
 red
(1 row)

The array length, by default, is not bound to a certain value, but this can also be specified when using arrays to define a relation. By default, an array index, as shown in the preceding example, starts from index one; however, this behavior can be changed by defining the dimension when initializing the array, as follows:

car_portal=# SELECT '[0:1]={1,2}'::INT[];
    int4
-------------
 [0:1]={1,2}
(1 row)

car_portal=# SELECT ('[0:1]={1,2}'::INT[])[0];
 int4
------
    1
(1 row)

car_portal=# SELECT ('[0:1]={1,2}'::INT[])[1];
 int4
------
    2
(1 row)

Arrays can be initialized using the {} construct. Another way to initialize an array is as follows:

SELECT array['red','green','blue'] AS primary_colors;

PostgreSQL provides many functions to manipulate arrays, such as array_remove to remove a certain element. The following are some of the function arrays:

SELECT  
  array_ndims(two_dim_array) AS "Number of dimensions",
  array_dims(two_dim_array) AS "Dimensions index range",
  array_length(two_dim_array, 1) AS "The array length of 1st dimension",  
  cardinality(two_dim_array) AS "Number of elements",
  two_dim_array[1][1] AS "The first element"
FROM
  (VALUES ('{{red,green,blue}, {red,green,blue}}'::text[][])) AS foo(two_dim_array);

-[ RECORD 1 ]---------------------+-----------
Number of dimensions              | 2
Dimensions index range            | [1:2][1:3]
The array length of 1st dimension | 2
Number of elements                | 6
The first element                 | red

A very common use case of arrays is to model multivalued attributes. For example, a dress can have more than one color, and a newspaper article can have several tags. Another use case is to model a hash store. This is achieved by having two arrays—one with the keys and another with the values—and the array index is used to associate the key with the value. For example, pg_stats uses this approach to store information about the common values' histogram. The most_common_vals and the most_common_freqs columns are used to list the most common values in a certain column and the frequencies of these most common values, respectively, as shown in the following example:

car_portal=# SELECT * FROM pg_stats WHERE array_length(most_common_vals,1) < 10 AND schemaname NOT IN ('pg_catalog','information_schema') LIMIT 1;
-[ RECORD 1 ]----------+----------------------------------------------------------------------------------------
schemaname             | public
tablename              | duplicate
attname                | f2
inherited              | f
null_frac              | 0
avg_width              | 4
n_distinct             | -0.263158
most_common_vals       | {4,3,9,8,2,5,6,7,10}
most_common_freqs      | {0.184211,0.157895,0.131579,0.105263,0.0789474,0.0789474,0.0789474,0.0789474,0.0789474}
histogram_bounds       |
correlation            | -0.0405953
most_common_elems      |
most_common_elem_freqs |
elem_count_histogram   |

Another use case of arrays is to store scientific information. Note that there are dedicated databases designed specifically to store and manipulate arrays, such as SciDB.

Also, arrays can be used to facilitate coding and in performing some SQL tricks, such as passing several arguments to the function using the VARIADIC array option or performing loops using the generate_series function. This allows the developers to perform complex tasks without using the PL/pgSQL language. For example, let's assume that we want to have at least one column as not null out of several columns. This, in reality, can be used to check for disjoint attributes or model inheritance. For example, let's assume we have a table called vehicle that contains a vehicle's common attributes. Also, let's assume that we have several types of vehicles, such as trucks, cars, sport cars, and so on. One could model this by having several columns referencing the car, truck, and sport car tables. To understand how one can use the VARIADIC function, let's model the vehicle inheritance example, as follows:

CREATE OR REPLACE FUNCTION null_count (VARIADIC arr int[]) RETURNS INT AS
$$
  SELECT count(CASE WHEN m IS NOT NULL THEN 1 ELSE NULL END)::int FROM unnest($1) m(n)
$$ LANGUAGE SQL

To use the preceding function, one needs to add a check to the table, as follows:

CREATE TABLE car (
  car_id SERIAL PRIMARY KEY,
  car_number_of_doors INT DEFAULT 5
)

CREATE TABLE bus (
  bus_id SERIAL PRIMARY KEY,
  bus_number_of_passengers INT DEFAULT 50
)

CREATE TABLE vehicle (
  vehicle_id SERIAL PRIMARY KEY,
  registration_number TEXT,
  car_id INT REFERENCES car(car_id),
  bus_id INT REFERENCES bus(bus_id),
  CHECK (null_count(car_id, bus_id) = 1)
  
)
INSERT INTO CAR VALUES (1, 5);
INSERT INTO BUS VALUES (1, 25);

To test the function, let's execute some insert commands, as follows:

postgres=# INSERT INTO CAR VALUES (1, 5);
INSERT 0 1
postgres=# INSERT INTO BUS VALUES (1, 25);
INSERT 0 1
postgres=#
postgres=# INSERT INTO vehicle VALUES (default, 'a234', null, null);
ERROR:  new row for relation "vehicle" violates check constraint "vehicle_check"
DETAIL:  Failing row contains (5, a234, null, null).
postgres=# INSERT INTO vehicle VALUES (default, 'a234', 1, 1);
ERROR:  new row for relation "vehicle" violates check constraint "vehicle_check"
DETAIL:  Failing row contains (6, a234, 1, 1).
postgres=# INSERT INTO vehicle VALUES (default, 'a234', null, 1);
INSERT 0 1
postgres=# INSERT INTO vehicle VALUES (default, 'a234', 1, null);
INSERT 0 1

Note that to call the null_count function, we need to add VARIADIC to the function's argument, as follows:

postgres=# SELECT * FROM null_count(VARIADIC ARRAY [null, 1]);
 null_count
------------
          1
(1 row)

Another trick is to generate the substring of a text; this comes in handy when one would like to get the longest prefix match. The longest prefix match is very important in areas such as telecommunication as some mobile operator or telephone companies can rent a range of numbers to another operator. Longest prefix matching is used to determine the network. The following example shows how we can achieve this:

.  CREATE TABLE prefix (
  network TEXT,
  prefix_code TEXT NOT NULL
);
CREATE INDEX ON prefix(prefix_code);

INSERT INTO prefix VALUES ('Palestine Jawwal', 97059), ('Palestine Jawwal',970599), ('Palestine watania',970597);

CREATE OR REPLACE FUNCTION prefixes(TEXT) RETURNS  TEXT[] AS
$$
  SELECT ARRAY(SELECT substring($1,1,i) FROM generate_series(1, length($1)) g(i))::TEXT[];
$$ LANGUAGE SQL IMMUTABLE;

The index on the prefix code will be used to perform an index scan on the table prefix. The function prefixes will return an array with the prefix substring. To test whether longest prefix matching worked, let's get the longest prefix for the number 97059973456789 through the following code:

postgres=# SELECT * FROM prefix WHERE prefix_code = any (prefixes('97059973456789')) ORDER BY length(prefix_code) DESC limit 1;
     network      | prefix_code
------------------+-------------
 Palestine Jawwal | 970599

One can use arrays for several purposes; for example, prior to the introduction of window functions in PostgreSQL version 8.4, arrays and the generate_series function were used to mimic the row_number function.

Common functions of arrays and their operators

Array operators are similar to other data type operators. For example, the = sign is used for equality comparison, and the || operator is used for concatenation. Also, in the previous chapters, we saw some operators similar to &&, which returns true if the arrays are overlapping. Finally, the @> and <@ operators are used if an array contains or is contained by another array, respectively.

The unnest function is used to return a set of elements from an array. This is quite useful when one would like to use set operations on arrays, such as distinct, order by, intersect, union, and so on. The following example is used to remove the duplicates and sort the array in an ascending order:

car_portal=# SELECT array(SELECT DISTINCT unnest (array [1,1,1,2,3,3]) ORDER BY 1);
  array
---------
 {1,2,3}
(1 row)

In the preceding example, the result of the unnest function is sorted and duplicates are removed using ORDER BY and DISTINCT, respectively. The () function array is used to construct the array from a set, as follows:

car_portal=# SELECT array (VALUES (1),(2));
 array
-------
 {1,2}
(1 row)

The array ANY function is similar to the SQL IN () construct and is used to compare containment, as shown in the following example:

car_portal=# SELECT 1 in (1,2,3), 1 = ANY ('{1,2,3}'::INT[]);
 ?column? | ?column?
----------+----------
 t        | t
(1 row)

Modifying and accessing arrays

An array element can be accessed via an index; if the array does not contain an element for this index, the NULL value is returned, as shown in the following example:

CREATE TABLE car(
  id SERIAL PRIMARY KEY,
  color text []
);

INSERT INTO car(color) VALUES ('{red, green}'::text[]);
INSERT INTO car(color) VALUES ('{red}'::text[]);
car_portal=# SELECT color [3]IS NOT DISTINCT FROM null FROM car;
 ?column?
----------
 t
 t
(2 rows)

Also, an array can be sliced by providing a lower and upper bound, as follows:

car_portal=# SELECT color [1:2] FROM car;
    color
-------------
 {red,green}
 {red}
(2 rows)

When updating an array, one could completely replace the array, amend a slice value, amend an element value, or append the array using the || concatenation operator, as follows:

--- append array using concatenation
car_portal=# SELECT ARRAY ['red', 'green'] || '{blue}'::text[] AS concat_array;
   concat_array
------------------
 {red,green,blue}
(1 row)
--- update slice
car_portal=# UPDATE car set color[1:2] = '{black, white}';
UPDATE 2
car_portal=# SELECT * FROM car;
 id |     color
----+---------------
  4 | {black,white}
  3 | {black,white}
(2 rows)

The remove function array can be used to remove all the elements that are equal to a certain value, as follows:

car_portal=# SELECT array_remove ('{Hello, Hello, World}'::TEXT[], 'Hello');
 array_remove
--------------
 {World}
(1 row)

To remove a certain value based on an index, one can use the WITH ORDINALITY clause. So, let's assume that we want to remove the first element of an array; this can be achieved as follows:

car_portal=# SELECT ARRAY(SELECT unnest FROM unnest ('{Hello, Hello, World}'::TEXT[]) WITH ordinality WHERE ordinality <> 1);
     array
---------------
 {Hello,World}
(1 row)

Indexing arrays in PostgreSQL

The GIN index can be used to index arrays; standard PostgreSQL distributions have the GIN operator class for one-dimensional arrays. The GIN index is supported for the following operators: contains "@>", is contained by "<@", overlapping "&&", and equality "=" operators. Take a look at the following code:

CREATE INDEX ON car USING GIN (color);
--- to force index, sequential scan is disabled
SET enable_seqscan TO off;

EXPLAIN SELECT * FROM car WHERE '{red}'::text[] && color;
--- Result
Bitmap Heap Scan on car  (cost=8.00..12.01 rows=1 width=36)
  Recheck Cond: ('{red}'::text[] && color)
  ->  Bitmap Index Scan on car_color_idx  (cost=0.00..8.00 rows=1 width=0)
        Index Cond: ('{red}'::text[] && color)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset