Extract, transform, and load

Let's consider the task of loading HTTP logs in the database and preparing it for analysis. Such tasks in data warehouse solutions are called extract, transform, and load (ETL).

Suppose the HTTP server that runs the car portal application is writing access logs in files and they are recreated every day. Assuming that the server runs nginx, a popular HTTP server software, lines in such log files by default should look similar to this:

94.31.90.168 - - [01/Jul/2017:01:00:22 +0000] "GET / HTTP/1.1" 200 227 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"

It has the given fields, separated by a space: remote address, remote user, timestamp, access method and resource URL, status code, body size, bytes, HTTP referer, and HTTP user agent.

The setup process and configuration of a HTTP server is out of the scope of this book. However, it is worth mentioning that to make it easier to load into a PostgreSQL database, it would make sense to change the log file format. To make it contain comma-separated values, one should change the directive log_format in the configuration file, nginx.conf, to make it look like this:

log_format main '$time_local;$remote_addr;$remote_user;'
'"$request";$status;$body_bytes_sent;'
'"$http_referer";"$http_user_agent"'

Now the logs are produced as CSV files containing the following lines:

01/Jul/2017:01:00:22 +0000;94.31.90.168;-;"GET / HTTP/1.1";200;227;"-";"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36"

The file has the same fields as before, just the timestamp field is moved to the first position. All the fields are separated by a semicolon (;). In the database, there is a table named called dwh.access_log with the following structure:

CREATE TABLE dwh.access_log 
(
ts timestamp with time zone,
remote_address text,
remote_user text,
url text,
status_code int,
body_size int,
http_referer text,
http_user_agent text
);

PostgreSQL provides a way to load data into tables quickly, many rows at a time, instead of executing INSERT commands row by row. There is a command named COPY that loads data into a table. The data can be taken from a file that is located on the database server in a place accessible by the PostgreSQL server process or from a stream. The data by default should look like a tab-separated list of fields; with records separated by a new line. However, this is customizable. The command is not included in the SQL standard. Here is a simplified syntax diagram for this command:

COPY <table name> [(column [, ...])] FROM { <file name> | STDIN } [[WITH] (<options>)]

options here is used to specify data format, delimiters, quotations, escape characters, and some other parameters. STDIN can be specified instead of the filename to copy the data from standard input. The column list can be provided when the file does not have the data for all the columns or they are given in a different order.

To load the data into the database, one should somehow copy the file to the server and then execute a command such as this:

COPY dwh.access_log FROM '/tmp/access.log';

This would load the contents of the file located at /tmp/access.log to the dwh.access_log table. However, this is not a good idea for several reasons:

  • Only a superuser can execute the COPY command to read or write from files.
  • As the PostgreSQL server process needs to access the file, it would be necessary to either place the file in the location where other database files are stored, or allow the PostgreSQL process to access other locations. Both may be considered issues from a security point of view.
  • Copying the file to the server may be the issue by itself: it would be necessary to allow the user (or the application) to access the filesystem on the server or mount shared network locations to the server.

To avoid these issues, one should use the COPY command to load the data from a stream, namely standard input. To do so, it may be necessary to write an application that would connect to the database, execute the COPY command, and then stream the data to the server. Luckily, the psql console can do this.

Let's try to load the sample file access.log that is provided in the attached media.

Suppose the file is located in the current directory. The database is running on localhost and the database user car_portal_app is allowed to connect. The following command will load the data from the file to the dwh.access_log table on a Linux system:

user@host:~$ cat access.log | psql -h localhost -U car_portal_app -d car_portal -c "COPY dwh.access_log FROM STDIN WITH (format csv, delimiter ';')"
COPY 15456

The cat command prints the file to the console. Its output is piped to psql that is executing the COPY command loading the data from standard input. The output of the command is the word COPY followed by the number of rows copied.

On Windows, the command is similar:

c:Usersuser> psql -h localhost -U car_portal_app -d car_portal -c "COPY dwh.access_log FROM STDIN WITH (format csv, delimiter ';')" < access.log
COPY 15456

It is also possible to execute this operation interactively from the psql console. There is a copy command provided by psql, which has a similar syntax to the SQL command COPY: copy dwh.access_log FROM 'access.log' WITH csv delimiter ';'. In the background, it is still the same. On the server, the COPY ... FROM STDIN command is executed and psql reads the file and sends its contents to the server.

Using this approach, it is very easy to organize a simple ETL process. It may be executed once per day and would just execute the preceding command. The prerequisite for this would be setting up log rotation for the HTTP server to make sure that the same data is not loaded twice.

The COPY command can not only load data into tables, but it can also copy the data the other way round: from tables to files or to standard output. This feature can be used again in ETL processes that can be organized in the form of simple bash scripts instead of complex software that would use high-level access to the database using JDBC or other APIs.

Here is an example of how to copy data table from one server to another (supposing the structure of the table is identical):
$ psql -h server1 -d database1 -c "COPY table1 TO STDOUT" | psql -h server2 -d database2 -c "COPY table2 FROM STDIN"
Here, the first psql command will read the table and output it to standard output. This stream is piped to the second psql command that writes the data from standard input to the other table.

ETL can also include enriching the input raw data with additional attributes or preprocessing it to make it easier to query. For example, the entries could be mapped to the records in the car_portal database. Let's say, API calls to /api/cars/30 could be mapped to the record in the table car with car_id = 30. Such processing can be done in the database, as follows:

car_portal=> ALTER TABLE dwh.access_log ADD car_id int;
ALTER TABLE

car_portal=> UPDATE dwh.access_log
SET car_id = (SELECT regexp_matches(url, '/api/cars/(d+)W'))[1]::int
WHERE url like '%/api/cars/%';
UPDATE 2297

In a similar way, the access_log table can be enriched by another attribute and can be joined to other tables.

In real life, ETL is usually more complex. It would normally include checking whether the data that is being loaded exists already in the destination table and would clean it before loading. This will make the process idempotent. Additionally, the process could identify which data to load and locate the input data. For example, consider an ETL job that loads some dataset every day. If it has failed for some reason, then when started the next day it should recognize that the failure has occurred and load two sets of data instead of only one. Notifying the right systems or people in case of failure can also be done by the ETL process. 

There are a lot of different ETL tools available on the market, and as open source software. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset