How Qlik Sense handles large volumes of data 

The usual Qlik in-memory approach does a really good job of handling usual datasets (millions of records). Obtaining the answer in a sub-second upon user selection is a regular experience.

Qlik Indexing Engine (QIX) is Qlik's patented in-memory data-indexing technology. This approach loads and keeps the user dataset (databases, files, and data lakes extracts) in server memory. It does this using a compression algorithm that can compress data down to 10% of original data. Let's understand how this works; while the data is being loaded by your script or data manager internally, Qlik engine creates some tables to accommodate your data. First, Qlik creates a symbol table for each field with two fields (pointer and value); then the engine loads only distinct values to this table and creates a pointer with the smallest bit representation for how many distinct values were loaded. This table has only one row for each distinct value and the smallest binary representation of that number. Let's see some examples of the symbol table.

Let's suppose we are loading a small table such as this:

Gender Age
Male 0-1 years
Other 100
Female 110
Male 001
... (more rows) ...
Male 001

The following is the symbol table for the gender field (only one record for each distinct value and a the smallest binary representation of the value) would be:

Value Pointer
Female 00
Male 01
Other 10

 

The following is the symbol table for the age field (only one record for each distinct value and the smallest binary representation of the value) would be:

Value Pointer
0-1 years 000
2-5 years 001
6-10 years 010
11-15 years 011
16-24 years 100
25-40 years 101
More than 40 years 110

 

Qlik engine now recreates the original data table with just the pointers (binary representation stored from symbol tables): 

Gender Age
01 000
10 100
00 110
01 001
... (more rows) ...
01 001

 

As you can see, this approach leads to a much more compact representation of the loaded data, which creates a very optimized dataset. Please check https://community.qlik.com/t5/Qlik-Design-Blog/Symbol-Tables-and-Bit-Stuffed-Pointers/ba-p/1475369 to find a more detailed description of this model.

But how can we handle a really big dataset that contains billions of rows? While adding more RAM is an option, this might not always be feasible. To handle this situation, we now have four approaches to handle big datasets in Qlik technology, which can be used together or apart:

  • Segmentation: Segmentation is the process of splitting one huge Qlik application into several applications. For example, instead of having one huge application with information on the entire country, we can split it into one app for each state of data. 
  • ChainingChaining is an approach that is usually attached to segmentation. This refers to the linking of multiple segmented Qlik applications, which maintains some sense of selection. This means that when a user selects something that is related to another application, they are redirected to that application with the same selections, keeping the selections made so that the experience is smooth and natural. As observed in the following diagram, when a user makes a selection in SegApp1, they will be redirected to the respective application, keeping the current selections, so the user can even think they are using the same application:

  • Direct discovery: Using this approach, a SQL statement is executed against the original data source for each uncached selection made by a user. In this case, the values from dimensions are usually pre-populated and for each calculation the SQL is reevaluated. This approach is being deprecated, giving way to ODAG.
  • ODAG: In the June 2017 version, Qlik Sense Enterprise launched a new way of handling really big datasets. This method is called ODAG. In this scenario, we have a summarized application (as we can see in the following diagram) that is refreshed by a schedule, which retrieves a consolidated view of the data (rolling up by date, product). This application is loaded in-memory and, if the user wants more detailed information, all selections that were made are passed to the detailed application template script that retrieves only the required detailed subset from the Data Lake:

This approach allows you to analyze billions of data records through a summarized view (something that Apache Hive, Impala, Redshift, BigQuery, or alternatives do very well) and if the user wants to analyze a detailed version of the scenario, a fresh new application with a custom query can retrieve the records that were selected.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset