Chapter 12. Inside VertiPaq

In Chapter 1, “Introducing the tabular model,” you saw that the tabular model can execute a query by using an in-memory analytics engine or by translating the MDX or DAX query into one or more SQL queries. The former uses a storage engine called VertiPaq, whereas the latter uses DirectQuery. This chapter is dedicated to the internal architecture of the VertiPaq engine, which is the in-memory columnar database that stores and hosts tabular models.

Before reading this chapter, you need to be aware of the following two facts:

Image Implementation details change often. We did our best to show information at a level that is not likely to change soon. We carefully balanced the detail level and usefulness with consistency over time. The most up-to-date information will always be available in blog posts and articles on the web.

Image All the considerations about the engine are useful if you rely on the VertiPaq engine, but they are not relevant if you are using DirectQuery. However, we suggest you read and understand it anyway because it shows many details that will help you choose the best engine for your analytical scenario.


Image What’s new in SSAS 2016

The VertiPaq engine has been improved in SSAS 2016, but this does not affect the general concepts behind how it works. However, this chapter has been reviewed and updated to include sections about internal metadata, memory usage, and tools to analyze VertiPaq structures.


Understanding VertiPaq structures

In this first part of the chapter, you learn how VertiPaq stores data in memory. This knowledge is fundamental in understanding how to improve compression and performance of both processing and querying a tabular model.

Understanding column storage

VertiPaq is an in-memory columnar database. This means all the data handled by a model resides in RAM, making it an easy concept to grasp. We think of a table as a list of rows, where each row is divided into columns—for example, the Product table shown in Figure 12-1.

Image

Figure 12-1 The Product table, with four columns and nine rows.

If you think of a table as a set of rows, then you are using the most natural visualization of a table structure, also known as a row store. In a row store, data is organized in rows. Using the values in the table shown in Figure 12-1 as an example, when data is stored in memory, the value in the Name column in the first row is adjacent to the values in the ID and Color columns in the same row. That means the value in the Name column in the second row is not adjacent to the value of the Name column in the first row. Between them are values in the Color and Unit Price columns in the first row and in the ID column in the second row.

Suppose you need to compute the sum of the values in the Unit Price column. If you were to read the data in the table sequentially in the way we just described, you would have to scan the entire table, reading many values that you are not interested in seeing. In other words, when scanning the memory of the database sequentially, to read the first value of Unit Price, you must read (and skip) ID, Name, and Color from the first row. Only then will you find an interesting value. The same process repeats for all the rows. You need to read and ignore many columns to find the interesting values to sum.

Reading and ignoring values takes time. In fact, if you were asked to compute the sum of the values in the Unit Price column, you would not follow that algorithm. Instead, as a human being, you would probably scan the first row and search for the position of Unit Price. You would then move your eyes vertically, reading only the values in the Unit Price column one at a time and mentally accumulating their values to produce the sum. The reason for this very natural behavior is that you save time by reading vertically instead of on a row-by-row basis.

In a columnar database, data is organized in such a way to optimize vertical scanning. To obtain this result, you need a way to make the different values of a column adjacent one to the other. Figure 12-2 shows the same Product table organized as a columnar database.

Image

Figure 12-2 The Product table organized on a column-by-column basis.

When stored in a columnar database, each column has its own data structure, and is physically separated from the others. Thus, the different values of Unit Price are adjacent to each other and are distant from the values in the Color, Name, and ID tables.

On this data structure, computing the sum of the values in the Unit Price column is much easier because you immediately go to the table that contains the Unit Price column, where you can find all the values you need to perform the computation. In other words, you need not read and ignore other column values. In a single scan, you obtain only the useful numbers, and you can quickly aggregate them.

Now imagine that instead of asking you for the sum of the values in the Unit Price column, we asked you to compute the sum of the values in the Unit Price column for only those products that contain a value of Red in the Color column. Try this before you continue reading. It will help you better understand the algorithm.

Computing the sum is more difficult now because you cannot obtain the sum by simply scanning the Unit Price column. Instead, you might scan the Color column and, whenever it contains the value Red, note the corresponding value in the Unit Price column. Then you would sum up all those values to compute the result. This algorithm, although very natural, would require you to constantly move your eyes from one column to another, possibly guiding your finger to keep the last scanned position. It is definitely not an optimized way of computing the value! A better way—one that only computers use—is to first scan the Color column, find the row numbers where the value is Red, and then, once you know the row numbers, scan the Unit Price column and sum only the rows you identified in the previous step.

This last algorithm is much better because it lets you perform one scan of the first column and one scan of the second column, always accessing the memory locations that are adjacent to each other (apart from the jump between the scan of the first and second column).

Things are even worse for a more complex expression, such as the sum of all the products that contain either Blue or Black in the Color column and a value higher than USD 50 in the Unit Price column. This time, you have no chance of scanning the column one at a time because the condition depends on many columns. (As usual, if you try it on paper, it helps you understand it better.) The simplest algorithm to produce such a result is to scan the table, not on a column basis, but on a row basis. You probably scanned the table row by row, even if the storage organization is column by column. Although it is a very simple operation when executed on paper by a human, the same operation is extremely expensive if executed by a computer in RAM because it requires a lot of random reads of memory, leading to a worse performance than if computed by doing a sequential scan.


Image Note

You might know that RAM is short for random-access memory, which allows data to be read at the same speed and latency, irrespective of the physical location of data inside the memory. While true in theory, this statement is no longer valid in modern hardware. Because RAM’s access time is high compared to the CPU clock speed, there are different levels of caches for the CPU to improve performance. Data is transferred in pages to the cache, so reading a contiguous area of memory is faster than accessing the same amount of data scattered in different non-contiguous memory addresses. You can find more information about the role of CPU cache at https://en.wikipedia.org/wiki/CPU_cache.


Columnar databases provide very quick access to a single column. If you need a calculation that involves many columns, they need to spend some time reorganizing the information in such a way that the final expression can be computed. Even though this example was very simple, it is already very useful to highlight the most important characteristics of column stores:

Image Single-column access is very fast because it reads a single block of memory and then computes whatever aggregation you need on that memory block.

Image If an expression uses many columns, the algorithm requires the engine to access different memory areas at different times, while keeping track of the progress in a temporary area.

Image The more columns you need to compute an expression, the harder it becomes to produce a final value. Eventually, it is easier to rebuild the row storage out of the column store to compute the expression.

Column stores aim to reduce the read time. However, they spend more CPU cycles to rearrange the data when many columns from the same table are used. Row stores, on the other hand, have a more linear algorithm to scan data, but they result in many useless reads. As a rule, reducing reads and increasing CPU usage is a good exchange because with modern computers it is always easier (and cheaper) to increase the CPU speed than to reduce the I/O (or memory access) time. Moreover, columnar databases can reduce the amount of time spent scanning data, via compression. VertiPaq compression algorithms aim to reduce the memory footprint of your data model. This is very important for two reasons:

Image A smaller model makes a better use of your hardware Why spend money on 1 TB of RAM when the same model, once compressed, can be hosted in 256 GB? Saving RAM is always a good option, if feasible.

Image A smaller model is faster to scan As simple as this rule is, it is very important when speaking about performance. If a column is compressed, the engine will scan less RAM to read its content, resulting in better performance.


Image Important

The actual details of VertiPaq’s compression algorithm are proprietary. Naturally, we cannot publish them in a book. What we explain in this chapter is simply a good approximation of what happens in the engine. You can use this information to understand how the VertiPaq engine stores data.


Value encoding versus hash encoding

The first compression technique concerns how VertiPaq stores column values in memory. Each column can be encoded using either one of these techniques: value encoding or hash encoding.

Value encoding

Value encoding stores data by applying both an offset and a reduction of the bits required to store the value, based on the range of values that are available in the column. Suppose you have a column that contains the price of products, stored as integer values. There are many different values, and to represent them all, you need a defined number of bits.

Let’s use the values in Figure 12-3 as an example. The maximum value in the Unit Price column is 216. Therefore, you need at least 8 bits to store each value. However, by using a simple mathematical operation, you can reduce the storage to 5 bits. In this case, by subtracting the minimum value (194) from all the values in the Unit Price column, VertiPaq reduces the range of the column to a range from 0 to 22. Storing numbers up to 22 requires fewer bits than storing numbers up to 216. While 3 bits might seem like a very small saving, when you multiply this over a few billion rows, it is easy to see that the difference can be important.

Image

Figure 12-3 Using VertiPaq to reduce the number of bits needed for a column.

The VertiPaq engine is even much more sophisticated than that. It can discover mathematical relationships between the values of a column. When it finds them, it can use them to modify the storage, reducing its memory footprint. Obviously, when using the column, it must re-apply the transformation in the opposite direction to obtain the original value. Depending on the transformation, this can happen before or after aggregating the values. Again, this will increase the CPU usage and reduce the number of reads, which, as discussed, is a very good option.

Value encoding happens only for numeric columns. Obviously, it cannot be applied on strings, as there are no values to encode. Note that VertiPaq stores the currency data type of DAX in an integer value. However, value encoding can also be applied to floating-point values, when the used values can be stored as a series of sequential integers with simple arithmetical transformations. For example, sequences compressed as value encoding might include 1, 2, 3, 4; or 0.01, 0.02, 0.03, 0.04; or 10, 20, 30, 40; or 120, 130, 140.

Hash encoding

Hash encoding (also known as dictionary encoding) is another technique used by VertiPaq to reduce the number of bits required to store a column. Hash encoding builds a dictionary of the distinct values of a column and then it replaces the column values with indexes to the dictionary. Let’s see this with an example. In Figure 12-4, you can see the Color column, which uses strings and, thus, cannot be value-encoded.

Image

Figure 12-4 Creating the dictionary and replacing the values with indexes.

When VertiPaq encodes a column with hash encoding, it does the following:

Image It builds a dictionary, containing the distinct values of the column.

Image It replaces the column values with integer numbers, where each number is the dictionary index of the original value.

There are some advantages to using hash encoding:

Image Columns contain only integer values, making it simpler to optimize the internal code of the engine. Moreover, it basically means that VertiPaq is data-type independent.

Image The number of bits used to store a single value is the minimum number of bits necessary to store an index entry. In the example provided, having only four different values, 2 bits are sufficient.

These two aspects are of paramount importance for VertiPaq. When you leverage hash encoding, it does not matter whether you use a string, a 64-bit integer, or a floating point to represent a value. All these data types can be hash-encoded, providing the same performance in terms of both speed of scanning and storage space. The only difference might be in the size of the dictionary, which is typically very small when compared to the size of the column itself.

The primary factor to determine the column size is not the data type, but the number of distinct values of the column. We refer to these numbers as the column’s cardinality. Of all the various factors of an individual column, the most important one when designing a data model is its cardinality. The lower the cardinality, the smaller the number of bits required to store a single value and, consequently, the smaller the memory footprint of the column. If a column is smaller, not only will it be possible to store more data in the same amount of RAM, it will also be much faster to scan it whenever you need to aggregate its values in a DAX expression.

Run-length encoding

Hash encoding and value encoding are two mutually exclusive compression techniques. However, there is a third complementary compression technique used by VertiPaq: run-length encoding (RLE). This technique aims to reduce the size of a dataset by avoiding repeated values. For example, consider a column that contains the calendar quarter of a sale, which is stored in the Sales table. This column might have the string Q1 repeated many times in contiguous rows, for all the sales in the same quarter. In such a case, VertiPaq avoids storing repeated values and replaces them with a slightly more complex structure. The structure contains the value only once, with the number of contiguous rows having the same value, as shown in Figure 12-5.

Image

Figure 12-5 Using RLE to replace repeated values with the number of contiguous rows that contain the same value.


Image Note

The table on the right side of Figure 12-5 contains the Quarter, Start, and Count columns. In reality, Start is not required because VertiPaq can compute it by summing all the previous values of Count, which again saves precious bytes of RAM.


RLE’s efficiency strongly depends on the repetition pattern of the column. Some columns will have the same value repeated for many rows, which results in a higher compression ratio. Others, with quickly changing values, will produce a lower compression ratio. Data sorting is extremely important to improve the compression ratio of RLE, as you will see later in this chapter.

You might have a column in which the content changes so often that, if you try to compress it using RLE, you end up using more space. Think, for example, of a table’s primary key. It has a different value for each row, resulting in an RLE version that is larger than the original column itself. In such a case, VertiPaq skips the RLE compression and stores the column as it is. Thus, the VertiPaq column storage will never exceed the original column size.

The previous example showed RLE applied to the Quarter column’s strings. In this case, RLE processed the already hash-encoded version of the column. In fact, each column can use RLE with either hash or value encoding. VertiPaq’s column storage, compressed with hash encoding, consists of two distinct entities: the dictionary and the data rows. The latter is the RLE-encoded result of the hash-encoded version of the original column, as shown in Figure 12-6.

Image

Figure 12-6 RLE applied to the hash-encoded version of a column.

VertiPaq also applies RLE to value-encoded columns. In this case, the dictionary is missing because the column already contains value-encoded integers.

When working with a tabular model, the factors to consider regarding its compression ratio are as follows, in order of importance:

Image The cardinality of the column This defines the number of bits used to store values.

Image The distribution of data in a column A column with repeated values will be compressed more than a column with frequently changing values.

Image The number of rows in the table Note that this is less important than cardinality and the distribution of data in columns.

Image The data type of the column This affects only the dictionary size.

Given these considerations, it is nearly impossible to predict the compression ratio of a table. Moreover, you can limit the number of rows and change the data types, but these are the least important aspects of the table.

Finally, it is worth noting that if you reduce the cardinality of a column, you are also increasing the chances of repetition. For example, if you store a column containing time values with a granularity of seconds, then you have up to 86,400 distinct values in the column. If, on the other hand, you store the same column with time values with a granularity of hours, then you have not only reduced the cardinality, but you have also introduced repeating values (3,600 seconds converts to the same hour), resulting in a much better compression ratio. Also consider that changing the data type from date to integer or string has an irrelevant impact on the column size because it affects only the dictionary size, not the number of unique values and the consequent compression rate.

Controlling column encoding

SQL Server Analysis Services (SSAS) must decide which algorithm to use to encode each column. In general, it needs to decide whether to use a value or hash encoding. However, the patented algorithms used by SSAS are much more complex. This description is a simplification of them, yet it is enough to give you a solid understanding.

To determine the best algorithm to use, SSAS reads a sampling of rows during the first scan of the source. It then selects the algorithm based on the following conditions:

Image If the data type of the column is a string, hash encoding is used.

Image For numeric values, SSAS uses the following additional heuristics:

• If the column numbers increase linearly, it is probably a primary key, and value encoding is the best option.

• If the numbers are in a defined range of values, then value encoding is the way to go.

• If the numbers have a wide range of values, then dictionary encoding is the best choice.

Once the decision is made, SSAS starts to compress the column using the chosen algorithm. Occasionally, values that were not in the original sampling can cause the process to require reevaluation. For example, SSAS might read a few million rows in which the values are in the range 100–201, making value encoding the best choice. However, after those millions of rows, an outlier might suddenly appear—for example, a large number like 60,000,000. Cleary, value encoding cannot continue because the number of bits needed to store such a large number is larger than the bits available. Instead of continuing with the wrong choice, SSAS can decide to re-encode the column. This means the whole column is re-encoded using, in this case, hash encoding. This process might take a long time because it needs to reprocess the entire column.

For very large datasets, where processing time is important, a best practice is to provide SSAS with a good sample of data distribution in the first set of rows it reads. This reduces re-encoding risks to a minimum. This technique can also be useful to “suggest” the type of encoding you want to use. For example, if the first partition read in a table has a small number of rows, and the interval of values is wide, then dictionary encoding might seem the best option for VertiPaq. However, you cannot apply a real constraint to the encoding choice made by the engine, and re-encoding is always possible later.

Hierarchies and relationships

As you saw in Chapter 11, “Processing and partitioning tabular models,” SSAS builds two additional data structures after table processing:

Image Hierarchies There are of two types of hierarchies: attribute hierarchies and user hierarchies. Both are data structures used to improve the performance of MDX queries. Because DAX does not have the concept of hierarchy in the language, user hierarchies are not used in DAX calculations. They are required only if you want to make a column visible to MDX. Attribute hierarchies are used internally by the engine for certain optimizations, such as looking for a certain value in an attribute. If a column is hidden from the client and used only in particular DAX calculations (such as aggregation) and not to filter or group values, then these structures are of limited use in your data model, but you still must pay the price to store them in memory.

Image Relationships A relationship is a data structure that maps IDs in one table to row numbers in another table.


Image Note

Relationships play an important role in the VertiPaq engine, and, for some extreme optimizations, it is important to understand how they work.


With regard to relationships, consider two related tables—Sales and Products—both containing a ProductKey column. Products[ProductKey] is a primary key. You know that VertiPaq used value encoding and no compression at all on Products[ProductKey] because RLE could not reduce the size of a column without duplicated values. However, Sales[ProductKey] is likely hash-encoded and compressed because it probably contains many repetitions. In other words, the data structures of the two columns are completely different.

Moreover, because you created the relationship, VertiPaq knows that you are likely to use it often, thus placing a filter on Products and expecting to filter Sales, too. If every time it needed to move a filter from Products to Sales, VertiPaq had to retrieve values from Products[ProductKey], search them in the Sales[ProductKey] dictionary, and retrieve the Sales[ProductKey] IDs to place the filter, then it would result in slow queries.

To improve query performance, VertiPaq stores relationships as pairs of IDs and row numbers. Given the ID of a Sales[ProductKey], it can immediately find the corresponding rows in the Products table that match the relationship. Relationships are stored in memory, as any other data structure of VertiPaq. Figure 12-7 shows you how the relationship between Sales and Products is stored.

Image

Figure 12-7 The Sales and Products relationship.

Segmentation and partitioning

As you might imagine, compressing a table of several billion rows in a single step would be extremely memory-intensive and time-consuming. However, the table is not processed as a whole. Instead, SSAS reads it into segments during processing. By default, these segments contain 8 million rows each. When a segment is completely read, the engine starts to compress it, and at the same time reads the next segment.

You can configure the segment size by using the DefaultSegmentRowCount entry in the Management Studio server properties of a SSAS instance. You can change this property without restarting the service, which would affect all the associated process operations (as shown in Figure 12-8). A value of 0 uses the default value of 8,000,000 rows in Analysis Services. If you import a model from a Power Pivot workbook, consider that the segments in Power Pivot are limited to 1,000,000 rows each. Thus, reprocessing such a database in Analysis Services usually reduces the number of segments.


Image Note

A future release of Analysis Services might introduce a setting for this column that is local to a single process operation, without requiring you to change the server settings to affect a single process.


Image

Figure 12-8 DefaultSegmentRowCount setting in the Analysis Services Properties dialog box.

Segmentation is important for the following reasons:

Image When querying a table, VertiPaq uses the segments as the basis for parallelism, using one core per segment when scanning a column. By default, SSAS always uses one single thread to scan a table with 8,000,000 rows or less. You start seeing parallelism in action only on much larger tables.

Image The larger the segment, the better the compression. VertiPaq can achieve better compression levels by analyzing more rows in a single compression step. On very large tables, it is important to test different segment sizes and measure the memory usage to achieve optimal compression. Keep in mind that increasing the segment size can negatively affect processing time; the larger the segment, the slower the processing.

Image Although the dictionary is global to the table, bit-sizing can be further reduced at the segment level. Thus, if a column has 1,000 distinct values but, in a specific segment, only two of them are used, then that column might be compressed up to a single bit for that segment. The actual number of bits used in a segment depends on the range of internal indexes that reference the dictionary. For this reason, the sort order of a partition could be important in large tables to reduce the number of distinct values per segment. For optimal compression, the values used in a partition must be adjacent in the dictionary if the column has hash encoding. Parallel processing of multiple partitions might affect this optimal result.

Image If segments are too small, then the parallelism at query time is increased. This is not always a good thing. In fact, while scanning the column is faster, VertiPaq needs more time at the end of the scan to aggregate partial results that are computed by the different threads. If a partition is too small, then the time required to manage task switching and final aggregation is more than the time needed to scan the data, with a negative impact to the overall query performance.

During processing, if the table has only one partition, the first segment receives a special treatment. In this case, the first segment can be larger than DefaultSegmentRowCount. VertiPaq reads twice the size of DefaultSegmentRowCount and starts to segment a table, but only if it contains more rows. (This does not apply to a table with more than one partition.) Therefore, a table with 10,000,000 rows will be stored as a single segment, whereas a table with 20,000,000 rows will use three segments: two containing 8,000,000 rows and one with only 4,000,000 rows.

Segments cannot exceed the partition size. If you have a partitioning schema on your model that creates partitions of only 1,000,000 rows, then all your segments will be smaller than 1,000,000 rows (namely, they will be the same as the partition size). The over-partitioning of tables is a very common mistake for new VertiPaq users. Remember that creating too many small partitions can lower the performance.

Reading VertiPaq internal metadata

SSAS lets you discover information about the data model by using dynamic management views (DMVs). DMVs are extremely useful to explore how your model is compressed, the space used by different columns and tables, the number of segments in a table, or the number of bits used by columns in different segments.

You can run DMVs from inside SQL Server Management Studio (SSMS) or, even better, by using DAX Studio. DAX Studio offers you the list of all DMVs in a simpler way, without requiring you to remember them. This chapter describes how to interpret the information coming from DMVs using another free tool, VertiPaq Analyzer (http://www.sqlbi.com/tools/vertipaq-analyzer/), which automatically retrieves data from DMVs and shows them in useful reports.

Using DMVs for VertiPaq memory usage

Although DMVs use a SQL-like syntax, you cannot use full SQL syntaxes to query them because they do not run inside SQL Server. They are a convenient way to discover the status of SSAS and to gather information about data models. Moreover, even if a new set of DMVs specific for tabular models is introduced in SSAS 2016, certain data (such as memory used by columns) is available only using other DMVs that were created when SSAS supported only Multidimensional. The information provided there is not optimized for Tabular. Therefore, certain views get column information provided as a CUBE_NAME, MEASURE_GROUP_NAME, or DIMENSION_NAME, although in VertiPaq there is no concept of a cube, measure group, or dimension. For this reason, a tool like VertiPaq Analyzer is useful to read high-level information without having to manage these details.

There are different DMVs, divided in the following two main categories:

Image SCHEMA views These views return information about SSAS metadata, such as database names, tables, and individual columns. They do not provide statistical information. Instead, they are used to gather information about data types, names, and similar data. There are two groups of DMVs with schema information, with the following prefixes:

TMSCHEMA The views with this prefix are new in SSAS 2016 for Tabular databases with 1200 compatibility level and higher, and are not available in earlier compatibility levels. They describe the metadata structure of a tabular model. The structure of each view is described in corresponding “Messages in the Protocol Details” section of the SQL Server Analysis Services Tabular protocol documentation, which is available at https://msdn.microsoft.com/en-us/library/mt719260.aspx.

MDSCHEMA The views with this prefix describe entities seen from a multidimensional perspective, even if the model is tabular. These views are available in any version of Analysis Services Tabular, and the documentation is available at https://msdn.microsoft.com/en-us/library/ms126079.aspx.

Image DISCOVER views These views are intended to gather information about the SSAS engine and/or discover statistics information about objects in a database. For example, you can use views in the discover area to enumerate the DAX keywords, the number of connections and sessions that are currently open, or the traces running. There are views containing information about the memory used to store segment data, for both multidimensional and tabular databases. (A specific view for Tabular with the same information is not available yet.) For example, DISCOVER_STORAGE_TABLE_COLUMN_SEGMENTS provides data using the multidimensional nomenclature (cube, measure group, dimension, and so on) even for tabular databases. The documentation about these views is available at https://technet.microsoft.com/en-us/library/ms126221.aspx.

This book does not describe the details of these views. Instead, it analyzes the meaning of the information for a tabular model, using the reports provided by VertiPaq Analyzer. We focus on the meaning of the data and not on how to retrieve it from DMVs. If you need more information about DMV views, follow the preceding links to find the Microsoft documentation on the web.

Interpreting VertiPaq Analyzer reports

VertiPaq Analyzer is a Power Pivot model that you can open in Excel 2013 or Excel 2016. It contains numerous tables and gets data from several DMVs. To use VertiPaq Analyzer, open the workbook available at http://www.sqlbi.com/tools/vertipaq-analyzer/, click the Download link on the right, and then follow the instructions at http://www.sqlbi.com/articles/data-model-size-with-vertipaq-analyzer/ to collect the data from the specific tabular database that you want to analyze. After you have connected VertiPaq Analyzer to the SSAS database to analyze, you can refresh the data model and analyze the content of the PivotTables, as shown in Figure 12-9.

Image

Figure 12-9 The VertiPaq Analyzer PivotTable, showing the table and columns size.

VertiPaq Analyzer shows data in one PivotTable for each of the following worksheets:

Image Tables This displays the size of columns and tables. It also provides a summary of segments and partitions used by the columns of the table. This is the first and most important summary to analyze when you want to get a complete overview of the data volume and distribution in a tabular model.

Image Columns This provides the same information as the Tables worksheet but without grouping the columns by table. This report is useful to look for the most expensive columns in the entire data model. It can also be used to determine the columns with the highest memory cost or cardinality.

Image User Hierarchies This shows the structure size used for the user hierarchies defined in the tabular model. This cost is usually a fraction of the cost of the attributes hierarchies (one for each column), which is available in the Tables and Columns worksheets.

Image Relationships This displays the cost of the relationships defined in the data model and the granularity of the columns involved. The memory cost of each relationship is usually a fraction of the data model.

Image Compression This offers a complete overview of the bits used in compressing segments of the data. You can add table and columns names to this PivotTable to analyze the compression applied to specific objects. This information can be useful to check whether there is space for improving compression in very large data models.

Image Columns Data Types This groups the columns by their data type so you can quickly verify whether the data type of columns is what you expected. For example, using this report, you can quickly identify floating-point columns that could or should be stored as currency, as you will learn later in this section.

Image Encoding This groups the columns by their encoding type, which is useful when you want to verify whether columns that are critical for certain DAX expressions use the encoding (hash or value) that works best for that specific task.

Image Measures This displays a list of all the measures defined in the model, including their DAX expressions.

Image Calc. Columns This displays a list of all the calculated columns defined in the model, including their DAX expressions.

Image Calc. Tables This displays a list of all the calculated tables defined in the model, including their DAX expressions.

The data model in VertiPaq Analyzer provides a limited number of Columns entities, which contain the table and column names and type, and Columns Segments entities, which contain attributes related to compression, partitions, and segments. The measures are defined in the Columns table, so they are easily reachable in the PivotTable fields list. These measures include the following:

Image Data Size This includes the bytes for all the compressed data in segments and partitions. It does not include dictionary and attribute hierarchies. This number is the amount of memory read by VertiPaq when doing a complete column scan. Different columns in the same table have different data sizes, resulting in varying performance for aggregation functions.

Image Cardinality This is the object’s number of rows in a table or number of unique values within a column. The cardinality of a column is the number of entries in the dictionary. The number of unique values used in the column could be smaller if the dictionary has unused entries. We consider a cardinality of 1,000,000 or more to be large. However, a table can have 100,000,000 rows without experiencing performance issues. The cardinality of a column is more important than the cardinality of a table from a performance point of view.

Image Rows This is the number of rows in a table, partition, or segment.

Image Columns Hierarchies Size This is the number of bytes of automatically generated hierarchies for columns (used by MDX), also called attribute hierarchies.

Image User Hierarchies Size This is the number of bytes of user-defined hierarchies.

Image Relationship Size This is the number of bytes of relationships between tables.

Image Columns Total Size This is the number of bytes of all the structures related to a column (that is, the sum of Data Size, Dictionary Size, and Columns Hierarchies Size).

Image Dictionary Size This is the number of bytes of dictionary structures. The dictionary size is related strictly to the number of unique values in the column. The dictionary might include values no longer used in the column, in case single partitions are removed or reprocessed. (You can use Process Defrag to shrink the dictionary size in these cases.) The dictionary size is minimal (less than 100 bytes) for columns stored using value encoding, whereas a larger size usually indicates a hash encoding of the column.

Image Table Size This is the number of bytes of a table (in other words, the sum of Columns Total Size, User Hierarchies Size, and Relationships Size).

Image Table Size % This is the ratio of Columns Total Size versus Table Size.

Image Database Size % This is the ratio of Table Size versus Database Size (that is, the sum of Table Size of all the tables). You can use this measure to quickly identify columns that have a big impact on the overall database size.

Image Segments # This is the number of segments. There is always at least one segment for each partition, and all the columns in a table have the same number of segments.

Image Partitions # This is the number of partitions. All the columns in a table have the same number of partitions.

Image Columns # This is the number of columns.

Image Encoding This describes the encoding of the columns selected, which could be HASH or VALUE. If you apply this measure to a group of columns with different encoding types, it displays Many.

You can group measures using a variety of attributes, which are already used in the PivotTables included in the VertiPaq Analyze workbook. The most important attributes are included in the Columns and Columns Segments tables. The following Columns attributes are available:

Image Tables-Columns This is a hierarchy with two levels, Table and Column, corresponding to the table and column names in the tabular model.

Image COLUMN_ENCODING This corresponds to the encoding type of a column, which can be HASH or VALUE, and can be used as a filter. This differs from the Encoding measure, which displays the encoding of the selected columns.

Image Data Type This corresponds to the data type of a column. The values available in VertiPaq are:

DBTYPE_CY This corresponds to the Currency type in DAX, which is a fixed decimal number with four decimal digits of fixed precision, internally stored as an integer.

DBTYPE_DATE This corresponds to the Date data type in DAX, which is internally stored as a floating point.

DBTYPE_I4 This does not have a corresponding data type in DAX. It is a 32-bit integer that is used for internal structures only. (You might find an internal column called RowNumber for each table here, but in certain models, these columns are stored as DBTYPE_I8.)

DBTYPE_I8 This corresponds to the Whole Number data type in DAX, which is internally stored as a 64-bit integer.

DBTYPE_R8 This corresponds to the Decimal Number data type in DAX, which is internally stored as a 64-bit floating point.

DBTYPE_WSTR This corresponds to the Text data type in DAX, which is internally stored as a Unicode string (16 bits per character). A string column is always hash-encoded in VertiPaq.

The following attributes are available in Columns Segments, which are useful to analyze the compression applied to columns of a tabular model:

Image BITS_COUNT This is the number of bits used to store the data ID in a column’s segment. For value encoding, this is related to the range of values in the column for the segment. For hash encoding, the index pointing to the dictionary position corresponds to a value internally stored as value encoding. Thus, a segment can have fewer bits if the values referenced in the dictionary have been created in a contiguous area. Parallel processing of multiple partitions can affect this order, so in very large tables (billions of rows), you might want to prepopulate the dictionary of columns used to partition the table (usually date-related columns).

Image BOOKMARK_BITS_COUNT This is an estimate of the number of RLE bits in the segment.

Image COMPRESSION_TYPE This is always NOSPLIT for regular columns and C123 for internal columns (row number). It is reserved for internal use by VertiPaq.

Image PARTITION_NAME This is the name of the partition, as defined in the tabular model.

Image SEGMENT_NUMBER This is an integer number identifying the segment in the table.

Image TABLE_PARTITION_NUMBER This is an integer number identifying the partition in the table. This number is generated by SSAS.

Image VERTIPAQ_STATE This is the state of the VertiPaq compression for a segment. It can be set to the following states:

COMPLETED This states that the VertiPaq compression completed successfully.

TIMEBOXED This states that the VertiPaq compression was timeboxed. If you find many columns with this state in the data model, you might consider reprocessing the table with an increased value for the ProcessingTimeboxSecPerMRow setting in the SSAS properties. This will provide more time to complete the compression process and produce a better compression.

SKIPPED This states that the column was not used by the VertiPaq clustering algorithm, which determines the degree of compression for each column. This always happens for internal columns, such as RowNumber, and could also happen for other columns, such as a high-cardinality column that contains a floating-point number. A column with hash encoding can have a SKIPPED state.

Memory usage in VertiPaq

All the database columns are stored in memory, making this the most relevant factor of memory consumption. In addition, VertiPaq needs other memory during a table refresh (we call this the processing phase) and when querying the tables.

Data memory usage

A common question when designing a database-based solution is how to forecast the database size, from which you derive disk and memory sizing. Knowing the database size is important when assessing the amount of memory required to run Analysis Services, which uses VertiPaq to store the database in memory. However, with this engine, answering this question is not easy.

Estimating database size is a relatively easy exercise for a relational database in SQL Server. This is because almost all the data is usually stored in a small number of tables. You can estimate the size of each table by calculating the product of the number of rows multiplied by the size of each row. The number of rows is usually estimated (for example, the number of customers, the number of transactions in five years, and so on), whereas the size of the row depends on the number of columns and their types. Having a sample set of data is often sufficient to make a good estimation of the average row size (you might have variable-length columns), including indexes. Detailed techniques for making such estimations are well known and documented. A good starting point is http://msdn.microsoft.com/en-us/library/ms187445.aspx.

With VertiPaq, estimating database size is a much more difficult exercise because it is very difficult to know in advance what the most important drivers are that will affect the database size. In fact, the number of rows is no longer the most important factor for determining the size of a table! The table size in VertiPaq is determined by a combination of the following factors:

Image The number of columns

Image The cardinality of each column

Image The data type of each column (for strings, the average size is relevant)

Image The number of rows

Not only do you have many variables to consider, but there is also no linear formula that produces an estimate, when starting from these values. If you are only interested in the size of the bigger tables within a database, you can get a rough estimate by using the following formula:

Image

This formula is not easy to apply. The average column cost can be quite different among columns, and it largely depends on the size of the dictionary, which is based on the number of distinct values in the column. You can see that adding rows to a table does not necessarily mean that you have a linear growth of the table size. In fact, if you add rows that use existing values in column dictionaries, you use only the first part of the multiplication (RowCount). If you add values that also increase the dictionary size, the AverageDictionaryCost for affected columns will increase, which results in a product that grows faster. Finally, the effect of adding a column depends on the size of the dictionary, so adding a column with low cardinality costs less than adding a column with high cardinality.

This is a general principle that helps you to estimate. However, it is much harder to translate these general concepts into concrete numbers because the dictionary cost depends on many factors, such as different data types, dictionary strategies, string length, and so on. VertiPaq automatically uses different types of dictionaries, depending on the type and data distribution of each column.

For these reasons, we suggest basing any estimation on a heuristic approach. Use a significant amount of real data and measure the size of a processed table. Then double the number of rows and measure the increment in size. Double it again, and then measure again. You will obtain a more accurate estimate in this way than by using a theoretical approach that is difficult to apply if you do not know data distribution. VertiPaq Analyzer helps you get these metrics after each test.

Processing memory usage

During processing, every table is read from the data source and loaded in memory to create the dictionary of unique values and the related index for each column. If you already have the table in memory and you do not clear the table from the VertiPaq database before proceeding, you will have two copies of the table until the process transaction commits. If you enable memory paging in Analysis Services and have enough virtual memory available, the process might succeed even if you do not have enough memory to store two copies of tables that are part of the processing batch. But if Analysis Services starts paging, query and processing performance might suffer. You should measure memory consumption during processing to avoid paging, if possible.


Image Note

VertiPaq is designed and optimized to have the whole database loaded into memory. To store more data and improve performance, data is also kept compressed while in memory, and dynamically uncompressed during each query. This is why fast CPUs with high memory bandwidth are required. Analysis Services can handle the paging of data to disk, but this should be limited to scenarios in which the paging activity is temporary. You can disable paging by setting the MemoryVertiPaqPagingPolicy advanced property to 0. (The default is 1, which enables this behavior.) For a more detailed discussion of VertiPaq memory settings, see http://www.sqlbi.com/articles/memory-settings-in-tabular-instances-of-analysis-services.


If multiple tables or partitions are processed in the same processing batch, they are processed in parallel by default in SSAS 2016. Previous versions of SSAS Tabular processed partitions of a table serially, allowing only multiple tables to be processed in parallel.

Every partition that is processed is divided into segments, each with 8,000,000 rows. After a segment is read, each column is processed and compressed. This part of the processing can scale on multiple cores and requires more memory, depending on the number of distinct values that are present in the segment. For this reason, as you saw in Chapter 3, “Loading data inside Tabular,” sorting a table might reduce the memory pressure during processing and queries, requiring less memory to store data. Reading a smaller number of distinct values per segment improves the compression rates and memory used. Ideally, you would obtain the best results by first sorting the table using the column with the smaller number of distinct values and then including other columns until you arrive at the column with the maximum granularity. However, this sorting might be too expensive for the data source. You should find the right tradeoff for tables that require more segments to be processed. This consideration is less important for partitions smaller than 8,000,000 rows because they will always process a single segment and will not have the issue of distributing values across different segments for the same partition.


Image Important

You can optimize compression for tables with more than 8,000,000 rows by providing sorted data to Analysis Services. In Tabular, you can specify for each partition a SQL statement that contains an ORDER BY condition. Optimizing such a query in the relational database is not discussed here, but it is something to consider to keep the processing time at a reasonable level.


Querying memory usage

Every query performs one or more scans of the columns involved in the query. Some query constructs create memory pressure because they must store both temporary and intermediate results. When running very complex queries on large tables, you might observe very high memory pressure caused by the internal materialization of data when expressions cannot be managed by keeping the data in a compressed form. This could be critical if you do not have enough memory available. You should monitor memory pressure during queries to correctly size the memory available to Analysis Services. Concurrent users can increase the memory pressure if complex queries run simultaneously.


Image Note

High memory pressure is caused by particularly complex DAX queries. Many queries do not require much memory even when they operate on very large tables. This warning is applicable to potentially critical conditions that might be raised by a single query that exhausts server resources. A complete description of DAX queries that can increase materialization, and how to control this effect, is included in The Definitive Guide to DAX, published by Microsoft Press. In general, materialization is an issue related to specific DAX queries, not just to the tabular model.


Understanding processing options

You have learned how the VertiPaq engine uses memory. In this section, you will see the actions performed during table processing, what resources are required during this operation, and the differences between the processing options.

Every table in a tabular model is copied to an internal structure for the VertiPaq engine unless DirectQuery is being used. This operation is called data processing, and it can occur at different levels of granularity: database, table, and partition. There are no constraints on the process order, and you can process lookup tables after related tables. The storage is by column, and every column has its own data. The real dependencies are calculated columns, which must be refreshed after processing any table that is directly or indirectly referenced by the DAX expression that defines the calculated column.


Image Note

In a multidimensional model, you must process the dimensions before the measure groups, and you might have to process a measure group after you process a dimension, depending on the type of processing. In a tabular model, this is not required, and processing a table does not affect other processed tables. It is your responsibility to invalidate a table containing data that is no longer valid. Integrity issues are the responsibility of the source system, and these errors will not be picked up by processing a tabular model as they would be in a multidimensional model.


In this chapter, the processing options are discussed at a functional level. You can perform many (but not all) of these operations through the SQL Server Management Studio (SSMS) user interface. In Chapter 11, “Processing and partitioning tabular models,” you learned how to use these features correctly, depending on the requirements of specific tabular models. In Chapter 13, “Interfacing with Tabular,” you will learn how to control process operations in a programmatic way by accessing all the available features.

What happens during processing

When you execute a full process on a tabular database, every partition of every table is processed, and all the calculated columns are refreshed. However, you might want to control this operation to optimize the processing time and memory required, especially for larger models. For example, you can process only those tables that have been updated on the data source, and by using partitioning, you can limit the table-refresh operation to only the part of the table that contains changed data. Because each table is processed independently of other tables, you can examine the process operation at a table level. The same operation is repeated for every table involved in a process batch.

A table or partition process operation is divided into the following two steps:

1. Process data The rows are sequentially read from the source table. If the process is on a table, a single dictionary is created for each column, and data files are created for each partition and column. If the process is on a partition, and a dictionary already exists, new values are added to the dictionary. In addition, existing values are kept in the dictionary even if they no longer exist in any source-table rows. This update of the dictionary can cause fragmentation, which you can eliminate with a specific operation that you will learn about in the next section. The data files of the processed partition are always rebuilt.

2. Process other structures Many objects can be created or refreshed only after data processing has completed. Some of these structures can belong to tables other than those that have been processed. These structures are as follows:

Calculated columns and tables All the calculated columns that have been processed must be refreshed. Moreover, any calculated column or calculated table that contains an expression with a direct or indirect dependency on a processed table must also be refreshed. The calculated columns must be calculated for all the partitions of a table, regardless of the partitions that have been processed. Calculated columns are split into the same number of segments of the table they belong to. However, calculated tables are split into segments according to the setting of the DefaultSegmentRowCount VertiPaq property.

Relationships If the table contains references to other lookup tables, the internal structures must be rebuilt, and either that table or the lookup table must be processed. However, the cost of rebuilding relationship structures depends on the size of the dictionary of the involved column—not on the table size or the number of partitions.

Hierarchies Attribute and user hierarchies in a table have structures that must be rebuilt after the data has been processed.

There are specific instructions to control these two steps of a process operation. In fact, the processing of other structures is usually done after all the data processing has been completed. Processing operations are memory-intensive and processor-intensive. During data processing, partitions are loaded serially, but other parts of the processing can use additional cores if available. Moreover, different tables can be processed in parallel, which might increase the memory requirements. You can control the process operation to avoid a high peak level of memory usage.

Because of this condition, Analysis Services might use paging during processing to access the physical memory available to complete the operation. However, if Analysis Services receives a query on the existing tabular model that has been paged during a process operation, you might observe a severe degradation in query performance that also affects processing time. You must measure the peak level of memory usage and the paging activity during the processing operation. If the processing operation exhausts server memory resources, you must adopt countermeasures.

Available processing options

Several processing options can be applied granularly to one or more objects. A more detailed operational guide is available in Chapter 11, and further details can be found in the blog post at https://blogs.msdn.microsoft.com/cathyk/2011/09/26/processing-tabular-models-101/.

In general terms, for each table, you can perform one of the following operations:

Image Process Data This operation reads from the data source and then populates a partition or a table. After that, the related structures of the table involved (even if you have loaded only a single partition) are no longer valid and require a Process Recalc operation.

Image Process Recalc This refreshes all the table-related structures (calculated columns, calculated tables, indexes, relationships, and hierarchies) when the underlying data in the partition or tables is changed. This operation can be directly executed only at the database level, affecting all the structures of all the tables that require refreshing. If you want to limit the process to only the affected objects after a single table has been changed (without refreshing the structures that have been invalidated and that need updating later), you must use the Process Full or Process Default processing options. These internally perform a recalculation of only those objects that have been invalidated by other operations.

Image Process Full This performs both a Process Data and a Process Recalc of the related structures. When it is requested at a database level, it rebuilds all the tables. When it is executed on a table or partition, it performs a Process Data operation on the requested object. It is followed by a Process Recalc operation that is limited to the structures that are related to the table involved in the first operation. Process Add is a special case of Process Full, because it runs in an incremental process that adds rows to an existing partition. If you are executing Process Full or Process Add operations on two or more tables in separate transactions, these operations might generate several Process Recalc steps of the same structures in other tables. An example is when you have a calculated column in another table that depends on the data from the table that is being processed by one of these commands.

Image Process Defrag This optimizes the table dictionary and rebuilds table data for all the partitions. If you process data at a partition level, you might end up with a fragmented dictionary that contains entries that are no longer used in the table. For example, suppose you remove a partition from a table. All the values that were used only in that partition will still be part of the dictionaries of hash-encoded columns even though they are no longer referenced by the column’s data structures. This does not affect the correctness of the query result, but it does consume more memory and might indirectly affect performance. Process Defrag might be an expensive operation for a large table. This operation can be executed at the table or database level. (The latter is not exposed in the SSMS user interface.)

As a rule, to minimize processing time when you do not have to fully process the whole database, you should use Process Data for tables or partitions that must be refreshed. After all the data has been loaded, you can perform a single Process Recalc at the database level, processing the affected structures only once. In this way, if a calculated column depends on two or more modified tables, it will also be refreshed just once. However, some structures (indexes and calculated columns of processed and related tables) will not be available from the first Process Data operation until the Process Default completes at the database level unless they are executed in a single transaction. This requires more memory because two copies of data will be stored in memory during processing. Choosing other strategies, such as performing Process Full on single partitions or tables in separate batches, might require a longer time because the related structures will be processed more than once. However, it will reduce or eliminate the period of time during which the tabular model is unavailable.


Image Note

Compared to a multidimensional model, the processing options in tabular models are simpler and easier to manage. You do not have the strong dependencies between dimensions and measure groups. The dependencies affect the structures that require refreshing, even if the relationships between the tables and the formulas in calculated columns define the dependencies between those tables. However, because these operations have a column granularity, the actual cost is limited to the parts of the table that require refreshing. Moreover, the unavailability of data in a tabular model can be limited to calculated columns that require refreshing rather than affecting the whole table, as you might expect if you come from a multidimensional background.


Summary

In this chapter, you saw that VertiPaq is an in-memory, column-oriented database, and you learned the internal structures used to store data. Because the VertiPaq engine stores data in memory, it is critical to understand how data is compressed and which columns cause additional memory pressure (usually because of their data-dictionary size). Finally, you learned how VertiPaq processes data and how to control the process phases to minimize the latency and optimize the data structures.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset