Chapter 7. Understand table features

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 7

Understand table features

A key aspect of relational databases is how data is stored in tables. It is important to understand both how tables should look to the user and how they are structured internally.

This chapter covers fundamentals including data types, keys, and constraints. It also covers special table types, including temporal tables and graph tables. At times, organizations choose to store binary large objects (BLOBs) within relational tables. This chapter covers the implications of storing this type of data within SQL Server, as well as other important table-related concepts including vertical and horizontal partitioning and change-tracking methods. Finally, it covers how PolyBase can help you use connectors to interact with data sources outside of SQL Server.

A proper relational database design requires considerations beyond the SQL Server features included in this chapter. Mapping application requirements, normalization, and organization-specific requirements are not covered in this book. There are many texts available to teach you those elements of relational database design, starting perhaps with the theory writings of relational model innovators C. J. Date and E. F. Codd.

All scripts for this book are available at https://www.microsoftpressstore.com/SQLServer2022InsideOut/downloads.

Review table structures

This section reviews information that is relevant when creating tables. First, it looks at system data types, emphasizing the data design decisions surrounding their use. Next, it briefly discusses primary and foreign key concepts. Then, it covers constraints, their impact on table design, and how they can help meet data integrity requirements. The section ends with user-defined data types and computed columns.

Note

Beyond coverage of primary keys and unique constraints, indexing is not covered in this chapter, although table design is not complete without considering it. For guidance on indexing, read Chapter 14, “Performance tune SQL Server,” and Chapter 15, “Understand and design indexes.”

General-purpose data types

Selecting the appropriate data type when designing relational databases is crucial. You can change a column’s data type after the fact, but doing so can be an expensive operation. A poorly chosen data type can result in suboptimal performance or might allow for unexpected values to be stored in the column. Therefore, proper data type choice becomes a decision point about performance, data integrity, and even application security.

The intent of this section is not to provide exhaustive coverage of each system data type available in SQL Server; rather, it’s to provide the information and guidance necessary to make solid table design decisions.

Alphanumeric types

Alphanumeric types (also known as strings) are commonly discussed in terms of fixed versus variable length, and with Unicode versus without Unicode support. The char and nchar data types are fixed length, and varchar and nvarchar are variable length. The difference is how each is encoded. The nchar and nvarchar data types are always encoded as 16-bit Unicode, using UTF-16. In contrast, char and varchar use 8-bit data types that store data in ASCII or, starting in SQL Server 2019, UTF-8. More information about these new collations and their purpose is included later in this section.

As a database designer, you must understand that the (n) in a [var]char(n) column definition indicates the number of bytes allocated for the column, not the number of characters that can be stored. The same is true for n[var]char(n) columns, though the size indicates the number of byte-pairs that can be stored. This is important because:

[var]char columns can store strings from double-byte character sets, and can use UTF-8 collations, which may require 2 or 4 bytes to store one character. The following subsection includes full coverage of UTF-8 collations.
n[var]char columns can store characters in the Unicode supplementary character range, which may require 4 bytes.

Note

You might be tempted to use an ASCII varchar data type to save space. However, you may need Unicode support more often than you think, and it often starts with people’s names. Additionally, users expect to store emojis and other Unicode character data in columns. Finally, increasing internationalization of applications is also best supported by using Unicode string data types.

Collation

With string data, collation becomes an important consideration. This is determined using the code page, which is one element of the collation. Collation also determines how data is compared and sorted, such as whether casing and accented letters are considered to be different.

Note

If the full range of Unicode characters must be supported in a column, the collation should be set to a supplementary characters collation. These collations’ names end in _SC and have been available since SQL Server 2012. The most frequently used characters have Unicode point values between 0x20 and 0xFFFF (point values below 0x20 are control characters). Thus, without using supplementary characters, 65,515 characters can be represented. Those include accented letters for most languages, many symbols, characters for Asian and Cyrillic languages, and many more.

For more information about Unicode supplementary characters, see https://learn.microsoft.com/sql/relational-databases/collations/collation-and-unicode-support#Supplementary_Characters.

SQL Server 2019 introduced a new family of collations that support UTF-8. These collations apply only to the char and varchar data types and store the string data using UTF-8 encoding. They effectively turn these two data types into Unicode data types, including support for supplementary characters. When you define a column or conversion to use a UTF-8 collation, the encoding is automatically updated.

Among other things, collation determines how the high-order bits in each character’s byte(s) are interpreted. Collation supports internationalization by allowing different character representations for characters whose integer values are greater than 127, up to 255.

Inside OUT

Should you expect space savings when using UTF-8 collations?

Answering this question requires at least a cursory understanding of how character data is encoded in UTF-8. The number of bytes required to encode characters varies from 1 to 4. The characters at the lowest code points (0–127) require only 1 byte, just like other collations. This is designed to maintain compatibility with ASCII. These 128 characters are the most common characters in Latin script, including uppercase and lowercase letters, digits, and many punctuation marks. However, code points 128–2,047 already require 2 bytes in UTF-8, while only requiring 1 byte in the non-UTF-8 collations. That code point range includes Latin accented characters, some additional punctuation marks, such as the inverted exclamation point (¡), and symbols (such as currency symbols). UTF-8 requires more storage space than UTF-16 for characters in the code point range 2,048–65,535. An extra script file in the book’s downloads illustrates some of these caveats.

Thus, whether you should expect space savings is quite nuanced and depends on how many characters in the average varchar column will require more than 1 byte to be encoded (when compared to varchar using collations other than UTF-8) or more than 2 bytes (when compared to nchar or nvarchar). For an internationalized application, this might be difficult to forecast.

We do not suggest switching collations to UTF-8 for the purpose of saving storage space. UTF-8 collations are designed to support internationalization of existing applications and databases without incurring massive changes and associated test requirements. For new application or database development where internationalization is expected, using UTF-16 encoding (with the nchar and nvarchar data types) is highly recommended.

For complete details about collation and Unicode support, visit https://learn.microsoft.com/sql/relational-databases/collations/collation-and-unicode-support.

Caution

You define the column width by the number of bytes, never by the number of characters that can be stored. If you decide that UTF-8 is the right encoding to use for an existing database, you need to ensure that the column width in bytes can accommodate the potentially larger size of the existing column values once converted to UTF-8. During the collation conversion, SQL Server will silently truncate any values that do not fit.

Before converting, you can determine if any strings will require more bytes than the column width supports using a Transact-SQL (T-SQL) statement like the one that follows, where val is a varchar(8) column in a table called CollationTest:

Click here to view code image

-- If COUNT > 0, then there are rows whose data size will be larger than the
-- current column width supports
SELECT COUNT(*)
FROM dbo.CollationTest
WHERE DATALENGTH(
    CAST(CAST(val AS VARCHAR(32))
        COLLATE Latin1_General_100_CS_AS_SC_UTF8 AS VARCHAR(32))) > 8;

This WHERE clause is used to determine which values will no longer fit in the width of the column (8 bytes), after altering the column to use a collation from the UTF-8 family of collations, where a single character might take up four times as much space (8 * 4 = 32). Quadrupling the byte count of the source column is not necessary; any value larger than the source column width will do.

Two CASTs are required:

The inner CAST converts val from varchar(8) to varchar(32) to simulate the UTF-8 column width.
The outer CAST converts val at the UTF-8 collation. This needs to be varchar(32) to analyze the converted UTF-8 records in the column val.

Large value data

No discussion of alphanumeric types would be complete without an examination of varchar(max) and nvarchar(max). By specifying max instead of a value between 1 and 8,000 bytes (for varchar) or between 1 and 4,000 byte-pairs (for nvarchar), the storage limit increases to 2 GB. If the column’s value exceeds 8,000 bytes, the data is not stored in the table’s storage structure. Large value data is stored out of row, though for each such column, 24 bytes of overhead is stored in the row. Of those 24 bytes, the first 16 bytes are used to store metadata, and the last 8 bytes contain the pointer to the row in the row-overflow page.

Note

The details of storing large value data also apply to the varbinary(max) and xml data types, both of which are discussed later in this chapter.

SQL Server has a row size limit of 8,060 bytes. Even if you do not use [n]varchar(max) columns, some data may be stored out of row or “off-row.” Any data that is stored off-row will incur some overhead when it is read. The flip side is that when a T-SQL statement does not reference a column whose data is stored off-row, there is a performance benefit. If your table’s usage patterns indicate that large value type columns are not frequently included in statements, you can optimize performance by storing the data off-row, even if the row size is less than 8,000 bytes. The following T-SQL statement enables the large_value_types_out_of_row option for the PurchaseOrders table in the WideWorldImporters sample database:

Click here to view code image

DECLARE @TableName NVARCHAR(776) = N'Purchasing.PurchaseOrders';
-- Turn the option on
EXEC sp_tableoption @TableNamePattern = @TableName
    , @OptionName = 'large value types out of row'
    , @OptionValue = 1;
GO
-- Verify the option setting
SELECT [name], large_value_types_out_of_row
FROM sys.tables
WHERE object_id = OBJECT_ID(@TableName);

After this is run, the values are not immediately migrated to out of row storage. This is true for any table, regardless of whether there is data populated. For example, the preceding table has no values in any of the varchar(max) columns; only when an existing data row is updated will the values be stored out of row. You could force such updates to happen by executing an UPDATE statement that sets the column value to any value or even itself, although this operation will be quite expensive on large tables. We don’t recommend this unless you have determined that the immediate benefit of forcing those values to be stored off-row exceeds the cost of the update operation.

In the T-SQL CREATE statement (but not in an ALTER statement), you can opt to store the data for large value type columns in a separate filegroup. In the CREATE TABLE statement, use the TEXTIMAGE_ON clause to specify the name of the filegroup where large object (LOB) data should be stored. If you want to change the TEXTIMAGE_ON setting, you will need to create a new table and copy the data in the table.

If you need to store more than 2 GB in a single column, consider using the FILESTREAM feature, discussed in the “Store large binary objects” section later in this chapter.

Numeric types

When considering numeric types in computer systems, it is important to understand the nature of your data. One of the most important concepts to understand is the difference between exact and approximate numeric types.

Approximate numeric types store values using a floating-point structure. In SQL Server, the number of bits in the mantissa is limited to 24 or 53, resulting in a respective precision of 7 or 15 digits. Due to the nature of the structure and the limited number of bits, these types cannot accurately store all numbers in the supported range. On the other hand, although exact types store numbers without losing precision, this comes at a loss of range. For approximate floating-point types, the range is very large and useful for scientific-like numbers and operations, where a small loss of precision might not matter. Math with these values is implemented in hardware, so performance is far better than anything other than integers. For exact types, the range is limited, but sufficient for operations requiring precision, such as those involving monetary values.

SQL Server provides real and float as approximate data types, although their implementation is closely related. The real data type is lower precision than the float data type. It is possible to specify the number of bits for the mantissa when defining float, but SQL Server will always use either 24 bits or 53 bits; any other value you specify is rounded up to either 24 or 53. The real data type is the same as specifying float(24), or in effect any number of mantissa bits between 1 and 24.

For more information about floating-point values, see Randolph West’s blog post on how these values are implemented, at https://bornsql.ca/blog/how-sql-server-stores-data-types-floating-points.

Note

The sample scripts for this chapter include an extra file that illustrates important caveats when converting from approximate floating-point numbers to exact types.

Exact numeric types include tinyint, smallint, int, and bigint, which are all whole numbers of varying byte sizes and therefore range. SQL Server does not support unsigned integers.

Some exact numeric types support decimal-point numbers. Foremost among these is the decimal data type. In SQL Server, another name for decimal is numeric. The decimal data type supports a precision of up to 38 digits, before or after the decimal point. The number of digits determines the storage size. In addition, you can specify the scale, which determines the number of digits to the right of the decimal point.

Other exact numeric types that support decimal point numbers are money and smallmoney. They have the same range as int and bigint, but with the decimal point shifted four places to the left. Because of this, any math calculations will be treated like integers and can be done in registers in the CPU. This enables them to perform calculations faster than non-integer numbers. These data types can store monetary data with a precision of up to four digits to the right of the decimal point—in other words, to the ten-thousandth.

Choosing between decimal and money or smallmoney is primarily determined by your need for range and precision. For monetary values, and if your multiplications and divisions will always return the desired result when using only four significant digits to the right of the decimal point, smallmoney and money may be good choices because they are more efficient in terms of storage space. For higher precision and larger scale, decimal is the right choice. In addition, decimal may be a better choice if the operations performed on the data create precision issues due to intermediate steps in math using only four digits.

Inside OUT

When should you use a numeric data type instead of a character data type?

You can store any numeric value, such as an amount or an identifier consisting only of digits, in an alphanumeric column or in a numeric data type column. Generally, you should choose a numeric data type if the values will be used in some type of calculation or when magnitude matters—for example, to calculate a discount on a monetary value. Another example for which numeric data types are used is in quantities because you might need to adjust the quantity by adding or subtracting additional units. On the other hand, a US ZIP code is best stored as an alphanumeric value because leading zeros must be preserved. The same can be true in an employee ID number or any other values that are not used in mathematical functions.

In addition to considering whether you need to use the value in calculations, you also need to consider the differences in how values are sorted. In a numeric column sorted in ascending order, the value 12 will come before 100. But in an alphanumeric column, 100 will come before 12. Either one can produce the desired answer based on your use case.

Date and time types

Date and time data types available in SQL Server 2022 include the aged datetime and smalldatetime types. Although these are not technically deprecated, we strongly caution against using them for new development due to issues surrounding precision, available date range, and lack of control over the precision and storage size. Additionally, these data types are not aligned with the SQL standard, lowering portability of the data between platforms. Their immediate replacement is datetime2, which in no case consumes more than 8 bytes of storage space (the same as datetime), but addresses precision, increases the date range, and can store dates in less than 8 bytes in return for lower precision. As a matter of detail, specifying datetime2(3) provides the same precision as datetime, but does so while requiring 1 fewer byte.

Note

All date and time data types discussed here are available in all currently supported versions of SQL Server. They are by no means new data types, but, unfortunately, are too frequently left unused for fear of backward-compatibility problems.

This does not mean, however, that all date and time-of-day values should be stored in datetime2. There are three additional data types that you should consider for storing date or time values:

date. If you need to store only a date without time or time zone information, this is your best choice. The date data type stores only a date and supports the same date range as datetime2. It stores the date in only 3 bytes, making it much more efficient than datetime (fixed at 8 bytes) and datetime2 (minimally 6 bytes). An example of such a case is a date of birth. A date of birth is commonly stored to calculate someone’s age, which is not generally treated as dependent on the time zone or on the time. Assume a person is born at 11 p.m. Central European Summer Time. If they moved to Southeast Asia, they would not celebrate their birthday a day later, even though the actual point in time of their birth was the next day in Southeast Asia. (That being said, some applications, such as one used in a neonatal facility, might need to store a more precise “time of birth,” so make sure you understand the requirements before choosing your data type.)
datetimeoffset. This data type provides the same precision and range as datetime2 but includes an offset value in hours and minutes to indicate the difference from UTC. This data type neither tracks nor understands actual time zones or daylight saving time (DST). It would be up to the application to track the time zone where the value originated to allow the application or recent versions of SQL Server to perform correct date arithmetic. (See the following note for more information.)

Note

Before SQL Server 2016, SQL Server had no understanding of time zones or DST. SQL Server 2016 and Azure SQL Database introduced the AT TIME ZONE function, which converts between time zones and applies or reverts a DST offset. The rules SQL Server applies are based on the Windows functionality for time zones and DST. These rules are explained and illustrated with examples at https://learn.microsoft.com/sql/t-sql/queries/at-time-zone-transact-sql.

With SQL Server on Linux, AT TIME ZONE returns the same results as executing the function on a Windows host.
time. This data type stores a time-of-day value consisting of hours, minutes, seconds, and fractional seconds, with a precision up to 100 nanoseconds. The exact fractional second precision and storage size is user defined by specifying a precision between 0 and 7. The time data type is a good choice when storing only a time-of-day value that is not time-zone sensitive, such as for a reminder. A reminder set for 11 a.m. might need to be activated at 11 a.m. regardless of time zone and date.

Note

The time data type can store no more than 23 hours, 59 minutes, 59 seconds, and 0.9999999 fractions of a second. This can make this data type unsuitable for storing elapsed time if there is a chance that elapsed time will be 24 hours or more. We typically suggest storing elapsed time in an integer value that holds the number of time units that have passed, based on the minimum precision you desire. Any fields that could exceed 2.1 billion in the lifetime of the application should use the bigint data type.

Inside OUT

How do you correctly retrieve the current system date and time?

In addition to continued use of the datetime data type despite the availability of the better datetime2 type, we also observe the common use of the lower-precision functions CURRENT_TIMESTAMP, GETDATE(), and GETUTCDATE(). Although these functions continue to work, they return values of the datetime type.

There are replacement functions available in SYSDATETIME(), SYSDATETIMEOFFSET(), and SYSUTCDATETIME(). Although their names don’t make it immediately clear, SYSDATETIME() and SYSUTCDATETIME() return the improved datetime2(7) type. SYSDATETIMEOFFSET() also returns a value of type datetimeoffset(7). SYSDATETIME() and SYSUTCDATETIME() are functionally equivalent to GETDATE() and GETUTCDATE(), respectively. SYSDATETIMEOFFSET() does not have a functional equivalent and is thus the only option if you need to include the time zone offset on the server other than using the AT TIME ZONE function.

Even if you cannot modify the schema of your database to change all datetime columns to datetime2 (or datetimeoffset), you might benefit in the long term from adopting the improved functions now. Even though the range of valid dates for datetime is much smaller on the lower end than for datetime2 and datetimeoffset, the upper end is the same (December 31, 9999). Discarding the future possibility of time-travel to before the year 1753, none of the improved functions will return a datetime2 or datetimeoffset value that cannot be cast to datetime.

For detailed technical comparisons between the available date and time data types, visit https://learn.microsoft.com/sql/t-sql/functions/date-and-time-data-types-and-functions-transact-sql.

Binary types

Some data cannot be efficiently represented as an alphanumeric string. For example, data that has been encrypted by the application should be stored as a binary value. The same might also apply to storing contents of binary file formats, such as PDF files.

SQL Server provides the binary data type to store fixed-length binary values, and varbinary to store variable-length binary values. (The image data type has been deprecated for almost two decades, and you should not use it.) For both data types, you specify the number of bytes that will be stored, up to 8,000. If you need to store more than 8,000 bytes, you can specify varbinary(max). This will allow up to 2 GB to be stored, although if the value exceeds 8,000 bytes, those bytes are not stored in the data row. As with [n]varchar, varbinary values may be stored out of row if the total row size would exceed 8,000 bytes.

Note

When storing binary values that are on average larger than 1 MB, you should review whether using FILESTREAM is not a better choice. FILESTREAM is discussed in the “Understand FILESTREAM” section later in this chapter.

Refer to the section “Large value data” earlier in this chapter for details on how varbinary(max) data is stored and access can be optimized.

Specialized data types

In addition to the data types that are designed to store traditional numeric, alphanumeric, and date and time values, SQL Server provides more specialized data types. These data types are more specific to certain use cases than the general-purpose data types.

Some specialized data types have SQL common language runtime (CLR) functions that make working with them significantly easier. For example, the hierarchyid data type has a ToString() function that converts the stored binary value into a human-readable format. These SQL CLR function names are case-sensitive, regardless of the case sensitivity of the instance or database.

Spatial data types: geometry and geography

The spatial data types provide a way to work with flat (planar) or ellipsoidal (round-earth) coordinates. The geometry data type is for a flat coordinate system, whereas the geography data type is for round-earth coordinates. In addition, both data types also support elevation, or Z, values. Both data types are CLR types that are available in every database, regardless of whether the SQL CLR feature is enabled.

SQL Server provides several methods to work with the values of these data types, including finding intersections, calculating surface area and distance, and many more. SQL Server supports methods defined by the Open Geospatial Consortium (OGC) as well as extended methods designed by Microsoft. The methods defined by the OGC are identified by their ST prefix.

Generally, you create a geometry or geography value by using the static STGeomFromText method. You can use this method to define points, lines, and polygons (closed shapes). The code example that follows creates two geometric points, one with coordinates (0, 0) and the second with coordinates (10, 10). Then, it calculates and outputs the distance between both points:

Click here to view code image

-- Define the variables
DECLARE @point1 GEOMETRY, @point2 GEOMETRY, @distance FLOAT;
-- Initialize the geometric points
SET @point1 = geometry::STGeomFromText('POINT(0  0)', 0);
SET @point2 = geometry::STGeomFromText('POINT(10 -10)', 0);
-- Calculate the distance
SET @distance = @point1.STDistance(@point2);
SELECT @distance;

The result in the output is approximately 14.14. (See Figure 7-1; note that no units are defined here.) The second argument in the STGeomFromText method is the spatial reference ID (SRID), which is relevant only for the geography data type. Still, it is a required parameter for the function, and you should specify 0 for geometry data.

A diagram showing point 1, shown as A, as the coordinates (0,0) in the upper left corner. Going out 10 units to the right and down 10 units, there is point 2, shown as B, as (10, 10). A line is shown from point A to point B. The line starts in the upper left corner and goes down at a 45-degree angle to point B. The line is indicated to be approximately 14.14 units long. — **Figure 7-1** The geometry defined in the sample script.

Using spatial data types in a database is valuable when you use the Database Engine to perform spatial queries. You have probably experienced the results of spatial queries in many applications—for example, when searching for nearby pizza restaurants on Bing Maps. Application code can certainly also perform those spatial queries; however, it would require the database to return all pizza restaurants along with their coordinates. By performing the spatial query in the database, the size of the data returned to the application is significantly reduced. SQL Server supports indexing spatial data such that spatial queries can perform optimally.

For a complete reference on the geometry and geography data types, the methods they support, and spatial reference identifiers, visit https://learn.microsoft.com/sql/t-sql/spatial-geometry/spatial-types-geometry-transact-sql and https://learn.microsoft.com/sql/t-sql/spatial-geography/spatial-types-geography.

Note

For an example of the geography data type, refer to the WideWorldImporters sample database. The Application.StateProvinces table includes a Border column of type geography. To visually see the geography data type at work, run a SELECT statement on all rows in the table using SQL Server Management Studio (SSMS). In addition to the row results, SSMS will display a Spatial results tab on which a map of the globe will be drawn.

The XML data type

The xml data type is designed to store XML documents or snippets. But support for XML goes beyond just storing XML data. The XML data type can enforce an XML schema, in which case the column is referred to as typed. XML data can also be queried using XQuery syntax. SQL Server further supports XML by formatting relational data output as XML or retrieving XML data in a relational structure.

A relational database is generally used to store highly structured data, by which we mean data that has a known schema. And even though schemas can change, at any given time every row in a table will have the same columns. Yet, for some scenarios, this strict schema is not appropriate. It might be necessary to accommodate storing data where different rows have different attributes. Sometimes, you can meet this requirement by adding additional nullable sparse columns.

A column set is a feature by which you can manage a group of sparse columns as XML data. Column sets come with significant limitations. Defining many sparse columns becomes onerous because a substantial number of columns can introduce challenges in working with the table. There, just storing the data as plain XML in an xml data type can alleviate the column sprawl. Additionally, if data is frequently used in XML format, it might be more efficient to store the data in that format in the database.

You can read more about sparse columns in the “Sparse columns” section later in this chapter. For detailed guidance on the use of column sets, see https://learn.microsoft.com/sql/relational-databases/tables/use-column-sets.

Although XML data could be stored in (n)varchar columns, using the specialized data type allows SQL Server to provide functionality for validating, querying, indexing, and modifying the XML data.

Note

SQL Server 2022 introduces XML compression, which can dramatically reduce the amount of storage required for XML data and XML indexes.

Refer to the section “Large value data” earlier in this chapter for details on how xml data is stored and access can be optimized.

Inside OUT

How can you work with JSON in SQL Server just like XML?

SQL Server 2016 introduced support for JSON, though it is not a distinct data type like XML. JSON support includes parsing, querying, modifying, and transforming JSON stored in varchar columns using functions. The brief sample that follows illustrates how to check if a value is valid JSON, using the ISJSON() function, and extracting a scalar value, using the JSON_VALUE() function.

Click here to view code image

DECLARE @SomeJSON nvarchar(50) = '{ "test": "passed" }';
SELECT ISJSON(@SomeJSON) IsValid, JSON_VALUE(@SomeJSON, '$.test') [Status];

The output from this code has two columns. The first column has the value 1 because the variable holds valid JSON data, and the second contains the value passed.

For an additional example, examine the columns CustomFields and OtherLanguages in the Application.People table in the WideWorldImporters sample database. OtherLanguages is a computed column, which extracts some JSON data from the CustomFields column using the JSON_QUERY() function.

For complete information on handling JSON-formatted data in SQL Server and Azure SQL Database, refer to https://learn.microsoft.com/sql/relational-databases/json/json-data-sql-server.

The rowversion data type

This data type generates a database-wide unique binary value upon each modification of row data. This binary value increments with each INSERT or UPDATE statement that affects the row, even if no other row data is modified. A common function of this data type is as a row change indicator for use with applications that use optimistic concurrency or as a database-wide change indicator.

Note

The rowversion data type was previously known as timestamp. rowversion is the recommended name to use; timestamp is deprecated. Unfortunately, SSMS does not support the use of rowversion in the table designer or when scripting a table; it continues to use timestamp.

The name timestamp is the same as the SQL ISO standard timestamp, but it does not work according to the ISO standard. Contrary to what the timestamp name might imply, the data in a rowversion column does not map to a moment in time.

When designing tables with rowversion, keep the following restrictions in mind:

A table can have only a single rowversion column. Considering the context of rowversion, this restriction is perfectly sensible, and we’ve included it here only for completeness.
You cannot specify a value for the rowversion column in INSERT or UPDATE statements. However, unlike with identity or computed columns, you must specify a column list in INSERT statements for tables with a rowversion column. Note that specifying the column list is recommended anyway.
Although the Database Engine will not generate duplicate rowversion values within a database, rowversion values are not unique across databases or instances.

Duplicate rowversion values can exist in a single database if a new table is created by using the SELECT INTO syntax. The new table’s rowversion values will be the same as those of the source table. This behavior might be desired when, for example, modifying a table’s schema by creating a new table and copying all the data into it. In other instances, this behavior might not be desired. In those cases, you should not include the rowversion column in the SELECT INTO statement. You should then alter the new table to add a rowversion column. This behavior and workaround are illustrated in an extra sample script file in the accompanying downloads for this book.

Implement optimistic concurrency

Including a rowversion column in a table is an excellent way to implement a row change indicator to achieve optimistic concurrency. With optimistic concurrency, a client reads data with the intent of updating it. Unlike with pessimistic concurrency, however, a lock is not maintained. Instead, in the same transaction as the UPDATE statement, the client will verify that the rowversion was not changed by another process. If it wasn’t, the update proceeds. But if the rowversion no longer matches what the client originally read, the update will fail. The client application can then retrieve the current values and present the user with a notification and suitable options, depending on the application needs. Many object-relational mappers (ORMs), including Entity Framework, support using a rowversion column type to implement optimistic concurrency.

Inside OUT

Can you implement optimistic concurrency without rowversion?

There are other ways to implement optimistic concurrency. A client application can track the value of each individual column in the updated row, then verify that only the columns affected by its own UPDATE statement have not been modified. Specifically, client A reads a row of data and intends to change only the Name column. Client B reads the same row of data and updates the Address column. When client A attempts to update the Name, it finds that the Name column’s value is unchanged and will proceed with the update.

This approach is suitable in some scenarios, but it has some drawbacks. First, each client needs to maintain additional state information—namely, the original value of each column. In a web application, the amount of state information to maintain can grow very large and consume a lot of memory. In a web farm scenario, maintaining such state information might require shared state configuration because the web client might not communicate with the same web server on the POST that it did on the GET.

Perhaps more importantly, the data row can be inconsistent after the second update. If each client updates a column in the same row, the row’s data might not reflect a valid business scenario. Certainly, the row’s values would not reflect what each client believes it would be.

Alternatively, you can use read committed snapshot isolation (RCSI) to enable optimistic concurrency at the database level, which is discussed in more detail in Chapter 6, “Provision and configure SQL Server databases.”

The uniqueidentifier data type

The uniqueidentifier data type stores a 16-byte value known as a globally unique identifier (GUID). SQL Server can generate GUIDs using one of two functions: NEWID() and NEWSEQUENTIALID(). NEWSEQUENTIALID() generates a GUID that is greater than a previously generated GUID by this function since the last restart of the server. You can use NEWSEQUENTIALID() only in a default constraint for a column; it is more suitable for use as a clustered primary key than NEWID(). Unlike NEWID(), which generates random values, the increasing nature of the GUIDs generated by NEWSEQUENTIALID() means that data and index pages will fill completely.

Note

Although GUIDs generated using NEWID() were originally generated by incorporating a system’s network interface card (NIC) MAC address, this has not been the case for many years, because it might have been possible to identify the system on which it was created. All GUIDs in SQL Server use a pseudorandom value, according to the UUID version 4 standard. The chance of collision is extremely low.

However, the NEWSEQUENTIALID() function is dependent on the MAC address of the machine’s network interface. This means that the starting point of the sequence generated by NEWSEQUENTIALID() could change when the machine’s network interface changes. A NIC change can occur with regularity on virtualized and PaaS platforms. With NEWSEQUENTIALID(), you will eventually experience fragmentation because the sequential GUIDs will have a smaller value than the previous sequence after a restart.

The uniqueidentifier data type plays an important role in some replication techniques. For more information, see Chapter 11, “Implement high availability and disaster recovery.”

The hierarchyid data type

The hierarchyid data type enables an application to store and query hierarchical data in a tree structure. A tree structure means that a row will have zero or one parent and zero or more children. There is a single root element denoted by a single forward slash (/). hierarchyid values are stored as a binary format but are commonly represented in their string format. Each element at the same level in the hierarchy (referred to as a sibling) has a unique numeric value (which might include a decimal point). In the string representation of a hierarchyid value, each level is separated by a forward slash. The string representation always begins with a slash (to denote the root element) and ends with a slash.

For example, as illustrated in Figure 7-2, a hierarchyid whose string representation is /1/10/ is a descendant of the /1/ element, which itself is a descendant of the implicit root element /. It must be noted, however, that SQL Server does not enforce the existence of a row with the ancestor element. This means it is possible to create an element /3/1/ without its ancestor /3/ being a value in a row. Implicitly, it is a child of /3/, even if no row with hierarchyid value /3/ exists. Similarly, the row with hierarchyid element /1/ can be deleted if another row has hierarchyid value /1/10/. If you don’t want this, the application or database will need to include logic to enforce the existence of an ancestor when inserting and to prevent the deletion of an ancestor.

A diagram showing hierarchical values starting at the root element /. The root element, /, has two descendants /1/ and /2/. These descendants are shown one level below the root element and have lines from / to the descendants /1/ and /2/. On the same level as /1/ and /2/, there is also a line going to the descendant /3/, which is in gray. This is to indicate that the element /3/ does not exist. Below the /1/ element, there is a line going down one level to the descendant /1/10/. Going back up a level to /2/, the diagram shows two different lines going to the descendants /2/1/ and /2/20/ one level below /2/. Going next to the /3/, there is a line showing a relationship to the descendant /3/1/ below the /3/ element. This descendant is not grayed out, indicating that this value does exist in the hierarchy. — **Figure 7-2** `hierarchyid` values. The value /3/ is in gray to indicate it is implicit.

Perhaps surprisingly, SQL Server does not enforce uniqueness of the hierarchyid values unless you define a unique index or constraint on the hierarchyid column. It is, therefore, possible for the /3/1/ element to be defined twice. This is likely not the desired situation, so we recommend that you ensure uniqueness of the hierarchyid values.

Using the hierarchyid data type is appropriate if the tree is most commonly queried to find descendants, such as children, children-of-children, and more. This is because hierarchyid processes rows depth-first if it is indexed. You can create a breadth-first index by adding a computed column to the table (which uses the .GetLevel() method on the hierarchyid column) and then creating an index on the computed column followed by the hierarchyid column. You cannot, however, use a computed column in a clustered index, so this solution will still be less efficient compared to creating a clustered index on the hierarchyid value alone.

A hierarchyid method worth mentioning is .GetAncestor(). This method returns the hierarchyid value of the current node’s parent. Conversely, .IsDescendantOf() determines whether a node is a descendant, direct or otherwise, of the hierarchyid provided as the function’s parameter.

For a complete overview of the hierarchyid data type, refer to https://learn.microsoft.com/sql/relational-databases/hierarchical-data-sql-server.

The sql_variant data type

The sql_variant data type enables a single column to store data of diverse types. You can also use this type as a parameter or a variable. In addition to storing the actual value, each sql_variant instance also stores metadata about the value, including its system data type, maximum size, scale and precision, and collation. Using sql_variant can be indicative of a poor database design, and you should use it judiciously. Client libraries that do not know how to handle that data might convert it to nvarchar(4000), with potential consequences for data that doesn’t convert well to character data.

In queries, you can retrieve the base type of the stored value and the base types’ properties using the SQL_VARIANT_PROPERTY() function. For example, using SQL_VARIANT_PROPERTY(<columnname>, 'BaseType'), you can retrieve the sysname of the underlying type. Other values that can be provided as the property parameter value of the function are Precision, Scale, TotalBytes, Collation, and MaxLength. If a particular property doesn’t apply to the underlying data type, the way precision doesn’t apply to varchar, the function returns NULL.

Data type precedence

When using a T-SQL operator that combines values that might be of different data types, how does the Database Engine handle the difference? The answer is that the data type with the lower precedence is converted to the data type with the higher precedence, assuming the conversion is possible. If it’s not, then an error is returned.

In the following code sample, the string variable @TheirString (a varchar) is first converted to datetime2 because datetime2 takes precedence over varchar. Then the comparison is executed.

Click here to view code image

DECLARE @MyDate datetime2(0) = '2022-12-22T20:05:00';
DECLARE @TheirString varchar(10) = '2022-12-20';
SELECT DATEDIFF(MINUTE, @TheirString, @MyDate);

For a complete list of data types in precedence order, visit https://learn.microsoft.com/sql/t-sql/data-types/data-type-precedence-transact-sql.

Constraints

Constraints define rules to which your data must adhere, and those rules are enforced by the Database Engine. This makes constraints a very powerful mechanism for guaranteeing data integrity. This section provides details on primary and foreign keys, which are used to establish relationships, and unique, check, and default constraints.

Primary keys, foreign keys, and relationships

Proper relational database design calls for a process called normalization. Through normalization, logical entities are broken into multiple related tables. Primary keys are created for entities, and foreign keys establish relationships between entities. A detailed discussion of normalization is beyond the scope of this book.

Without intending to wax poetic, keys are the nervous system of a relational database. They establish the relationships between the multiple tables created through normalization. A relational database system uses both primary keys and foreign keys. In a single table, the primary key values must be unique because those values can be used as foreign key values in the related table. The foreign key values can also be unique in the related table, in which case the established relationship is a one-to-one relationship. This is discussed in the “Vertical partitions” section later in the chapter.

Note

This chapter does not include coverage of indexes, even though a primary key and a unique constraint are always associated with an index. Frequently, foreign key columns are also indexed, as their values are often used to query the table. For information on indexing, see Chapter 15.

A table can have exactly one primary key. This primary key can consist of multiple columns, in which case it’s referred to as a compound primary key. A simple primary key only has one column. In no case can a nullable column be (part of) the primary key. If additional columns’ values should be unique, you can apply a unique index or constraint. (See the next sections for coverage on additional constraint types.)

Note

In most cases, SQL Server does not require that tables have a primary key. Some features, such as FILESTREAM and certain types of replication, might require tables to have a primary key. In general, though, you should default to designing tables with primary keys unless there is an overriding reason not to.

Foreign keys are intended to establish referential integrity. Referential integrity allows values found in the foreign key column(s) to exist in either the primary key, unique constraint, or unique index column(s). By default, foreign keys in SQL Server have referential integrity enforced. It is possible to establish a foreign key without referential integrity enforced, or to alter the foreign key to turn referential integrity off and on. This functionality is useful during import operations or certain types of database maintenance.

This applies to check constraints, too, which are discussed later in this section. The same is also true for primary keys if they are not enabled. However, during normal operations, foreign keys should have referential integrity enabled to protect the integrity of your data. Otherwise, establishing the relationship is useful only for documentary purposes, which can be helpful, but less so than knowing the foreign key references are always correct.

Note

If a foreign key is a composite key in which one or more columns allow NULL, a row with a NULL value in just one of the foreign key columns will pass the integrity check, even if the other columns contain values that do not exist in the parent table. This is because foreign key constraints fail only on a false result, not NULL or true.

To provide referential integrity, we recommend prohibiting NULL values in some columns of a composite foreign key and allowing them in others. It’s certainly acceptable to have NULL values in all columns of the composite foreign keys; this indicates that the relationship is optional. In that case, create a check constraint (covered later in this section) to ensure either all foreign key columns are NULL or none are.

One table can have multiple foreign keys.

When defining a foreign key, you can specify how to handle an operation in the parent row that would invalidate the relationship. Cascading specifically means that the same operation will be run on the child row(s) as was run on the parent. Thus, if the primary key value is updated, the foreign key values will be updated, and if the parent row is deleted, the foreign key values will be deleted. Alternatively, on updates or deletes in the parent table, no action can be taken (the default, which would cause the update or delete statement to fail if referential integrity is enforced), the foreign key value can be set to NULL (effectively creating an orphaned row), or the foreign key value can be set to its default constraint’s specification (effectively mapping the child row to another parent).

Unique constraints

A unique constraint enforces unique values in one column or selected columns. Unlike a primary key, the unique constraint allows the column(s) to be nullable, though NULL values in a constraint are treated as one value in SQL Server, not an unknown value. This means it is possible to have many rows with a NULL value if you are using a compound key. Otherwise, there will only be one NULL value.

Refer to Chapter 15 for guidance on unique filtered indexes, which can be used to work around this limitation.

Like primary and foreign keys, referential integrity is enabled by default for unique constraints. It is possible to disable the unique index, which will also disable the unique constraint.

SQL Server 2022 enables you to allow resumable add table constraints. The following sample adds a unique constraint to the column CountryName in the Application.Countries table in the WideWorldImporters sample database with a resumable add table constraint.

Click here to view code image

ALTER TABLE [Application].Countries WITH CHECK
    ADD CONSTRAINT UC_CountryName_Resume UNIQUE (CountryName)
    WITH (ONLINE = ON, RESUMABLE = ON, MAX_DURATION = 60);

When creating any type of unique constraint or index, the name must be distinct. The following could also be added to the table as well (which would create the exact same duplicate physical structure, which is not of any value):

Click here to view code image

ALTER TABLE [Application].Countries WITH CHECK
    ADD CONSTRAINT UC_CountryName_Resume2 UNIQUE (CountryName)
    WITH (ONLINE = ON, RESUMABLE = ON, MAX_DURATION = 60);

Check constraints

A check constraint enforces rules that can be expressed by using a Boolean expression. For example, in the Sales.Invoices table in the sample WideWorldImporters database, there is a check constraint defined that requires the ReturnedDeliveryData column to either be NULL or contain valid JSON, as shown below.

Click here to view code image

ALTER TABLE Sales.Invoices WITH CHECK
     ADD CONSTRAINT CK_Sales_Invoices_ReturnedDeliveryData_Must_Be_Valid_JSON
     CHECK ((ISJSON(ReturnedDeliveryData)<>(0)));

Check constraints can reference more than one column. A frequently encountered requirement is that when one column contains a specific value, another column cannot be NULL.

Using constraints with compound conditions also provides an opportunity to provide check constraints in the face of changing requirements. If a new business rule requires that a nullable column must now contain a value, but no suitable default can be provided for the existing rows, you should consider creating a check constraint that verifies whether an incrementing ID column or date column is larger than the value it held when the rule took effect. For example, consider the table Sales.Invoices in the previous sample, which has a nullable column Comments. If effective September 1, 2022, every new and modified invoice must have a value in the Comments column, the table could be altered using the following script:

Click here to view code image

ALTER TABLE Sales.Invoices WITH CHECK
   ADD CONSTRAINT CH_Comments CHECK (LastEditedWhen < '2022-09-01'
   OR Comments IS NOT NULL);

A problem that you cannot solve by using a constraint is when a column must contain unique values if a value is provided. In other words, the column should allow multiple rows with NULL, but otherwise should be unique. The solution then is to use a filtered unique index.

Read about filtered unique indexes in Chapter 15.

Default constraints

The fourth and final constraint type is the default constraint. A default constraint specifies the value that will be used as the default value when an INSERT statement does not specify a value for the column.

Default constraints are useful in a number of scenarios, most notably when adding a new non-nullable column to a table with existing data. This scenario is demonstrated in the following code sample, which adds a PrimaryLanguage column to the Application.People table in WideWorldImporters.

Click here to view code image

ALTER TABLE [Application].People
    ADD PrimaryLanguage nvarchar(50) NOT NULL
        CONSTRAINT DF_Application_People_PrimaryLanguage DEFAULT 'English';

Sequence objects

A sequence is a database object that generates numeric values in a specified order. Unlike the new SQL Server 2022 function GENERATE_SERIES() (which returns a table of numeric values), sequences are used when you want to retrieve the next available value.

How the sequence is generated depends on its start value, increment value, and minimum and maximum values. A sequence can be ascending, which is the case when the increment value is positive. When the increment value is negative, the values provided by the sequence are descending. A sequence object has some similarities to a column with an identity specification, but there are important distinctions:

You can define a sequence to cycle, meaning when the numbers in the sequence are exhausted, the next use will return a previously generated number. Which number will be returned when the sequence cycles is determined by the increment. If it is an ascending sequence, the minimum value is returned; if it is a descending sequence, the maximum value is returned.
A sequence is not bound to just one table. You can use numbers generated by the sequence in any table in the database or outside of a table.
Sequence numbers can be generated without inserting a new row in a table.
Values generated from a sequence can be updated or overridden without extra work.

Sequences are used when the application wants to have a numeric sequence generated one at a time—for example, before inserting one or more rows. Consider the common case of a parent-child relationship. Even though most developer tools expect to work with identity columns, knowing the value of a new parent row’s primary key value and using it as the foreign key value in the child rows can have benefits for the application—for example, making the value modifiable—because the identity column values cannot be updated.

A sequence is especially useful when a single incrementing range is desired across multiple tables. More creative uses for a sequence include using it with a small range—say, 5—to automatically place new rows in one of five buckets.

To create a sequence, use the CREATE SEQUENCE command. When creating the sequence, you specify a data type that can hold an integer value; the start, increment, minimum, and maximum values; and whether the numbers should cycle when the minimum or maximum value is reached. However, all these are optional. If no data type is specified, the type will be bigint.

Ideally, you should match the data type of the sequence to the data type of the column that will be holding the value. If no increment is specified, it will be 1. If no minimum or maximum value is specified, the minimum and maximum value of the underlying data type will be used. By default, a sequence does not cycle.

The sample script that follows creates a sequence called MySequence of type int. The values start at 1001 and increment by 1 until 1003 is reached, after which 1001 will be generated again. The script demonstrates the cycling of the values using a WHILE loop.

Click here to view code image

-- Define the sequence
CREATE SEQUENCE dbo.MySequence AS int
    START WITH 1001
    INCREMENT BY 1
    MINVALUE 1001
    MAXVALUE 1003
    CYCLE;
-- Declare a loop counter
DECLARE @i int = 1;
-- Execute 4 times
WHILE (@i <= 4)
BEGIN
    -- Retrieve the next value from the sequence
    SELECT NEXT VALUE FOR dbo.MySequence AS NextValue;
    -- Increment the loop counter
    SET @i = @i + 1;
END;

The output of the script will be 1001, 1002, 1003, and 1001. The sequence is used by calling NEXT VALUE FOR. You can use NEXT VALUE FOR as a default constraint or as a function parameter unless it’s a table-valued function. There are quite a few more places where NEXT VALUE FOR cannot be used, including subqueries, views, user-defined functions, and conditional expressions.

For a full listing of limitations, visit https://learn.microsoft.com/sql/t-sql/functions/next-value-for-transact-sql#limitations-and-restrictions.

Note

Sequences are cached by default. When using caching, the current value is saved in memory; this might cause values to be skipped from the sequence on a server restart. You can turn off caching by specifying the NO CACHE clause in the CREATE or ALTER SEQUENCE statement. Doing this will allow the sequence values to be persisted at the cost of additional calls to disk. You can control the size of the cache by using the CACHE clause and specifying an integer constant.

Inside OUT

Considerations with NEXT VALUE FOR

Using NEXT VALUE FOR multiple times for the same sequence in the same statement will result in only one value per row being used. For example, in the following T-SQL snippet, the first and the second column of the new row will have the same value:

Click here to view code image

INSERT INTO dbo.SomeTable VALUES (NEXT VALUE FOR dbo.MySequence,
    NEXT VALUE FOR dbo.MySequence, 'More data...');

This might not be the desired scenario. One solution is to use two separate sequences, although there is no guarantee they will return different numbers, even if their start values are different. Another solution is to define the sequence object’s increment as 2 and, for one of the columns, add 1 to the sequence value, as in the following snippet:

Click here to view code image

INSERT INTO dbo.SomeTable VALUES (NEXT VALUE FOR dbo.MySequence,
    NEXT VALUE FOR dbo.MySequence + 1, 'More data...');

If you must guarantee that the values are different, you should place a CHECK CONSTRAINT on the table. However, we recommend evaluating the need for having two columns in the same table with autogenerated numeric values.

NEXT VALUE FOR generates and returns a single value at a time. If multiple values should be generated at once, the application can use the sp_sequence_get_range stored procedure. This procedure allocates as many numbers from the sequence as specified and returns metadata about the generated numbers. The actual values that are generated are not returned. The sample script that follows uses the MySequence sequence to generate five numbers. The metadata is captured in variables and later output. Note that the data type of most output parameters is sql_variant. The underlying type of those parameters is the data type of the sequence.

Click here to view code image

-- Declare variables to hold the metadata
DECLARE @FirstVal sql_variant, @LastVal sql_variant,
    @Increment sql_variant, @CycleCount int,
    @MinVal sql_variant, @MaxVal sql_variant;
-- Generate 5 numbers and capture all metadata
EXEC sp_sequence_get_range dbo.MySequence
    , @range_size = 5
    , @range_first_value = @FirstVal OUTPUT
    , @range_last_value = @LastVal OUTPUT
    , @range_cycle_count = @CycleCount OUTPUT
    , @sequence_increment = @Increment OUTPUT
    , @sequence_min_value = @MinVal OUTPUT
    , @sequence_max_value = @MaxVal OUTPUT;
-- Output the values of the output parameters
SELECT @FirstVal AS FirstVal, @LastVal AS LastVal
    , @CycleCount AS CycleCount, @Increment AS Increment
    , @MinVal AS MinVal, @MaxVal AS MaxVal;

The output of this sample script will vary with each run. Because of the specific way in which the MySequence object was defined, however, every three cycles, the output will repeat.

Note

Although the only required output parameter is @range_first_value, if the application intends to use any value but the first, the application should consume all the metadata that is returned as part of the optional output parameters. Without it, the application might infer the incorrect value. By fetching the values, you know you have the value you expect. It is up to the application to calculate the actual numbers generated by using the first value, last value, increment, minimum and maximum value, and cycle count output parameters.

Caution

You might receive error 11732 when using sequences. This error indicates that the limit of the sequence has been reached and the sequence does not cycle. If this error occurs when using the sp_sequence_get_range stored procedure, no values are returned; that is, the sequence is not affected at all.

User-defined data types and user-defined types

SQL Server supports defining new data types. Two variations exist:

User-defined data types (UDTs). These alias existing data types.
User-defined types. These are .NET Framework types.

An effective and common use for UDT is creating table types. As the name implies, a table type defines a table structure as a type that can then be used as a function or stored procedure parameter. Such a parameter enables the easy passing of multiple values or rows to the function or procedure.

We should warn against the liberal use of either variant of custom data types. They can make a database schema significantly more difficult to understand and troubleshoot. Alias types add little value because they do not create new behavior, but on the other hand, some architects find the “self-documenting” aspect attractive. SQL CLR user-defined types enable SQL Server to expose new behavior, but they might come with a significant security risk if they are used improperly.

More details about the security of CLR user-defined types can be found at https://learn.microsoft.com/sql/relational-databases/clr-integration/common-language-runtime-integration-overview.

User-defined data types

Alias data types are merely a new name for an existing system data type including the same length and precision as the original data type, optionally a default nullability specification. Specifying a default value or a validation rule for the alias is deprecated functionality.

For example, if you want to ensure that a customer name was always defined as an nvarchar column with a maximum length of 100 characters, you might use the CREATE TYPE statement as shown here:

Click here to view code image

CREATE TYPE CustomerNameType FROM nvarchar(100);
GO

After creating this UDT, in any place where you would ordinarily specify nvarchar(100), you can use CustomerNameType instead. This can be in a table’s column definition, as the return type of a scalar function, or as a parameter to a stored procedure.

The following abbreviated CREATE TABLE statement, which is based on the WideWorldImporters sample Customers table, illustrates how CustomerNameType replaces nvarchar(100):

Click here to view code image

CREATE TABLE Sales.Customers (
    CustomerID INT NOT NULL,
    CustomerName CustomerNameType, -- can override nullability of the type here
…

UDTs can adversely affect data quality, as there are no additional methods of providing data protection.

CLR user-defined types

You develop user-defined types in a .NET language such as C#, and you must compile them into a .NET assembly. This .NET assembly is then registered in the database where the type will be used. A database can use these types only if SQL CLR is enabled.

Use caution when enabling CLR and granting permissions to assemblies, especially for trustworthy databases. Depending on the permissions granted, an assembly may be able to acquire sysadmin privileges, gain access to system resources, access user data, or any combination of the above.

Sparse columns

Sparse columns store NULL values in an optimized manner, reducing space requirements for storing NULL values at the expense of overhead to retrieve non-NULL values. As discussed in the earlier section “The XML data type,” a potential workaround for saving storage space for tables with many columns that allow NULL and have many NULL values is using sparse columns. Tables with sparse columns can have up to 30,000 columns. Sparse columns exist at the storage layer and are not counted as part of the maximum number of columns a table can have. Microsoft suggests that a space savings of at least 20 percent should be achieved before the overhead is worth it.

The Microsoft Docs at https://learn.microsoft.com/sql/relational-databases/tables/use-sparse-columns define the space savings by data type when using sparse columns.

Note

Not all data types can be defined as sparse columns. Specifically, you cannot define geography and geometry, image, text and ntext, rowversion, and UDTs as sparse columns.

Sparse columns are defined in CREATE or ALTER TABLE statements by using the SPARSE keyword. The sample script that follows creates a table, OrderDetails, with two sparse columns, ReturnedDate and ReturnedReason. Sparse columns are useful here because we might expect most products to not be returned and for the ReturnedDate and ReturnedReason columns to be retrieved only occasionally.

Click here to view code image

CREATE TABLE dbo.OrderDetails (
    OrderId int NOT NULL,
    OrderDetailId int NOT NULL,
    ProductId int NOT NULL,
    Quantity int NOT NULL,
    ReturnedDate date SPARSE NULL,
    ReturnedReason varchar(50) SPARSE NULL);

Note

For brevity, the CREATE TABLE script in the preceding example does not define primary keys, foreign keys, or columns that you might typically expect in an order details table.

Computed columns

Typically, columns store persisted data. Derived data—that is, data that is the result of a calculation—is not ordinarily stored. Instead, the application derives it every time it’s needed. In some circumstances, storing derived data in the database can be beneficial. SQL Server supports storing derived data using computed columns and indexed views.

Computed columns are defined in a table as the result of an expression of other columns in the table, function calls, and perhaps constants. Computed column values can either be calculated when accessed (the default) or persisted, depending on the need.

Using computed columns is always a trade-off. You use a computed column when you determine that there is some benefit to the Database Engine being aware of the derived data. You might find this beneficial because the database could be the central source of the computation instead of having to spread it out across multiple systems. Another trade-off is found when you persist computed columns; you trade storage space for compute efficiency.

If the expression that calculates the computed column’s value is deterministic, that column can be persisted and indexed. An expression is deterministic if the expression will always return the same result for the same inputs. An example of a deterministic expression is OrderQuantity + 1. Given the same value for OrderQuantity, the result will always be the same. An example of a nondeterministic expression is one that uses the SYSDATETIME() function; the expression returns a different result each time it is executed.

For a complete discussion of indexing computed columns, visit https://learn.microsoft.com/sql/relational-databases/indexes/indexes-on-computed-columns.

The WideWorldImporters sample database contains two computed columns in the Sales.Invoices table. One of these is ConfirmedDeliveryTime. It is derived by examining the contents of the JSON value stored in the ReturnedDeliveryData column and converting it to a datetime2 value. The datetime2 value is not persisted in this case. This means each time ConfirmedDeliveryTime is queried, the expression is evaluated. If the column was persisted, the expression would be evaluated only when the row is created or updated.

When defining a computed column, instead of specifying a data type, you specify the AS clause followed by an expression. Using the Sales.OrderLines table in the same sample database, you can create a computed column to calculate the order line’s extended price. The following sample SQL statement illustrates how:

Click here to view code image

ALTER TABLE Sales.OrderLines
    ADD ExtendedPrice AS (Quantity * UnitPrice) PERSISTED;

This statement creates a new column in the table called ExtendedPrice. Its value is computed using the expression Quantity * UnitPrice. The column is persisted because we expect to be querying this value frequently. The type of the computed column is determined by SQL Server based on the result of the expression. In this case, the data type is set to decimal(29,2). If the determined data type is not suitable for your needs, you can apply a cast in the expression to a more appropriate data type.

Special table types

As data storage needs have become more specialized, SQL Server has gained extended functionality to support these scenarios in the form of special table types. These table types support scenarios that would otherwise require significant effort by the database developer to implement. This section discusses temporal tables, memory-optimized tables, external tables, and graph tables. We discuss another special table type, FileTable, in the next section.

External tables and PolyBase are discussed later in this chapter, in the section “Benefits of PolyBase for external data sources and external tables.”

System-versioned temporal tables

System-versioned temporal tables, or temporal tables for short, are designed to keep not only current values of rows, but also historic values. In addition to the current table, there is a companion history table with the same schema structure. The history table stores the historic rows. SQL Server can create the history table at the time the current table is created, and you can opt to specify the history table’s name or let SQL Server create an anonymous history table. Alternatively, you might use an existing table as the history table, in which case the Database Engine will validate that the schema matches that of the current table. Creating a history table by hand can be complex, but specifying a history table that was previously used and was disconnected for some reason is a valuable option.

When a table is designed to be a temporal table, it must have two explicitly defined columns of type datetime2, which are used to indicate the validity period of the row. The datetime2 columns can have any precision between 0 and 7. You define the name of these columns and add the GENERATED ALWAYS AS ROW START|END clause. In addition to that clause, two more clauses are required in the table declaration: PERIOD FOR SYSTEM_TIME (start_time_col, end_time_col) and WITH (SYSTEM_VERSIONING = ON).

Note

Temporal tables should be used when users need to query the data and see changes over time. Later sections in this chapter cover change tracking and change data capture and compare these features to temporal tables and how they can be used for ETL type processes.

Create a system-versioned temporal table

The simple CREATE TABLE statement that follows illustrates the use of these clauses to create a system-versioned temporal table with an anonymous history table:

Click here to view code image

CREATE TABLE dbo.Products (
    -- Clustered primary key is required
    ProductId int NOT NULL PRIMARY KEY CLUSTERED
  , ProductName varchar(50) NOT NULL
  , CategoryId int NOT NULL
  , SalesPrice money NOT NULL
  , SysStartTime datetime2 GENERATED ALWAYS AS ROW START NOT NULL
  , SysEndTime datetime2 GENERATED ALWAYS AS ROW END NOT NULL
  -- PERIOD FOR SYSTEM_TIME to indicate columns storing validity start and end
  , PERIOD FOR SYSTEM_TIME (SysStartTime, SysEndTime))
-- SYSTEM_VERSIONING clause without HISTORY_TABLE option creates
-- an anonymous history table, meaning the name will be auto-generated
WITH (SYSTEM_VERSIONING = ON);

Note

The row’s start and end validity column values are managed by SQL Server. The values in those columns are in the UTC time zone. Neither validity period column will ever be NULL and the end time will always be 9999-12-31 in the base table.

An existing table can also be altered to become a temporal table. This is a two-stage process, in which you first alter the table to include the two required datetime2 columns and then alter the table to turn on system versioning while optionally specifying a history table name. The history of all columns in the table will be captured, so if the table is very volatile, and if it includes columns that you don’t care to track, this feature may not be useful.

When creating a new table or altering an existing one, you can apply the optional HIDDEN property to the columns for the validity period to exclude the columns from a standard SELECT statement. This might be useful to ensure backward compatibility with existing applications that query the table.

Understand data movement in temporal tables

The Database Engine manages the movement of data from the current table to the history table. The following list details the data movements that take place with each Data Manipulation Language (DML) operation:

INSERT and BULK INSERT. A new row is added to the current table. The row’s validity start time is set to the transaction’s start time. The validity end time is set to the datetime2 type’s maximum value—December 31, 9999—at a fractional second, or a whole second when using datetime2(0). There is no change in the history table.
UPDATE. A new row is added to the history table with the old values. The validity end time of the history row is set to the transaction’s start time. In the current table, the row is updated with the new values and the validity start time is updated to the transaction’s start time. If the same row is updated multiple times in the same transaction, multiple history rows with the same validity start and end time will be inserted. Those rows will not be retrieved using typical queries; only the current version of non-deleted rows will be returned. For instance, if only one column is changed, the entire row will be duplicated in the history structure.
DELETE. A new row is added to the history table containing the values from the current table. The validity end period of the history row is set to the transaction’s start time. The row is removed from the current table.

MERGE statements need no special consideration. A MERGE operation behaves as if separate INSERT, UPDATE, and DELETE statements are executed, as determined necessary by the MATCH clauses. Those statements add rows to the history table, as just described.

Query temporal tables

Querying a temporal table is no different from querying another table if your query only needs to return current data. This makes it possible to modify an existing database and alter tables into temporal tables without requiring application modifications.

Note

Recall that when using the HIDDEN property on the period columns, existing applications won’t be exposed to those columns. You must explicitly include hidden columns if you want to query them.

When designing queries that need to return historical data or even a mix of current and historical data, you use the FOR SYSTEM_TIME clause in the FROM clause of the SELECT statement. There are five subclauses used with FOR SYSTEM_TIME that help you define the time frame for which you want to retrieve rows. The following list describes these subclauses. It also provides a sample T-SQL statement for each one that you can run on the WideWorldImporters sample database to see its effects.

Note

The IsCurrent column in the output indicates whether the retrieved row is the current row or a history row. This is accomplished by checking whether the ValidTo column contains the maximum datetime2 value. Due to the nature of the WideWorldImporters sample data, you might need to scroll through several hundred rows before encountering a value of 0 for IsCurrent, which indicates that it is a history row.

ALL. The result set is essentially the union between the current and the history tables. Multiple rows can be returned for the same primary key in the current table. This will be the case for any row that has one or more history entries, as shown here:

Click here to view code image
```
SELECT PersonID, FullName,
      CASE WHEN ValidTo = '9999-12-31 23:59:59.9999999' THEN 1
         ELSE 0 END AS IsCurrent
FROM Application.People FOR SYSTEM_TIME ALL
WHERE PeriodId = 11
ORDER BY ValidFrom;
```

AS OF. This returns rows that were valid at the single point in time in the UTC time zone. Rows that have been deleted from the current table or that didn’t exist yet will not be included:

Click here to view code image

/* AS OF sub-clause returns all rows that were valid at one point in time.
 * Recall the SYSTEM_TIME is UTC.
 * Showing an example here of how to convert a local time to UTC:
 * Local time is March 13, 2022 12:00 AM (midnight) US Pacific Time
 * (the start of the day).
 * March 13 is not in daylight saving time, so the offset is -8 hours.
 * Thus, records we're looking for were active on March 13, 2022 8 AM UTC.
 * Calling the AT TIME ZONE function twice gives the desired time in UTC.
 */
DECLARE @AsOfTime datetime2(7) = CONVERT(datetime2(7), '2022-03-13T00:00:00', 126)
    AT TIME ZONE 'Pacific Standard Time' AT TIME ZONE 'UTC';
SELECT PersonID, FullName
    , CASE WHEN ValidTo = '9999-12-31 23:59:59.9999999' THEN 1
           ELSE 0 END 'IsCurrent'
FROM [Application].People FOR SYSTEM_TIME AS OF @AsOfTime
ORDER BY ValidFrom;

FROM … TO. This returns all rows that were active between the specified lower bound and upper bound. In other words, if the row’s validity start time is before the upper bound or its validity end time is after the lower bound, the row will be included in the result set. Rows that became active exactly on the upper bound or that closed exactly on the lower bound are not included. This clause might return multiple rows for the same primary key value:

Click here to view code image
```
SELECT PersonID, FullName,
     CASE WHEN ValidTo = '9999-12-31 23:59:59.9999999' THEN 1
         ELSE 0 END AS IsCurrent
-- SYSTEM_TIME uses UTC so provide date range in UTC as well
FROM Application.People FOR SYSTEM_TIME FROM '2022-03-13' TO '2022-04-23'
ORDER BY ValidFrom;
```

BETWEEN … AND. This is like FROM … TO, but rows that opened exactly on the upper bound are included:

Click here to view code image

SELECT PersonID, FullName,
     CASE WHEN ValidTo = '9999-12-31 23:59:59.9999999' THEN 1
         ELSE 0 END AS IsCurrent
FROM Application.People FOR SYSTEM_TIME BETWEEN '2022-03-13' AND '2022-04-23'
ORDER BY ValidFrom;

CONTAINED IN (,). This returns rows that were active exclusively between the lower and the upper bound. If a row was valid earlier than the lower bound or valid past the upper bound, it is not included. A row that was opened exactly on the lower bound or closed exactly on the upper bound will be included. If the upper bound is earlier than the maximum value for datetime2, only history rows will be included:

Click here to view code image
```
DECLARE @now datetime2(7) = SYSUTCDATETIME();
SELECT PersonID, FullName,
     CASE WHEN ValidTo = '9999-12-31 23:59:59.9999999' THEN 1
         ELSE 0 END AS IsCurrent
FROM Application.People FOR SYSTEM_TIME CONTAINED IN ('2022-03-13', @now)
ORDER BY ValidFrom;
```

Note

In the sample statement for the CONTAINED IN subclause, the variable @now is declared and initialized with the current UTC time. This is necessary because the FOR SYSTEM_TIME clause does not support functions as arguments.

Manage temporal tables

Altering a temporal table will generally cause its associated history table to be altered in the same way. This applies when adding, altering, or removing columns. However, there are a few operations that require you to disable system versioning. These include adding an identity, rowguidcol, or computed column; adding a sparse column in most cases; and adding a column set.

Caution

If you add a new column that does not allow NULL to a temporal table, the default value you’re required to specify will be used to fill the new column in both the current table and the history table. While it’s hard to imagine another solution that Microsoft could have implemented, one might argue that the history table doesn’t reflect the truth at the time its records were active.

To drop a temporal table, you must disable system versioning and then drop both the current table and the history table using two separate DROP TABLE statements.

Inside OUT

How do you design temporal tables to use the least amount of space?

Introduced in SQL Server 2017, you can use HISTORY_RETENTION_PERIOD with SYSTEM_VERSIONING when defining or altering system-versioned temporal tables. By default, HISTORY_RETENTION_PERIOD is set to INFINITE, meaning SQL Server will not automatically purge history data. By defining a finite period, such as 6 MONTHS, SQL Server will automatically purge history records with a valid end time older than the finite period. Two conditions must hold true, however. First, the temporal history retention flag must be enabled for the database (which it is by default). And second, the history table must have a clustered or columnstore index. For a rowstore clustered index, the first column in the index must be the column corresponding to the end of the validity period. If these conditions hold true, a background task is created to clean up the aged data.

In addition to using automatic retention, which controls the growth of the history table, you can also consider vertically partitioning the temporal table. Vertical partitioning is discussed in more detail in the “Vertical partitions” section later in this chapter. Splitting the table vertically into two tables, and only making one table system versioned, results in significant space savings, because the history is kept only for the columns in the system-versioned table. This does come at the expense of potentially frequent JOINs between both tables. This approach is also not suitable if you are system-versioning tables for compliance requirements for which all row data must be available in exactly the form it was at any given point.

Besides reducing the history kept by setting a retention period and using vertical partitioning to avoid keeping history for columns that do not require it, you might also consider horizontal partitioning or a custom cleanup script to manage the history data. These options are described in detail at https://learn.microsoft.com/sql/relational-databases/tables/manage-retention-of-historical-data-in-system-versioned-temporal-tables.

Memory-optimized tables

A traditional disk-based table’s data is loaded in memory as needed. The Database Engine handles loading data from durable storage to memory and removing the data from memory again. Many factors play a role in when data is loaded or released from memory. Data in memory-optimized tables is kept in memory at all times. This data is durable by default because it is persisted to disk (though the format of the data is different from that of disk-based tables). A schema-only, non-durable option is available, which does not retain data between service restarts and certain other operations.

The benefits of keeping all data from specific tables in memory is blazing-fast performance, which often can be improved by another order of magnitude by applying a columnstore index to the memory-optimized table. (Columnstore indexes are covered in Chapter 15.) This of course requires the server to have sufficient memory to hold the memory-optimized tables’ data in memory while still leaving enough room for other operations. If the system runs out of memory (OOM), errors will occur, and you’ll need to take specific steps to recover.

To avoid OOM issues, monitor memory usage. The Microsoft Docs at https://learn.microsoft.com/sql/relational-databases/in-memory-oltp/monitor-and-troubleshoot-memory-usage#bkmk_Monitoring describe how to monitor and troubleshoot memory usage for memory-optimized tables.
Guidance for resolving various OOM issues is available at https://learn.microsoft.com/sql/relational-databases/in-memory-oltp/resolve-out-of-memory-issues.

Memory-optimized tables are available in all editions of SQL Server and in Azure SQL Database’s Premium and Business Critical tiers. However, memory limitations present in the Express and Standard editions of SQL Server do apply to memory-optimized tables.

Note

Over time, many limitations of memory-optimized tables that were present in earlier versions of SQL Server have been eliminated.

This chapter discusses only the setup and configuration of memory-optimized tables, along with caveats. You can find a complete discussion of the purpose and use of memory-optimized tables in Chapter 15.

Database preparation for memory-optimized tables

Before creating memory-optimized tables, you must prepare the database. The database compatibility level must be at least 130. For SQL Server, you need to create a memory-optimized filegroup. There is no such requirement for Azure SQL Database—or, more accurately, the filegroup is intrinsically present.

Caution

You cannot remove a memory-optimized filegroup without dropping the database.

Microsoft provides a T-SQL script to ensure that these settings are correct and that a memory-optimized filegroup is created. You can even run the script in Azure SQL Database to ensure that the database supports memory-optimized tables.

Rather than reprinting this script here, we refer you to the SQL Server samples GitHub page at https://raw.githubusercontent.com/microsoft/sql-server-samples/main/samples/features/in-memory-database/in-memory-oltp/t-sql-scripts/enable-in-memory-oltp.sql.

The script first checks to ensure that the instance or database supports memory-optimized tables, using the SERVERPROPERTY(N'IsXTPSupported') function call. On SQL Server, the script will create a memory-optimized filegroup and container if none already exist. The script also checks and sets the database-compatibility level.

After these actions are complete, you are ready to create one or more memory-optimized tables. The WITH (MEMORY_OPTIMIZED = ON) is the key clause of the CREATE TABLE statement that makes your table a memory-optimized table. Memory-optimized tables support indexing, but you must create and delete them using an ALTER TABLE … ADD/DROP INDEX statement instead of a CREATE/DROP INDEX statement.

Natively compiled stored procedures and user-defined functions

You can access memory-optimized tables via interpreted T-SQL statements and stored procedures. However, you can achieve significant additional performance gains if you use natively compiled stored procedures. These stored procedures are compiled to machine code the first time they are run rather than evaluated every time they run.

Note

Natively compiled stored procedures can access only memory-optimized tables. Traditional interpreted stored procedures and ad hoc queries can reference both disk-based tables and memory-optimized tables in the same statement—for example, to join a memory-optimized table with a disk-based table.

To create a natively compiled stored procedure, use the WITH NATIVE_COMPILATION clause of the CREATE PROCEDURE statement. Natively compiled stored procedure objects require the use of the SCHEMABINDING option. The BEGIN ATOMIC statement is also required; it replaces BEGIN TRANSACTION for natively compiled procedures and functions. This statement either begins a new transaction or creates a save point in an existing transaction on the session. When creating a save point in an existing transaction, only the changes made by the stored procedure would be rolled back if the stored procedure were to fail.

The BEGIN ATOMIC statement has two required options:

TRANSACTION_ISOLATION. You must set this value to one of the three supported isolation levels: snapshot, repeatable read, or serializable.
LANGUAGE. This is a name value from the sys.syslanguages system compatibility view. For example, for United States English, it is us_english, and for Dutch it is Nederlands.

The BEGIN ATOMIC statement is also where delayed durability can be specified (DELAYED_DURABILITY = ON). With delayed durability, the Database Engine reports to the client that the transaction committed before the log record has been committed to a drive. This creates a risk of data loss should the service or server shut down before the asynchronous log write is completed. You should take the same care to use delayed durability with BEGIN ATOMIC as with BEGIN TRANSACTION. To use delayed durability, it must not be disabled at the database level. Schema-only memory-optimized tables do not use transaction logging, so when modifying data in those tables, there is no benefit in specifying delayed durability.

For more information on delayed durability, see Chapter 14.

The following short sample script creates a memory-optimized table and natively compiled stored procedure in a database that has been prepared previously.

Click here to view code image

CREATE TABLE dbo.UserDetails (
    UserId    int NOT NULL,
    DetailId  int NOT NULL,
    Detail    nvarchar(50) NOT NULL,
    CONSTRAINT PK_UserDetails PRIMARY KEY NONCLUSTERED (UserId, DetailId)
) WITH (MEMORY_OPTIMIZED = ON, DURABILITY = SCHEMA_AND_DATA);
GO
CREATE PROCEDURE dbo.GetUserName
    @userId int
WITH NATIVE_COMPILATION, SCHEMABINDING
AS
BEGIN ATOMIC
    WITH (TRANSACTION ISOLATION LEVEL = SNAPSHOT,
          LANGUAGE = N'us_english')
    SELECT Detail
    FROM dbo.UserDetails
    WHERE UserId = @userId
        -- Assume this refers to the name
        AND DetailId = 1;
END;
GO

Note

Several T-SQL statements and constructs are not supported with memory-optimized tables and natively compiled stored procedures. A full list of these unsupported constructs is available in Microsoft Docs at https://learn.microsoft.com/sql/relational-databases/in-memory-oltp/transact-sql-constructs-not-supported-by-in-memory-oltp.

Caveats to memory-optimized tables

To put it plainly, you should probably not convert all of your tables to memory-optimized tables. There are several caveats you must consider before adopting memory-optimized tables and when deciding which tables to turn into memory-optimized tables. This section discusses these caveats.

Memory-optimized tables support only three transaction isolation levels: snapshot, repeatable read, and serializable. If your application needs other isolation levels, you will not be able to implement memory-optimized tables. Refer to Chapter 14 for complete details about transaction isolation levels.

Caution

Changing the database’s read-commit snapshot property will cause schema-only memory-optimized tables to be truncated. Although database designers are aware that schema-only memory-optimized tables are not persisted and might load initial data into such tables when the SQL Server service starts, they might not know to reload data after a database property change.

Because all memory-optimized table data is kept in memory, you would correctly expect additional memory requirements. When planning for memory size, however, you should consider that the memory requirement of a memory-optimized table can be more than twice the size of the data in the table. This is due to processing overhead requirements, including the row versions that are kept.

To review specific guidance on planning for memory size, refer to https://learn.microsoft.com/sql/relational-databases/in-memory-oltp/estimate-memory-requirements-for-memory-optimized-tables.

When persisted memory-optimized tables are used, upon service start, the Database Engine will load all data from the drive to memory. Indexes of memory-optimized tables are not persisted, and they are rebuilt entirely upon service start as the index operations are not logged. The service is not available while these operations take place. With large tables, this can lead to significantly longer service start times. Even though you might carefully plan your service or server restarts for a maintenance window, an unplanned failover on a failover cluster instance (FCI) will also take that amount of time. This might be detrimental to meeting your Service-Level Agreement (SLA), which might have been the entire reason to configure an FCI in the first place. If the performance of memory-optimized tables is needed in combination with a high-availability configuration, you might consider availability groups instead. Because the Database Engine service is running on the secondary, there is no delay caused by having to read the data from a drive and rebuilding indexes.

Read about FCI and availability groups in Chapter 11.

One way to reduce database startup time due to memory-optimized tables is to ensure that checkpoints are taken frequently. This is because checkpoints cause the updated rows in the memory-optimized table to be committed to the data file. Any data that is not committed to the drive must be read from the transaction log. For large tables, this benefit is likely small.

Another contributor to delays, though after service start, is when natively compiled stored procedures are run for the first time. This can take about as long as running a traditional stored procedure because the compiled version of the stored procedure is not saved. Any time a natively compiled stored procedure is run subsequently, the compiled version will be faster.

Memory-optimized tables use an optimistic concurrency model. While memory-optimized tables don’t use locks, latches, or spinlocks, they are subject to isolation level errors and update conflicts. This means a client application might experience unexpected conflicts (which are shown as errors that are truly just messages to the client that their data is being used in an incompatible manner). You should design the application to handle those. Ironically, one of the greatest benefits of memory-optimized tables is optimistic concurrency, but it also causes one of the largest drawbacks, in that applications must be designed to correctly handle the errors generated.

Not unlike when faster drive storage is used for SQL Server, when adopting memory-optimized tables, you might find that the CPU usage is much higher. This is because much less time is spent waiting for I/O operations to complete. This is exactly why you implemented memory-optimized tables: CPU utilization is higher because data is being processed faster! However, you might inadvertently reduce the number of concurrent requests that can be served, especially if one instance runs multiple databases. If this is a concern, consider using Resource Governor to manage the relative CPU usage for specific workloads.

Graph tables

Introduced in SQL Server 2017 and extended in SQL Server 2019, graph functionality provides schema extensions to store directed graph data—that is, nodes and edges—in the relational database. Fitting graph data in a relational database is challenging, and this feature attempts to resolve these challenges. The graph features in SQL Server are useful in some common scenarios, particularly when trying to integrate graph structures with relational ones, but are currently not a complete replacement for dedicated graph databases that support advanced scenarios.

Graph data is often associated with networks, such as social networks, and hierarchies. More generally, graphs are data structures that consist of nodes and edges. The nodes represent entities and the edges represent the connections between those entities. Nodes are also referred to as vertices, and edges as relationships.

Some use cases lend themselves particularly well to being stored in a graph model. For example:

Highly interconnected data. A commonly used example of highly interconnected data is that of social networks. Social network data expresses relationships among people, organizations, posts, pictures, events, and more. In such a data model, each entity can be connected to any other entity, creating lots of many-to-many relationships. In a relational database, this requires the creation of a table for each many-to-many relationship. Querying such relationships requires two or more JOIN clauses, which can quickly create lengthy SELECT statements. Such statements can be difficult to digest and are potentially error prone. Graph databases offer support for flexible definitions of relationships and query syntax that is less verbose.
Hierarchical data. A single node in a graph can have many parents in addition to many children. You may also find it conceptually easier to build even a simple tree, where a node has only one parent, using nodes and edges.
Many-to-many relationships that can be extended at any time during the data life cycle. Relational databases have strict requirements for the definition of tables and relationships. For a data model that is required to evolve quickly to support new relationships, adding new relationships is easy enough in a relational database, but a graph lets you use the same table/edge for multiple relationships. This strict schema requirement can get in the way of meeting evolving requirements in a timely fashion.

You can effectively implement these use cases by employing a graph database. While these features are evolving and regularly being improved, SQL Server’s graph features do not (yet) provide a solution that is on par with dedicated graph databases. The “Graph table shortcomings” section later in the chapter discusses some of the limitations of the current implementation.

Define graph tables

In SQL Server, you store graph data in two table types: node and edge tables. These table types are still stored internally as relational structures, but the Database Engine has additional capabilities to manage and query the data that is stored within them.

The T-SQL CREATE TABLE syntax has two clauses: AS NODE and AS EDGE. The following T-SQL script creates a People node table and a Relationships edge table. You can run this script in any existing or new database; there are no specific requirements of the database:

Click here to view code image

CREATE TABLE dbo.People (
     PersonId int NOT NULL PRIMARY KEY CLUSTERED,
     FirstName nvarchar(50) NOT NULL,
     LastName nvarchar(50) NOT NULL
) AS NODE;
CREATE TABLE dbo.Relationships (
    RelationshipType nvarchar(50) NOT NULL,
    -- Two people can only be related once
    CONSTRAINT UX_Relationship UNIQUE ($from_id, $to_id),
    CONSTRAINT EC_People_ConnectsTo_People CONNECTION (dbo.People TO dbo.People)
) AS EDGE;

In the sample script, both the node and the edge table contain user-defined columns. Edge tables are not required to have user-defined columns, but node tables must have at least one. In the case of edge tables, which model relationships, simply modeling the relationship without additional attributes can be all you need. In the case of node tables, which model entities, there is no value in a node without properties, as the nodes exist to compare properties to one another. Designing a node table is comparable to designing a relational table; you would still consider normalization and other concepts.

The sample script also defines an edge constraint, a feature introduced in SQL Server 2019. Edge constraints restrict which node types can be associated using a particular edge. In this case, a Relationships edge is defined between two nodes of type People.

The edge constraint may be repeated multiple times in a single-edge table definition, but the entries in the table must comply with all constraints. Alternatively, multiple edge constraint clauses may be defined within the same edge constraint, in which case any of the constraint clauses must be satisfied.

Note

Edge constraints and edge constraint clauses are not the same thing. An edge constraint is used to enforce integrity on an edge table. An edge constraint can have one or more edge constraint clauses, and an edge constraint clause is used to define relationships between nodes.

The next script defines a second node table, Animals; removes the existing edge constraint; and finally creates a new edge constraint with two clauses, allowing a relationship to exist between two People rows or between a People row and an Animals row.

Click here to view code image

CREATE TABLE dbo.Animals (
   AnimalId int NOT NULL PRIMARY KEY CLUSTERED,
   AnimalName nvarchar(50) NOT NULL
) AS NODE;
-- Drop and re-create the constraint, because an edge constraint cannot be altered
ALTER TABLE Relationships
    DROP CONSTRAINT EC_Relationship;
ALTER TABLE Relationships
    ADD CONSTRAINT EC_Relationship CONNECTION (dbo.People TO dbo.People,
        dbo.People TO dbo.Animals);

Note

Edge constraints not only enforce the type(s) of node that can be connected using the edge, they also enforce referential integrity between the nodes and the edge. SQL Server supports cascading deletes of edges when a node is deleted. Specify ON DELETE CASCADE in the CONSTRAINT clause to enable cascading of the delete operation, or ON DELETE NO ACTION (the default) if deleting the node should fail.

In addition to user-defined columns, both table types also have one or more columns that are added to the table implicitly to support graph operations. Node tables have two system-generated implicit (also called pseudo) columns, $graph_id and $node_id:

graph_id_<hex_string_1>. This is a bigint column, which stores the internally generated graph ID for the row. This column is internal and cannot be explicitly queried. However, information about this column will be accessible in sys.columns.
$node_id_<hex_string_2>. This returns a computed nvarchar value that includes the internally generated bigint value and schema information. This column can be queried but you should avoid explicitly querying this column. Instead, you should use the NODE_ID_FROM_PARTS() query to access the JSON.

In addition to optional user-defined columns, edge tables have three implicit columns:

$edge_id_<hex_string_3>. This is a system-managed value, comparable to the $node_id column in a node table.
$from_id_<hex_string_4>. This references a node ID from any node table in the graph that meets the edge constraints. This is the source node in the directed graph.
$to_id_<hex_string_5>. This references a node ID from any node table in the graph that meets the edge constraints. This is the target node in the directed graph.

Inside OUT

When should you choose graph tables over relational tables?

There is nothing inherent to a graph database that makes it possible for you to solve a problem that you cannot also solve using a relational database. The relational database concept has been around for nearly five decades, and relational database management systems are as popular as ever.

However, the use cases described earlier, and queries described momentarily, are examples of data models and operations that might be better addressed by a graph structure. This is because the Database Engine has specific optimizations to address some of the types of queries that are often run against such models.

In addition, a graph table can still contain foreign keys referring to relational tables, and a relational table can contain a foreign key referring to a graph table.

Work with graph data

DML statements generally work the same in graph tables as they do in relational tables. Some operations are not supported. An edge table does not support updating either the $frHi, om_id, or $to_id column value. Thus, to update a relationship, the existing edge row must be deleted and a new one inserted. User-defined columns of edge tables do support update operations.

When querying graph data, you can write your own table joins to join nodes to edges to nodes, though this approach offers none of the benefits of graph tables. Instead, we recommend using the MATCH comparison operator in the WHERE clause. The MATCH operator uses a style of expression referred to as ASCII art to indicate how nodes and edges should be traversed. You might be surprised to find that the node and edge tables are specified using old-style join syntax first. The MATCH subclause then performs the actual equi-joins necessary to traverse the graph.

The brief examples that follow are intended to provide an introduction only. They build on the creation of the People and Relationship tables shown in the previous example. First, a few rows of sample data are inserted. Then, the sample data is queried using the MATCH subclause:

Click here to view code image

-- Insert a few sample people
-- $node_id is implicit and skipped
INSERT INTO dbo.People VALUES
    (1, 'Karina', 'Jakobsen'),
    (2, 'David', 'Hamilton'),
    (3, 'James', 'Hamilton'),
    (4, 'Stella', 'Rosenhain');
-- Insert a few sample relationships
-- The first sub-select retrieves the $node_id of the from_node
-- The second sub-select retrieves the $node_id of the to_node
INSERT INTO dbo.Relationships VALUES
    ((SELECT $node_id FROM People WHERE PersonId = 1),
     (SELECT $node_id FROM People WHERE PersonId = 2),
     'spouse'),
     ((SELECT $node_id FROM People WHERE PersonId = 2),
     (SELECT $node_id FROM People WHERE PersonId = 3),
     'father'),
     ((SELECT $node_id FROM People WHERE PersonId = 4),
     (SELECT $node_id FROM People WHERE PersonId = 2),
     'mother');
-- Simple graph query
SELECT P1.FirstName + ' is the ' + R.RelationshipType +
    ' of ' + P2.FirstName + '.'
FROM dbo.People P1, dbo.People P2, dbo.Relationships R
WHERE MATCH(P1-I->P2);

The arrow used in the MATCH subclause means that a node in the People table should be related to another node in the People table using the Relations edge. As with self-referencing many-to-many relationships, the People table needs to be present in the FROM clause twice to allow the second People node to be different from the first. Otherwise, the query would retrieve only edges in which people are related to themselves. (There are no such relationships in our sample.)

The true power of the MATCH subclause is evident when traversing the graph between three or more nodes. One such example would be finding restaurants your friends have liked in the city where your friends live and where you intend to travel.

For a more comprehensive sample graph database, refer to https://learn.microsoft.com/sql/relational-databases/graphs/sql-graph-sample.

SQL Server 2019 introduced support for the shortest path algorithm. With this support, it is now possible for the MATCH clause to traverse an arbitrary number of nodes to find a related node. Several T-SQL syntax elements are required, including a SHORTEST_PATH function as well as FOR PATH and WITHIN GROUP (GRAPH PATH) clauses. The following sample retrieves all the direct descendants of one of the people in the same table:

Click here to view code image

-- Construct Stella Rosenhain's direct descendants' family tree
-- In our example data, two rows will be returned
SELECT P1.FirstName
        , STRING_AGG(P2.FirstName, '->') WITHIN GROUP (GRAPH PATH) AS Descendants
FROM dbo.People P1
    , dbo.People FOR PATH P2
    , dbo.Relationships FOR PATH related_to1
WHERE (MATCH(SHORTEST_PATH(P1(-(related_to1)->P2)+))
    -- Stella Rosenhain
    AND P1.PersonId = 4);

Running the preceding query shows that David is the direct descendant of Stella. The query also returns a record showing that James is David’s descendant. However, Karina is not returned in the query because she is not a descendant of Stella. A more complex example is included in the accompanying files for this book.

Graph table shortcomings

Since the first graph features were released with SQL Server 2017, additional investments have been made to overcome the limitations found in that release. Most notably, the SHORTEST_PATH function enables both the shortest path graph analytic function and transitive closures (the ability to recursively traverse edges). Still, many limitations compared to native graph databases remain.

The following list contains two notable syntax limitations starting in SQL Server 2019 and a brief description of their significance for implementing a graph. Hopefully, this will provide the information you need to make an informed decision about using SQL Server for graph data.

Need to explicitly define edges as tables. Graphs model pairwise relations between entities (the nodes). Flexibility can be key in maximizing the benefits of graph models. Even though the nodes and their properties are often well understood, new relationships can be modeled as new needs arise or additional possibilities emerge. The need to make schema modifications to support new types of edges reduces flexibility. Some of this can be addressed by defining one or a few edge tables and storing the edge properties as XML or JSON. This approach, too, has drawbacks in terms of performance and ease of writing queries against the data.
Limited polymorphism. Polymorphism is the ability to find a node of any type connected to a specified starting node. In SQL Server, a workaround for graph models with few node and edge types is to query all known node and edge types and combine the result sets by using a UNION clause. For large graph models, this solution becomes impractical.

Store large binary objects

Large objects (LOBs), including XML and binary large objects (BLOBs), can be used to store files. However, storing LOBs—and more specifically BLOBs—in a relational database has been known to cause debate. Before SQL Server offered the FILESTREAM feature as a specialized way for the Database Engine to manage BLOBs, database designers had two options, neither of which would likely meet all your requirements:

Store the BLOB, such as an image, video, or document file, in a varbinary column. Downsides of this approach include rapid growth of the data file, frequent page splits, and pollution of the buffer pool. Benefits include transactional integrity and integrated backup and recovery of the BLOB data.
Have the application store the BLOB in the file system and use an nvarchar column to store a local server or UNC path to the file. Downsides of this approach include requiring the application to manage data integrity (avoiding missing files or files without associated database records) and lack of integrated security (the mechanism to secure the BLOBs is an entirely different model than that for protecting the database). There are some benefits, though, primarily around performance and ease of programming for the client to work with the BLOBs (using traditional file I/O APIs provided by the OS).

The FILESTREAM feature is designed to provide the best of both worlds. FILESTREAM is not a data type, but an extension to varbinary(max), which changes how data is stored as well as providing additional capabilities.

This section discusses FILESTREAM and an extension of FILESTREAM called FileTable. FileTable lets you access varbinary data via the built-in OS file manager.

Understand FILESTREAM

There are three requirements to take advantage of FILESTREAM:

The instance must be configured to allow at least one of several levels of FILESTREAM.
Your database needs at least one FILESTREAM filegroup.
Any table containing a FILESTREAM column requires a unique, non-NULL rowguid.

A FILESTREAM filegroup refers to a location on an NT File System (NTFS) or Resilient File System (ReFS) volume under the control of the Database Engine. The Database Engine uses this location to store binary data and log files for the binary data.

When a FILESTREAM filegroup is available in the database, FILESTREAM can be used as a modifier on varbinary(max) column declarations. When creating a table with a FILESTREAM column, you can specify on which filegroup the FILESTREAM data will be stored. When multiple FILESTREAM database files are added to a single filegroup, they will be used in round-robin fashion, as long as they don’t exceed their maximum size.

In general, FILESTREAM’s performance benefits kick in when the average BLOB size is 1 MB or larger. For smaller BLOB sizes, storing the BLOBs in the database file using a varbinary(max) column is better for performance. You might determine, however, that the ease of programming against file I/O APIs in the client application is an overriding factor and decide to use FILESTREAM even with smaller BLOBs.

If any of your BLOBs exceed 2 GB in size, you will need to use FILESTREAM. The varbinary(max) data type supports a maximum BLOB size of 2 GB. Another reason for choosing FILESTREAM is the ability to integrate BLOBs with SQL Server Semantic Search. To be clear, varbinary(max) columns can also be integrated with Semantic Search, but BLOBs stored in traditional file systems cannot. Semantic Search in SQL Server supports extracting and indexing statistically relevant keywords or phrases, which in turn enables the identification of similar or related documents. Among other things, Semantic Search can be used to suggest tags for an article or identify résumés based on a job description.

More information about Semantic Search is available at https://learn.microsoft.com/sql/relational-databases/search/semantic-search-sql-server.

Inside OUT

How do you move data from a varbinary(max) column to FILESTREAM?

Unfortunately, moving from varbinary(max) columns to FILESTREAM is not as easy as modifying the column to add the FILESTREAM modifier. Attempting to modify the column in that way will result in an error. Instead, you can use the following three-step process after creating a FILESTREAM filegroup:

Create a new varbinary(max) FILESTREAM column in the table, or in another table if you want to use vertical partitioning.
Copy the data from the existing varbinary(max) column to the new FILESTREAM column. This operation can cause a significant amount of database activity, depending on the number of rows and the size of the BLOBs.
Drop the varbinary(max) column. Optionally, you can then rename the FILESTREAM column to the name of the dropped column. Until you have (optionally) deployed a modified application that uses the I/O APIs with FILESTREAM to achieve better performance, the existing application’s T-SQL statements will continue to work on the FILESTREAM column.

In addition to being aware that this operation can potentially cause significant database activity, you also need to ensure that sufficient storage space is available to hold both copies of the data. (If you are moving the BLOBs to different storage hardware, this might be less of a concern.) After completing the process, you should consider whether shrinking the data file is appropriate.

Finally, while these operations take place, consider placing a lock on the table. Otherwise, transactions may modify the data in the varbinary(max) or add new rows. In step 3, that data could be lost.

Even though FILESTREAM BLOBs are stored in the file system, they are managed by the Database Engine. This includes transactional consistency and point-in-time restores. Thus, when a BLOB is deleted, the file on the drive backing that BLOB is not immediately deleted. Similarly, when a BLOB is updated, an entirely new file is written, and the previous version is kept on the drive. When the deleted file or previous file version is no longer needed, the Database Engine will eventually delete the file using a garbage-collection process. You are already aware of the importance of generating transaction log backups with databases in the full recovery model. This way, the transaction log can be truncated and stop growing. When using FILESTREAM, this mantra applies doubly: The number of files will keep growing until they are deleted by the garbage collector.

Caution

You should never modify the contents of a FILESTREAM folder (a data container) manually. Doing so can lead to FILESTREAM data corruption.

FileTable

FileTable makes it possible to access BLOBs managed by the Database Engine using traditional file share semantics. Applications that can read and write from a file share can access BLOBs managed by the SQL Server Database Engine as if they were on a regular Server Message Block (SMB) file share.

Although clients can use file I/O APIs to work with FILESTREAM, obtaining a handle to the BLOB requires the use of specific client libraries and application modifications. There might be applications that cannot be modified to work with FILESTREAM but for which having BLOBs managed by the relational engine would have significant advantages. It was for this purpose that FileTable, which is a special table type, was developed.

Note

FileTable is not currently available on SQL Server on Linux due to its reliance on Windows APIs.

A FileTable has a fixed schema, so you can neither add user-defined columns nor remove columns. The only control provided is the ability to define indexes, even unique ones, on some FileTable columns.

The fixed schema has a FILESTREAM column that stores the actual file data in addition to many metadata columns, and the non-NULL unique rowguid column required of any table containing FILESTREAM data. FileTable can organize data hierarchically, meaning folders and subfolders are supported concepts.

For a detailed discussion of the FileTable schema, visit https://learn.microsoft.com/sql/relational-databases/blob/filetable-schema.

Table partitions

Table partitioning occurs when you design a table that stores data from a single logical entity in physically separate structures. In other words, rather than storing all the entity’s data in a single physical data structure, the data is split into multiple physical data structures, but the user continues to treat it as a single unit, as normal.

Table partitioning has multiple purposes, some of which relate to performance, either when querying or when loading data. (We discuss this later in detail.) As shown in Figure 7-3, there are two types of partitioning: horizontal partitioning and vertical partitioning. This section discusses each of these separately, including common use cases and recommendations.

A diagram showing both horizontal and vertical partitioning. The image on the left is marked (a) and shows an example of horizontal partitioning. This is represented with a table having columns EntityId, A, B, and C. The table header and first two rows are grouped together. There is a wavy line separating this section of the table from the data below. This is to show that horizontal partitioning keeps all the columns in a data record together and partitions data by row. The image on the right (b), shows vertical partitioning. This image has two tables. The first table is Table1_1 with columns EntityId, A, and B. The second table is Table1_2 with columns EntityId and C. There is a wavy line separating the two tables. This is to show that columns are split between multiple tables in vertical portioning. However, if there is no horizontal partitioning on these tables, the vertical partitions will have the same number of records as the original table. — **Figure 7-3** (a) Horizontal partitioning splits a table’s data rows. (b) Vertical partitioning splits a table’s columns.

As illustrated in Figure 7-3, horizontal and vertical partitioning are distinctly different. Horizontal partitioning splits the data rows, and each partition has the same schema. Vertical partitioning splits the entity’s columns across multiple tables.

Figure 7-3 shows a table partitioned in only two partitions, but you can partition tables into many partitions. You can also mix horizontal and vertical partitioning.

Note

In SQL Server, partitioning usually refers to horizontal partitioning, only because it is the name of a feature. This book discusses both horizontal and vertical partitioning; as such, we always explicitly declare which partitioning type is being discussed.

Horizontally partitioned tables and indexes

In a large-scale database, in which a single table can grow to hundreds of gigabytes or more, some operations become more difficult. For example, adding new rows can take an excessive amount of time and might also cause SELECT queries on the table to fail due to lock escalation. Similar concerns exist with respect to removing data and index maintenance.

Horizontal partitioning can address these concerns. However, it is not a silver bullet that will make all performance problems in large tables disappear. On the contrary, when applied incorrectly, horizontal partitioning can have a negative effect on your database workload. This section builds on the brief discussion of partitioning found in Chapter 3, “Design and implement an on-premises database infrastructure.”

Note

Support for horizontal partitioning in SQL Server was limited to SQL Server Enterprise edition until the release of SQL Server 2016 with Service Pack 1. Since then, all editions support horizontal table and index partitioning.

About horizontal partitions

SQL Server’s partitioning feature supports horizontal partitioning with a partition function, which determines in which partition of the table a given row will be stored. Each partition can be stored in its own filegroup in the same database.

When partitioning a table, the rows of the table are not all stored in the same physical place. So, when designing partitions, you must decide on a partition key, which is the column that will be used to assign a row to exactly one partition. From a logical viewpoint, however, all rows belong to the same table.

A query without a WHERE clause returns all rows, regardless of which partition they are stored in. This means the Database Engine must do more work to retrieve rows from different partitions. Your goal when partitioning for query performance should be to write queries that eliminate partitions. You can accomplish this by including the partition key in the WHERE clause.

Additional benefits of horizontal partitioning include the ability to set specific filegroups to read-only. By mapping partitions containing older data to read-only filegroups, you can be assured that this data is unchangeable without affecting your ability to insert new rows. In addition, you can exclude read-only filegroups from regular backups. Finally, during a restore, filegroups containing the most recent data could be restored first, allowing new transactions to be recorded faster than if the entire database needed to be restored.

Note

Restoring selected files or filegroups while keeping the database available is called an online restore, which is still supported only in the Enterprise edition.

Index partitioning

In addition to horizontal table partitioning, SQL Server also supports index partitioning. A partitioned index is said to be aligned with the table if the table and the index are partitioned in the same number of partitions using the same column and boundary values.

When a partitioned index is aligned, you can direct index maintenance operations to a specific partition. This can significantly speed up the maintenance operation because you can rebuild or reorganize a partition rather than the entire index. On the other hand, if the entire index needs to be rebuilt, SQL Server will attempt to do so in a parallel fashion. Rebuilding multiple indexes simultaneously creates memory pressure. Because of this, we recommend that you not use partitioning on a system with less than 16 GB of RAM.

Note

Achieving an aligned index is typically done by using the same partition function and scheme as the table. However, it is not strictly necessary to create an aligned partitioned index. If you choose to use a different function or scheme, you must remember to modify the function for the index simultaneously with the table’s partition function. Therefore, we recommend that you use one partition function for both the partition schema and the aligned partitioned index.

You might benefit from creating a partitioned index without partitioning the table. You can still use this nonaligned index to improve query efficiency if only one or a few of the index partitions need to be used. In this case, you must also use the index’s partition key in the WHERE clause to gain the performance benefit of eliminating partitions.

Define partitions and partition a table

We now demonstrate how to create a horizontally partitioned table using the SQL Server feature. Three database objects are involved in defining partitions and partitioning a table:

A partition function, which defines the number of partitions and the boundary values
A partition scheme, which defines on which filegroup each partition is placed
The table to be partitioned

Note

For brevity, the following script does not show the creation of the database with the filegroups and files necessary to support the partition scheme. The sample script included with the book downloads does include the CREATE DATABASE statement, as well as a CREATE TABLE statement followed by an INSERT statement and SELECT statements from DMVs to review the table’s partition statistics.

Click here to view code image

-- Create a partition function for February 1, 2019, through January 1, 2020
CREATE PARTITION FUNCTION MonthPartitioningFx (datetime2)
    -- Store the boundary values in the right partition
    AS RANGE RIGHT
   -- Each month is defined by its first day (the boundary value)
    FOR VALUES ('20190201', '20190301', '20190401',
      '20190501', '20190601', '20190701', '20190801',
      '20190901', '20191001', '20191101', '20191201', '20200101');
-- Create a partition scheme using the partition function
-- Place each trimester on its own partition
-- The most recent of the 13 months goes in the latest partition
CREATE PARTITION SCHEME MonthPartitioningScheme
    AS PARTITION MonthPartitioningFx
    TO (FILEGROUP2, FILEGROUP2, FILEGROUP2, FILEGROUP2,
        FILEGROUP3, FILEGROUP3, FILEGROUP3, FILEGROUP3,
        FILEGROUP4, FILEGROUP4, FILEGROUP4, FILEGROUP4, FILEGROUP4);

If you visualize the table data as being sorted by the partition key in ascending order, the left partition is the partition on top. When defining a partition function, you indicate whether the boundary value—in this example, the first day of each month—will be stored in the partition on the left (the default) or the partition on the right (as specified in the sample).

Figure 7-4 shows the relationship between the partition function and the partition scheme. In the sample, the partition function created 13 partitions using 12 boundary values. The partition scheme then directed these 13 partitions to three filegroups by specifying each filegroup four times and the last filegroup five times (because it will hold the last partition).

A diagram showing the relationship between the partition function, partition scheme, and the filegroups on which the partitions will be stored. A sample of code used to create the partition function is shown on the bottom. The code shown is CREATE PARTITON FUNCTION MonthPartitioningFx (datetime2) AS RANGERIGHT FOR VALUES '20190201', '20190301', '20190401', '20190501', '20190601', '20190701', '20190801', '20190901', '20191001', '20191101', '20191201', '20200101';. Directly above this code, there is another example of code to create the partition scheme. The code displayed is as follows: CREATE PARTITION SCHEME MonthPartitioningScheme AS PARTITION MonthPartitioningFx TO FILEGROUP2, FILEGROUP2, FILEGROUP2, FILEGROUP2, FILEGROUP3, FILEGROUP3, FILGROUP3, FILEGROUP3, FILEGROUP3, FILEGROUP4, FILGROUP4, FILEGROUP4, FILEGROUP4, FILEGROUP4;. At the top of the diagram, there are three database shaped images representing the three different filegroups, FILEGROUP2, FILEGROUP3, and FILEGROUP4. Directly below the filegroup images, there are several boxes, one next to another, representing the date ranges. For FILEGROUP2, there are four boxes < Feb 2019, Feb 2019, Mar 2019, and Apr 2019. For FILEGROUP 3, there are also four boxes, May 2019, Jun 2019, Jul 2019, and Aug 2019. FILEGROUP 4 has five boxes, Sep 2019, Oct 2019, Nov 2019, Dec 2019 and >= Jan 2020. The first occurrence of FILEGROUP2, in the partition scheme, goes to the beginning of the < Feb 2019 box. There is no value shown for this data in the partition function. The diagram also shows how the date range '20190201' from the partition function is related to the second occurrence of FILEGROUP2 in the partition scheme. This partition scheme is shown to relate to the beginning of the Feb 2019 box below FILEGROUP2. Similarly, the data range '20190301' in the partition function is displayed to the third occurrence of FILEGROUP2 in the partition scheme. The FILEGROUP2 from the partition scheme points to the beginning of the Mar 2019 box above. The date '20200101' in the partition function is related to the last FILEGROUP4 entry in the partition scheme. This partition scheme goes to the beginning of the right most box, for data greater than or equal to January 1, 2020. — **Figure 7-4** The relationship between the partition function, the partition scheme, and the filegroups on which the partitions will be stored.

Horizontal partition design guidelines

When designing horizontal partitions, keep these guidelines in mind, with the understanding that your mileage may vary:

The number of parallel operations that can be run per query depends on the number of processor cores in the system. Using more partitions than processor cores limits the number of partitions that will be processed in parallel. So, even though SQL Server now supports up to 15,000 partitions, on a system with 12 processor cores, at most 12 partitions will be processed in parallel. You may choose to use fewer partitions than the number of available processor cores to set aside capacity for other queries. There is also the option to disable parallel partition processing or to change the number of processors that a single query can use. Be advised that changing the number of processors per query is a server configuration.
Choose the partition key to accommodate growing column values. This could be a date value or an incrementing identity column. Usually, you will want to have new rows added to the rightmost partition.
The selected partition key should be immutable, meaning there should be no business reason for this key value to change. If the value of a partition key changes, SQL Server will execute the UPDATE statement as a DELETE and INSERT statement; there is no provision to “move” a row to another filegroup. This approach is similar to when the value of a clustered index changes.
For the partition key, a narrow data type is preferable over a wide data type.
To achieve most benefits of partitioning, specifically those related to performance, you will need to put each partition into its own filegroup. This is not a requirement, and some or all partitions can share a single filegroup. For example, the next section discusses a sliding window partition strategy, in which partitioning is beneficial even if all are in the same filegroup.
Consider the storage backing the filegroups. For example, your storage system might not provide higher performance if all filegroups have been placed on the same physical drives. Be aware that even if the drive letters are different, they may still all be on the same physical drive.
Tables that are good candidates for partitioning are tables with many—as in millions or billions—rows for which data is mostly added as opposed to updated, and on which queries are frequently run that would return data from one or a few partitions.

Implement a sliding window partition strategy

Horizontal partitioning is often applied to relational data warehouses. A common data warehouse operation is loading a significant amount of data to a fact table while simultaneously purging old data. The sliding window partition strategy is particularly well-suited for tables for which data is regularly added and removed. For example, data in a fact table can be purged after 13 months. Perhaps each time data is loaded into the data warehouse, rows older than 13 months are removed while new rows are added. This is a sliding window in that the fact table always contains the most recent 13 months of data.

To set up a sliding window, you need a partition function and scheme as well as the fact table. You should also set up a stored procedure that modifies the partition function to accommodate the new boundary values. Finally, you will need a staging table with the same columns and clustered index as the partitioned table.

Note

The next example assumes that data is loaded in the data warehouse only once every month. This is not particularly realistic, but the example still works when data is loaded more frequently, even in real time. Only the first load operation for a new month will need to modify the partition function.

Figure 7-5 illustrates what happens on March 1, 2020, when data is loaded for the month of February 2020. The fact table is partitioned into 13 partitions, one for each month. An automated process, which is not depicted here, modifies the partition function to accommodate the new date range by splitting the rightmost partition, holding the most recent data, in two. Then, the partition holding the oldest month’s data is switched out to a staging table and the new data is switched in from a staging table. Finally, the leftmost partition, which held the oldest data but is now empty, is merged with the second leftmost partition.

A diagram showing the process of partition switching. The partition begins with 13 boxes representing the time periods before February 2019, February 2019, March 2019, April 2019, May 2019, June 2019, July 2019, August 2019, September 2019, October 2019, November 2019, December 2019, and on or after January 2020. The box representing the on or after January 2020 is twice as wide as the other boxes. The next step is to split the partition. The partition on or after January 2020 has been split into January 2020 and on or after February 2020. There are now 14 partitions. Staging data for February 2020 is transferred and represented with a process indicating Staging In. The partition for on or after February 2020 now has data populated. There is also a process to move the partition for data before February 2019 out, indicated by a process named Staging In. Now that the partition for data before February 2019 has been moved out, the partitions before February 2019 and February 2019 are merged into the new partition before March 2019. — **Figure 7-5** An overview of the sliding window partition strategy.

Note

Implementing a sliding window partition strategy is not without pitfalls. To fully automate it, job auditing is required to ensure that the process that modifies the partition function operates successfully. Additional complexity is introduced if the switched-out, old data is to be archived rather than purged. Archiving a partition usually involves moving the specific partition to a new table.

You can optimize the process of switching the old partition out and the new partition in by using a memory-optimized table as the staging table.

Note

In addition, you can use the SQL Server partition feature for the history table, as described here: https://learn.microsoft.com/sql/relational-databases/tables/manage-retention-of-historical-data-in-system-versioned-temporal-tables#using-table-partitioning-approach.

Vertical partitions

Vertical partitioning makes sense when a single table would ordinarily contain many columns, some of which might contain large values that are infrequently queried. In some cases, indexes with included columns will not do enough to alleviate the issues. You may be able to improve performance by storing the infrequently accessed columns in another table. Another problem that you can solve by vertical partitioning is when you run into a maximum row size limit or maximum column count limit.

Note

We encourage you to first review your database design to ensure that one logical entity really needs such wide columns or that many attributes. Perhaps the table could be normalized into multiple related tables. If all those columns are needed, splitting the entity vertically into two or more tables can be a reasonable tactic.

An entity that is vertically partitioned into multiple tables can usually be identified by the fact that the tables have a common name prefix or suffix and share the same primary key values. A conceptual one-to-one relationship exists between the tables, which you can enforce by using a foreign key constraint.

Unlike with horizontal partitioning, SQL Server does not have a feature that directly supports vertical partitioning. As the database designer, you will need to create the necessary schema to vertically partition tables yourself.

Be careful not to abuse vertical partitioning as a strategy. Every time data from two tables is needed in a single result set, a join operation will be required. These joins could be expensive operations—or are at least more expensive than reading data from a single page—and might nullify other performance benefits if you run them frequently.

There are a few special cases for using vertical partitioning. One relates to the FileTable feature. FileTables, as noted earlier in this chapter, have a fixed schema. You might, however, need to store additional metadata about the files. Because you are unable to extend the schema, you will need to create a new table that uses the same primary key as the FileTable. Using INSERT and DELETE triggers, you can guarantee data integrity by ensuring that for every row in the FileTable, there is a matching row in your extended metadata table.

A second special case is related to temporal tables, also discussed earlier in this chapter. If there is no requirement to capture the history of all columns in the temporal table, splitting the logical entity into two vertical partitions (or perhaps more if additional considerations apply) can reduce the amount of space consumed by unneeded historical data.

Capture modifications to data

SQL Server supports several methods for capturing row data that has been modified, including temporal tables (discussed earlier in this chapter). This section discusses change tracking and change data capture, and provides recommendations on their use. Although these features allow applications to detect when data has changed, temporal tables operate very differently and serve different purposes.

Note

The SQL Server auditing feature is not covered in this section. Auditing may also meet your needs, especially if you need to track the execution of SELECT and DDL statements or include the principal that executed the statements.

Auditing is a security feature, and is discussed in Chapter 13, “Protect data through classification, encryption, and auditing.”

Use change tracking

Change tracking only tracks whether a row has changed. If you need to also track the data that has changed, see the next section, “Use change data capture.” Change tracking is mostly useful for synchronizing copies of data with occasionally offline clients or for extract, transform, load (ETL) operations. For example, an application that facilitates offline editing of data will need to perform a two-way synchronization when reconnected. One approach to implementing this requirement is to copy the data to the client. If there is concern about loading all the data on the client, however, you can also load a subset of the data. When the client goes offline, the application reads and updates data using the offline copy. When the client re-establishes connectivity to the server, changes can be merged efficiently. The application is responsible for detecting and managing conflicting changes where the client has changed rows that it needs to integrate into the live data.

Configuring change tracking is a two-step process: First, you enable change tracking for the database. Then, you enable change tracking for the table(s) you want to track.

Note

Change tracking and memory-optimized tables are mutually exclusive. A database cannot have both at the same time.

Before performing these steps, you can set up snapshot isolation for the database. Snapshot isolation is not required for proper operation of change tracking, but it is very helpful for accurately querying the changes. Another very useful strategy is to copy the changes either using a database snapshot or in a snapshot isolation level transaction. This technique ensures that the data your copy process sees is not changing. Because data can change as you are querying it, using the snapshot isolation level inside an explicit transaction enables you to see consistent results until you commit the transaction.

For an end-to-end example of how an application can use change tracking to accomplish two-way data synchronization with an occasionally offline data store, see https://learn.microsoft.com/sql/relational-databases/track-changes/work-with-change-tracking-sql-server.

The following sample script enables snapshot isolation on the WideWorldImporters sample database. Then, it enables change tracking on the WideWorldImporters sample database and on two tables: Sales.Orders and Sales.OrderLines. Only on the Sales.Orders table is column tracking activated, so change tracking will include which columns were included in UPDATE statements (whether or not the values were actually changed). Next, change tracking is disabled for Sales.OrderLines. Finally, the sys.change_tracking_tables catalog view is queried to retrieve a list of tables with change tracking enabled.

Click here to view code image

USE master;
GO
-- Enable snapshot isolation for the database, if desired
ALTER DATABASE WideWorldImporters
    SET ALLOW_SNAPSHOT_ISOLATION ON;
-- Enable change tracking for the database
ALTER DATABASE WideWorldImporters
    SET CHANGE_TRACKING = ON
    (CHANGE_RETENTION = 5 DAYS, AUTO_CLEANUP = ON);
USE WideWorldImporters;
GO
-- Enable change tracking for Orders
ALTER TABLE Sales.Orders
    ENABLE CHANGE_TRACKING
    -- and track which columns were included in the statements
WITH (TRACK_COLUMNS_UPDATED = ON);
-- Enable change tracking for OrderLines
ALTER TABLE Sales.OrderLines
    ENABLE CHANGE_TRACKING;
-- Disable change tracking for OrderLines
ALTER TABLE Sales.OrderLines
    DISABLE CHANGE_TRACKING;
-- Query the current state of change tracking in the database
SELECT *
FROM sys.change_tracking_tables;

A major benefit compared to implementing a custom solution is that change tracking does not make any schema changes to the user tables such as triggers, or require additional user tables to store the captured changes. In addition, change tracking is available in all editions of SQL Server and in Azure SQL Database. Autocleanup ensures that the database does not grow unchecked.

Note

We recommend enabling autocleanup and setting a retention period of sufficiently long duration to ensure that data synchronization has taken place. Applications can check whether they have waited too long to synchronize; that is, applications can find out whether cleanup has already removed tracking information since the application last synchronized.

Although change tracking can track which rows have changed and optionally which columns were included in an UPDATE statement’s SET clause, it cannot indicate what the old values were or how often the row has been changed. If your use case does not require this, change tracking provides a lightweight option for tracking. If your use case does require one or both, consider using change data capture instead, discussed next.

Use change data capture

Change data capture varies in some important ways from change tracking. First and foremost, change data capture captures the historical values of the data. This requires a significantly higher amount of storage than change tracking. Unlike change tracking, change data capture uses an asynchronous process to write the change data. This means the client need not wait for the change data to be committed before the database returns the result of the DML operation, as it uses the log to do this asynchronously.

Note

Because change data capture relies on the SQL Server Agent, it is not available in Azure SQL Database, although it is available in Azure SQL Managed Instance. Starting with SQL Server 2016 with Service Pack 1, change data capture is available in Standard edition.

Note

Since SQL Server 2017 CU 15, change data capture and memory-optimized tables can be used in the same database. Previously, a database could not have both at the same time.

The following script enables change data capture on a database using the sys.sp_cdc_enable_db stored procedure. Then, the script enables change data capture for the Sales.Invoices table. The script assumes that a user-defined database role cdc_reader has been created. Members of this role will be able to query the captured data changes.

Click here to view code image

USE WideWorldImporters;
GO
EXEC sys.sp_cdc_enable_db;
EXEC sys.sp_cdc_enable_table
    @source_schema = 'Sales',
    @source_name = 'Invoices',
    @role_name = 'cdc_reader';

Note

If you run the preceding script while SQL Server Agent is stopped, the cdc_reader database role will be created as well as the SQL Server Agent jobs. If you try to run it again, you will get an error that the specific capture instance already exists.

After executing these statements, data changes to the Sales.Invoices table will be tracked, complete with before and after data.

Additional optional settings for change data capture include which columns to track and on which filegroup to store the change table. For more information, visit https://learn.microsoft.com/sql/relational-databases/track-changes/enable-and-disable-change-data-capture-sql-server.

Inside OUT

How can application developers best use change tracking and change data capture?

There are two important considerations to effectively use these SQL Server features in client applications:

An application should identify itself by using a source ID when synchronizing data. By providing a source ID, the client can avoid obtaining the same data again. A client specifies its own source ID by using the WITH CHANGE_TRACKING_CONTEXT clause at the start of statements.
An application should perform the request for changed data in a snapshot isolation-level transaction. This will prevent another application from updating data between the check for updated data and when data updates are sent. The snapshot isolation level must be enabled at the database level, as described in the previous section.

Query change tracking and change data capture

When change tracking or change data capture is enabled, SQL Server offers functions for querying the tracking and capture information. These functions are demonstrated here using the WideWorldImporters sample as modified in the preceding two sections.

Query change tracking

The following code sample updates a row in the Sales.Orders table, which has change tracking enabled. The next statement in the sample then demonstrates how to query the information gathered by change tracking.

Click here to view code image

-- Modify a row in the Orders table,
-- which has change tracking enabled
UPDATE Sales.Orders
    SET Comments = 'I am a new comment!'
    WHERE OrderID = 1;
DECLARE @OrderCommentsColumnId int =
    COLUMNPROPERTY(OBJECT_ID('Sales.Orders'), N'Comments', 'ColumnId'),
    @DeliveryInstructionsColumnId int =
    COLUMNPROPERTY(OBJECT_ID('Sales.Orders'), N'DeliveryInstructions', 'ColumnId');
-- Query all changes to Sales.Orders
SELECT *
    -- Determine if the Comments column was included in the UPDATE
    , CHANGE_TRACKING_IS_COLUMN_IN_MASK(@OrderCommentsColumnId,
        CT.SYS_CHANGE_COLUMNS) CommentsChanged
    -- Determine if the DeliveryInstructions column was included
    , CHANGE_TRACKING_IS_COLUMN_IN_MASK(@DeliveryInstructionsColumnId,
        CT.SYS_CHANGE_COLUMNS) DeliveryInstructionsChanged
FROM CHANGETABLE(CHANGES Sales.Orders, 0) as CT
ORDER BY SYS_CHANGE_VERSION;

The output includes the values of the primary key—in this case the values of the single column OrderID—and a SYS_CHANGE_OPERATION column indicating that an UPDATE statement was executed using the value U. Because you enabled column tracking for the Orders table, there is a non-NULL value for the SYS_CHANGE_COLUMNS column in the output.

For each column for which you’d like to determine whether it was included in the UPDATE statement, use the CHANGE_TRACKING_IS_COLUMN_IN_MASK function. It uses the column as the position in the SYS_CHANGE_COLUMNS bitmask to indicate if the value is changed. In the sample, two output columns are added that determine if the Comments and DeliveryInstructions columns were included in the UPDATE statement. A value of 1 indicates yes, and 0 indicates no.

Query change data capture

The following code sample updates a row in the Sales.Invoices table that has change data capture enabled. Then the change data capture table is queried with the option all update old, which causes the single UPDATE statement to have two rows in the output, one containing the old image and the other row containing the updated data.

Click here to view code image

-- Modify a row in the Invoices table,
-- which has change data capture enabled
UPDATE Sales.Invoices
    SET Comments = 'I am a new invoice comment again'
    WHERE InvoiceID = 1;
DECLARE @from_lsn binary(10) = sys.fn_cdc_get_min_lsn('Sales_Invoices'),
    @to_lsn binary(10) = sys.fn_cdc_get_max_lsn();
-- Each capture instance will have unique function names
-- By default, the capture instance name is schema_table
-- Note: there may be a slight delay before output is returned
SELECT *
FROM cdc.fn_cdc_get_all_changes_Sales_Invoices(@from_lsn, @to_lsn,
    N'all update old');

Note

There are many more change data capture functions and stored procedures available. These are covered at https://learn.microsoft.com/sql/relational-databases/system-functions/change-data-capture-functions-transact-sql and https://learn.microsoft.com/sql/relational-databases/system-stored-procedures/change-data-capture-stored-procedures-transact-sql.

Compare change tracking, change data capture, and temporal tables

This section compares three features that have common use cases. Table 7-1 should prove helpful when you’re deciding which change tracking feature is appropriate for your needs.

Table 7-1 A comparison of features and uses of change tracking, change data capture, and temporal tables.

	Change tracking	Change data capture	Temporal tables
Requires user schema modification	No	No	Yes
Available in Azure SQL Database	Yes	No	Yes
Edition support	Any	Enterprise and Standard	Any
Provides historical data	No	Yes	Yes
Tracks DML type	Yes	Yes	No
Has autocleanup	Yes	Yes	Yes
Time of change indicator	LSN	LSN	`datetime2`

Benefits of PolyBase for external data sources and external tables

PolyBase is a feature that allows SQL Server to query and interact with data sources outside of SQL Server. This concept is called data virtualization. In data virtualization, data that is external to the SQL Server is made to behave like internal SQL Server data. When working with data virtualization, you’ll become intimately familiar with the CREATE DATABASE SCOPED CREDENTIAL and CREATE EXTERNAL DATA SOURCE T-SQL syntax.

While PolyBase is the SQL Server feature for data virtualization, the feature name isn’t necessarily uniform across SQL platforms. For example, Azure Synapse Analytics and Azure SQL Manage Instance have data virtualization (the latter introduced in September 2022), but do not use the PolyBase feature name. There are four key scenarios that best use PolyBase in SQL Server:

Parallelized import of data into SQL Server
Joining multiple data sources in a single query
Eliminating data latency and reducing the need for multiple copies of data
Archiving data to alternate storage

Unified data platform features

PolyBase enables database administrators to create a unified data platform for data analytics and applications. Through the use of PolyBase, SQL Server can interact with different file types and databases using T-SQL. Supported external data sources include Oracle, MongoDB, Teradata, Generic ODBC, Azure Storage, and SQL Server.

Some recent version-specific changes to PolyBase include:

Starting with SQL Server 2019, PolyBase is available on Linux as well as Windows.
Starting in SQL Server 2022, PolyBase offers functionality to allow connections to S3-compatible object storage, including the ability to interact with Apache Parquet file and Delta table formats. Hadoop external data sources are retired in SQL Server 2022.
Comma-separated values (CSV) files are also supported in Azure Storage in SQL Server 2022.
PolyBase scale-out groups are deprecated in SQL Server 2019 and retired in SQL Server 2022.
While it isn’t called PolyBase, Azure SQL Managed Instance supports some data virtualization features. Azure SQL Managed Instance even has some features that PolyBase doesn’t, like the ability to authenticate to Azure Blob Storage with a managed identity.

Note

The external provider capabilities of PolyBase are tricky. The details of CREATE EXTERNAL DATA SOURCE change with every major version of SQL Server. Azure SQL platforms have different capabilities as well. For example, Cloudera CDH and Hortonworks HDP support in SQL Server 2019 was removed from SQL Server 2022. In Microsoft Learn Docs, pay special attention to the version selector (top-left) for the CREATE EXTERNAL DATA SOURCE reference article at https://learn.microsoft.com/sql/t-sql/statements/create-external-data-source-transact-sql. Each version of SQL Server contains different LOCATION argument options, for example.

Microsoft provides how-to guides for each of the many different types of external data sources—for example, how to configure PolyBase to access external data in Oracle.

Pushdown computation

Pushdown computation is supported for numerous data sources, including generic ODBC, Oracle, SQL Server, Teradata, MongoDB, and Azure Blob Storage.

The purpose of pushdown computation is to allow specific computations to be performed on the external data source. This distributes some of the workload of the query to the remote data source, reducing the amount of data that needs to be transmitted and therefore improving query performance. The computations include joins, aggregations, expressions, and operators.

For more information about what computations work with pushdown, visit https://learn.microsoft.com/sql/relational-databases/polybase/polybase-pushdown-computation#enable-pushdown-computation.

INSIDE OUT

How can you tell if pushdown computation occurred in a PolyBase query?

There is only certain T-SQL syntax supported for translation to the external data source’s ODBC commands, and there are some limitations that would prevent pushdown from otherwise occurring.

The methods for determining whether pushdown has occurred may vary by SQL Server version. Before SQL Server 2019, Trace Flag 6408 was required to surface certain key indicators of pushdown. Starting with SQL Server 2019, the read_command column of sys.dm_exec_external_work will contain the remote query, with the limitation that it only contains the first 4,000 characters of the remote query.

Microsoft provides a detailed walk-through for detecting PolyBase pushdown for joins, aggregations, and filters here: https://learn.microsoft.com/sql/relational-databases/polybase/polybase-how-to-tell-pushdown-computation.

Install and configure PolyBase

This section looks at the installation of PolyBase and the application of PolyBase concepts.

Installation

When installing PolyBase on a Windows machine, keep these points top of mind:

PolyBase can be installed on only one instance per server.
To use PolyBase, you need to be assigned the sysadmin role or Control Server permission.
In SQL Server 2019 and after, you must enable PolyBase using sp_configure after installation. This is an important step to remember in both Windows and Linux.
If the Windows firewall service is running during installation, the necessary firewall rules will be set up automatically. Other firewalls external to your SQL Server instance may still require changes.

For step-by-step instructions on how to install PolyBase, see the Microsoft documentation for your specific version of SQL Server:

On Windows, see https://learn.microsoft.com/sql/relational-databases/polybase/polybase-installation.
On Linux (Red Hat, Ubuntu, or SUSE), see https://learn.microsoft.com/sql/relational-databases/polybase/polybase-linux-setup.

Note

Decide on the service account for PolyBase in advance. The only way to change the service account for the PolyBase Engine Service and Data Movement Service is to uninstall and reinstall PolyBase.

With the installation complete, you accomplish the next steps via T-SQL syntax:

Create the database master key, if it doesn’t already exist, with CREATE MASTER KEY.
Create a database scoped credential for the data source with CREATE DATABASE SCOPED CREDENTIAL.
Configure the external data source with CREATE EXTERNAL DATA SOURCE.
Create external tables, with CREATE EXTERNAL TABLE.

For additional details, follow the steps at https://learn.microsoft.com/sql/relational-databases/polybase/polybase-configure-sql-server. These steps are also covered in this chapter.

Configure and enable

A master key is a symmetric key used to protect other keys and certificates in the database. It is encrypted with both an algorithm and password. A master key is necessary to protect the credentials of the external tables. Details on how to create and update a master key, and the instructions to set it up, can be found in this chapter in the section “External tables.”

Inside OUT

How many master keys can a database have?

Each database can only have one master key. Before creating a new master key, query the sys.symmetric_keys table in the desired database to confirm there is not a master key already present.

If you try to create a master key on a database that already has a master key, you will receive the following error:

Click here to view code image

There is already a master key in the database. Please drop it before
performing this statement.

However, you can have multiple passwords for the same master key. If you already have a master key created, and you want a specific password for use with PolyBase, you can issue the following command:

Click here to view code image

ALTER MASTER KEY ADD ENCRYPTION BY PASSWORD  = '<strong password here>';

Starting in SQL Server 2019, you need to enable PolyBase globally using sp_configure. Once complete, you must run RECONFIGURE, and then restart the SQL Server service. Both PolyBase services will have to be started manually as they are turned off during this process and do not restart automatically.

This can be done with these commands:

Click here to view code image

exec sp_configure @configname = 'polybase enabled', @configvalue = 1;
RECONFIGURE [ WITH OVERRIDE ] ;

For more details on this command, visit https://learn.microsoft.com/sql/database-engine/configure-windows/polybase-connectivity-configuration-transact-sql.

Note

If you are unsure if PolyBase is installed, you can check its SERVERPROPERTY with this command:

Click here to view code image

SELECT SERVERPROPERTY ('IsPolyBaseInstalled') AS IsPolyBaseInstalled;

It returns 1 if it is installed and 0 if it is not.

Shared access signatures

As the name suggests, a shared access signature (SAS) is a way to share an object in your storage account with others without exposing your account key. This gives you granular control over the access you grant, at the account, service, or user level.

You can set a start time and expiry time.
Azure Blob Storage containers, file shares, queues, and tables are all resources that accept SAS policies.
You can set an optional IP address or range from which access will be accepted.
You can restrict access from HTTPS clients by specifying the accepted protocol.

An SAS is used to allow clients to read, write, or delete data in your Azure Storage account without access to your account key. This is typically necessary when a client wants to upload large amounts of data or high-volume transactions to your storage account, and creating a service to scale and match demand is too difficult or expensive.

The SAS is a signed Uniform Resource Identifier (URI) that points to resources and includes a token containing a set of query parameters that indicate how the resource can be accessed. Azure Storage checks both the storage piece and the provided token piece of the SAS URI to verify the request. If for any reason the SAS is not valid, it receives an error code 403 (Forbidden), and access is denied.

There are three different types of SAS:

User delegation shared access signatures rely on Azure AD credentials.
The service SAS delegates access via the storage account key to a resource in a single storage service (blob, queue, table, or file service).
The account-level SAS delegates access via the storage account key to resources in one or more storage services within a storage account. This also includes the option to apply access to services such as Get/Set Service Properties and Get Service Stats.
- For details on all the parameters that can be set with shared access signatures, see https://learn.microsoft.com/azure/storage/common/storage-dotnet-shared-access-signature-part-1#shared-access-signature-parameters.

You can create an ad hoc SAS either as an account SAS or a service SAS, and the start time, expiry time, and permissions are all specified in the SAS URI, the SECRET that is used to eventually create the CREDENTIAL in SQL Server.

You can also create policies to use for many shared access signatures. You may find that using the free Azure Storage Explorer application makes creating stored access policies and shared access signatures easy. You can download it here: https://aka.ms/storageexplorer. You can also create policies and shared access signatures in PowerShell.

Inside OUT

Are there any risks to using or storing a SAS URI?

Yes. Anyone who has the SAS URI can access your storage account. It belongs in the SECRET of a CREATE CREDENTIAL T-SQL statement and shouldn’t be stored anywhere it could be discovered—including in source control or unsecured documentation.

If a SAS expires (an expiration date must be provided), applications relying on the SAS could break. In SQL Server, you should generate a new SAS and ALTER the credential to use it.

To mitigate against SAS expiration, we recommended that you:

Always use HTTPS to create or distribute a SAS.
Use storage access policies.
Set the expiration on an account SAS far in the future and regularly monitor and rotate the SAS as necessary.
Set the expiration on an ad hoc SAS shorter to reduce the risk if compromised.
Ensure clients renew with enough time to avoid disruption.
Ensure you use the principle of least privilege when creating SAS URIs to reduce risk. Consider multiple shared access signatures with narrowly defined permissions.

Data uploaded with a SAS is not validated in any way, so a middle tier that performs rule validation, authentication, and auditing may still be the better option. Regardless of what you choose, monitor your storage for spikes in authentication failures.

Examples of how to create account and service SAS can be found at https://learn.microsoft.com/azure/storage/common/storage-dotnet-shared-access-signature-part-1#sas-examples.

Port requirements for Hadoop

If you plan to access Hadoop with PolyBase in SQL Server 2016 through SQL Server 2019, there are some additional requirements. These relate to which cluster components and ports need to be open for PolyBase to interact correctly with a Hadoop external data source. The ports for the following cluster components must be open:

HDFS ports
- NameNode
- DataNode
Resource manager
- Job submission
- Job history

Table 7-2 shows the default ports for these components. These ports are dependent on the version of Hadoop. In addition, the ports may be different if Hadoop is using a custom configuration.

Table 7-2 Port requirements for Hadoop access with PolyBase.

Hadoop cluster component	Default port
NameNode	8020
DataNode (Data transfer, non-privilege IPC port)	50010
DataNode (Data transfer, privilege IPC port)	1019
Resource Manager Job Submission (Hortonworks 1.3)	50300
Resource Manager Job Submission (Cloudera 4.3)	8021
Resource Manager Job Submission (Hortonworks 2.0 on Windows, Cloudera 5.x on Linux)	8032
Resource Manager Job Submission (Hortonworks 2.x, 3.0 on Linux, Hortonworks 2.1–3 on Windows)	8050
Resource Manager Job History	10020

Database scoped credential

Database scoped credentials are used to access non-public blob storage accounts from SQL Server or Azure Synapse with PolyBase. SQL Server with PolyBase also requires database scoped credentials for several types of connectors for PolyBase, including S3-compliant object storage, ODBC generic types, and Azure Blob Storage.

Find the syntax to create scoped credentials at https://learn.microsoft.com/sql/t-sql/statements/create-database-scoped-credential-transact-sql.

Note

If your SAS key value begins with a question mark (?), be sure to remove the leading ? because it will not be recognized. Ensure you have already set up a master key. The master key will be used to protect these credentials. If you do not yet have a master key, the instructions to set it up can be found later in this chapter in the section, “External tables.”

External data sources

External data sources are used to establish connectivity to systems outside of SQL Server for data virtualization or loading data using PolyBase. The most common use cases are loading data with bulk INSERT or OPENROWSET activities, or accessing data that would otherwise not be available in SQL Server because it resides on another system—even a non-relational storage system.

The details of the CREATE EXTERNAL DATA SOURCE T-SQL syntax have changed with every major version of SQL Server. In Microsoft Docs, pay special attention to the version selector (top-left) in the CREATE EXTERNAL DATA SOURCE reference article: https://learn.microsoft.com/sql/t-sql/statements/create-external-data-source-transact-sql.

The following example shows how to load data from a CSV file in an Azure Blob Storage location that has been configured as an external data source. This requires a database scoped credential using a shared access signature.

Click here to view code image

-- Create the External Data Source
-- Remove the ? from the beginning of the SAS token
-- Do not put a trailing /, file name, or shared access signature parameters at the end
of the LOCATION URL when configuring an external data source for bulk operations.
CREATE DATABASE SCOPED CREDENTIAL AccessPurchaseOrder
WITH
     IDENTITY = 'SHARED ACCESS SIGNATURE'
, SECRET = '******srt=sco&sp=rwac&se=2022-02-01T00:55:34Z&st=2023-12-
29T16:55:34Z***************'
;
CREATE EXTERNAL DATA SOURCE ExternalPurchaseOrder
WITH
(LOCATION   = 'https://newinvoices.blob.core.windows.net/week3'
,CREDENTIAL = AccessPurchaseOrder, TYPE = BLOB_STORAGE)
;
--Insert into
BULK INSERT Sales.Orders
FROM 'order-2022-11-04.csv'
WITH (DATA_SOURCE = ' ExternalPurchaseOrder');

Note

Ensure you have at least read permission on the object that is being loaded, and that the expiration period of the SAS is valid (all dates are in UTC time). A credential is not needed if the Azure Blob Storage account has public access.

Creating the external data source can be tricky. Your external data source may require some specifics of the connection string to be provided via the CONNECTION_OPTIONS parameter of the CREATE EXTERNAL DATABASE statement.

Microsoft recently documented the details of potentially required third-party provider parameters in an article for CREATE EXTERNAL DATA SOURCE CONNECTION_OPTIONS, at https://learn.microsoft.com/sql/t-sql/statements/create-external-data-source-connection-options.

External file format

The external file format is required before you can create the external table, but only for some external data source types. As suggested, the file format specifies the layout of the data to be referenced by the external table. Hadoop, Azure Blob Storage, and Azure Data Lake Storage all need an external file format object defined for PolyBase. Delimited text files, Parquet, and both RCFile and ORC Hive files are supported.

Here’s an example:

Click here to view code image

CREATE EXTERNAL FILE FORMAT skipHeader_CSV
WITH (FORMAT_TYPE = DELIMITEDTEXT,
      FORMAT_OPTIONS(
          FIELD_TERMINATOR = ',',
          STRING_DELIMITER = '"',
          FIRST_ROW = 2,
          USE_TYPE_DEFAULT = True)
);

Syntax for this command can be found at https://learn.microsoft.com/sql/t-sql/statements/create-external-file-format-transact-sql.

Starting with SQL Server 2022, you can also query data directly from the folder of Delta tables in Azure Blob Storage. Creating an external file format is required, but simpler:

Click here to view code image

CREATE EXTERNAL FILE FORMAT deltaTable1 WITH (FORMAT_TYPE = DELTA);

External tables

External tables are used to read specific external data and to import data into SQL Server. You don’t have to create an external table. You can use OPENROWSET to query data on some external data sources, but external tables provide strong data typing and easy joins to traditional SQL Server tables using T-SQL queries. External tables are an important part of the promise PolyBase brings as a single plane of data, inside SQL Server, for your entire data estate.

No actual data is moved to SQL Server during the creation of an external table; only the metadata and basic statistics about the folder and file are stored. The intent of the external table is to be the link connecting the external data to SQL Server to create the data virtualization. The external table looks much like a regular SQL Server table in format and has a similar syntax, as you see here with this external Oracle table.

Click here to view code image

-- Create a Master Key
   CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'password';
   CREATE DATABASE SCOPED CREDENTIAL credential_name
   WITH IDENTITY = 'username', Secret = 'password';
-- LOCATION: Location string for data
   CREATE EXTERNAL DATA SOURCE external_data_source_name
   WITH ( LOCATION = 'oracle://<server address>[:<port>]',
   CREDENTIAL = credential_name)
--Create table
   CREATE EXTERNAL TABLE customers(
   [O_ORDERKEY] DECIMAL(38) NOT NULL,
   [O_CUSTKEY] DECIMAL(38) NOT NULL,
   [O_ORDERSTATUS] CHAR COLLATE Latin1_General_BIN NOT NULL,
   [O_TOTALPRICE] DECIMAL(15,2) NOT NULL,
   [O_ORDERDATE] DATETIME2(0) NOT NULL,
   [O_ORDERPRIORITY] CHAR(15) COLLATE Latin1_General_BIN NOT NULL,
   [O_CLERK] CHAR(15) COLLATE Latin1_General_BIN NOT NULL,
   [O_SHIPPRIORITY] DECIMAL(38) NOT NULL,
   [O_COMMENT] VARCHAR(79) COLLATE Latin1_General_BIN NOT NULL
   )
   WITH ( LOCATION='customer', DATA_SOURCE= external_data_source_name   );

Note

Schema drift can affect external tables. If the external source changes, it is not automagically changed in the external table definition. Any change to the external source will need to be reflected in the external table definition.

In ad hoc query scenarios, such as querying Hadoop data, PolyBase stores the rows retrieved from the external data source in a temporary table. After the query completes, PolyBase removes and deletes the temporary table. No permanent data is stored in SQL tables.

In an import scenario, such as SELECT INTO from an external table, PolyBase stores rows returned as permanent data in the SQL table. The new table is created during query execution when PolyBase retrieves the external data.

PolyBase can push some of the query computation to improve query performance. This action is called predicate pushdown. It is used by specifying the Hadoop resource manager location option when creating the external data source and enabling pushdown using these parameters:

Click here to view code image

PUSHDOWN                  = [ON | OFF]
, RESOURCE_MANAGER_LOCATION = '<resource_manager>[:<port>]'

With some third-party data providers, instead of using the PUSHDOWN keyword, you may need to specify pushdown-related keywords in the CONNECTION_OPTIONS argument of CREATE EXTERNAL DATA SOURCE.

You can create many external tables that reference the same or different external data sources.

Note

Elastic query also uses external tables, but the same table cannot be used for both elastic queries and PolyBase. Although they have the same name, they are not the same.

Statistics

Statistics on external tables are created the same as other tables.

Syntax for CREATE STATISTICS can be found at https://learn.microsoft.com/sql/t-sql/statements/create-statistics-transact-sql.

Catalog views

There are catalog views for the installation and running of PolyBase.

sys.external_data_sources. Used to identify external data sources and to give visibility into related metadata. This includes the source, type, and name of the remote database, and in the case of Hadoop, the resource manager’s IP and port. This can be very helpful.
sys.external_file_formats. Used to obtain details on the external file format for the sources. Along with the file format type, it includes details about the delimiters, general format, encoding, and compression.
sys.external_tables. Contains a row for each external table in the current database. It details information needed about the tables, such as ID links to preceding tables. Many of the columns provide details about external tables over the shard map manager; this is a special database that maintains global mapping information about all shards (databases) in a shard set. The metadata allows an application to connect to the correct database based upon the value of the sharding key. Every shard in the set contains maps that track the local shard data (known as shardlets).

Dynamic management views

Several dynamic management views can be used with PolyBase to troubleshoot issues, such as finding long-running queries, and to monitor nodes in a PolyBase group.

A list of relevant views can be found at https://learn.microsoft.com/sql/relational-databases/polybase/polybase-troubleshooting#dynamic-management-views.

More PolyBase examples, architectures including S3 and URL queries

S3-compatible object storage is a new feature of SQL Server 2022, providing both backup and restores to S3-compatible storage and for use with PolyBase. SQL Server 2022 enables you to connect to external data stored in any S3-compatible object storage. These functions allow you to connect to various files stored in S3 storage, including Parquet files. You can read from these files and, if you grant the correct permissions to the S3 user, write to them.

Many developers like to use S3-compatible object storage, and the relative low cost makes it a desirable solution to store files. However, challenges can arise when trying to access these files within a relational database. If you would like to write files directly to S3-compatible storage without using SSIS, PolyBase may be a good solution. To set this connection up, you need to create several objects, including a database scoped credential, external data source, external file format, and external table.

For more information and details on the permissions and limitations of this new feature of SQL Server 2022, see https://learn.microsoft.com/sql/relational-databases/polybase/polybase-configure-s3-compatible.

The following code walks you through how to set this up.

Click here to view code image

-- Create database scoped credential
IF NOT EXISTS(SELECT * FROM sys.credentials WHERE name = 'sqlserver2022parquets3')
BEGIN
 CREATE DATABASE SCOPED CREDENTIAL sqlserver2022parquets3 --PolyBaseS3
 WITH IDENTITY = 'S3 Access Key',
 SECRET = '######';
END
-- Create external source
-- Can use URL not just IP address
CREATE EXTERNAL DATA SOURCE sqlserver2022parquetdc
WITH
(
    LOCATION = 's3://sqlserver2022parquet.s3.us-east-1.amazonaws.com/',
    CREDENTIAL = sqlserver2022parquets3
);
-- Create external file format
CREATE EXTERNAL FILE FORMAT ParquetFileFormat WITH (FORMAT_TYPE = PARQUET);
GO
-- Create external table
-- Location below specifies folder and filename
CREATE EXTERNAL TABLE Warehouse.ColdRoomTemperaturesParquet (
    [ColdRoomTemperatureID] [bigint] ,
    [ColdRoomSensorNumber] [int] ,
    [RecordedWhen] [datetime2](7) ,
    [Temperature] [decimal](10, 2) ,
    [ValidFrom] [datetime2](7) ,
    [ValidTo] [datetime2](7)  )
WITH (LOCATION = '/output/ColdRoomTemperatures.parquet',
DATA_SOURCE = sqlserver2022parquetdc,
FILE_FORMAT = ParquetFileFormat);
GO
-- Query data directly from S3 storage with OPENROWSET
SELECT  TOP 1 *
FROM    OPENROWSET
        (   BULK 'output/ColdRoomTemperatures.parquet',
            FORMAT       = 'PARQUET',
            DATA_SOURCE  = 'sqlserver2022parquetdc'
        ) AS [cc];
-- Can query the external table directly as well
SELECT  TOP 1 *
FROM Warehouse.ColdRoomTemperaturesParquet;

Inside OUT

How do you set up S3-compatible object storage?

Over the past several years, the use of S3-compatible object storage has become increasingly popular among developers. However, one of the challenges has been how to integrate S3-compatible object storage with SQL Server. For those of us accustomed to working with SQL Server, this becomes even trickier, as we are not always familiar with how to set up S3-compatible object storage. This can make it increasingly difficult to get a proof-of-concept set up and tested.

If you want to set up your own S3-compatible object storage with the new capabilities of SQL Server 2022, you can with MinIO using their free GNU AGPL v3 edition. Anthony Nocentino has written detailed instructions on how you can get MinIO set up correctly. You can find more details at https://www.nocentino.com/posts/2022-06-10-setting-up-minio-for-sqlserver-object-storage.

PolyBase examples with a generic ODBC driver

While PolyBase allowed connections to Oracle, Teradata, and MongoDB in prior versions of SQL Server, SQL Server 2022 introduces the ability to connect using ODBC generic types. By installing the appropriate driver and setting up the external database objects, you can connect to other relational databases, including MySQL or SAP as examples. (This feature is supported only by SQL Server on Windows at the time of this book’s writing.)

The following sample code contains an example of a generic ODBC connector for MySQL:

Click here to view code image

-- Create database scoped credential
IF NOT EXISTS(SELECT * FROM sys.credentials WHERE name = 'sqlserver2022mysql')
BEGIN
 CREATE DATABASE SCOPED CREDENTIAL sqlserver2022mysql
 WITH IDENTITY = 'sqlzelda',
 SECRET = '<Strong Password>';
END
-- Create external source
CREATE EXTERNAL DATA SOURCE sqlserver2022mysqldc
WITH ( LOCATION = 'odbc://localhost:3306',
CONNECTION_OPTIONS = 'Driver={MySQL ODBC 8.0 ANSI Driver};
ServerNode = localhost:3306',
--PUSHDOWN = ON,
CREDENTIAL = sqlserver2022mysql );
-- Create external table
-- Location below specifies folder and filename
CREATE EXTERNAL TABLE Warehouse.ColdRoomTemperatureMySQL
(
    ColdRoomTemperatureID INT NOT NULL,
    ColdRoomSensorNumber INT NOT NULL,
    Temperature DECIMAL(10, 2) NOT NULL--,
)
 WITH
 (
    LOCATION='coldroom.coldroomtemperatures',
    DATA_SOURCE = sqlserver2022mysqldc
 );
GO
-- Add index to external table
CREATE STATISTICS stx_coldroomsensornumber
ON Warehouse.ColdRoomTemperatureMySQL (ColdRoomSensorNumber)
WITH FULLSCAN;
-- Can query the external table directly as well
SELECT TOP 1 ColdRoomTemperatureID

Here, FROM [Warehouse].[ColdRoomTemperatureMySQL] is the external table virtualizing data through the generic ODBC connector. To re-create this example, you need to download the MySQL ODBC 8.0 ANSI driver, though you can use any data provider’s ODBC driver to use this feature.

Azure bulk operations examples

In addition to being able to access Azure Blob Storage, PolyBase will also allow you to access CSV files and Delta tables. When connecting to Azure Blob Storage, you will be able to query, import, and export data. You can access these files by querying external tables or opening the files directly using OPENROWSET. For example:

Click here to view code image

CREATE DATABASE SCOPED CREDENTIAL AzureBlob
WITH  IDENTITY = 'SHARED ACCESS SIGNATURE',
SECRET = 'sv=2018-03-28&ss=bfqt&srt=sco&sp=rwdlacup
&se=2099-08-19T23:56:04Z&st=2022-08-19T15:56:04Z&spr=https&sig=ZWHPwhateverD';
CREATE EXTERNAL DATA SOURCE AzureBlob_ForBulk
WITH  ( LOCATION = 'https://container.blob.core.windows.net/subfolder' , --No trailing /
CREDENTIAL = AzureBlob, TYPE = BLOB_STORAGE);

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 7. Understand table features

Create new playlist

Sign In

Sign Up

Chapter 7

Review table structures

General-purpose data types

Alphanumeric types

Collation

Large value data

Numeric types

Date and time types

Binary types

Specialized data types

Spatial data types: geometry and geography

The XML data type

The rowversion data type

Implement optimistic concurrency

The uniqueidentifier data type

The hierarchyid data type

The sql_variant data type

Data type precedence

Constraints

Primary keys, foreign keys, and relationships

Unique constraints

Check constraints

Default constraints

Sequence objects

User-defined data types and user-defined types

User-defined data types

CLR user-defined types

Sparse columns

Computed columns

Special table types

System-versioned temporal tables

Create a system-versioned temporal table

Understand data movement in temporal tables

Query temporal tables

Manage temporal tables

Memory-optimized tables

Database preparation for memory-optimized tables

Natively compiled stored procedures and user-defined functions

Caveats to memory-optimized tables

Graph tables

Define graph tables

Work with graph data

Graph table shortcomings

Store large binary objects

Understand FILESTREAM

FileTable

Table partitions

Horizontally partitioned tables and indexes

About horizontal partitions

Index partitioning

Define partitions and partition a table

Horizontal partition design guidelines

Implement a sliding window partition strategy

Vertical partitions

Capture modifications to data

Use change tracking

Use change data capture

Query change tracking and change data capture

Query change tracking

Query change data capture

Compare change tracking, change data capture, and temporal tables

Benefits of PolyBase for external data sources and external tables

Unified data platform features

Pushdown computation

Install and configure PolyBase

Installation

Configure and enable

Shared access signatures

Port requirements for Hadoop

Database scoped credential

External data sources

External file format

External tables

Statistics

Catalog views

Dynamic management views

Table of Contents for
Chapter 7. Understand table features