Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

21 Data masking

Data protection is heavily regulated in many countries, and failing to protect certain types of data may break national and international laws, such as the GDPR in the European Union and HIPAA in the United States. For these laws and many additional reasons, one of the most important duties of a DBA is preventing data leaks.

SQL Server implements many security principles, like authentication and authorization, to help protect data from unauthorized access, but these measures can be bypassed when databases are moved from production to other environments, such as development and testing, or when databases are given to vendors for troubleshooting.

To reduce the potential for data breaches when sharing databases that contain sensitive data, we must consider protecting data privacy by replacing any personally identifiable information (PII) with fabricated data, while also keeping the resulting data meaningful to the consuming applications or test suites. PII includes, but is not limited to, name, birth date, passport number, home address, and phone number.

In this chapter, we will focus on static data masking. This is the process of permanently replacing sensitive data at rest with new values by updating the data in our database.

Note SQL Server has a feature called Dynamic Data Masking, which replaces data in transit, so the actual data is not modified. This is not what dbatools offers, so it won’t be covered here. To reiterate, data masking in the dbatools context is about the anonymization of the data, also known as static data masking, which is the process of replacing sensitive data in its entirety.

First, we’ll share the options that dbatools offers to help identify and mask our data. Then, we’ll show how we would generate masking configurations and apply them against a database that contains PII. Finally, we’ll demonstrate how these configurations can be saved to make the process easily repeatable. Easily repeatable processes not only increase the chances that you’ll protect your data each time it’s exported, but they also make it possible to automate within a CI/CD pipeline.

In this chapter, we will address a few key dbatools commands, as shown in table 21.1.

Table 21.1 Key data masking commands

Command	Description
`Get-DbaRandomizedValue`	Generates random values for various data types
`Get-DbaRandomizedType`	Displays a dictionary of randomized types and subtypes (e.g., internet and email) available within our masking suite
`Invoke-DbaDbPiiScan`	Helps figure out which columns may potentially contain PII
`New-DbaDbMaskingConfig`	Simplifies the generation of data masking configuration files
`Invoke-DbaDbDataMasking`	Masks data by using randomized values determined by a configuration file and a randomizer framework
`Test-DbaDbDataMaskingConfig`	Tests masking configuration files to ensure they are valid and consumable

21.1 Getting started

You can find the commands dbatools offers for dealing with this topic by executing the command shown next.

Listing 21.1 Finding command names that contain DataMasking

Find-DbaCommand DataMasking

21.2 A common approach

Sometimes, if we’re lucky, the development team that designs the database also creates scripts to run some update statements to scramble the data. However, other times, the dev teams don’t create these scripts at all, or we are required to work with third-party software that also lacks this functionality. In these cases, we may not be familiar with the database schema, but we still want to mask the data before handing over a copy of the database to others.

One common approach to this problem is to update data based on a simple pattern or several patterns. So, for instance, you might use a T-SQL script to update the values of two columns, such as the following:

All “T” chars will become “D,” so if we had a “Tom” on the FirstName column, these will become “Dom.”
Phone numbers will be doubled, and then the first nine digits of that number will be used in place of the original. The phone number “987654321,” for example, would become “197530864.”

You can see a simple T-SQL based example of this in the next listing.

Listing 21.2 Simple masking of phone number and first name

UPDATE dbo.Person
   SET FirstName = REPLACE(FirstName, 'T', 'D').
       PhoneNumber = LEFT(PhoneNumber*2, 9)

The data will be easy to reverse if you know the pattern, of course. Although this may be convenient for our developers, it’s also convenient for bad actors who are eager to steal our data. Further, if the data changes over time, these patterns may fail or not apply to new data. If the database is upgraded and gets new tables with new sensitive data, you would have to validate the new entries one by one to determine whether they are PII data that should be masked.

21.3 The better approach

Instead of the previously mentioned, simple update approach, we should aim to replace the real data with entirely fabricated data. dbatools makes it easy to create randomly generated data and use that new data to update any records containing PII. This is an easier and more sustainable approach to masking the data. Also, because we don’t use fixed patterns, it’s much harder (or even impossible) to reverse engineer the data.

21.3.1 Generating random data

Within dbatools, we use a mix of two approaches to generate random data, depending on the pattern of the data that you want to mask. For simple SQL Server data types, like the types shown in table 21.2, we just generate a new random value.

Table 21.2 Sample data types

Type	Description
Number	`int`, `tinyint`, `smallint`, `bigint`
Dates	`time`, `smalldatetime`, `datetime`, `datetime2`
Text	`char`, `varchar`, `nvarchar`

For more complicated or specific types of data, such as data formatted as emails, first names, IP addresses, and phone numbers, dbatools relies on the Bogus framework (sqlps.io/bogus), which is a well-known “fake data” generator for .NET. We’ve even brought the functionality of the Bogus framework to PowerShell within the Get-DbaRandomizedValue command, which allows you to easily generate random data ad hoc, as shown next.

Listing 21.3 Generating random data ad hoc

PS> # Generate a random datetime value for the year 2021
PS> $splatGetRandValueDT = @{
  Datatype = "datetime"
  Min = "2021-01-01"
  Max = "2021-12-31 23:59:59"
}
PS> Get-DbaRandomizedValue @splatGetRandValueDT
 
PS> # Generate a random IP value
PS> $splatGetRandValueIP = @{
  RandomizerType = "Internet"
  RandomizerSubType = "IP"
}
PS> Get-DbaRandomizedValue @splatGetRandValueIP

In the first command in listing 21.3, we passed a valid SQL Server data type to the -Datatype parameter, which then returned a random value for that type. In the second, more advanced example, we specified the -RandomizerType and -RandomizerSubType parameters, which then generated an IP address. This was accomplished by specifying Internet as the primary type and IP as the subtype.

In addition to “faking” IP addresses, our masking commands can generate numerous randomized data combinations that help create fabricated data. To see the extensive list of options, you can use both Get-DbaRandomizedType and Get-DbaRandomizedValue to see what’s available.

21.4 The process

Now that you know a little more about the way dbatools generates randomized data, let’s look how we use those techniques to help simplify the masking of data.

21.4.1 Finding potential PII data

When attempting to mask our own data, we realized that finding PII data using T-SQL or SSMS was a tedious, manual process. In response, we created the command Invoke-DbaDbPiiScan, which makes it easy to find most PII data. The approach used in this command was to both examine column name hints and perform pattern recognition within the data. You can see just how easy it is to find PII data with Invoke-DbaDbPiiScan in the next code listing.

Listing 21.4 Example output of Invoke-DbaDbPiiScan

PS> Invoke-DbaDbPiiScan -SqlInstance mssql1 -Database AdventureWorks
 
ComputerName   : mssql1
InstanceName   : MSSQLSERVER
SqlInstance    : mssql1
Database       : AdventureWorks
Schema         : SalesLT
Table          : Customer
Column         : EmailAddress
PII-Category   : Communication                        ①
PII-Name       : E-mail                               ②
FoundWith      : KnownName                            ③
MaskingType    : Internet                             ④
MaskingSubType : Email                                ⑤
Pattern        : {(w*)(?i)(email|e-mail|mail)(w*),
  ^(?:(?!invalid).)*email(?!w*ID)}                   ⑥

① The category in which this table data belongs

② The name of the PII data type

③ The way it was found—KnownName file in this case; can also be from the Pattern file

④ The masking type

⑤ The masking subtype

⑥ The pattern used to infer that, given the column name, the data will likely contain email addresses

You can see that, within the example in listing 21.4, the PII data of Email subtype was discovered by the column name.

In the event that Invoke-DbaDbPiiScan can’t determine likely PII data by the column name, it uses pattern matching instead. We can observe pattern matching in action when we change the column name EmailAddress to a less-obvious name, such as EAddress. In the following code snippet, we see that Invoke-DbaDbPiiScan no longer finds the PII by using the column name but rather by matching a preset regular expression pattern.

Listing 21.5 Example output of finding results using patterns

PS> Invoke-DbaDbPiiScan -SqlInstance mssql1 -Database AdventureWorks
 
ComputerName   : mssql1
InstanceName   : MSSQLSERVER
SqlInstance    : mssql1
Database       : AdventureWorks
Schema         : SalesLT
Table          : Customer
Column         : EAddress                                           ①
PII-Category   : Communication
PII-Name       : E-mail
FoundWith      : Pattern                                            ②
MaskingType    : Internet                                           ③
MaskingSubType : Email                                              ④
Country        : All
CountryCode    : All
Pattern        : [A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,6}         ⑤
Description    :

Within the results, we can see that the data within a column named EAddress ① matched a pattern ② of an email that was checked by using the regular expression ⑤. The regular expression pattern helped to infer that the masking type was Internet ③ and the masking subtype was Email ④.

By default, Invoke-DbaDbPiiScan samples the first 100 records to determine the type of data it contains. If you want to include more or fewer rows to try to find the pattern recognition, you can leverage the -SampleCount parameter to specify a different amount of records to be analyzed, as shown next.

Listing 21.6 Modifying the sampling count Invoke-DbaDbPiiScan

PS> Invoke-DbaDbPiiScan -SqlInstance mssql1 -Database AdventureWorks
➥ -SampleCount 200

We’ve found that sampling as few as 10 rows works well enough, but higher sampling rates return more accurate results. The downside to sampling a higher number of rows is that the analysis will take more time to complete. That amount of time is dependent on the type of data and the resources available to process the sampling.

21.4.2 Generating a configuration file for masking

Now that you know how to identify which tables and columns have PII data, you’ll likely want to save the results of this test so that you can repeat the masking process on a regular basis. You can do this using the New-DbaDbMaskingConfig command, which generates a human-readable file that is easy to modify. But, before we check the JSON file structure generated by New-DbaDbMaskingConfig, let’s explore some of the parameters available to run the command and understand how it works.

Warning As of the time of this writing, not all column and data types are supported (e.g., user-defined data types). Use Get-Help New-DbaDbMaskingConfig for the most up-to-date documentation and limitations.

When using New-DbaDbMaskingConfig, you may want to generate a file based on all columns of all tables, or you can decide to be more granular and specify which table(s) and/or column(s) you want to include on your file. For this, you will use the -Table and -Column parameters, as demonstrated in the following listing, which will generate a file named mssql1.AdventureWorks.DataMaskingConfig.json under the D: emp folder.

Listing 21.7 Generating a new masking config file

PS> New-DbaDbMaskingConfig -SqlInstance mssql1 -Database AdventureWorks
➥ -Table Address -Column City, PostalCode -Path D:	emp

The -Table parameter on each of the masking commands is expecting only the table name, without a schema, meaning that if the fully qualified name of your table is Person .Address, you only need to specify Address. The contents are, again, easy to both read and modify as shown here.

Listing 21.8 Generating a new masking config file

{
  "Name":  "AdventureWorks",
  "Type":  "DataMaskingConfiguration",
  "Tables":  [
        {
            "Name":  "Address",
            "Schema":  "Person",
            "Columns":  [
                    {
                    "Name":  "City",
                    "ColumnType":  "nvarchar",
                    "CharacterString":  null,
                    "MinValue":  15,
                    "MaxValue":  30,
                    "MaskingType":  "Address",
                    "SubType":  "City",
                    "Format":  null,
                    "Separator":  null,
                    "Deterministic":  false,
                    "Nullable":  false,
                    "KeepNull":  true,
                    "Composite":  null,
                    "Action":  null,
                    "StaticValue":  null
                    },
                    {
                    "Name":  "PostalCode",
                    "ColumnType":  "nvarchar",
                    "CharacterString":  null,
                    "MinValue":  8,
                    "MaxValue":  15,
                    "MaskingType":  "Address",
                    "SubType":  "Zipcode",
                    "Format":  null,
                    "Separator":  null,
                    "Deterministic":  false,
                    "Nullable":  false,
                    "KeepNull":  true,
                    "Composite":  null,
                    "Action":  null,
                    "StaticValue":  null
                    }
                ],
            "HasUniqueIndex":  true,
            "FilterQuery":  null
        }
    ]
}

You can use this file on an ongoing basis to mask data within the City and PostalCode columns within the Address table.

Defining deterministic columns

dbatools also supports setting columns as deterministic. This means that if a raw value appears more than once, the same masked value can also be used. To make a column deterministic, you need to edit your configuration file.

First, search for the column name you want to change, then update the value of the Deterministic value from false to true. Using a portion of the example in listing 21.8, we’ll update the City column to be deterministic, shown next.

Listing 21.9 Making the City column deterministic

{
                    "Name":  "City",
                    "ColumnType":  "nvarchar",
                    "CharacterString":  null,
                    "MinValue":  15,
                    "MaxValue":  30,
                    "MaskingType":  "Address",
                    "SubType":  "City",
                    "Format":  null,
                    "Separator":  null,
                    "Deterministic":  true,    ①
                    "Nullable":  false,
                    "KeepNull":  true,
                    "Composite":  null,
                    "Action":  null,
                    "StaticValue":  null
                    },

As you can see in listing 21.9 ①, the Deterministic property has been changed to true. This means that all instances of a specific city name will be changed to the same masked value. As an example, all instances of Los Angeles will be changed to Provincetown, and all instances of San Diego will be changed to Seattle.

Ultimately, the decision to use deterministic masking comes down to your requirements. We often find that very normalized databases likely won’t need deterministic masking because everything revolves around a primary key, essentially making the data naturally deterministic. Denormalized databases, on the other hand, may benefit from deterministic masking. If your report developer requires the same proportions of data, such as X amount of Y comes from Z city, this would be a good use case for deterministic masking.

Another good use case for deterministic masking would be keeping masked, but known, CI/CD values for testing purposes. Although deterministic replacements are tracked only per each run, you can export the resulting mask dictionary for later use. This operation is useful but out of scope for this book.

21.4.3 Applying static data masking

Now that you’ve learned how to generate a configuration file and potentially identify and set deterministic columns, it’s time to use the configuration file to mask data within your database using the Invoke-DbaDbDataMasking command. Before we dive into the PowerShell code, let’s look at some raw AdventureWorks data that has not yet been masked, shown in figure 21.1.

Figure 21.1 Table data before static data masking

And now, it’s finally time to mask our data using Invoke-DbaDbDataMasking, as shown in the next listing.

Listing 21.10 Applying static data masking to your table

PS> Invoke-DbaDbDataMasking -SqlInstance mssql1 -Database dbatools
➥ -FilePath "D:	empmssql1.AdventureWorks.DataMaskingConfig.json"

An output example of the execution of the command is shown in the next code sample.

Listing 21.11 Example output of Invoke-DbaDbDataMasking

ComputerName : mssql1
InstanceName : MSSQLSERVER
SqlInstance  : mssql1
Database     : AdventureWorks
Schema       : SalesLT
Table        : Address
Columns      : {City, PostalCode}
Rows         : 450
Elapsed      : 00:00:10
Status       : Successful

If we check our table again, as depicted in figure 21.2, now we will see that the data for the City and PostalCode columns is different.

Figure 21.2 Table data after static data masking

It’s so rewarding to actually see these changed results each time we mask our own data. We encourage you to use this against your own databases to see just how easy it is to introduce a whole new layer of security without any additional cost.

21.4.4 Validating a data masking configuration file

As we’ve seen earlier in the chapter, editing JSON files is easy. But JSON formatting is sensitive, so if you make changes often, the possibility of typos and mistypes increases, which can cause problems. We’ve encountered this scenario ourselves and ended up creating a command that makes it easier to validate a data masking configuration file’s JSON structure.

Using this command to validate your JSON files along the way can help save frustration from unexpected parsing issues.

21.5 Hands-on lab

Try the following tasks:

Check whether a specific database table has any column that contains potential PII data.
Generate a data masking configuration file for a selection of tables and columns.
Mask the data using the data-masking configuration file generated.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 21 Data masking

Create new playlist

Sign In

Sign Up