Data protection is heavily regulated in many countries, and failing to protect certain types of data may break national and international laws, such as the GDPR in the European Union and HIPAA in the United States. For these laws and many additional reasons, one of the most important duties of a DBA is preventing data leaks.
SQL Server implements many security principles, like authentication and authorization, to help protect data from unauthorized access, but these measures can be bypassed when databases are moved from production to other environments, such as development and testing, or when databases are given to vendors for troubleshooting.
To reduce the potential for data breaches when sharing databases that contain sensitive data, we must consider protecting data privacy by replacing any personally identifiable information (PII) with fabricated data, while also keeping the resulting data meaningful to the consuming applications or test suites. PII includes, but is not limited to, name, birth date, passport number, home address, and phone number.
In this chapter, we will focus on static data masking. This is the process of permanently replacing sensitive data at rest with new values by updating the data in our database.
Note SQL Server has a feature called Dynamic Data Masking, which replaces data in transit, so the actual data is not modified. This is not what dbatools offers, so it won’t be covered here. To reiterate, data masking in the dbatools context is about the anonymization of the data, also known as static data masking, which is the process of replacing sensitive data in its entirety.
First, we’ll share the options that dbatools offers to help identify and mask our data. Then, we’ll show how we would generate masking configurations and apply them against a database that contains PII. Finally, we’ll demonstrate how these configurations can be saved to make the process easily repeatable. Easily repeatable processes not only increase the chances that you’ll protect your data each time it’s exported, but they also make it possible to automate within a CI/CD pipeline.
In this chapter, we will address a few key dbatools commands, as shown in table 21.1.
You can find the commands dbatools offers for dealing with this topic by executing the command shown next.
Sometimes, if we’re lucky, the development team that designs the database also creates scripts to run some update statements to scramble the data. However, other times, the dev teams don’t create these scripts at all, or we are required to work with third-party software that also lacks this functionality. In these cases, we may not be familiar with the database schema, but we still want to mask the data before handing over a copy of the database to others.
One common approach to this problem is to update data based on a simple pattern or several patterns. So, for instance, you might use a T-SQL script to update the values of two columns, such as the following:
All “T” chars will become “D,” so if we had a “Tom” on the FirstName column, these will become “Dom.”
Phone numbers will be doubled, and then the first nine digits of that number will be used in place of the original. The phone number “987654321,” for example, would become “197530864.”
You can see a simple T-SQL based example of this in the next listing.
UPDATE dbo.Person SET FirstName = REPLACE(FirstName, 'T', 'D'). PhoneNumber = LEFT(PhoneNumber*2, 9)
The data will be easy to reverse if you know the pattern, of course. Although this may be convenient for our developers, it’s also convenient for bad actors who are eager to steal our data. Further, if the data changes over time, these patterns may fail or not apply to new data. If the database is upgraded and gets new tables with new sensitive data, you would have to validate the new entries one by one to determine whether they are PII data that should be masked.
Instead of the previously mentioned, simple update approach, we should aim to replace the real data with entirely fabricated data. dbatools makes it easy to create randomly generated data and use that new data to update any records containing PII. This is an easier and more sustainable approach to masking the data. Also, because we don’t use fixed patterns, it’s much harder (or even impossible) to reverse engineer the data.
Within dbatools, we use a mix of two approaches to generate random data, depending on the pattern of the data that you want to mask. For simple SQL Server data types, like the types shown in table 21.2, we just generate a new random value.
For more complicated or specific types of data, such as data formatted as emails, first names, IP addresses, and phone numbers, dbatools relies on the Bogus framework (sqlps.io/bogus), which is a well-known “fake data” generator for .NET. We’ve even brought the functionality of the Bogus framework to PowerShell within the Get-DbaRandomizedValue
command, which allows you to easily generate random data ad hoc, as shown next.
PS> # Generate a random datetime value for the year 2021 PS> $splatGetRandValueDT = @{ Datatype = "datetime" Min = "2021-01-01" Max = "2021-12-31 23:59:59" } PS> Get-DbaRandomizedValue @splatGetRandValueDT PS> # Generate a random IP value PS> $splatGetRandValueIP = @{ RandomizerType = "Internet" RandomizerSubType = "IP" } PS> Get-DbaRandomizedValue @splatGetRandValueIP
In the first command in listing 21.3, we passed a valid SQL Server data type to the -Datatype
parameter, which then returned a random value for that type. In the second, more advanced example, we specified the -RandomizerType
and -RandomizerSubType
parameters, which then generated an IP address. This was accomplished by specifying Internet
as the primary type and IP
as the subtype.
In addition to “faking” IP addresses, our masking commands can generate numerous randomized data combinations that help create fabricated data. To see the extensive list of options, you can use both Get-DbaRandomizedType
and Get-DbaRandomizedValue
to see what’s available.
Now that you know a little more about the way dbatools generates randomized data, let’s look how we use those techniques to help simplify the masking of data.
When attempting to mask our own data, we realized that finding PII data using T-SQL or SSMS was a tedious, manual process. In response, we created the command Invoke-DbaDbPiiScan
, which makes it easy to find most PII data. The approach used in this command was to both examine column name hints and perform pattern recognition within the data. You can see just how easy it is to find PII data with Invoke-DbaDbPiiScan
in the next code listing.
PS> Invoke-DbaDbPiiScan -SqlInstance mssql1 -Database AdventureWorks ComputerName : mssql1 InstanceName : MSSQLSERVER SqlInstance : mssql1 Database : AdventureWorks Schema : SalesLT Table : Customer Column : EmailAddress PII-Category : Communication ① PII-Name : E-mail ② FoundWith : KnownName ③ MaskingType : Internet ④ MaskingSubType : Email ⑤ Pattern : {(w*)(?i)(email|e-mail|mail)(w*), ^(?:(?!invalid).)*email(?!w*ID)} ⑥
① The category in which this table data belongs
② The name of the PII data type
③ The way it was found—KnownName file in this case; can also be from the Pattern file
⑥ The pattern used to infer that, given the column name, the data will likely contain email addresses
You can see that, within the example in listing 21.4, the PII data of Email
subtype was discovered by the column name.
In the event that Invoke-DbaDbPiiScan
can’t determine likely PII data by the column name, it uses pattern matching instead. We can observe pattern matching in action when we change the column name EmailAddress to a less-obvious name, such as EAddress. In the following code snippet, we see that Invoke-DbaDbPiiScan
no longer finds the PII by using the column name but rather by matching a preset regular expression pattern.
PS> Invoke-DbaDbPiiScan -SqlInstance mssql1 -Database AdventureWorks ComputerName : mssql1 InstanceName : MSSQLSERVER SqlInstance : mssql1 Database : AdventureWorks Schema : SalesLT Table : Customer Column : EAddress ① PII-Category : Communication PII-Name : E-mail FoundWith : Pattern ② MaskingType : Internet ③ MaskingSubType : Email ④ Country : All CountryCode : All Pattern : [A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,6} ⑤ Description :
Within the results, we can see that the data within a column named EAddress ① matched a pattern ② of an email that was checked by using the regular expression ⑤. The regular expression pattern helped to infer that the masking type was Internet
③ and the masking subtype was Email
④.
By default, Invoke-DbaDbPiiScan
samples the first 100 records to determine the type of data it contains. If you want to include more or fewer rows to try to find the pattern recognition, you can leverage the -SampleCount
parameter to specify a different amount of records to be analyzed, as shown next.
We’ve found that sampling as few as 10 rows works well enough, but higher sampling rates return more accurate results. The downside to sampling a higher number of rows is that the analysis will take more time to complete. That amount of time is dependent on the type of data and the resources available to process the sampling.
Now that you know how to identify which tables and columns have PII data, you’ll likely want to save the results of this test so that you can repeat the masking process on a regular basis. You can do this using the New-DbaDbMaskingConfig
command, which generates a human-readable file that is easy to modify. But, before we check the JSON file structure generated by New-DbaDbMaskingConfig
, let’s explore some of the parameters available to run the command and understand how it works.
Warning As of the time of this writing, not all column and data types are supported (e.g., user-defined data types). Use Get-Help New-DbaDbMaskingConfig
for the most up-to-date documentation and limitations.
When using New-DbaDbMaskingConfig
, you may want to generate a file based on all columns of all tables, or you can decide to be more granular and specify which table(s) and/or column(s) you want to include on your file. For this, you will use the -Table
and -Column
parameters, as demonstrated in the following listing, which will generate a file named mssql1.AdventureWorks.DataMaskingConfig.json under the D: emp folder.
PS> New-DbaDbMaskingConfig -SqlInstance mssql1 -Database AdventureWorks ➥ -Table Address -Column City, PostalCode -Path D: emp
The -Table
parameter on each of the masking commands is expecting only the table name, without a schema, meaning that if the fully qualified name of your table is Person .Address, you only need to specify Address. The contents are, again, easy to both read and modify as shown here.
{ "Name": "AdventureWorks", "Type": "DataMaskingConfiguration", "Tables": [ { "Name": "Address", "Schema": "Person", "Columns": [ { "Name": "City", "ColumnType": "nvarchar", "CharacterString": null, "MinValue": 15, "MaxValue": 30, "MaskingType": "Address", "SubType": "City", "Format": null, "Separator": null, "Deterministic": false, "Nullable": false, "KeepNull": true, "Composite": null, "Action": null, "StaticValue": null }, { "Name": "PostalCode", "ColumnType": "nvarchar", "CharacterString": null, "MinValue": 8, "MaxValue": 15, "MaskingType": "Address", "SubType": "Zipcode", "Format": null, "Separator": null, "Deterministic": false, "Nullable": false, "KeepNull": true, "Composite": null, "Action": null, "StaticValue": null } ], "HasUniqueIndex": true, "FilterQuery": null } ] }
You can use this file on an ongoing basis to mask data within the City and PostalCode columns within the Address table.
Defining deterministic columns
dbatools also supports setting columns as deterministic. This means that if a raw value appears more than once, the same masked value can also be used. To make a column deterministic, you need to edit your configuration file.
First, search for the column name you want to change, then update the value of the Deterministic
value from false
to true
. Using a portion of the example in listing 21.8, we’ll update the City column to be deterministic, shown next.
{ "Name": "City", "ColumnType": "nvarchar", "CharacterString": null, "MinValue": 15, "MaxValue": 30, "MaskingType": "Address", "SubType": "City", "Format": null, "Separator": null, "Deterministic": true, ① "Nullable": false, "KeepNull": true, "Composite": null, "Action": null, "StaticValue": null },
As you can see in listing 21.9 ①, the Deterministic
property has been changed to true
. This means that all instances of a specific city name will be changed to the same masked value. As an example, all instances of Los Angeles
will be changed to Provincetown
, and all instances of San Diego
will be changed to Seattle
.
Ultimately, the decision to use deterministic masking comes down to your requirements. We often find that very normalized databases likely won’t need deterministic masking because everything revolves around a primary key, essentially making the data naturally deterministic. Denormalized databases, on the other hand, may benefit from deterministic masking. If your report developer requires the same proportions of data, such as X amount of Y comes from Z city, this would be a good use case for deterministic masking.
Another good use case for deterministic masking would be keeping masked, but known, CI/CD values for testing purposes. Although deterministic replacements are tracked only per each run, you can export the resulting mask dictionary for later use. This operation is useful but out of scope for this book.
Now that you’ve learned how to generate a configuration file and potentially identify and set deterministic columns, it’s time to use the configuration file to mask data within your database using the Invoke-DbaDbDataMasking
command. Before we dive into the PowerShell code, let’s look at some raw AdventureWorks data that has not yet been masked, shown in figure 21.1.
And now, it’s finally time to mask our data using Invoke-DbaDbDataMasking
, as shown in the next listing.
PS> Invoke-DbaDbDataMasking -SqlInstance mssql1 -Database dbatools ➥ -FilePath "D: empmssql1.AdventureWorks.DataMaskingConfig.json"
An output example of the execution of the command is shown in the next code sample.
ComputerName : mssql1 InstanceName : MSSQLSERVER SqlInstance : mssql1 Database : AdventureWorks Schema : SalesLT Table : Address Columns : {City, PostalCode} Rows : 450 Elapsed : 00:00:10 Status : Successful
If we check our table again, as depicted in figure 21.2, now we will see that the data for the City and PostalCode columns is different.
It’s so rewarding to actually see these changed results each time we mask our own data. We encourage you to use this against your own databases to see just how easy it is to introduce a whole new layer of security without any additional cost.
As we’ve seen earlier in the chapter, editing JSON files is easy. But JSON formatting is sensitive, so if you make changes often, the possibility of typos and mistypes increases, which can cause problems. We’ve encountered this scenario ourselves and ended up creating a command that makes it easier to validate a data masking configuration file’s JSON structure.
Using this command to validate your JSON files along the way can help save frustration from unexpected parsing issues.
Check whether a specific database table has any column that contains potential PII data.
Generate a data masking configuration file for a selection of tables and columns.
Mask the data using the data-masking configuration file generated.