© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
A. MortonMastering Snowflake Solutionshttps://doi.org/10.1007/978-1-4842-8029-4_5

5. Protecting Data in Snowflake

Adam Morton1  
(1)
Sydney, NSW, Australia
 

Not so long ago, protecting your data from accidental deletion or modification, hardware or software failure, or recovering the database after it was polluted by loading corrupt data into it was not a trivial affair.

Just to give you a flavor of the pain involved in protecting data, let’s open a window into the past. We’ll look at one key element, backing up data in a database.

To ensure you could roll back to a known position, you had to make database backups. These backups were cumbersome to deal with due to their large size, and making the actual backups (or restoring one!) took an incredibly long time by today’s standards. It was an intensive process on the database and slowing down user queries was commonplace. Not only that, but you also needed to schedule these backups around critical processes, such as loading data and releasing code into the live environment. This meant if anything did go wrong, you had a known backup point to restore to.

So naturally, if you were a DBA, you looked for ways to mitigate this pain points, so making backups out of hours and backing up just the changed data during the working week with a full backup on the weekend, when you had more time, was very typical.

Matters got even more complicated and time consuming when you had to maintain a standby database for disaster recovery, something I’ll touch upon in the next chapter on business continuity and disaster recovery.

Even if you didn’t experience this pain yourself (and I sincerely hope you didn’t!) hopefully you’re getting the idea of the amount of effort and complexity involved.

Thankfully, Snowflake has a range of features to help us recover gracefully from the unexpected events life sometimes throws our way. Collectively, the range of features in this category are referred to as the Continuous Data Protection (CDP) lifecycle.

The purpose of CDP is to support your data recovery requirements for the duration of the value it provides to your organization and consumers. It looks something like Figure 5-1. By the time you finish this chapter, you’ll have a very clear understanding of these components and how they fit together.
Figure 5-1

The continuous data protection lifecycle

Data Encryption

In the previous chapter, I touched on the fact that data in transit over the network is encrypted within Snowflake. Snowflake uses TLS encryption for all customer communication, ensuring data isn’t sent over the network in clear text.

Data at rest within Snowflake is always encrypted. This is the default configuration and cannot be switched off. If you opt to have an external stage (a storage area on a cloud storage platform), you can choose if you want to encrypt the data within this location. Encrypting the data in this way is known as client-side encryption .

Snowflake will always encrypt data immediately when it stores it, regardless if this data arrives in an unencrypted or encrypted format.

If you choose to encrypt the data in the external stage, which is the recommended approach, then you need to ensure Snowflake can read this data when it arrives. To do this, you create a named stage (using the CREATE STAGE command as discussed in Chapter 2) object adding the MASTER_KEY parameter and then load data from the stage into your Snowflake tables.

When working with data in the external stage, users reference the named stage without needing access to the client-side encryption keys.

Note

Data within internal stages (within Snowflake) is encrypted by default.

Encryption Key Management

Snowflake uses AES 256-bit encryption and a hierarchy of keys specific to each Snowflake account. Since Snowflake is a multi-tenant cloud service storing data across many customer accounts, this approach is important because it provides an additional layer of isolation, with a separate set of account master keys.

When a key encrypts another key, it is called wrapping . When the key is decrypted again, it is called unwrapping . At the very top of the hierarchy is the root key, which is wrapped in a hardware security module (HSM), a device designed specifically for securely storing digital keys.

This hierarchical key model narrows the scope of each layer of keys. For example, there are keys at the account, table, and file level, as shown in Figure 5-2. Each table key is responsible for encrypting the data within its table. If a key is compromised, this approach attempts to ensure that any breach is limited.
Figure 5-2

Hierarchical approach to encryption keys

Account and table master keys are automatically rotated by Snowflake when they are more than 30 days old. New keys are introduced and previously active key are retired. Retired keys continue to be made available to the recipient to decrypt data. Snowflake is clever enough to determine when a retired key is no longer required before it automatically destroys the old key. Only active keys are used when wrapping child keys. Rotating keys frequently limits the amount of data each key is responsible for protecting.

Customer Managed Keys

But what if you don’t want to rely just on Snowflake to manage your keys? Say you’re storing sensitive information, such as military personnel and positions of key infrastructure. If this data fell into the wrong hands, lives could be in danger. As a result, your organization has stipulated the highest security requirements around data.

In this case, you can generate and store you own customer key and send it to Snowflake. Snowflake takes your key and merges it with its own key, creating a composite key. If either key is revoked, all data in the Snowflake account cannot be decrypted.

Your own customer key will typically be wrapped in a Key Management Service (KMS) such as AWS KMS, Google’s Cloud KMS, or Azure Key Vault. These services are made available by the cloud providers to securely store, maintain, and administer keys. Using a client key in combination with Snowflake’s key changes the behavior of the encryption key hierarchy, as Figure 5-3 shows.
Figure 5-3

Encryption key hierarchy including a customer key

This added complexity results in greater control and flexibility for you as the customer. In the event of a data breach, you can choose to revoke or disable your key, effectively pressing pause on any data operations within your Snowflake account. Additionally, if your organization has strict requirements relating to key management, such as an aggressive key rotation policy, then maintaining a client-side key allows you to implement these policies, and in turn, tightly control the end-to-end data lifecycle.

Using a customer key along with Snowflake’s key along with Snowflake’s built-in user authentication creates three layers of data protection, which Snowflake calls Tri-Secret Secure.

Note

To use customer-managed keys with Snowflake, you must have the Business Critical edition of Snowflake.

Time Travel

Time Travel allows you to query data at a point in time, within a defined time window using SQL. It’s a little like winding back the clock on your database, allowing you to view the exact state the data was in at a specific point in time.

As with many features Snowflake provides, there’s nothing you need to do in the background to maintain these copies of data. As the user, you just need to define how long to keep the version of the data, which I’ll cover in the next section on data retention.

By using some of the SQL extensions provided for Time Travel operations, you can restore tables, schemas, and databases in their entirety. This includes objects that have been dropped. Yes, there’s actually an UNDROP command !

When I discussed cloning in a previous chapter, I mentioned that cloning uses Time Travel behind the scenes. You can create clones of objects at a point in time.

You can also query data from the past, regardless how of many times the data has been modified. You can imagine how helpful this can be when investigating any data issues raised by your consumers. It allows you to understand how a record may have been modified over time by simply executing basic SQL commands. A common example is to leverage Time Travel to roll back the data to a previous state before erroneous data was loaded by a ETL process.

Data Retention Periods

To cater for the flexibility Time Travel offers, Snowflake maintains versions of the data before the data was updated or deleted. It will keep these versions for as long as the data retention period is set.

In the Standard edition of Snowflake, the retention period is 1 day or 24 hours, by default. As you can see in Table 5-1, it is possible to set Time Travel to 0. This is equivalent to disabling Time Travel, meaning historical data is no longer available to be queried. However, it is important you make the right choice from the outset. Extending the retention period from a lower to a higher number, for example 0 to 1, doesn’t mean you’ll have access to that data immediately. In this instance, you’ll have to wait for a day to pass until the you have a full day’s worth of data to access via Time Travel.
Table 5-1

Retention Periods

 

Standard Edition

Enterprise Edition

(and higher)

 

Min

Default

Max

Min

Default

Max

Temporary or transient objects

0

1

1

0

1

1

Permanent objects

0

1

1

0

1

90

To change the data retention period, you can use the ACCOUNTADMIN role to set the value for the DATA_RETENTION_TIME_IN_DAYS parameter. Interestingly, you can use this parameter when creating a database, schema, or table to override the global default. This means if you have a small amount of business-critical data (and you’re running the Enterprise edition or above) in your database, you could decide to set the retention period to 90 days while leaving all other objects at the default of 1 day. It is important to remember that increasing the data retention period will also increase storage costs. This is why I recommend you set this at a schema or table level depending on your specific requirements.

The following code snippet shows how to set the data retention period at the point you create a table and then subsequently amend the retention at a later point in time:
CREATE TABLE DIM_CUSTOMER(CUSTOMER_SK INT, CUSTOMER_BK INT, CUSTOMER_FIRST_NAME VARCHAR(100))
DATA_RETENTION_TIME_IN_DAYS = 90;
ALTER TABLE DIM_CUSTOMER SET DATA_RETENTION_TIME_IN_DAYS=30;

Querying Historical Data

To query previous version of objects, you can use the AT or BEFORE clauses. The specified point can be a timestamp, time offset (from the current point in time), or a previously executed statement, as the following examples demonstrate.

Querying table data at a specific point in time using a timestamp:
SELECT *
FROM DIM_CUSTOMER AT(TIMESTAMP => '2021-06-07 02:21:10.00 ​-0700'::timestamp_tz);
Querying table data 15 minutes ago using a time offset:
SELECT *
FROM DIM_CUSTOMER AT(OFFSET => -60*15);
Querying table data up to but not including any changes made by the specified statement:
SELECT *
FROM DIM_CUSTOMER BEFORE(STATEMENT => '019db306-3200-7542-0000-00006bb5d821');

Dropping and Undropping Historical Data

When an object is dropped and Time Travel is enabled, the data is not removed from the account. Instead, the version of the object is held in the background for the data retention period.

You can list any dropped objects using the SHOW command along with the HISTORY keyword:
SHOW TABLES HISTORY LIKE '%DO_NOT_DROP';
SHOW SCHEMAS HISTORY IN SALES_DB;
SHOW DATABASES HISTORY;

The result set includes all dropped objects. You’ll see multiple records if an object has been dropped more than once. The DROPPED_ON shows the time and date when the object was dropped.

Fail-safe

After the data retention period associated with Time Travel ends, data cannot be viewed within your account. However, that’s not the end of the story! For a further, non-configurable 7 days, data from permanent objects ends up in something called Fail-safe.

Fail-safe acts as a Last Chance Saloon in the event of a failure or operational failures. Only Snowflake employees can access Fail-safe, and it make take several hours to recover the data from this area. Snowflake documentation states that it is provided on a best endeavor basis, meaning you should not rely on it as part of a disaster recovery scenario.

Underlying Storage Concepts

With both Time Travel and Fail-safe, you will be charged for data stored. This is calculated for each 24-hour period from the time the data changed. This is based on the number of days from when the data was modified, along with your data retention setting (see Table 5-1).

When you drop or truncate a table, entire copies of the data are held in Time Travel and Fail-safe, but in all other cases Snowflake is intelligent enough to work out what data it needs to store to recover those individual rows if required. Storage usage is calculated as a percentage of the overall table size vs. those records that have been modified.

The total amount of space taken up by the Fail-safe area can be viewed in the Billing & Usage tab of the Account section in Snowflake, as shown in Figure 5-4.
Figure 5-4

Storage used by Fail-safe

Temporary and Transient Tables

You now understand the data retention periods between the different Snowflake editions and object types. You are also aware that you can override the global data retention setting on a per object basis. It’s therefore worth considering how to develop an approach to ensure critical data can be recovered, while short-lived data, which will never be restored, doesn’t bring with it any storage costs unnecessarily.

You should ensure you always use permanent tables that contain data used for regular end-user consumption. Typically, this would be a presentation layer, perhaps containing facts and dimensions, but in other cases your users may need access to pre-transformed, granular data.

Temporary tables exist only for the duration of the session that created them, or if explicitly dropped. They aren’t visible to any other session other that the one that created them. They are useful for ad-hoc pieces of exploratory work, or occasionally as a working area to prepare some data for a specific request.

Transient tables are like permanent tables in that they need to be explicitly dropped and can be accessed by multiple sessions. However, they don’t take up any space in Fail-safe and therefore won’t incur any storage costs. Additionally, they are only available in Time Travel for a maximum of 1 day.

These transient tables are ideal to use as working tables, which form part of your data ingestion or ETL process, especially if each day you’re staging millions of records from a data source. You can allow these records to pass through transient tables on the way to the warehouse without having to be concerned about storage costs related to Time Travel or Fail-safe.

Bringing It All Together

Here, you take what you’ve learned in this chapter and put it together in a brief practical example.
//CREATE DATABASE
CREATE OR REPLACE DATABASE TIME_TRAVEL;
//CREATE A SEQUENCE TO USE FOR THE TABLE.
//WE'LL USE THIS LATER WHEN WE PINPOINT A RECORD TO DELETE.
CREATE OR REPLACE SEQUENCE SEQ_TIME_TRAVEL
START = 1
INCREMENT = 1;
//CREATE A TABLE
CREATE OR REPLACE TABLE VERY_IMPORTANT_DATA
(ID NUMBER,
VERY VARCHAR(10),
IMPORTANT VARCHAR(20),
TABLE_DATA VARCHAR(10));
//INSERT 100 RECORDS INTO THE TABLE FROM THE SNOWFLAKE SAMPLE //DATA
INSERT INTO VERY_IMPORTANT_DATA
SELECT SEQ_TIME_TRAVEL.NEXTVAL, 'VERY', 'IMPORTANT', 'DATA'
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER"
LIMIT 100;
//CONFIRM WE HAVE 100 RECORDS
SELECT COUNT(*) FROM VERY_IMPORTANT_DATA;
//DROP THE TABLE - OOPS!
DROP TABLE VERY_IMPORTANT_DATA;
//LOOK AT THE HISTORY OF THIS TABLE
//NOTE THE VALUE IN THE DROPPED_ON COLUMN
SHOW TABLES HISTORY LIKE '%VERY_IMPORTANT_DATA';
//UNDROP THE TABLE TO RESTORE IT
UNDROP TABLE VERY_IMPORTANT_DATA;
//CONFIRM THE TABLE IS BACK WITH 100 RECORDS
SELECT COUNT(*) FROM VERY_IMPORTANT_DATA;
//REVIEW THE TABLE METADATA AGAIN
SHOW TABLES HISTORY LIKE '%VERY_IMPORTANT_DATA';
//IMPORTANT: WAIT A FEW MINUTES BEFORE RUNNING THE NEXT
//BATCH OF QUERIES. THIS ALLOWS FOR A GOOD PERIOD OF TIME
//TO QUERY THE TABLE BEFORE WE
//DELETE A SINGLE RECORD FROM THE TABLE
DELETE FROM VERY_IMPORTANT_DATA
WHERE ID = (SELECT MAX(ID)
             FROM VERY_IMPORTANT_DATA);
//CHECK THE METADATA TO GET THE MOST RECENT QUERY_ID (RELATING TO //THE QUERY ABOVE)
SET QUERY_ID =
(SELECT TOP 1 QUERY_ID
FROM TABLE (INFORMATION_SCHEMA.QUERY_HISTORY())
WHERE QUERY_TEXT LIKE 'DELETE FROM VERY_IMPORTANT_DATA%'
ORDER BY START_TIME DESC);
//CHECK THE VALUE STORED
SELECT $QUERY_ID;
//CREATE A CLONE OF THE ORIGINAL TABLE USING THE QUERY_ID //OBTAINED
//FROM THE QUERY ABOVE
CREATE OR REPLACE TABLE VERY_IMPORTANT_DATA_V2 CLONE VERY_IMPORTANT_DATA
BEFORE (STATEMENT => $QUERY_ID);
//COMPARE BOTH TABLES TO VIEW THE 1 RECORD DIFFERENCE
SELECT *
FROM VERY_IMPORTANT_DATA_V2 V2
LEFT JOIN VERY_IMPORTANT_DATA V1 ON V2.ID = V1.ID
WHERE V1.ID IS NULL;
//RUN THE QUERY USING AN OFFEST TO A FEW MINTUES EARLIER AGAINST
//THE ORIGINAL TABLE. THIS QUERY WONT RETURN ANY RECORDS.
SELECT *
FROM VERY_IMPORTANT_DATA_V2 V2
LEFT JOIN VERY_IMPORTANT_DATA AT(OFFSET => -60*7) V1 ON V2.ID = V1.ID
WHERE V1.ID IS NULL;

Summary

In this chapter, you learned about Snowflake’s Continuous Data Protection and how it aims to protect data throughout its valuable lifecycle. Snowflake encrypts data at rest by default using a hierarchy of keys at each level to minimize the impact of any data breaches.

Your organization may store highly sensitive data, which dictates a more comprehensive approach. In this instance, you can use the Business Critical edition of Snowflake and introduce your own customer key. This customer key merges with Snowflake’s key to provide you, the customer, with greater control over data access, allowing you to revoke access completely and pause all data operations within your account. This can make a significant difference in limiting the damage caused by a potential data breach.

You also now understand the differences between Time Travel and Fail-safe, the purpose of each, and what objects they apply to. You explored the data retention period, along with how you can fine-tune this to match both short- and long-lived objects, allowing you to get the best balance of cost against data protection.

In the next chapter, I’ll continue to broaden this discussion to look at how to approach business continuity. You’ll consider what options you have to guarantee data availability in a disaster recovery scenario, look at data replication across regions, and see how Snowflake can be configured in a multi-cloud environment.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset