Not so long ago, protecting your data from accidental deletion or modification, hardware or software failure, or recovering the database after it was polluted by loading corrupt data into it was not a trivial affair.
Just to give you a flavor of the pain involved in protecting data, let’s open a window into the past. We’ll look at one key element, backing up data in a database.
To ensure you could roll back to a known position, you had to make database backups. These backups were cumbersome to deal with due to their large size, and making the actual backups (or restoring one!) took an incredibly long time by today’s standards. It was an intensive process on the database and slowing down user queries was commonplace. Not only that, but you also needed to schedule these backups around critical processes, such as loading data and releasing code into the live environment. This meant if anything did go wrong, you had a known backup point to restore to.
So naturally, if you were a DBA, you looked for ways to mitigate this pain points, so making backups out of hours and backing up just the changed data during the working week with a full backup on the weekend, when you had more time, was very typical.
Matters got even more complicated and time consuming when you had to maintain a standby database for disaster recovery, something I’ll touch upon in the next chapter on business continuity and disaster recovery.
Even if you didn’t experience this pain yourself (and I sincerely hope you didn’t!) hopefully you’re getting the idea of the amount of effort and complexity involved.
Thankfully, Snowflake has a range of features to help us recover gracefully from the unexpected events life sometimes throws our way. Collectively, the range of features in this category are referred to as the Continuous Data Protection (CDP) lifecycle.
Data Encryption
In the previous chapter, I touched on the fact that data in transit over the network is encrypted within Snowflake. Snowflake uses TLS encryption for all customer communication, ensuring data isn’t sent over the network in clear text.
Data at rest within Snowflake is always encrypted. This is the default configuration and cannot be switched off. If you opt to have an external stage (a storage area on a cloud storage platform), you can choose if you want to encrypt the data within this location. Encrypting the data in this way is known as client-side encryption .
Snowflake will always encrypt data immediately when it stores it, regardless if this data arrives in an unencrypted or encrypted format.
If you choose to encrypt the data in the external stage, which is the recommended approach, then you need to ensure Snowflake can read this data when it arrives. To do this, you create a named stage (using the CREATE STAGE command as discussed in Chapter 2) object adding the MASTER_KEY parameter and then load data from the stage into your Snowflake tables.
When working with data in the external stage, users reference the named stage without needing access to the client-side encryption keys.
Data within internal stages (within Snowflake) is encrypted by default.
Encryption Key Management
Snowflake uses AES 256-bit encryption and a hierarchy of keys specific to each Snowflake account. Since Snowflake is a multi-tenant cloud service storing data across many customer accounts, this approach is important because it provides an additional layer of isolation, with a separate set of account master keys.
When a key encrypts another key, it is called wrapping . When the key is decrypted again, it is called unwrapping . At the very top of the hierarchy is the root key, which is wrapped in a hardware security module (HSM), a device designed specifically for securely storing digital keys.
Account and table master keys are automatically rotated by Snowflake when they are more than 30 days old. New keys are introduced and previously active key are retired. Retired keys continue to be made available to the recipient to decrypt data. Snowflake is clever enough to determine when a retired key is no longer required before it automatically destroys the old key. Only active keys are used when wrapping child keys. Rotating keys frequently limits the amount of data each key is responsible for protecting.
Customer Managed Keys
But what if you don’t want to rely just on Snowflake to manage your keys? Say you’re storing sensitive information, such as military personnel and positions of key infrastructure. If this data fell into the wrong hands, lives could be in danger. As a result, your organization has stipulated the highest security requirements around data.
In this case, you can generate and store you own customer key and send it to Snowflake. Snowflake takes your key and merges it with its own key, creating a composite key. If either key is revoked, all data in the Snowflake account cannot be decrypted.
This added complexity results in greater control and flexibility for you as the customer. In the event of a data breach, you can choose to revoke or disable your key, effectively pressing pause on any data operations within your Snowflake account. Additionally, if your organization has strict requirements relating to key management, such as an aggressive key rotation policy, then maintaining a client-side key allows you to implement these policies, and in turn, tightly control the end-to-end data lifecycle.
Using a customer key along with Snowflake’s key along with Snowflake’s built-in user authentication creates three layers of data protection, which Snowflake calls Tri-Secret Secure.
To use customer-managed keys with Snowflake, you must have the Business Critical edition of Snowflake.
Time Travel
Time Travel allows you to query data at a point in time, within a defined time window using SQL. It’s a little like winding back the clock on your database, allowing you to view the exact state the data was in at a specific point in time.
As with many features Snowflake provides, there’s nothing you need to do in the background to maintain these copies of data. As the user, you just need to define how long to keep the version of the data, which I’ll cover in the next section on data retention.
By using some of the SQL extensions provided for Time Travel operations, you can restore tables, schemas, and databases in their entirety. This includes objects that have been dropped. Yes, there’s actually an UNDROP command !
When I discussed cloning in a previous chapter, I mentioned that cloning uses Time Travel behind the scenes. You can create clones of objects at a point in time.
You can also query data from the past, regardless how of many times the data has been modified. You can imagine how helpful this can be when investigating any data issues raised by your consumers. It allows you to understand how a record may have been modified over time by simply executing basic SQL commands. A common example is to leverage Time Travel to roll back the data to a previous state before erroneous data was loaded by a ETL process.
Data Retention Periods
To cater for the flexibility Time Travel offers, Snowflake maintains versions of the data before the data was updated or deleted. It will keep these versions for as long as the data retention period is set.
Retention Periods
Standard Edition | Enterprise Edition (and higher) | |||||
---|---|---|---|---|---|---|
Min | Default | Max | Min | Default | Max | |
Temporary or transient objects | 0 | 1 | 1 | 0 | 1 | 1 |
Permanent objects | 0 | 1 | 1 | 0 | 1 | 90 |
To change the data retention period, you can use the ACCOUNTADMIN role to set the value for the DATA_RETENTION_TIME_IN_DAYS parameter. Interestingly, you can use this parameter when creating a database, schema, or table to override the global default. This means if you have a small amount of business-critical data (and you’re running the Enterprise edition or above) in your database, you could decide to set the retention period to 90 days while leaving all other objects at the default of 1 day. It is important to remember that increasing the data retention period will also increase storage costs. This is why I recommend you set this at a schema or table level depending on your specific requirements.
Querying Historical Data
To query previous version of objects, you can use the AT or BEFORE clauses. The specified point can be a timestamp, time offset (from the current point in time), or a previously executed statement, as the following examples demonstrate.
Dropping and Undropping Historical Data
When an object is dropped and Time Travel is enabled, the data is not removed from the account. Instead, the version of the object is held in the background for the data retention period.
The result set includes all dropped objects. You’ll see multiple records if an object has been dropped more than once. The DROPPED_ON shows the time and date when the object was dropped.
Fail-safe
After the data retention period associated with Time Travel ends, data cannot be viewed within your account. However, that’s not the end of the story! For a further, non-configurable 7 days, data from permanent objects ends up in something called Fail-safe.
Fail-safe acts as a Last Chance Saloon in the event of a failure or operational failures. Only Snowflake employees can access Fail-safe, and it make take several hours to recover the data from this area. Snowflake documentation states that it is provided on a best endeavor basis, meaning you should not rely on it as part of a disaster recovery scenario.
Underlying Storage Concepts
With both Time Travel and Fail-safe, you will be charged for data stored. This is calculated for each 24-hour period from the time the data changed. This is based on the number of days from when the data was modified, along with your data retention setting (see Table 5-1).
When you drop or truncate a table, entire copies of the data are held in Time Travel and Fail-safe, but in all other cases Snowflake is intelligent enough to work out what data it needs to store to recover those individual rows if required. Storage usage is calculated as a percentage of the overall table size vs. those records that have been modified.
Temporary and Transient Tables
You now understand the data retention periods between the different Snowflake editions and object types. You are also aware that you can override the global data retention setting on a per object basis. It’s therefore worth considering how to develop an approach to ensure critical data can be recovered, while short-lived data, which will never be restored, doesn’t bring with it any storage costs unnecessarily.
You should ensure you always use permanent tables that contain data used for regular end-user consumption. Typically, this would be a presentation layer, perhaps containing facts and dimensions, but in other cases your users may need access to pre-transformed, granular data.
Temporary tables exist only for the duration of the session that created them, or if explicitly dropped. They aren’t visible to any other session other that the one that created them. They are useful for ad-hoc pieces of exploratory work, or occasionally as a working area to prepare some data for a specific request.
Transient tables are like permanent tables in that they need to be explicitly dropped and can be accessed by multiple sessions. However, they don’t take up any space in Fail-safe and therefore won’t incur any storage costs. Additionally, they are only available in Time Travel for a maximum of 1 day.
These transient tables are ideal to use as working tables, which form part of your data ingestion or ETL process, especially if each day you’re staging millions of records from a data source. You can allow these records to pass through transient tables on the way to the warehouse without having to be concerned about storage costs related to Time Travel or Fail-safe.
Bringing It All Together
Summary
In this chapter, you learned about Snowflake’s Continuous Data Protection and how it aims to protect data throughout its valuable lifecycle. Snowflake encrypts data at rest by default using a hierarchy of keys at each level to minimize the impact of any data breaches.
Your organization may store highly sensitive data, which dictates a more comprehensive approach. In this instance, you can use the Business Critical edition of Snowflake and introduce your own customer key. This customer key merges with Snowflake’s key to provide you, the customer, with greater control over data access, allowing you to revoke access completely and pause all data operations within your account. This can make a significant difference in limiting the damage caused by a potential data breach.
You also now understand the differences between Time Travel and Fail-safe, the purpose of each, and what objects they apply to. You explored the data retention period, along with how you can fine-tune this to match both short- and long-lived objects, allowing you to get the best balance of cost against data protection.
In the next chapter, I’ll continue to broaden this discussion to look at how to approach business continuity. You’ll consider what options you have to guarantee data availability in a disaster recovery scenario, look at data replication across regions, and see how Snowflake can be configured in a multi-cloud environment.