Currently three cloud providers dominate the market: Microsoft, Amazon, and Google. It just so happens that these are the same three choices you have when deciding where to run Snowflake.
However, these giants of cloud computing are certainly not immune to failure. These organizations are at the forefront of innovation, leading the way into the uncharted territory of highly complex, distributed computing at a massive scale.
In 2015, Google suffered a minor outage on some services, following four successive lightning strikes on the local power grid that supplied power to its data center in Belgium.
In 2017, AWS’ S3 experienced a four-hour outage in the US-EAST-1 region. A vast number of websites and applications were impacted. Don’t forget: If you are running Snowflake with an external stage, it will reside on either S3, GCS, or Azure Blob storage. In this particular scenario, you would have experienced some impact if your external stage pointed to an AWS storage location within this region.
In March of 2021, Microsoft was carrying out a data migration effort at the same time as a business-as-usual task to rotate keys. Suddenly users found themselves unable to log in. This error resulted in a 14-hour outage and the impact wasn’t just limited to Azure services but also Office, Teams, and other Microsoft applications.
It is worth pointing out that Snowflake itself is also subject to outages in the same way that the other third-party cloud providers are. There is a steady stream of organizations willing to move their workloads to the cloud in order to realize the promise of cost savings along with the ability to scale on-demand. However, these recent failures should act as a reminder that we need to consider how these outages might impact our business-critical functions.
One thing struck me when writing this chapter: This is the first chapter in the book in which the approach you take in setting up data replication with Snowflake doesn’t differ that much from traditional RDBMSs I used in the past. Obviously, we’re working in the cloud now, which brings new benefits, but if you have experience in setting up data to be replicated between two sites, many of the concepts still hold true.
Regions and Availability Zones
Before I get into the specific capabilities of Snowflake in this area, I want to ensure that you are comfortable with some fundamentals of what I’ll discuss in this chapter. If you’re already familiar with regions and availability zones, you can move straight on to the next section.
A region relates to a physical location. Some examples are US West (Oregon), Europe (London), or Asia Pacific (Sydney). Here, the cloud service providers deploy data centers. A logical grouping of data centers within a region is referred to as an availability zone (AZ). Each AZ is completely independent from the next, with redundant power and networking supplies.
Typically, a region consists of a minimum of three AZs. This allows cloud-based services to maintain multiple, redundant copies of data across three AZs within the same region. For example, AWS’ S3 service automatically creates and stores its data across separate AZs, thus ensuring data durability. If an AZ goes down because of a power failure, the services can continue to run uninterrupted through one of the other two AZs within that region.
Data Replication, Failover, and Failback
Ensuring that you design your system to handle failures is critical to the success of your solution. Depending on the criticality of the service in question, not every organization will need to maintain a global footprint of services across multiple regions. It’s all about balancing risk and cost with business continuity in mind.
Data replication guarantees you’ll have another copy of your data stored in a database, which is physically separate from your primary database instance. This provides an insurance policy against losing your primary database due to hardware or software failure and power or network outages.
With Snowflake you can replicate data between multiple regions and across multiple cloud service providers too. You can have one Snowflake account in Azure replicating data to another account running on AWS. Therefore, if the Azure service suffers a major outage, you can failover to your AWS standby service, thereby allowing your business to continue to function as normal.
You can instruct Snowflake to automatically replicate changes between databases in different regions without the need to manually configure ETL jobs or data pipelines to copy data.
You can also determine the frequency at which the data replication service runs, fine-tuning it to meet your own maximum accepted data loss in the event of a disaster. As you’d expect, data is encrypted in transit between cloud environments, and it’s also compatible with Tri-Secret Secure, which we discussed in the previous chapter.
Failover and failback are features of Snowflake’s Business Critical edition or higher.
Primary and Secondary Databases
When you enable data replication, you need to define the primary database as part of the setup. Under business-as-usual (BAU) operations, changes are written to and committed to this primary database first. When users or applications query the service, data is returned from this database. When the primary database is replicated to another account, this copy is called the secondary database.
The secondary database is in a read-only mode, meaning changes cannot be applied directly to it by a user or application. If this were permitted, then it wouldn’t be in sync with the primary database, breaking the replication process. When a DDL or DML operation is carried out on the primary database, those changes are periodically batched up and replayed against the secondary database. All data along with any changes to objects such as schemas, tables, or views are all replicated.
Only objects at the database level are replicated. Users (their privileges), roles, warehouses, shares, and resource monitors are not replicated.
If you are running the Business Critical edition and you configure replication to a lower account, Snowflake will throw an error. This is a good feature and was put in place by design to prevent exposing sensitive data by mistake. As discussed, the Business Critical edition is typically used by organizations that handle highly sensitive data, which is why Snowflake recognizes this and tries to help out in this scenario. You can override this default behavior by using the IGNORE EDITION CHECK clause when executing the enabling replication statement, which I’ll cover later.
You should be aware that compute resources are used to support data replication operations.
Promoting Databases
The process of promoting a secondary database to a primary database is not an automated one. To promote a secondary database to a primary database, a role with the OWNERSHIP privilege on the database is required to execute the ALTER DATABASE mydb1 PRIMARY; command. You must then point your connections to the newly promote primary database.
Client Redirect
At the time of writing, the client redirect feature was in preview. However, I want to touch upon it here as the aim of this capability is to expose a secure URL for your organization’s Snowflake databases regardless of which Snowflake account they reside within. Although you still need to manually promote a secondary database to a primary as described above, this capability negates the need to manually update the URL your client connections and applications are using.
The hostname in the connection URL is composed of your organization name and the connection object name in addition to a common domain name: organization_name-connection_name.snowflakecomputing.com.
Take note that the hostname does not specify the account to which you are connecting. An account administrator determines the account to use by assigning an account to serve as the primary connection. When you use this URL to connect to Snowflake, you are connecting to the account that has been defined as the primary connection.
If an outage that impacts the primary database connection occurs, the administrator can point this URL to a Snowflake account that stores the secondary databases. Throughout the outage you can continue to use the same URL to access your databases.
Business Continuity
Process Flow
The following set of diagrams walks through how the failover and failback processes work. Say your organization has two accounts, one in the AWS’ US East (Ohio) region and the other on Google’s asia-east2-b (Hong Kong) region.
Monitoring Replication Progress
Reconciling the Process
You may want to periodically conduct your own checks to ensure the primary and secondary databases are in sync.
Snowflake recommends using the HASH_AGG function across a random set of tables in all databases. This function creates a unique fingerprint for the rows you provide it with. You can provide it with an entire table (or a subset of a table using a WHERE clause or a TIME TRAVEL function such as TIMESTAMP, for example) and it will return an aggregate signed 64-bit hash value.
You can then use the hash values from both databases to compare for any differences. You can automate this process by writing the results to a metadata table.
Data Loss
Recovery Point Objective (RPO): This is the maximum amount of data that can be lost following the recovery of a disaster or outage. This is measured in time.
Recovery Time Objective (RTO): This is the maximum acceptable downtime tolerated by the business. Again, measured in time.
While the RTO for Snowflake can be minimal due to the fact you don’t need to deal with restoring physical backups, the RPO is important to consider when scheduling your data refresh from your primary to your secondary set. You should look to schedule your data refreshes so that if you lost your primary instance, any data loss is well within your RPO.
Bringing It All Together
Here the topics you’ve learned about in this chapter are brought together in a real-world, practical example.
As with many features of Snowflake, it’s possible to configure primary and secondary databases from either the UI or using SQL. For this example, you’re going to use the web UI. Regardless of the approach, you need to use the ACCOUNTADMIN role before you begin the configuration.
The Example Scenario
- 1.
Configure replication and failover (this promotes the local database to become the primary database).
- 2.
Select an account to replicate the data to.
- 3.
Create a secondary database on the replicated account and refresh the data.
- 4.
Monitor the initial data refresh.
- 5.
Schedule ongoing data refreshes between the primary and secondary databases.
Steps
Step 1: Configure Replication and Failover
Step 2: Select an Account to Replicate the Data to
Step 3: Create a Secondary Database on the Replicated Account
Step 4: Monitor the Initial Data Refresh
Step 5: Schedule Ongoing Data Refreshes
You can schedule ongoing refreshes to refresh your secondary database(s). Clicking Schedule Refresh brings up the dialog box (Figure 6-16) that contains a SQL template.
Summary
In this chapter, you learned what capabilities Snowflake provides for disaster recovery scenarios. Although many concepts such as data replication, failover, failback, and RPO/RTO are firmly rooted in the past, moving to the cloud naturally brings with it some newer concepts and terminology, such as regions and availability zones.
You explored how to configure the primary and secondary databases before scheduling a regular data refresh. In the event of an outage of the primary database, you learned the process flow for failing over to the secondary database to maintain business operations.
At the end of this chapter, you saw a practical example of configure this scenario using the web UI. You should now be super confident in approaching this area. Not only will you be able to ensure you capture and document the right requirements, which will help you design your solution, but you will also be able to implement your design as well!
In the next chapter, you’ll switch gears and move into sharing data with users both inside and outside your organization. You’re also going to delve into the data cloud, which allows you to select from a menu of data and work with it on demand with a click of a button!