© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
A. MortonMastering Snowflake Solutionshttps://doi.org/10.1007/978-1-4842-8029-4_6

6. Business Continuity and Disaster Recovery

Adam Morton1  
(1)
Sydney, NSW, Australia
 

Currently three cloud providers dominate the market: Microsoft, Amazon, and Google. It just so happens that these are the same three choices you have when deciding where to run Snowflake.

However, these giants of cloud computing are certainly not immune to failure. These organizations are at the forefront of innovation, leading the way into the uncharted territory of highly complex, distributed computing at a massive scale.

In 2015, Google suffered a minor outage on some services, following four successive lightning strikes on the local power grid that supplied power to its data center in Belgium.

In 2017, AWS’ S3 experienced a four-hour outage in the US-EAST-1 region. A vast number of websites and applications were impacted. Don’t forget: If you are running Snowflake with an external stage, it will reside on either S3, GCS, or Azure Blob storage. In this particular scenario, you would have experienced some impact if your external stage pointed to an AWS storage location within this region.

In March of 2021, Microsoft was carrying out a data migration effort at the same time as a business-as-usual task to rotate keys. Suddenly users found themselves unable to log in. This error resulted in a 14-hour outage and the impact wasn’t just limited to Azure services but also Office, Teams, and other Microsoft applications.

It is worth pointing out that Snowflake itself is also subject to outages in the same way that the other third-party cloud providers are. There is a steady stream of organizations willing to move their workloads to the cloud in order to realize the promise of cost savings along with the ability to scale on-demand. However, these recent failures should act as a reminder that we need to consider how these outages might impact our business-critical functions.

One thing struck me when writing this chapter: This is the first chapter in the book in which the approach you take in setting up data replication with Snowflake doesn’t differ that much from traditional RDBMSs I used in the past. Obviously, we’re working in the cloud now, which brings new benefits, but if you have experience in setting up data to be replicated between two sites, many of the concepts still hold true.

Regions and Availability Zones

Before I get into the specific capabilities of Snowflake in this area, I want to ensure that you are comfortable with some fundamentals of what I’ll discuss in this chapter. If you’re already familiar with regions and availability zones, you can move straight on to the next section.

A region relates to a physical location. Some examples are US West (Oregon), Europe (London), or Asia Pacific (Sydney). Here, the cloud service providers deploy data centers. A logical grouping of data centers within a region is referred to as an availability zone (AZ). Each AZ is completely independent from the next, with redundant power and networking supplies.

Typically, a region consists of a minimum of three AZs. This allows cloud-based services to maintain multiple, redundant copies of data across three AZs within the same region. For example, AWS’ S3 service automatically creates and stores its data across separate AZs, thus ensuring data durability. If an AZ goes down because of a power failure, the services can continue to run uninterrupted through one of the other two AZs within that region.

Data Replication, Failover, and Failback

Ensuring that you design your system to handle failures is critical to the success of your solution. Depending on the criticality of the service in question, not every organization will need to maintain a global footprint of services across multiple regions. It’s all about balancing risk and cost with business continuity in mind.

Data replication guarantees you’ll have another copy of your data stored in a database, which is physically separate from your primary database instance. This provides an insurance policy against losing your primary database due to hardware or software failure and power or network outages.

With Snowflake you can replicate data between multiple regions and across multiple cloud service providers too. You can have one Snowflake account in Azure replicating data to another account running on AWS. Therefore, if the Azure service suffers a major outage, you can failover to your AWS standby service, thereby allowing your business to continue to function as normal.

You can instruct Snowflake to automatically replicate changes between databases in different regions without the need to manually configure ETL jobs or data pipelines to copy data.

You can also determine the frequency at which the data replication service runs, fine-tuning it to meet your own maximum accepted data loss in the event of a disaster. As you’d expect, data is encrypted in transit between cloud environments, and it’s also compatible with Tri-Secret Secure, which we discussed in the previous chapter.

Note

Failover and failback are features of Snowflake’s Business Critical edition or higher.

Primary and Secondary Databases

When you enable data replication, you need to define the primary database as part of the setup. Under business-as-usual (BAU) operations, changes are written to and committed to this primary database first. When users or applications query the service, data is returned from this database. When the primary database is replicated to another account, this copy is called the secondary database.

The secondary database is in a read-only mode, meaning changes cannot be applied directly to it by a user or application. If this were permitted, then it wouldn’t be in sync with the primary database, breaking the replication process. When a DDL or DML operation is carried out on the primary database, those changes are periodically batched up and replayed against the secondary database. All data along with any changes to objects such as schemas, tables, or views are all replicated.

Note

Only objects at the database level are replicated. Users (their privileges), roles, warehouses, shares, and resource monitors are not replicated.

If you are running the Business Critical edition and you configure replication to a lower account, Snowflake will throw an error. This is a good feature and was put in place by design to prevent exposing sensitive data by mistake. As discussed, the Business Critical edition is typically used by organizations that handle highly sensitive data, which is why Snowflake recognizes this and tries to help out in this scenario. You can override this default behavior by using the IGNORE EDITION CHECK clause when executing the enabling replication statement, which I’ll cover later.

You should be aware that compute resources are used to support data replication operations.

Promoting Databases

The process of promoting a secondary database to a primary database is not an automated one. To promote a secondary database to a primary database, a role with the OWNERSHIP privilege on the database is required to execute the ALTER DATABASE mydb1 PRIMARY; command. You must then point your connections to the newly promote primary database.

Client Redirect

At the time of writing, the client redirect feature was in preview. However, I want to touch upon it here as the aim of this capability is to expose a secure URL for your organization’s Snowflake databases regardless of which Snowflake account they reside within. Although you still need to manually promote a secondary database to a primary as described above, this capability negates the need to manually update the URL your client connections and applications are using.

The hostname in the connection URL is composed of your organization name and the connection object name in addition to a common domain name: organization_name-connection_name.snowflakecomputing.com.

Take note that the hostname does not specify the account to which you are connecting. An account administrator determines the account to use by assigning an account to serve as the primary connection. When you use this URL to connect to Snowflake, you are connecting to the account that has been defined as the primary connection.

If an outage that impacts the primary database connection occurs, the administrator can point this URL to a Snowflake account that stores the secondary databases. Throughout the outage you can continue to use the same URL to access your databases.

Business Continuity

Process Flow

The following set of diagrams walks through how the failover and failback processes work. Say your organization has two accounts, one in the AWS’ US East (Ohio) region and the other on Google’s asia-east2-b (Hong Kong) region.

Initially you promote the local database instance in Ohio to serve as the primary database. You also configure database replication for the secondary account in Hong Kong, as in Figure 6-1.
Figure 6-1

Promoting the local database to primary by configuring replication

Next, within your secondary account, in Hong Kong, you create the secondary database and immediately kick off the data refresh to populate the database from the primary (Figure 6-2). The secondary database is in read-only mode.
Figure 6-2

Creating the secondary database and executing the data refresh

Let’s assume there’s a major issue and it takes the US East (Ohio) region offline. You can then failover to the secondary database in Hong Kong, which is then promoted to become the primary, becoming writable in the process to support the newly directed traffic to this instance, as shown in Figure 6-3.
Figure 6-3

AWS Ohio region suffers an outage, resulting in the secondary database being promoted to the primary

Once the issue in the US East (Ohio) region has been resolved and your database comes back online, then any data that has been written to the newly promoted primary database (in Hong Kong) will need to be written to this secondary database (Figure 6-4).
Figure 6-4

AWS Ohio is back online as the secondary database. Data from the primary is replicated

The databases are now in sync. You can now choose if this configuration is satisfactory, or you might want to failback. Let’s say most of your workloads and users are based in the US. Some users may experience a little more latency when using the instance based in Hong Kong. In this case, you decide to failback and promote the secondary instance (currently the Ohio one) back to the primary role (Figure 6-5).
Figure 6-5

Promoting the AWS Ohio instance back to the primary

Monitoring Replication Progress

You’re able to monitor the refresh duration within the Web UI in the Databases ➤ Replication tab. The Last Refresh Status contains a colored progress bar (Figure 6-6), which, when hovered over, provides a detailed breakdown of the time taken for each stage in the refresh process.
Figure 6-6

Detailed data refresh statistics

Highlighting the row of the database on this page brings up a sidebar displaying the current refresh status (Figure 6-7), along with the refresh start time and the number of bytes transferred.
Figure 6-7

Refresh History pane in the Web UI

You can also view the last 14 days of refresh history by querying the metadata using the DATABASE_REFRESH_HISTORY table function. If you need to go back further than this, use the REPLICATION_USAGE_HISTORY view. You can find this view in the ACCOUNT_USAGE schema within the SNOWFLAKE database .
SELECT *
FROM TABLE(information_schema.database_refresh_history(SALES));

Reconciling the Process

You may want to periodically conduct your own checks to ensure the primary and secondary databases are in sync.

Snowflake recommends using the HASH_AGG function across a random set of tables in all databases. This function creates a unique fingerprint for the rows you provide it with. You can provide it with an entire table (or a subset of a table using a WHERE clause or a TIME TRAVEL function such as TIMESTAMP, for example) and it will return an aggregate signed 64-bit hash value.

You can then use the hash values from both databases to compare for any differences. You can automate this process by writing the results to a metadata table.

Data Loss

When designing for business continuity, it’s important to understand what your organization’s acceptance tolerance is for data loss and downtime. There are two key metrics, which have been around for years, that cater to this:
  • Recovery Point Objective (RPO): This is the maximum amount of data that can be lost following the recovery of a disaster or outage. This is measured in time.

  • Recovery Time Objective (RTO): This is the maximum acceptable downtime tolerated by the business. Again, measured in time.

Figure 6-8 illustrates where these two metrics sit in the disaster recovery process.
Figure 6-8

RPO and RTO explained

While the RTO for Snowflake can be minimal due to the fact you don’t need to deal with restoring physical backups, the RPO is important to consider when scheduling your data refresh from your primary to your secondary set. You should look to schedule your data refreshes so that if you lost your primary instance, any data loss is well within your RPO.

Bringing It All Together

Here the topics you’ve learned about in this chapter are brought together in a real-world, practical example.

As with many features of Snowflake, it’s possible to configure primary and secondary databases from either the UI or using SQL. For this example, you’re going to use the web UI. Regardless of the approach, you need to use the ACCOUNTADMIN role before you begin the configuration.

The Example Scenario

You’re going to set up a cross-cloud configuration between two different cloud providers. To do this, you need to create primary and secondary databases before scheduling an ongoing refresh between the two databases. Figure 6-9 illustrates what the finished setup will look like.
Figure 6-9

This map shows what you’ll end up with from this practical example

In the following steps, I will walk you through how to carry out these tasks:
  1. 1.

    Configure replication and failover (this promotes the local database to become the primary database).

     
  2. 2.

    Select an account to replicate the data to.

     
  3. 3.

    Create a secondary database on the replicated account and refresh the data.

     
  4. 4.

    Monitor the initial data refresh.

     
  5. 5.

    Schedule ongoing data refreshes between the primary and secondary databases.

     

Steps

Step 1: Configure Replication and Failover

In the Databases section of the web UI, you will see a Replication tab. If you select it and then highlight the database you wish to use, you can click Enable Replication, which brings up the dialog box shown in Figure 6-10.
Figure 6-10

Enabling replication

Step 2: Select an Account to Replicate the Data to

The Available to dropdown box lists all the accounts within your organization (Figure 6-11). Select the account you wish to replicate the data to and click Enable.
Figure 6-11

Selecting the available accounts for replication

Step 3: Create a Secondary Database on the Replicated Account

Now you can head on over to the secondary account. Log into the web UI and go to the Replication tab in the Databases section. Since you have configured the replication to this account, you will see your database appear in the Available area (Figure 6-12).
Figure 6-12

Databases available on this account

Select the database and click Create Secondary Databases. The dialog box in Figure 6-13 will pop up. Ensure that the Refresh immediately after creation check box is checked and click Create.
Figure 6-13

The Create Secondary Databases dialog box

Step 4: Monitor the Initial Data Refresh

If you’ve followed the steps above, then the data refresh from the primary to the secondary database will run immediately. The Last Refresh Status column will display In progress, as shown in Figure 6-14.
Figure 6-14

The secondary database being refreshed with data from the primary account

Once the data refresh is complete, you can hover over the progress bar and see the duration that relates to each state of the refresh process (Figure 6-15).
Figure 6-15

Viewing the detailed data refresh statistics

Step 5: Schedule Ongoing Data Refreshes

You can schedule ongoing refreshes to refresh your secondary database(s). Clicking Schedule Refresh brings up the dialog box (Figure 6-16) that contains a SQL template.

You can see that the template will create a new task. You can amend the schedule as needed and execute it for regular data refreshes in line with you RPO and RTOs.
Figure 6-16

The Schedule Refresh dialog box

Summary

In this chapter, you learned what capabilities Snowflake provides for disaster recovery scenarios. Although many concepts such as data replication, failover, failback, and RPO/RTO are firmly rooted in the past, moving to the cloud naturally brings with it some newer concepts and terminology, such as regions and availability zones.

You explored how to configure the primary and secondary databases before scheduling a regular data refresh. In the event of an outage of the primary database, you learned the process flow for failing over to the secondary database to maintain business operations.

At the end of this chapter, you saw a practical example of configure this scenario using the web UI. You should now be super confident in approaching this area. Not only will you be able to ensure you capture and document the right requirements, which will help you design your solution, but you will also be able to implement your design as well!

In the next chapter, you’ll switch gears and move into sharing data with users both inside and outside your organization. You’re also going to delve into the data cloud, which allows you to select from a menu of data and work with it on demand with a click of a button!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset