© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
A. MortonMastering Snowflake Solutionshttps://doi.org/10.1007/978-1-4842-8029-4_9

9. Advanced Performance Tuning

Adam Morton1  
(1)
Sydney, NSW, Australia
 

How do you tune the Snowflake data warehouse when there are no indexes and limited options available to tune the database itself?

Snowflake was designed for simplicity, to reduce administrative overhead, and to work straight out of the box. It keeps its promise pretty well in my experience, but at some stage your data volumes and user base may well reach a tipping point where you need to get your hands dirty.

Out of the box, you can increase the virtual warehouse size to improve performance. If you want a more involved way to address underlying issues, you have several options, and that’s the focus of this chapter. Generally, the best approach is to pinpoint the root cause of the performance issue, which allows you to take steps to improve the query or reconfigure the warehouse differently. This pain point might be part of the ingestion, transformation, or the way end user queries are written. Often the most effective solutions are based upon reviewing the symptoms before deciding on the diagnosis, rather than moving directly to a solution.

By all means, consider scaling up to a larger warehouse to improve query performance, but first identify and focus on the actual problem. In doing so, you may well discover that there are more effective and efficient solutions available.

Designing Tables for High Performance

Spending a little time considering how best to design tables will set you up for success as your solution grows. Therefore, it is really important to consider some best practices here.

Data Clustering

I covered micro-partitioning and clustering briefly earlier in this book. As a reminder, Snowflake stores data in small blocks of data called micro-partitions. These micro-partitions are arranged using a clustering key, which is automatically decided by Snowflake based on the order the data is loaded.

Snowflake collects rich statistics on these micro-partitions, which allows queries to avoid reading unnecessary parts of a table based on the query filters, known as pruning. To provide the maximum benefit, it is important that the underlying physical data is aligned to the query usage patterns.

As discussed, there is typically no need to specify a clustering key for most tables. Snowflake performs automatic tuning via the optimization engine and its use of micro-partitioning. However, for larger data sets (greater than 1 TB) and if the query profile indicates that a significant percentage of the total duration time is spent scanning data, you should consider changing the clustering key on the table.

Clustering Key

Snowflake’s definition of a clustering key is “a subset of columns in a table (or expressions on a table) that are explicitly designated to co-locate the data in the table in the same micro-partitions.”

Having an effective clustering key allows a query to scan fewer micro-partitions, return less data into the cache, and return the results to the user or application who submitted the query more quickly. As you can imagine, it’s pretty important to ensure you’ve got the correct clustering key selected for large tables.

Snowflake will automatically decide the best clustering key for you as well as “recluster” the data within the micro-partitions that make up your tables to keep it organized for optimum performance. The clustering key is used to organize how data is stored across the micro-partitions. Co-locating similar data, such as records containing the same transaction date, can significantly reduce the work required by the query when the transaction date is used as a filter on a query or in a join between tables.

A clustering key can consist of more than one column. Snowflake recommends a maximum of three to four; beyond this point you’ll start to incur more costs due to frequent reclustering, which negates the performance benefits.

When selecting a clustering key, it is also important to consider the cardinality, such as how many distinct values are within the column(s) to be used. Too few distinct values, such as Male and Female, will result in ineffective pruning; too many distinct values, like a unique ID column, will also be inefficient. When looking to select a clustering key, aim to strike a balance between cardinality and commonly used columns in query filters or joins .

Other than the table being very large, the other determining factor for when you should consider changing your clustering key is when the usage pattern against your data is different from the order the data is loaded.

For example, imagine you have a very large transactional table that contains order details. This data is loaded by Order Date. But your reports all query the data in this table by OrderID. In this instance, you’ll get better performance by setting a multi-column clustering key on the Order Date and OrderID columns.

You can specify the clustering key when you create a table as follows:
CREATE OR REPLACE TABLE TEST (C1 INT, C2 INT) CLUSTER BY (C1, C2);
Or you can alter an existing table:
ALTER TABLE TEST CLUSTER BY (C1, C2);

The order of the clustering keys in a multi-clustered table also matters. You should list the columns in order of the number of queries that rely on these columns first.

To view the clustering keys for a table, you can run the following command :
SHOW TABLES LIKE 'TEST';

Pruning Efficiency

The efficiency of pruning can be observed by comparing partitions scanned and partitions total statistics in the table scan operators within the query profile. Using the earlier example with the TPC data, you can see that the table scan operator scanned 1,485 partitions out of a total of 7,509 (Figure 9-1).
Figure 9-1

Pruning statistics

The wider the gap between the number of partitions scanned and the total partitions, the better. After all, avoiding reading unnecessary partitions is the name of the game here.

If these numbers were much closer, this would inform you that pruning isn’t helping your query. In that case, for very large tables you should look into changing the clustering key.

For smaller tables, you could look to reorganize your query to include a filter that uses the existing clustering key. If a filter operator exists above the table scan that removes a significant number of records, this could be an indication that restructuring your query might be worth looking into.

Clustering Depth

The clustering depth for a table provides a value (always 1 or greater for any populated table) that tells you how many micro-partitions contain data from a column in a table. The closer the number is to 1, the better the data is clustered. This means that a query has to read fewer micro-partitions for a particular column to satisfy a query.

Several factors can impact the clustering depth. As DML operations are performed on a table, data will naturally become less clustered over time.

Snowflake provides a system function called $clustering_information to help assess how well-clustered a table is. Although the $clustering_information function returns the clustering depth as part of its JSON result set, alternatively you could use the $clustering_depth function that just returns the depth of the table.

With both functions, you provide a table name along with optionally one or several columns. The function returns how well clustered the table is based on the columns provided.

Note

If you don’t specify any columns for both $clustering_information and $clustering_depth functions, the existing clustering key is used by default.

Let’s have a look by running these functions against one of the TPC tables used earlier in this chapter:
use schema snowflake_sample_data.tpcds_sf10tcl;
SELECT SYSTEM$CLUSTERING_DEPTH('store_returns');
SELECT SYSTEM$CLUSTERING_INFORMATION('store_returns');
The results returned by the $clustering_Information function are as follows:
{
  "cluster_by_keys" : "LINEAR(SR_RETURNED_DATE_SK, SR_ITEM_SK)",
  "notes" : "Clustering key columns contain high cardinality key SR_ITEM_SK which might result in expensive re-clustering. Please refer to https://docs.snowflake.net/manuals/user-guide/tables-clustering-keys.html for more information.",
  "total_partition_count" : 7509,
  "total_constant_partition_count" : 0,
  "average_overlaps" : 7.6508,
  "average_depth" : 5.0602,
  "partition_depth_histogram" : {
    "00000" : 0,
    "00001" : 1,
    "00002" : 170,
    "00003" : 225,
    "00004" : 2046,
    "00005" : 2413,
    "00006" : 1909,
    "00007" : 682,
    "00008" : 63,
    "00009" : 0,
    "00010" : 0,
    "00011" : 0,
    "00012" : 0,
    "00013" : 0,
    "00014" : 0,
    "00015" : 0,
    "00016" : 0
  }
}

Reclustering

If you do decide a recluster is necessary, use the ALTER TABLE command as follows:
ALTER TABLE TEST RECLUSTER;
You can also optionally choose to limit the recluster operation to a maximum number of bytes as the following code snippet demonstrates:
ALTER TABLE TEST RECLUSTER MAX_SIZE = 100000;

Designing High-Performance Queries

Optimizing Queries

To work out the mechanics of a query, you need to know where to go to view the query profile. If you go into the History tab, you’ll see a list of previously executed queries, as shown in Figure 9-2.
Figure 9-2

The history section of the web UI

If you click on one of the query IDs, it will take you through to the details for that query. Here, you can select the Profile tab. This will display the execution tree for that query.

Let’s execute a query in your environment. Go back to your query worksheet and select the small down arrow (Figure 9-3).
Figure 9-3

Opening tutorials

Select Open Tutorials and select Sample queries on TPC-DS data (Figure 9-4).
Figure 9-4

Sample tutorial queries

This will open up the following sample query:
-- TPC-DS_query1
with customer_total_return as
(select sr_customer_sk as ctr_customer_sk
,sr_store_sk as ctr_store_sk
,sum(SR_RETURN_AMT_INC_TAX) as ctr_total_return
from store_returns
,date_dim
where sr_returned_date_sk = d_date_sk
and d_year =1999
group by sr_customer_sk
,sr_store_sk)
 select  c_customer_id
from customer_total_return ctr1
,store
,customer
where ctr1.ctr_total_return > (select avg(ctr_total_return)*1.2
from customer_total_return ctr2
where ctr1.ctr_store_sk = ctr2.ctr_store_sk)
and s_store_sk = ctr1.ctr_store_sk
and s_state = 'NM'
and ctr1.ctr_customer_sk = c_customer_sk
order by c_customer_id
limit 100;
You’re going to execute the first query, TPC-DS_query1. This will take a while to execute. Once it completes, head back into the History section. Click the query ID and select Profile. This will open up the query profile tree as well as the profile overview on the right-hand side of the window (Figure 9-5).
Figure 9-5

Profile Overview window

Looking at the Profile Overview tells you some important information about the query. Look at the Spilling section. This tells you that the query results are too large to fit in the memory of the virtual warehouse, so the data starts to “spill” into the local storage. This additional I/O is very expensive and creates performance issues.

Handily, Snowflake breaks down the query into the most expensive nodes (Figure 9-6). This allows you to focus your efforts on what will give you the biggest return.
Figure 9-6

The most expensive nodes in the query profile section of the web UI

The arrows connecting each of the nodes are annotated with the number of records. The thicker the arrow, the more records are being transferred between elements of the query plan.

You can see in the window that this aggregate function is summing the SR_RETURN_AMT_INC_TAX​ field and is causing records to spill to disk (Figure 9-7).
Figure 9-7

Query profile showing spilling to disk

You can also see that the processing execution time is high at 85%, which suggests that compute is also constraining the execution time of the query.

If you click the Aggregate node, it takes you to that node in the query plan (Figure 9-8).
Figure 9-8

Aggregate node in the query plan

So what can you do? Well, the easiest thing is to scale up the size of your warehouse. This would add more compute and memory instantly, thus improving the query, but at a higher cost. However, as I mentioned, this isn’t always the best strategy.

You could decide to reduce the volume of records earlier on in the query, in an attempt to reduce the volume of 555.9 million records. This would reduce the work the aggregate function has to do. Perhaps you could filter the records by a date range or a particular product category, for example. You might be able to join it to another table in your database to restrict the records you need to deal with.

For example, let’s assume this particular query is being executed by a data consumer against the data multiple times throughout the day, while the underlying data is loaded in batch overnight by your ETL process. By creating an aggregated table once as part of the ETL before the business day starts, it will allow for better performance for your data consumers. This is ideal if the data in the underlying dataset changes frequently.

Alternatively, you could create a materialized view (which I cover later in this chapter) to store the aggregated results.

Cardinality

A common mistake I see in queries is when users join multiple records from one table with multiple records in another table. Often it is because the wrong columns to join the data have been selected, or the join condition is missing altogether!

This creates a huge number of records being produced by the resulting query, an “exploding” join, if you will. This scenario, known as a cartesian product, is easily identified when viewing the number of records as part of the query profile (Figure 9-9).
Figure 9-9

The “exploding” join

Here you can see that each table provides 772 and 816 records into the join, respectively, but the output from the join generates over 235k records. This points to something being fundamentally wrong in the query and warrants further investigation.

Materialized Views

A materialized view is a precomputed data set that is based on the SELECT statement within the definition of the view. Since the data is precomputed, it is faster when returning data then a standard view, which retrieves the results at execution time.

Using a materialized view can help speed up expensive queries, especially if you’re running this query multiple times a day.

Note

Materialized views are only available in the Enterprise edition of Snowflake and above.

So why use a materialized view over a table? This can be very useful when the view returns a small amount of data relative to the underlying table or tables used in the SELECT statement. Examples of high-cost operations are aggregations, windowed functions, or when working with semistructured data.

Materialized views are maintained automatically by Snowflake. If the underlying table data changes, Snowflake will recognize this and recompute the data. This removes a lot of complexity and points of failure when compared to manually developing your own solution to carry this out.

The Snowflake service that keeps materialized views up to date incurs additional costs. Firstly, it requires additional storage to hold the precomputed results. Next, when Snowflake needs to recompile the results, it will also use some compute, resulting in credit consumption. Therefore, it’s best to use it when the underlying data changes less frequently.

Furthermore, the query optimizer in Snowflake will recognize where using a materialized view results in a lower cost query plan and, as a result, will redirect the query to use the materialized views.

You can also decide to cluster a materialized view. In fact, you could create a subset of different materialized views, each with a different clustering key to cater for different workloads. In this case, where most of your queries access the base table data through clustered materialized views, it almost negates the impact of a clustered key on the table.

However, be warned there are some limitations when using materialized views. They could be significant depending on what you are looking to achieve. Here are the most severe limitations for you be aware of:
  • A materialized view can query only a single table.

  • Joins, including self-joins, are not supported.

A materialized view cannot include the following:
  • UDFs (this limitation applies to all types of user-defined functions, including external functions)

  • Window functions

  • HAVING clauses

  • ORDER BY clauses

  • LIMIT clauses

Many aggregate functions are also not allowed in a materialized view definition. For a complete list, and for an up-to-date list of all materialized view limitations, visit the official Snowflake documentation page here:

https://docs.snowflake.com/en/user-guide/views-materialized.html#limitations-on-creating-materialized-views.

Search Optimization Service

Traditionally data warehouses are designed for large analytics queries, such as aggregations and calculations, over very large data sets. They fare less well when users run point lookup queries, or in other words, queries that are highly selective and therefore look up just a small number of records.

An example is when you need to retrieve a list of previous sales transactions of a customer based on their email address, like this:
SELECT D_DATE, SS_SALES_PRICE
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCDS_SF100TCL"."CUSTOMER" C
INNER JOIN "SNOWFLAKE_SAMPLE_DATA"."TPCDS_SF100TCL"."STORE_SALES" SS ON SS.SS_CUSTOMER_SK = C.C_CUSTOMER_SK
INNER JOIN "SNOWFLAKE_SAMPLE_DATA"."TPCDS_SF100TCL"."DATE_DIM" DD ON SS.SS_SOLD_DATE_SK = DD.D_DATE_SK
WHERE C_EMAIL_ADDRESS = '[email protected]';

This is because the query engine needs to scan a large amount of data to find all the records that match the email address.

When you add a search optimization to a table, Snowflake records metadata on the table to understand where all the data resides in the underlying micro-partitions. This query optimizer can use this metadata to find the relevant partitions faster than the normal pruning approach, significantly improving the performance of point lookup queries.

To apply the search optimization to a table, you need to have the ownership privilege on the table along with the search optimization privilege on the schema. The following code snippet shows how to do this:
GRANT ADD SEARCH OPTIMIZATION ON SCHEMA SALES TO ROLE BI_DEVELOPERS;
ALTER TABLE SALES.SALES_TRANSACTIONS ADD SEARCH OPTIMIZATION;

The search optimization service works best on tables that either aren’t clustered or when a table is frequently queried on columns other than the clustering key. In addition to this, the queries should also be running for tens of seconds, with at least one of the columns used in the query filter having between 100-200k distinct values.

Note

A good way of obtaining an approximate distinct count of a column is the use of the approx_distinct_count function as part of a select statement. This leverages the metadata stored by Snowflake so it is very fast and cost efficient.

The following query simulates a pointed lookup by searching for a specific email address in the WHERE clause . It took 11.89 seconds to run and scanned all 347 partitions, as shown in Figure 9-10.
SELECT D_DATE, SS_SALES_PRICE
FROM CUSTOMER C
INNER JOIN STORE_SALES SS ON SS.SS_CUSTOMER_SK = C.C_CUSTOMER_SK
INNER JOIN DATE_DIM DD ON SS.SS_SOLD_DATE_SK = DD.D_DATE_SK
WHERE C_EMAIL_ADDRESS = '[email protected]';
Figure 9-10

Profile showing pointed looked up with no search optimization applied

To improve the speed of the query, enable the search optimization service on the table by executing the following code:
GRANT ADD SEARCH OPTIMIZATION ON SCHEMA PUBLIC TO ROLE SYSADMIN;
ALTER TABLE CUSTOMER ADD SEARCH OPTIMIZATION;
--ENSURE THE SESSION DOESN’T USE THE CACHE WHICH WOULD SKEW THE ---RESULTS
ALTER SESSION SET USE_CACHED_RESULT = FALSE;
You can use the SHOW TABLES command to check what tables have the search optimization feature switched on, as well as the viewing progress of harvesting the required metadata from the micro-partitions and the total space used. Figure 9-11 and Figure 9-12 shows the before and after effect of enabling the search optimization service.
Figure 9-11

SHOW TABLES results before enabling the search optimization service

Figure 9-12

SHOW TABLES results after enabling the search optimization service

Once the SEARCH_OPTIMIZATION_PROCESS field shows the process has reached 100%, you can run the same query again to search for the email address and observe the results. This time the query executes in 1.14 seconds and looking at the query profile in Figure 9-13 you can see that that the search optimization service is now leveraged by the query optimizer. The benefit comes from the query engine used the metadata to target just 1 partition out of 347.
Figure 9-13

Query profile detailing the impact of the search optimization service

Obviously, the larger the table, the more partitions that make up the table and the bigger the benefit of enabling search optimization on the table.

Optimizing Warehouse Utilization

You should experiment with different sizes of warehouses along with homogenous workloads to avoid incurring additional costs and wasting resources.

It is important to balance tuning the SQL within your queries against the number of concurrent queries your solution can cater for. Ensuring you leverage virtual warehouses effectively means you can reduce the risk of queries queuing while reducing latency.

Imagine your ETL process is a series of tasks executing queries one after each other against a X-Small warehouse.

One solution to improve concurrency is to scale up to a bigger virtual warehouse to complete the work faster, but this will only work for so long. At some stage, when using this strategy, the bottleneck will likely move elsewhere, leaving unused resources on your virtual warehouse and costing you, as the customer, money.

A better approach is to use a multi-cluster warehouse and design your ETL tasks to execute in parallel. This means your ETL process can take advantage of concurrent connections, allowing your warehouse to scale up and down on demand, optimizing cost, performance, and throughput.

In the example in Table 9-1 and Figure 9-14 you’ll observe a performance and cost sweet spot at the point you reach the large virtual warehouse size. As you continue to scale up beyond that point, the performance gains start to plateau, regardless of the fact you’re starting to incur significantly higher costs. This just shows that providing more CPU and memory only get you so far.
Table 9-1

Performance and Spend by Warehouse Size

 

X-Small

Small

Medium

Large

X-Large

2X-Large

3X-Large

4X-Large

Credit/hour

1

2

4

8

16

32

64

128

Credit/hour (second)

0.0003

0.0006

0.0011

0.0022

0.0044

0.0089

0.0178

0.0356

Workload execution time (second)

232

202

127

80

67

48

39

33

Total Credits used

0.0644444

0.1122222

0.1411111

0.1777778

0.2977778

0.4266667

0.6933333

1.1733333

Figure 9-14

Credits used vs. workload execution duration

Executing your workload across different sizes of warehouses allows you to establish a baseline. Logging your results in this way will lead you to the optimum warehouse sizes.

Adopting a similar approach with your own workloads will help you land upon the most appropriate warehouse configuration for your organization.

To performance test while removing the impact of the cache skewing your results, you should execute the following query:
ALTER SESSION SET USE_CACHED_RESULT = FALSE;

Warehouse Utilization Patterns

It’s worth discussing different utilization patterns for your warehouses and how to interpret them. This can help when trying to decide how best to optimize your environment.

If you click on the Warehouses tab in the Snowflake web UI, as Figure 9-15 shows, you will see a list of your warehouses.
Figure 9-15

The warehouse section in the web UI

Clicking a warehouse name takes you into a display that shows your warehouse load over time (see Figure 9-16). It shows the relationship between running and queuing queries over time as a stacked bar chart. There’s also a slider at the bottom of this pane so you can change the time period.
Figure 9-16

Warehouse load over time in the web UI

You may see a few variations of queuing in your own environment. These can be broken down into three categories:
  • Queued: Queries that are queued while waiting for resources used by currently executing queries to become available.

  • Queued (Provisioning): Queries that are queued while they wait for the provisioning of new servers within the warehouse. You can expect to see this in the first few seconds after a warehouse begins to resume after being suspended.

  • Queued (Repairing): This kind of queuing is the least common. You see this while any underlying warehouse compute resources are being repaired.

What follows is a range of warehouse query load charts to illustrate good and bad utilization patterns. This should guide you when determining if you need to make any performance changes within your own environment. Figure 9-17 shows what good looks like.
Figure 9-17

A well-utilized warehouse

This usage pattern shows a well-utilized warehouse. While there’s some queuing of queries, these elements are very minimal. A small amount of queuing like this on a well-utilized warehouse is no bad thing. You can see in the chart that the majority of the time, queries are executing with zero queuing while following a fairly uniform pattern. This means the available resources provisioned with the warehouse are well used.

Figure 9-18 shows an on-demand warehouse for a relatively infrequent usage pattern.
Figure 9-18

A well-utilized on-demand warehouse

The gaps in the chart highlight when no queries are executing. In this case, the warehouse is set to auto suspend, so once the queries have completed, the warehouse moves into a suspended mode and importantly does not consuming any credits.

A couple of use cases where you might see pattern similar to this are for intraday micro-batch loading of data or sporadic event-based messages that land in a S3 bucket and trigger Snowpipe to load this data into Snowflake.

Even though this graph looks very different to Figure 9-17 the bottom line here is that it’s still a good pattern and matches your workload well.

Figure 9-19 shows an underutilized warehouse pattern.
Figure 9-19

An underutilized warehouse

In this chart, you can see there are a lot of unused resources, which is denoted by the amount of whitespace across the top of the chart.

From this graph, we can make a couple of observations:
  • There is a relatively small number of queries being executed in comparison to the size of the warehouse.

  • The warehouse is frequently online throughout the day but is being underutilized, resulting in credits being unnecessarily consumed.

You can take a couple of approaches to improve utilization here. If you assume the warehouse is a medium size, you can reduce the size of the warehouse gradually and review the utilization pattern following the change. This may result in less resources being wasted and a reduction in operating costs.

Alternatively, if you have more than one warehouse with a similar profile in your environment, you can consider combining one or more workloads onto a single warehouse. This may give you even greater economies of scale, where you remove the unused headroom from each virtual warehouse while maximizing the available resources into just one warehouse.

Figure 9-20 shows a high number of queuing queries.
Figure 9-20

A high number of queuing queries

In this instance, I know that this warehouse serves a data science team that frequently executes a large number of queries throughout the business day. They also run data transformation processes for other business processes using the same warehouse across the day.

Users have been complaining of slow query performance. This pattern shows a warehouse that is being saturated with the number of queries it is trying to perform.

Tip

Consider separating out business critical workloads using dedicated warehouses to avoid contention.

In this scenario, you could consider creating a separate, smaller warehouse to separate the two different workloads, which should help with some of the queuing issues you’re seeing. Alternatively, if you decide to keep both workloads on the same warehouse, you could configure the warehouse to be multi-cluster and adjust the min and max values appropriately. This would give Snowflake the flexibility to scale the warehouse out during peak times throughout the business day before shutting down some clusters when the workload drops off into the evening and overnight.

Leveraging Caching

I discussed caching earlier in this book. You learned that Snowflake caches data in both the virtual warehouse (the data cache) and the cloud services layer (the results cache). Ensuring that you design your solution to take advantage of the available cache is a great way to improve overall query performance on Snowflake while keeping costs at a minimum.

Once you understand how caching works, you’ll find yourself configuring and designing your virtual warehouses and data models a little differently.

Data visualization tools can also take advantage of caching. Dashboards that contain business KPIs each day are produced each day. Their usage pattern is well defined, and many executives and managers will run this report at the start of each day with the same parameters. In this instance, you’ll want to focus on ensuring these queries hit the cache as often as possible.

Once the initial dashboard is executed, the query results will be pulled into the results cache and made available for all subsequent retrievals for the next 24 hours. The initial query will, of course, take a little longer as it needs to fetch the results from the warehouse before placing them in the results cache. To mitigate this, you may want to “warm” the cache by running the SQL queries behind these dashboards once your data loads have completed. This ensures the cache has the latest, most up-to-date data available from the get-go.

The data cache on the virtual warehouse sits on SSD storage. It is gradually aged out based on the date the data was last requested. Although you have less control over this cache, there are some techniques you can employ to maximize the hit ratio of this cache.

Good programming practice always advocates avoiding the use of SELECT * in queries. In this context, the query will retrieve all the associated columns from the database storage to the warehouse cache. This is obviously slower and less efficient than specifying a narrower query, which targets just the columns it requires. The more data in the cache, the quicker it fills up. The quicker it fills up, the fewer queries it can serve, so it’s an important factor when considering just how many queries might be hitting your warehouse every day.

Monitoring Resources and Account Usage

Resource Monitors

It’s important to set up monitoring to manage and control the spend on your Snowflake account. Resource monitors in Snowflake allow you to configure actions in response to certain threshold being hit at either the account or the warehouse level.

Your storage is often very inexpensive, therefore your main focus should be centered on the compute so you can control it more closely.

To create a resource monitor, you need to use the ACCOUNTADMIN role. This is required to initially set it up, but once created you can grant access to lower-level roles.

As with most things in Snowflake, you can opt to do this in either SQL or in the web UI. The examples use SQL to help you understand all the available options.

If you decide to use the UI, you can also populate the configuration values and click Show SQL to easily get the SQL code.

Resource monitors are typically set at the monthly level, inline with Snowflake’s billing cycle, although you can go all the way down to daily if you wish.

The credit quota is the number of credits to be the used as the limit. Once this threshold is met, you can select certain actions in response, as you’ll see shortly. At the start of a new month (or whichever frequency you have selected), this will restart from zero.

When your threshold is met, referred to as a trigger, you have three choices of response:
  • Notify: You can be notified when a certain trigger is met. No other action is taken. Important: You must have enabled notifications within the web UI for your account to receive these notifications!

  • Suspend: This option will not accept any new queries. It will, however, allow any currently executing queries to complete. This behavior may well result in you exceeding your threshold. This is very much dependent on the active workload on your warehouse when the trigger is met.

  • Immediate: This is the only options that guarantees you do not go beyond your threshold. It’s the most defensive options and will kill any queries currently executing on your warehouse.

Here is an example of how to set up a resource monitor using SQL at the account level:
USE ROLE ACCOUNTADMIN;
CREATE RESOURCE MONITOR MONTHLY_ACCOUNT_BUDGET
WITH
  CREDIT_QUOTA    = 1000
  FREQUENCY       = MONTHLY
  START_TIMESTAMP = IMMEDIATELY
TRIGGERS
  ON 90 PERCENT DO NOTIFY
  ON 95 PERCENT DO SUSPEND
  ON 99 PERCENT DO SUSPEND_IMMEDIATE;
ALTER ACCOUNT SET RESOURCE_MONITOR = MONTHLY_ACCOUNT_BUDGET;
For finer-grain control, you can look at setting up a resource monitor at the warehouse level. This is useful, in addition to the account level monitoring when you have a cross-charging policy in place with your data consumers. For example, if you have a cross-charging agreement in place with your Data Science team, and the team has a dedicated warehouse, they may want to be notified or have certain mechanisms in place to prevent overspend. Setting up a resource monitor at the warehouse level is the best way to cater for that situation, as the following code example illustrates:
USE ROLE ACCOUNTADMIN;
CREATE RESOURCE MONITOR DATA_SCIENCE_WH_BUDGET
WITH
  CREDIT_QUOTA    = 10
  FREQUENCY       = MONTHLY
  START_TIMESTAMP = IMMEDIATELY
TRIGGERS
  ON 70 PERCENT DO NOTIFY
  ON 80 PERCENT DO NOTIFY
  ON 90 PERCENT DO NOTIFY
  ON 99 PERCENT DO SUSPEND
  ON 100 PERCENT DO SUSPEND_IMMEDIATE;
ALTER WAREHOUSE DATA_SCIENCE_WH SET RESOURCE_MONITOR = DATA_SCIENCE_WH_BUDGET;

Query History

Snowflake maintains a history of all queries executed on the account. In the web UI you can select the History icon (Figure 9-21).
Figure 9-21

The history icon in the web UI

In the History section, you can view all the queries along with other useful information such as the user, warehouse, duration of the query, and bytes scanned (Figure 9-22).
Figure 9-22

The query history

You can also filter on a number of fields to assist you with searching through the history (Figure 9-23).
Figure 9-23

Query history filter

However, this only goes back so far in time. To get a full list of history for the previous 12 months, you need to resort to using the SNOWFLAKE database. This is a read-only, shared database provided out of the box by Snowflake.

The SNOWFLAKE database contains two schemas, ACCOUNT_USAGE and READER_ACCOUNT_USAGE. The latter stores metadata relating to any reader accounts created as part of any secure data sharing within your environment. However, in this case, you’re going to focus on the ACCOUNT_USAGE schema.

Within this schema is a QUERY_HISTORY view. This view stores query history for the previous 365 days.

The following SQL snippet can help identify potential query performance issues on queries that run for more than 5 minutes and scan over a megabyte of data:
SELECT  QUERY_ID,
        ROUND(BYTES_SCANNED/1024/1024) AS MB_SCANNED,
        TOTAL_ELAPSED_TIME/1000        AS SECS_ELAPSED,
        (PARTITIONS_SCANNED/NULLIF(PARTITIONS_TOTAL,0)) * 100 AS TABLE_SCAN_PCT,
        PERCENTAGE_SCANNED_FROM_CACHE*100 AS PCT_FROM_CACHE,
        BYTES_SPILLED_TO_LOCAL_STORAGE,
        BYTES_SPILLED_TO_REMOTE_STORAGE
FROM   SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE (BYTES_SPILLED_TO_LOCAL_STORAGE > 1024 * 1024 OR
       BYTES_SPILLED_TO_REMOTE_STORAGE > 1024 * 1024 OR
       PERCENTAGE_SCANNED_FROM_CACHE < 0.1)
AND  TOTAL_ELAPSED_TIME > 300000
AND    BYTES_SCANNED > 1024 * 1024
ORDER BY TOTAL_ELAPSED_TIME DESC;

In particular, you should pay close attention to TABLE_SCAN_PCT and BYTES_SPILLED_TO_LOCAL_STORAGE along with BYTES_SPILLED_TO_REMOTE_STORAGE.

A high TABLE_SCAN_PCT value indicates that the query isn’t able to effectively take advantage of clustering and is scanning more data than necessary to satisfy the query. In this instance, evaluating the effectiveness of the clustering key and the selectivity of the query are the immediate next steps.

If you have high MBs spilled to either local or remote storage, consider moving the query to a bigger warehouse or scaling up the existing warehouse if appropriate.

Useful References

Snowflake offers some excellent query snippets across the metadata it provides. The following links should come in very handy for your own environment:

Resource Optimization: Performance:

https://quickstarts.snowflake.com/guide/resource_optimization_performance_optimization/index.html?index=..%2F..index

Resource Optimization: Usage Monitoring:

https://quickstarts.snowflake.com/guide/resource_optimization_usage_monitoring/index.html?index=..%2F..index#0

Resource Optimization: Setup and Configuration:

https://quickstarts.snowflake.com/guide/resource_optimization_setup/index.html?index=..%2F..index#0

Resource Optimization: Billing Metrics:

https://quickstarts.snowflake.com/guide/resource_optimization_billing_metrics/index.html?index=..%2F..index#0

Summary

In this chapter, you lifted the hood on Snowflake to get into those areas that make it tick. You took a deep dive into how to design your tables and queries for maximum performance.

You looked at how best to design tables for high performance. This included a detailed look into clustering, how to assess the effectiveness of an existing clustering key, and what action you can take if you’re not seeing the desired benefits against your workloads. You also learned best practices around data types to ensure you don’t fall into any common design pitfalls.

Following this, you moved to the next logical level: how to design your queries most effectively. You looked at how you can read and draw conclusions from query profiles to help pinpoint and track down the root cause of any performance issues. As part of this section, you explored other common usage patterns and where the use of materialized views or the search optimization service could help with performance.

The configuration of warehouses has a significant impact to the performance of your overall environment. A very tactical yet costly approach is to continually increase the size of your warehouse to cater for bad performance. You saw evidence that this approach only takes you so far. You worked through some example utilization patterns so you can identify what good and bad looks like, and how you can best address suboptimal performance.

Lastly, you looked into monitoring. This includes how to leverage resource monitors to notify you or curtail any unwarranted spend. In addition, you also looked at how to use the query history to proactively monitor performance of queries across your environment, potentially alerting you to performance risks before they become a business-critical showstopper.

Finally, I included links to useful scripts Snowflake has helpfully provided for you to monitor performance on your own solution.

You can trace a lot of what we discussed in this chapter back to Chapter 1 on Snowflake’s architecture. This bloodline runs throughout everything you do in Snowflake, so I cannot stress enough that it is worth taking the time to really understand the fundamentals of what makes Snowflake!

As Snowflake evolves, as does its market positioning, it continues expand its footprint well beyond data warehousing boundaries. As part of the extended ecosystem of tools, it’s elastic scalability and high-performance features make it the ideal candidate to store and process data by front-end applications traditionally served by independent tools and technologies.

The focus of the final chapter is a whistle-stop tour of the ever-evolving range of tools at your disposal, which allow you to build custom applications that can also operate at scale.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset