How do you tune the Snowflake data warehouse when there are no indexes and limited options available to tune the database itself?
Snowflake was designed for simplicity, to reduce administrative overhead, and to work straight out of the box. It keeps its promise pretty well in my experience, but at some stage your data volumes and user base may well reach a tipping point where you need to get your hands dirty.
Out of the box, you can increase the virtual warehouse size to improve performance. If you want a more involved way to address underlying issues, you have several options, and that’s the focus of this chapter. Generally, the best approach is to pinpoint the root cause of the performance issue, which allows you to take steps to improve the query or reconfigure the warehouse differently. This pain point might be part of the ingestion, transformation, or the way end user queries are written. Often the most effective solutions are based upon reviewing the symptoms before deciding on the diagnosis, rather than moving directly to a solution.
By all means, consider scaling up to a larger warehouse to improve query performance, but first identify and focus on the actual problem. In doing so, you may well discover that there are more effective and efficient solutions available.
Designing Tables for High Performance
Spending a little time considering how best to design tables will set you up for success as your solution grows. Therefore, it is really important to consider some best practices here.
Data Clustering
I covered micro-partitioning and clustering briefly earlier in this book. As a reminder, Snowflake stores data in small blocks of data called micro-partitions. These micro-partitions are arranged using a clustering key, which is automatically decided by Snowflake based on the order the data is loaded.
Snowflake collects rich statistics on these micro-partitions, which allows queries to avoid reading unnecessary parts of a table based on the query filters, known as pruning. To provide the maximum benefit, it is important that the underlying physical data is aligned to the query usage patterns.
As discussed, there is typically no need to specify a clustering key for most tables. Snowflake performs automatic tuning via the optimization engine and its use of micro-partitioning. However, for larger data sets (greater than 1 TB) and if the query profile indicates that a significant percentage of the total duration time is spent scanning data, you should consider changing the clustering key on the table.
Clustering Key
Snowflake’s definition of a clustering key is “a subset of columns in a table (or expressions on a table) that are explicitly designated to co-locate the data in the table in the same micro-partitions.”
Having an effective clustering key allows a query to scan fewer micro-partitions, return less data into the cache, and return the results to the user or application who submitted the query more quickly. As you can imagine, it’s pretty important to ensure you’ve got the correct clustering key selected for large tables.
Snowflake will automatically decide the best clustering key for you as well as “recluster” the data within the micro-partitions that make up your tables to keep it organized for optimum performance. The clustering key is used to organize how data is stored across the micro-partitions. Co-locating similar data, such as records containing the same transaction date, can significantly reduce the work required by the query when the transaction date is used as a filter on a query or in a join between tables.
A clustering key can consist of more than one column. Snowflake recommends a maximum of three to four; beyond this point you’ll start to incur more costs due to frequent reclustering, which negates the performance benefits.
When selecting a clustering key, it is also important to consider the cardinality, such as how many distinct values are within the column(s) to be used. Too few distinct values, such as Male and Female, will result in ineffective pruning; too many distinct values, like a unique ID column, will also be inefficient. When looking to select a clustering key, aim to strike a balance between cardinality and commonly used columns in query filters or joins .
Other than the table being very large, the other determining factor for when you should consider changing your clustering key is when the usage pattern against your data is different from the order the data is loaded.
For example, imagine you have a very large transactional table that contains order details. This data is loaded by Order Date. But your reports all query the data in this table by OrderID. In this instance, you’ll get better performance by setting a multi-column clustering key on the Order Date and OrderID columns.
The order of the clustering keys in a multi-clustered table also matters. You should list the columns in order of the number of queries that rely on these columns first.
Pruning Efficiency
The wider the gap between the number of partitions scanned and the total partitions, the better. After all, avoiding reading unnecessary partitions is the name of the game here.
If these numbers were much closer, this would inform you that pruning isn’t helping your query. In that case, for very large tables you should look into changing the clustering key.
For smaller tables, you could look to reorganize your query to include a filter that uses the existing clustering key. If a filter operator exists above the table scan that removes a significant number of records, this could be an indication that restructuring your query might be worth looking into.
Clustering Depth
The clustering depth for a table provides a value (always 1 or greater for any populated table) that tells you how many micro-partitions contain data from a column in a table. The closer the number is to 1, the better the data is clustered. This means that a query has to read fewer micro-partitions for a particular column to satisfy a query.
Several factors can impact the clustering depth. As DML operations are performed on a table, data will naturally become less clustered over time.
Snowflake provides a system function called $clustering_information to help assess how well-clustered a table is. Although the $clustering_information function returns the clustering depth as part of its JSON result set, alternatively you could use the $clustering_depth function that just returns the depth of the table.
With both functions, you provide a table name along with optionally one or several columns. The function returns how well clustered the table is based on the columns provided.
If you don’t specify any columns for both $clustering_information and $clustering_depth functions, the existing clustering key is used by default.
Reclustering
Designing High-Performance Queries
Optimizing Queries
If you click on one of the query IDs, it will take you through to the details for that query. Here, you can select the Profile tab. This will display the execution tree for that query.
Looking at the Profile Overview tells you some important information about the query. Look at the Spilling section. This tells you that the query results are too large to fit in the memory of the virtual warehouse, so the data starts to “spill” into the local storage. This additional I/O is very expensive and creates performance issues.
The arrows connecting each of the nodes are annotated with the number of records. The thicker the arrow, the more records are being transferred between elements of the query plan.
You can also see that the processing execution time is high at 85%, which suggests that compute is also constraining the execution time of the query.
So what can you do? Well, the easiest thing is to scale up the size of your warehouse. This would add more compute and memory instantly, thus improving the query, but at a higher cost. However, as I mentioned, this isn’t always the best strategy.
You could decide to reduce the volume of records earlier on in the query, in an attempt to reduce the volume of 555.9 million records. This would reduce the work the aggregate function has to do. Perhaps you could filter the records by a date range or a particular product category, for example. You might be able to join it to another table in your database to restrict the records you need to deal with.
For example, let’s assume this particular query is being executed by a data consumer against the data multiple times throughout the day, while the underlying data is loaded in batch overnight by your ETL process. By creating an aggregated table once as part of the ETL before the business day starts, it will allow for better performance for your data consumers. This is ideal if the data in the underlying dataset changes frequently.
Alternatively, you could create a materialized view (which I cover later in this chapter) to store the aggregated results.
Cardinality
A common mistake I see in queries is when users join multiple records from one table with multiple records in another table. Often it is because the wrong columns to join the data have been selected, or the join condition is missing altogether!
Here you can see that each table provides 772 and 816 records into the join, respectively, but the output from the join generates over 235k records. This points to something being fundamentally wrong in the query and warrants further investigation.
Materialized Views
A materialized view is a precomputed data set that is based on the SELECT statement within the definition of the view. Since the data is precomputed, it is faster when returning data then a standard view, which retrieves the results at execution time.
Using a materialized view can help speed up expensive queries, especially if you’re running this query multiple times a day.
Materialized views are only available in the Enterprise edition of Snowflake and above.
So why use a materialized view over a table? This can be very useful when the view returns a small amount of data relative to the underlying table or tables used in the SELECT statement. Examples of high-cost operations are aggregations, windowed functions, or when working with semistructured data.
Materialized views are maintained automatically by Snowflake. If the underlying table data changes, Snowflake will recognize this and recompute the data. This removes a lot of complexity and points of failure when compared to manually developing your own solution to carry this out.
The Snowflake service that keeps materialized views up to date incurs additional costs. Firstly, it requires additional storage to hold the precomputed results. Next, when Snowflake needs to recompile the results, it will also use some compute, resulting in credit consumption. Therefore, it’s best to use it when the underlying data changes less frequently.
Furthermore, the query optimizer in Snowflake will recognize where using a materialized view results in a lower cost query plan and, as a result, will redirect the query to use the materialized views.
You can also decide to cluster a materialized view. In fact, you could create a subset of different materialized views, each with a different clustering key to cater for different workloads. In this case, where most of your queries access the base table data through clustered materialized views, it almost negates the impact of a clustered key on the table.
A materialized view can query only a single table.
Joins, including self-joins, are not supported.
UDFs (this limitation applies to all types of user-defined functions, including external functions)
Window functions
HAVING clauses
ORDER BY clauses
LIMIT clauses
Many aggregate functions are also not allowed in a materialized view definition. For a complete list, and for an up-to-date list of all materialized view limitations, visit the official Snowflake documentation page here:
Search Optimization Service
Traditionally data warehouses are designed for large analytics queries, such as aggregations and calculations, over very large data sets. They fare less well when users run point lookup queries, or in other words, queries that are highly selective and therefore look up just a small number of records.
This is because the query engine needs to scan a large amount of data to find all the records that match the email address.
When you add a search optimization to a table, Snowflake records metadata on the table to understand where all the data resides in the underlying micro-partitions. This query optimizer can use this metadata to find the relevant partitions faster than the normal pruning approach, significantly improving the performance of point lookup queries.
The search optimization service works best on tables that either aren’t clustered or when a table is frequently queried on columns other than the clustering key. In addition to this, the queries should also be running for tens of seconds, with at least one of the columns used in the query filter having between 100-200k distinct values.
A good way of obtaining an approximate distinct count of a column is the use of the approx_distinct_count function as part of a select statement. This leverages the metadata stored by Snowflake so it is very fast and cost efficient.
Obviously, the larger the table, the more partitions that make up the table and the bigger the benefit of enabling search optimization on the table.
Optimizing Warehouse Utilization
You should experiment with different sizes of warehouses along with homogenous workloads to avoid incurring additional costs and wasting resources.
It is important to balance tuning the SQL within your queries against the number of concurrent queries your solution can cater for. Ensuring you leverage virtual warehouses effectively means you can reduce the risk of queries queuing while reducing latency.
Imagine your ETL process is a series of tasks executing queries one after each other against a X-Small warehouse.
One solution to improve concurrency is to scale up to a bigger virtual warehouse to complete the work faster, but this will only work for so long. At some stage, when using this strategy, the bottleneck will likely move elsewhere, leaving unused resources on your virtual warehouse and costing you, as the customer, money.
A better approach is to use a multi-cluster warehouse and design your ETL tasks to execute in parallel. This means your ETL process can take advantage of concurrent connections, allowing your warehouse to scale up and down on demand, optimizing cost, performance, and throughput.
Performance and Spend by Warehouse Size
X-Small | Small | Medium | Large | X-Large | 2X-Large | 3X-Large | 4X-Large | |
---|---|---|---|---|---|---|---|---|
Credit/hour | 1 | 2 | 4 | 8 | 16 | 32 | 64 | 128 |
Credit/hour (second) | 0.0003 | 0.0006 | 0.0011 | 0.0022 | 0.0044 | 0.0089 | 0.0178 | 0.0356 |
Workload execution time (second) | 232 | 202 | 127 | 80 | 67 | 48 | 39 | 33 |
Total Credits used | 0.0644444 | 0.1122222 | 0.1411111 | 0.1777778 | 0.2977778 | 0.4266667 | 0.6933333 | 1.1733333 |
Executing your workload across different sizes of warehouses allows you to establish a baseline. Logging your results in this way will lead you to the optimum warehouse sizes.
Adopting a similar approach with your own workloads will help you land upon the most appropriate warehouse configuration for your organization.
Warehouse Utilization Patterns
It’s worth discussing different utilization patterns for your warehouses and how to interpret them. This can help when trying to decide how best to optimize your environment.
Queued: Queries that are queued while waiting for resources used by currently executing queries to become available.
Queued (Provisioning): Queries that are queued while they wait for the provisioning of new servers within the warehouse. You can expect to see this in the first few seconds after a warehouse begins to resume after being suspended.
Queued (Repairing): This kind of queuing is the least common. You see this while any underlying warehouse compute resources are being repaired.
This usage pattern shows a well-utilized warehouse. While there’s some queuing of queries, these elements are very minimal. A small amount of queuing like this on a well-utilized warehouse is no bad thing. You can see in the chart that the majority of the time, queries are executing with zero queuing while following a fairly uniform pattern. This means the available resources provisioned with the warehouse are well used.
The gaps in the chart highlight when no queries are executing. In this case, the warehouse is set to auto suspend, so once the queries have completed, the warehouse moves into a suspended mode and importantly does not consuming any credits.
A couple of use cases where you might see pattern similar to this are for intraday micro-batch loading of data or sporadic event-based messages that land in a S3 bucket and trigger Snowpipe to load this data into Snowflake.
Even though this graph looks very different to Figure 9-17 the bottom line here is that it’s still a good pattern and matches your workload well.
In this chart, you can see there are a lot of unused resources, which is denoted by the amount of whitespace across the top of the chart.
There is a relatively small number of queries being executed in comparison to the size of the warehouse.
The warehouse is frequently online throughout the day but is being underutilized, resulting in credits being unnecessarily consumed.
You can take a couple of approaches to improve utilization here. If you assume the warehouse is a medium size, you can reduce the size of the warehouse gradually and review the utilization pattern following the change. This may result in less resources being wasted and a reduction in operating costs.
Alternatively, if you have more than one warehouse with a similar profile in your environment, you can consider combining one or more workloads onto a single warehouse. This may give you even greater economies of scale, where you remove the unused headroom from each virtual warehouse while maximizing the available resources into just one warehouse.
In this instance, I know that this warehouse serves a data science team that frequently executes a large number of queries throughout the business day. They also run data transformation processes for other business processes using the same warehouse across the day.
Users have been complaining of slow query performance. This pattern shows a warehouse that is being saturated with the number of queries it is trying to perform.
Consider separating out business critical workloads using dedicated warehouses to avoid contention.
In this scenario, you could consider creating a separate, smaller warehouse to separate the two different workloads, which should help with some of the queuing issues you’re seeing. Alternatively, if you decide to keep both workloads on the same warehouse, you could configure the warehouse to be multi-cluster and adjust the min and max values appropriately. This would give Snowflake the flexibility to scale the warehouse out during peak times throughout the business day before shutting down some clusters when the workload drops off into the evening and overnight.
Leveraging Caching
I discussed caching earlier in this book. You learned that Snowflake caches data in both the virtual warehouse (the data cache) and the cloud services layer (the results cache). Ensuring that you design your solution to take advantage of the available cache is a great way to improve overall query performance on Snowflake while keeping costs at a minimum.
Once you understand how caching works, you’ll find yourself configuring and designing your virtual warehouses and data models a little differently.
Data visualization tools can also take advantage of caching. Dashboards that contain business KPIs each day are produced each day. Their usage pattern is well defined, and many executives and managers will run this report at the start of each day with the same parameters. In this instance, you’ll want to focus on ensuring these queries hit the cache as often as possible.
Once the initial dashboard is executed, the query results will be pulled into the results cache and made available for all subsequent retrievals for the next 24 hours. The initial query will, of course, take a little longer as it needs to fetch the results from the warehouse before placing them in the results cache. To mitigate this, you may want to “warm” the cache by running the SQL queries behind these dashboards once your data loads have completed. This ensures the cache has the latest, most up-to-date data available from the get-go.
The data cache on the virtual warehouse sits on SSD storage. It is gradually aged out based on the date the data was last requested. Although you have less control over this cache, there are some techniques you can employ to maximize the hit ratio of this cache.
Good programming practice always advocates avoiding the use of SELECT * in queries. In this context, the query will retrieve all the associated columns from the database storage to the warehouse cache. This is obviously slower and less efficient than specifying a narrower query, which targets just the columns it requires. The more data in the cache, the quicker it fills up. The quicker it fills up, the fewer queries it can serve, so it’s an important factor when considering just how many queries might be hitting your warehouse every day.
Monitoring Resources and Account Usage
Resource Monitors
It’s important to set up monitoring to manage and control the spend on your Snowflake account. Resource monitors in Snowflake allow you to configure actions in response to certain threshold being hit at either the account or the warehouse level.
Your storage is often very inexpensive, therefore your main focus should be centered on the compute so you can control it more closely.
To create a resource monitor, you need to use the ACCOUNTADMIN role. This is required to initially set it up, but once created you can grant access to lower-level roles.
As with most things in Snowflake, you can opt to do this in either SQL or in the web UI. The examples use SQL to help you understand all the available options.
If you decide to use the UI, you can also populate the configuration values and click Show SQL to easily get the SQL code.
Resource monitors are typically set at the monthly level, inline with Snowflake’s billing cycle, although you can go all the way down to daily if you wish.
The credit quota is the number of credits to be the used as the limit. Once this threshold is met, you can select certain actions in response, as you’ll see shortly. At the start of a new month (or whichever frequency you have selected), this will restart from zero.
Notify: You can be notified when a certain trigger is met. No other action is taken. Important: You must have enabled notifications within the web UI for your account to receive these notifications!
Suspend: This option will not accept any new queries. It will, however, allow any currently executing queries to complete. This behavior may well result in you exceeding your threshold. This is very much dependent on the active workload on your warehouse when the trigger is met.
Immediate: This is the only options that guarantees you do not go beyond your threshold. It’s the most defensive options and will kill any queries currently executing on your warehouse.
Query History
However, this only goes back so far in time. To get a full list of history for the previous 12 months, you need to resort to using the SNOWFLAKE database. This is a read-only, shared database provided out of the box by Snowflake.
The SNOWFLAKE database contains two schemas, ACCOUNT_USAGE and READER_ACCOUNT_USAGE. The latter stores metadata relating to any reader accounts created as part of any secure data sharing within your environment. However, in this case, you’re going to focus on the ACCOUNT_USAGE schema.
Within this schema is a QUERY_HISTORY view. This view stores query history for the previous 365 days.
In particular, you should pay close attention to TABLE_SCAN_PCT and BYTES_SPILLED_TO_LOCAL_STORAGE along with BYTES_SPILLED_TO_REMOTE_STORAGE.
A high TABLE_SCAN_PCT value indicates that the query isn’t able to effectively take advantage of clustering and is scanning more data than necessary to satisfy the query. In this instance, evaluating the effectiveness of the clustering key and the selectivity of the query are the immediate next steps.
If you have high MBs spilled to either local or remote storage, consider moving the query to a bigger warehouse or scaling up the existing warehouse if appropriate.
Useful References
Snowflake offers some excellent query snippets across the metadata it provides. The following links should come in very handy for your own environment:
Resource Optimization: Performance:
Resource Optimization: Usage Monitoring:
Resource Optimization: Setup and Configuration:
https://quickstarts.snowflake.com/guide/resource_optimization_setup/index.html?index=..%2F..index#0
Resource Optimization: Billing Metrics:
Summary
In this chapter, you lifted the hood on Snowflake to get into those areas that make it tick. You took a deep dive into how to design your tables and queries for maximum performance.
You looked at how best to design tables for high performance. This included a detailed look into clustering, how to assess the effectiveness of an existing clustering key, and what action you can take if you’re not seeing the desired benefits against your workloads. You also learned best practices around data types to ensure you don’t fall into any common design pitfalls.
Following this, you moved to the next logical level: how to design your queries most effectively. You looked at how you can read and draw conclusions from query profiles to help pinpoint and track down the root cause of any performance issues. As part of this section, you explored other common usage patterns and where the use of materialized views or the search optimization service could help with performance.
The configuration of warehouses has a significant impact to the performance of your overall environment. A very tactical yet costly approach is to continually increase the size of your warehouse to cater for bad performance. You saw evidence that this approach only takes you so far. You worked through some example utilization patterns so you can identify what good and bad looks like, and how you can best address suboptimal performance.
Lastly, you looked into monitoring. This includes how to leverage resource monitors to notify you or curtail any unwarranted spend. In addition, you also looked at how to use the query history to proactively monitor performance of queries across your environment, potentially alerting you to performance risks before they become a business-critical showstopper.
Finally, I included links to useful scripts Snowflake has helpfully provided for you to monitor performance on your own solution.
You can trace a lot of what we discussed in this chapter back to Chapter 1 on Snowflake’s architecture. This bloodline runs throughout everything you do in Snowflake, so I cannot stress enough that it is worth taking the time to really understand the fundamentals of what makes Snowflake!
As Snowflake evolves, as does its market positioning, it continues expand its footprint well beyond data warehousing boundaries. As part of the extended ecosystem of tools, it’s elastic scalability and high-performance features make it the ideal candidate to store and process data by front-end applications traditionally served by independent tools and technologies.
The focus of the final chapter is a whistle-stop tour of the ever-evolving range of tools at your disposal, which allow you to build custom applications that can also operate at scale.