© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
A. MortonMastering Snowflake Solutionshttps://doi.org/10.1007/978-1-4842-8029-4_1

1. Snowflake Architecture

Adam Morton1  
(1)
Sydney, NSW, Australia
 

In 2020, the global pandemic took everyone by surprise. It disrupted markets, closed international borders, and turned each of our everyday lives upside down. Overnight, companies were forced to shift their entire workforce to a remote working model.

The organizations that previously understood the value of technology and used it to underpin their corporate strategy were able to successfully navigate these choppy waters, while transitioning almost seamlessly to this “new” way of working. For the rest, it shone a light on the inefficiencies and inherent weaknesses when companies fail to innovate and invest in technology quickly enough.

Technology and Data Are Inseparable

Organizations are constantly looking for way to differentiate themselves from their competitors in a crowded marketplace. They want to provide superior products and services. They want to offer a better, more unique, personalized customer experience. They want to acquire and retain customers, price more intelligently, predict what you want to buy (even before you know it!), predict risk more effectively, know what asset classes to invest in, and identify fraud more accurately and quickly. Take a moment to think about the things your organization wants to achieve, and I bet you can’t accomplish those things without data!

Unlocking Business Value

These days, I tend to get involved with helping clients understand how they can achieve their strategic business objectives, such as the ones just mentioned, through effective use of technology, data, and analytics. You simply cannot have the former without the latter.

For example, automating manual tasks can have a huge impact on the operational efficiency of an organization. Arming employees with modern technology allows them to streamline mundane, repeatable activities and frees up time to be more productive, while focusing on higher value-adding activities. This is a never-ending battle. Just today I was copying and pasting between spreadsheets. I’m simultaneously supportive of automation while also recognizing that it’s just impossible to automate everything. Sometimes I just need to copy and paste to get something done.

In other situations, I’ve seen instances of teams operating in silos. I’ve seen people sitting next to each other, fulfilling the same requirement, with two copies of the same data, which they downloaded to their own machines! They simply hadn’t spoken to each other about what they were each working on.

The ability to collaborate and share data brings many benefits. It removes the need for the same set of data to be sent over email, downloaded from BI applications, and stored on individual hard drives in files. Having a set of well managed and governed data stored in one place, which is used many times over, ensures consistent answers to business questions and opens the door for the development of new products, for example.

Business Agility Is More Important Than Ever

Gone are the days of the business sending IT a request for a new requirement, only to wait months and months before finally receiving something that didn’t quite hit the mark. Today we have very different expectations and the ways of working are quite different. Business and IT teams now need to form a partnership and work in a collaborative and iterative manner.

This has coincided over the last few years with a move away from a traditional waterfall-style delivery model to an agile-based approach. This brings together multi-disciplined teams of technical and business people to deliver new products and services in rapid development iterations called sprints.

From a data and analytics perspective, this shift also takes advantage of advances in technology to move towards a methodology called DataOps. DataOps is a combination of agile, DevOps, and lean manufacturing processes with the aim of automating as much of the development lifecycle as possible while introducing process efficiencies. This improves quality and reduces the time it takes to deliver value.

All Hail the Cloud!

Historically, storage and compute were tightly coupled. You couldn’t have one without the other. At the start of a project, you had to predict the size of your server and project the future demand on your service for the next two to three years before putting in your order, paying the money upfront, and crossing your fingers nothing changed in the meanwhile! This led to over- or under-provisioning resources. Both are bad and ideally you try to avoid either.

You want to store massive amounts of data? Sure thing! But you’ll also need to pay for a significant amount of compute resources to go along with it. What happens if you plan on infrequently accessing huge volumes of data while occasionally carrying out high-intensive operations on it? You’ll need to pay upfront for lots of compute resources that won’t get used much, along with a lot of storage! Thankfully, the world has moved on.

Many companies have spiky workloads with the need to scale up their compute resources instantly to cater for short bursts of high demand. For example, the retail sector will typically anticipate spikes in traffic for Black Friday sales. Reality TV shows regularly have short windows of time for live voting; think about the X-Factor live final. They need to ensure they have huge amount of resources on standby ready to cater to the demand.

With the advent of cloud computing, storage is relatively cheap in comparison to compute. You can imagine the cost savings of being able to shut down the compute resources when all your data consumers go home for the evening rather than having it up and running 24x7.

One of the key differentiators of Snowflake is the decoupling of this storage and compute. This provides the ability to independently scale either storage or compute demand depending on your specific needs. This design allows you to run multiple workloads on the same set of data without resource contention.

Decisions, Decisions, Decisions!

When designing data platforms, you need to make key decisions. These decisions can have certain tradeoffs, especially when considering potential future impacts. For example, you don’t want to make a design decision upfront that backs you into a corner if requirements evolve in the future. You need to balance the needs of today while considering those potential needs of tomorrow, and it’s a tricky balancing act. Being confident in your decisions requires a solid understanding of Snowflake’s building blocks. In this chapter, you’ll dig into Snowflake’s architecture in some depth and learn what makes it tick.

Snowflake Architecture

Snowflake’s architecture can be broken down into three specific areas (shown in Figure 1-1):
  • Database storage : Snowflake reorganizes structured and unstructured data into its internally optimized, compressed, columnar format.

  • Query processing : Snowflake uses compute and memory resources provided by virtual warehouses to execute and process queries and data operations.

  • Cloud services : A collection of supporting services that coordinate activities across the platform, from users logging in to query optimization.

Figure 1-1

Snowflake’s layered architecture

These services operate much like a postal service. The database storage is the mailroom, designed to efficiently organize and store all the different letters and parcels arriving from many different places. Some parcels are very large, some are small, and other are just regular letters, but they must all find their place within the mailroom.

Query processing acts as the courier. It handles the logistics of taking the incoming mail from the sender and working out the quickest and most efficient route to deliver it. Once delivered, it can then obtain a response and return it to the sender.

The virtual warehouse provides the resources to allow these deliveries to take place. Sometimes all that might be required is a bicycle; other times a van or truck is needed. In specific cases, a plane might be required for international items.

The cloud services layer is the HQ, the big boss. It’s the end-to-end tracking system providing oversight across all services. It ensures the post is secured for access when in transit and keeps the lights on to ensure the post reaches its destination within the guaranteed time scale.

Keep this metaphor in mind as you look at each of these services in detail in the following pages. Understanding how these features complement each other really helps when thinking about how to design efficient solutions with Snowflake.

Database Storage

With thousands upon thousands of letters and parcels coming into the mailroom each day, you need a system to store and organize them. Without a process to help find a specific item of post, you’ll struggle to locate what you need and very quickly the whole system will be in complete disarray.

A stable foundation is critical to guaranteeing the efficiency of any service built on top. This is exactly where this first layer of the Snowflake’s architecture comes into play: organizing and tracking where data is stored for efficient retrieval.

Micro Partitions

A fundamental building block in establishing this foundation is something called micro partitions . They are relatively small (50MB and 500MB of uncompressed data) blocks of storage that sit in the underlying cloud provider’s data store, whether that be AWS S3, Google’s GCS or Azure BLOB storage.

Essentially Snowflake treats each micro partition as a unit of DML. Using this approach simplifies a lot of internal operations within the system.

As data lands in Snowflake, a few key things happen, which are completely transparent to the user. These operations create the solid foundation, which in turn enables lightning-fast query performance. You can break these operations down roughly as follows:
  1. 1.

    Divide and map the incoming data into micro partitions using the ordering of the data as it is inserted/loaded.

     
  2. 2.

    Compress the data.

     
  3. 3.

    Capture and store metadata.

     
Figure 1-2 illustrates how micro partitioning works in Snowflake for the first six records in the table shown.
Figure 1-2

Snowflake’s micro partitions

Bear in mind that given the small size of the partitions, for very large tables we could be talking millions of these micro partitions. This granularity brings additional flexibility, allowing for finer-grain query pruning. Imagine having a transactional table with 10 years of sales history but you only want to look at yesterday’s sales. You can imagine the huge benefit of targeting just the data you need rather than scanning the entire table to return such a small proportion of the available data.

What Is the Benefit of Micro Partitioning?

The metadata associated with micro partitions allows Snowflake to optimize the most expensive area of database processing: the I/O operations (reading and writing to storage).

This process of narrowing down a query to only read what is absolutely required to satisfy the query is referred to as pruning. Pruning can be broken down into two distinct stages:
  1. 1.

    When a SQL query is executed with a WHERE clause, the metadata is used to locate those micro partitions that hold the required data. So instead of searching through all micro partitions, Snowflake targets just those micro partitions that are relevant.

     
  2. 2.

    Once the relevant micro partitions have been identified in phase one, the second phase of pruning is applied. The header of each partition is read to identify the relevant columns, further negating the need to read any more data than is required .

     

This is much like looking for information in a book. You could search through all pages at random to attempt to find what you’re looking for. Or you could use the index and be much more efficient in finding what you need.

The same approach largely holds true for semistructured data. Snowflake will attempt to convert the paths within the data into columns (known as sub-columns in this scenario) under the covers in an opportunistic fashion to support optimization. By doing this Snowflake can also capture the metadata on these sub-columns in the same way it does for regular, structured data. If a user writes a query that targets a sub-path in JSON data, as an example, this will be optimized in exactly the same way as a regular column.

In summary , Snowflake essentially applies a generic approach to partitioning on every single table within the database without the need for any user intervention.

Partitioning in the Pre-Snowflake World

In other, more traditional database software, it is left to the DBA to optionally define a partitioning strategy over the data stored. Only occasionally have I witnessed partitions being deployed in those environments. When they were used, they were in response to a very specific requirement or to fix an isolated performance issue. In some cases, this was down to lack of knowledge that this functionality existed, but in the majority of cases this was due to time and the associated maintenance overhead this created. Snowflake has removed this overhead entirely!

Data Clustering

In Snowflake , data stored in tables is sorted along natural dimensions such as date or geographic regions. This concept is known as data clustering and is defined automatically as data is loaded into Snowflake. And it works just fine most of the time.

However, there may well be circumstances where your table grows to be very large (over 1TB) or large amounts of DML have been applied to the table. This will cause natural degradation of the natural clustering over time and will ultimately impact query performance.

In this instance, Snowflake provides you with the ability to select a different clustering key based on your specific needs. I will be covering this functionality in more detail in Chapter 9. For now, simply being aware that clustering exists and understanding the role it plays is all you need to know.

Virtual Warehouses

Let’s start with a brief refresher on virtual warehouses. If you’ve been using Snowflake for a while and are comfortable with the idea of virtual warehouses, you can skip this section and jump to query processing.

So, what is a virtual warehouse? Essentially, it’s a bundle of compute resources and memory. You need a virtual warehouse to perform pretty much any DML activity on the data within Snowflake. This includes loading data into tables. When you have written some code you need to execute, or you’re ready to load data, then you need to associate this with a warehouse so you can execute it.

Virtual warehouses come in t-shirt sizes ranging from Extra Small (X-Small) through to 6X-Large. Table 1-1 details the full range of virtual warehouse sizes. The size of the virtual warehouse directly correlates to the number of credits required to run the warehouse. After the first minute, which you are always billed for, credits are then calculated on a per second basis while the virtual warehouse is running.
Table 1-1

Virtual Warehouse Sizes

Warehouse Size

Servers/Cluster

X-Small

1

Small

2

Medium

4

Large

8

X-Large

16

2X-Large

32

3X-Large

64

4X-Large

128

5X-Large

256

6X-Large

512

It is recommended you start with the smallest virtual warehouse size and experiment with different homogenous workloads and virtual warehouse sizes until you find the best balance between performance and cost.

Caching

Before moving on to talk in more detail about virtual warehouses, I want to briefly touch on caching. Going back to our postal service analogy, the cache is for the places that frequently receive a large amount of the same type of delivery. Because you know what this looks like, and how to deliver it, you can create a special section for it in the mail room to promote efficiency.

In a similar way, when a query is executed against a virtual warehouse for the first time, the result set is pushed into the cache. When Snowflake receives the same query again, it can rapidly return the result set to the user without needing to find the data.

In Snowflake, there are two types of cache: the result cache and local disk cache.

Result Cache

The result cache stores the result set of every query executed in the past 24 hours. If a user submits a subsequent query that matches the previous query, then an attempt to retrieve the result set from the result cache is made. There are some caveats to this approach to consider:
  • If the underlying data that makes up the result set changes, this will invalidate the result set held in the cache.

  • If the query contains a function that needs to be evaluated at execution time (referred to as non-deterministic), then the cache cannot be used.

  • The user executing the query must all have all the correct privileges for all the tables used in the query.

The result cache is not specific to an individual virtual warehouse. It serves the whole environment. Any query by any user on the account that fits the criteria mentioned in the section above can take advantage of the result cache regardless of the virtual warehouse they are using.

Local Disk Cache

A local disk cache is used to hold the results of SQL queries. The data is fetched from the remote disk and cached within the local solid state disk (SSD) of the virtual warehouse. Figure 1-3 shows where the different types of cache sit within the architecture.
Figure 1-3

The different types of cache within Snowflake

We’ll revisit caching and how it impacts performance in Chapter 9.

Configuring Virtual Warehouses

Depending on your workloads there are several factors that will influence your decision when configuring a virtual warehouse. Figure 1-4 shows the available options to be considered when configuring a virtual warehouse in Snowflake.
Figure 1-4

Snowflake’s Configure Warehouse page

Number of Clusters

As part of the warehouse configuration settings, you have the ability to set the minimum and maximum number of clusters. The minimum and maximum values can be anything between 1 and 10.

If you have workloads that can spike rapidly, requiring more resources, or you have many concurrent users accessing the same data, then setting up the warehouse to run in multi-cluster mode will help. Multi-cluster warehouses are those with a maximum cluster setting greater than 1.

Note

Moving from a smaller to a larger virtual warehouse is referred to as scaling up while adding addition clusters is known as scaling out.

When your warehouse initially starts, it will provision the number of minimum clusters you have configured. Figure 1-5 illustrates what a typical demand profile might look like over the course of a day. As more and more concurrent connections are made, the need for resources will also increase. Snowflake recognizes this and automatically provisions additional servers to cater for this. Looking at Figure 1-5, you can see that the peak demand for resources occurs at 11:00 and 15:30. During this time, more servers are provisioned until the number of clusters meets the maximum configured value. In the same way, as demand drops off, clusters are removed. This automatic scaling out of clusters in response to demand is extremely effective and easy to configure.
Figure 1-5

Example demand profile

Scaling Policy

You can further influence the scaling behavior by setting a scaling policy. The default is Standard, where performance will be prioritized over saving credits. You can also select Economy, where Snowflake will wait longer before choosing to provision additional clusters and will be more aggressive when looking to decommission clusters as demand for resources starts to fall.

If you have predictable high concurrency workloads, then you can opt to run a multi-cluster warehouse in maximized mode. This is when the minimum and maximum values are the same value. When a warehouse starts in this mode, it will provision the maximum amount of available resources immediately.

Note

Keep in mind that credits are consumed based on warehouse size, number of clusters per warehouse (for multi-cluster warehouses), and the length of time each cluster runs.

What about predictable workloads without high concurrency? In this case, an appropriate approach is to create multiple virtual warehouses for different purposes. For example, create a virtual warehouse for your data ingestion pipelines, one for your Finance department, another for Marketing, and one for your data visualization tool.

Separating workloads out this way not only removes any contention but provides greater flexibility and control around credit usage. You can stop, start, and resize any of these virtual warehouses independently to cater for specific requirements your users or applications may have. Furthermore, this provides visibility into which virtual warehouses are being used the most.

I have worked for organizations that set Snowflake up as a central shared service for the entire business. They allowed departments to utilize these services in return for a cross-charging model based upon how many credits they consumed. In this case, having a dedicated virtual warehouse for each department worked very well. The department had its own resources and the company had visibility into the number of credits they consumed. It was very transparent and easy to manage while guaranteeing the customers a constant level of service.

Auto Suspend

You can choose to suspend any of your virtual warehouses from running. This can be useful when your warehouses may not be queried for large periods of the day. Using a smaller time increment such as the default of 10 minutes make sense in this case due to the fact your credits are billed on a per-second basis. Setting a very low value of 2-3 minutes doesn’t really make any sense as you are automatically billed for the first minute anyway and your warehouses might be continually starting and restarting.

Note

Suspending a virtual warehouse will also clear out the local disk cache. It’s important to consider how many queries take advantage of the cache when deciding how often you’ll auto suspend your virtual warehouses.

Query Processing

The job of the query processing service is to take queries before passing them to the cloud services layer (which I’ll cover in the next section).

The overall process looks like this:
  1. 1.

    A query is executed on Snowflake and can originate from ODBC, JDBC, or the web interface.

     
  2. 2.

    The optimizer sees the query and first attempts to satisfy it from the result cache.

     
  3. 3.

    The optimizer then employs data pruning to narrow down the location of the data to be processed and generates a query plan.

     
  4. 4.

    The resources provided by the virtual warehouse are then used to scan only the data needed from the local SSD cache or database storage before processing and retrieving the results.

     
  5. 5.

    Finally, the result is processed, returned to the user, and popped into the result cache for future use.

     

I cover caching in detail in Chapter 9.

Cloud Services

Cloud services is the Head Office of our postal service, the brains of the operation. This collection of services ties everything we discussed so far together.

In this section, I’ll briefly cover the services managed in this layer:
  • Authentication

  • Infrastructure management

  • Metadata management

  • Query parsing and optimization

  • Access control

Authentication

The authentication service allows users and applications to log on to the Snowflake platform. This can take the form of single sign-on where users authenticate through an external, SAML 2.0-compliant identity provider, key-pair authentication, or simply a username and password.

I discuss the details of these features, including multi-factor authentication, in Chapter 4.

Infrastructure Management

The infrastructure management aspect looks after the management of the underlying cloud infrastructure including storage buckets, provisioning, and decommissioning new clusters to support virtual warehouses.

Metadata Management

Snowflake’s metadata management service collects metadata when various operations are carried out on the Snowflake platform.
  • Who is using what data?

  • Which users are logged in?

  • When was the table last loaded? Is it being used?

  • Which columns are important in a table, and which columns are least used?

  • Where is table data stored along with clustering information?

The metadata is exposed in Snowflake via a layer of system-defined views and some table functions. These objects reside in the Information_Schema and are designed for easy end user access to obtain information about the Snowflake environment.

Additionally, statistics form part of this data collection and are automatically collected when data is loaded into Snowflake. Again, this is another big benefit over legacy databases where the DBA was responsible for keeping the statistics up to date. I’ve seen a few approaches to this, some more successful than others. Some DBAs simply refreshed all stats on a weekly basis for the entire database. This would usually take place out of hours as it would take a long time!

A slightly more advanced option is to query the metadata to work out how much of a table’s data has changed since the statistics were last refreshed. If a certain proportion of the table has changed, then the statistics are refreshed for this table. The query optimizer relies on these statistics when generating a query plan. The downside of not having up-to-date statistics (known as stale statistics) means the optimizer will be generating a sub-optimal plan. It simply won’t understand which partitions and columns the data is stored within to allow for effective pruning. The impact? Poor performance. Snowflake negates all of this by always ensuring statistics are automatically kept up to date.

Query Parsing and Execution

This service works out the most efficient way to process queries. It parses the query to ensure there are no syntactical errors and manages the query execution. This service relies upon the resources provided by the query processing layer to execute the query before finally passing the result set to the user or application. I discuss this process in more detail in Chapter 9.

Access Control

This service ensures that users can only access or carry out operations on the objects and data they are permitted to access, based on their privileges.

Snowflake adopts a role-based access control model. This means users are granted roles and those roles are granted permissions to carry out actions against objects. I discuss access control in more detail in Chapter 4.

Summary

In this chapter, you learned about Snowflake’s three-layer architecture. You looked at how the database storage layer effectively stores both relational and semistructured data in small blocks of storage called micro partitions. You learned how Snowflake automatically clusters the data while collecting metadata to support fast and efficient retrieval. You gained an understanding of how to work with virtual warehouses within the query processing layer, along with some key considerations when configuring the settings to match your workloads and usage patterns. And, finally, you explored the various components that make up the cloud services layer, which ties all the services together and provides a seamless service.

The remainder of this book builds upon the understanding of Snowflake’s architecture you gained in this chapter. In the next chapter, you look at how to load data into Snowflake. You also evaluate some of the different methods available, such as when to bring data into Snowflake vs. leaving it in cloud storage and how to handle bulk and near–real-time streaming data and structured and unstructured data. Along the way, I’ll help you assess these approaches by calling out the advantages of each along with some practical examples.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset