© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
A. MortonMastering Snowflake Solutionshttps://doi.org/10.1007/978-1-4842-8029-4_7

7. Data Sharing and the Data Cloud

Adam Morton1  
(1)
Sydney, NSW, Australia
 

In organizations of all shapes and sizes, data is locked in silos. It takes time to locate data, integrate it, and share it so insights can be obtained. Therefore, decisions are made on intuition rather than fact.

Companies face a range of challenges when they need to share data internally between departments and externally across suppliers, customers, and vendors. The time it takes to extract this data from the many disparate data sources that exist, to transferring the data using something like FTP, means that the data quickly becomes outdated and stale.

These traditional methods of sharing data result in several challenges, such as additional storage costs for all parties, ETL development costs, latency, and new points of failure along with additional security concerns.

The need to share data quickly and efficiently is becoming increasingly important in today’s global economy. Some organizations view this as an opportunity to monetize the data they hold by sharing it more widely, while others want to consume data from a wider range of sources to enrich their own data sets.

The introduction of APIs tried to alleviate some of the challenges of data sharing. However, this approach still requires time and effort to build, develop, and maintain the APIs. It also requires creating copies of the data that cannot be retrieved in the event a data sharing agreement terminates. Additionally, there’s often a limit to the volume of data and types of questions data consumers can ask of the data. It also requires the data consumers learn how to query and integrate with the API.

Snowflake provides a better way to securely share data, one that doesn’t require shuffling physical copies of data around the network. This is the primary focus of this chapter.

The Data Cloud

Before you can understand the Data Cloud, you need to understand the current world of moving and exchanging data across businesses. A great example to draw upon to help explain this is Blockbuster vs. Netflix.

Anyone old enough to remember a VCR may also recall Blockbuster. It was the major movie rental franchise in the country. If you enjoyed movies, you probably visited a Blockbuster store too many times in the past. Let's look at its business model and why it is no longer with us.

Every time a new movie came out, Blockbuster had to take the master copy of the VHS tape and made millions of copies of it, before distributing a few of them to each store. As a customer, you had to go to the store, pay a rental fee to take that VHS tape to your home, and stick it into your VCR, hoping the tape was in good working condition and the previous customer rewound it before returning it.

This process was repeated every time a new movie came out. Just imagine how many rentals they lost because they did not have enough tapes on the shelves in each store because they did not have the capacity to make enough copies or to distribute them fast enough? Or how many frustrated customers were created because the tape was damaged or a VCR was defective?

This business model is similar to what you see businesses do when they share and exchange their data: when a new movie comes out, they make copies, and those copies are handed to a customer. Then they hope the tapes work and the customer has a proper VCR so they can watch it. You can see that this model is not very scalable, secure, or customer centric. Figure 7-1 illustrates Blockbuster distribution model from store to consumer.
Figure 7-1

Blockbuster’s distribution of physical tapes to consumers

You can only make and distribute so many copies using the resources you have, and once you hand out one of those tapes, you have no idea what happens to it. Also, customers are not always successful finding an available tape or being able to play it on their own VCR.

In today's world, sharing data works much the same way. A new version of the data becomes available, the data provider starts making copies of it, and sends the files to businesses that need to use that dataset. Providers also can't handle high demand, so they usually deal with only the biggest consumers. Just like movies, data replication is limited by the provider's capacity to duplicate and distribute the data.

In comes Netflix's current business model, which put an end to Blockbuster's dominance. Netflix improved the customer experience and lowered costs by simplifying and eliminating distribution and consumption-related problems. No lost rentals due to not having tapes or DVDs, no delays in getting the movies in the hands of customers, plus a much cheaper and scalable distribution model.

Netflix has one copy of the movie stored in the cloud and everyone simply streams that movie in a live fashion instantly. If a movie gets changed or a show adds a new episode, one copy gets put on the server and everyone can see it whenever they wish (Figure 7-2). And this method is fully scalable, where they don't care if they have 100 users or millions because they don't have to replicate the content or processes for each user.
Figure 7-2

Netflix’s distribution model

We know this process as streaming movies. If I want to watch something, I simply click on that content on any device and start watching it right that second. If a new episode is available or if the producer changes scenes in an existing movie, I don't need to do anything on my end. It just shows up on my screen. I’ve got access to millions of movies and other video content that could never fit in a Blockbuster store, and everything is always there and accessible right now and live.

Snowflake's data sharing brings the same streaming content delivery concept to delivering and consuming data for businesses. It makes providing, delivering, and consuming data between a data provider and consumer as easy as watching a movie on Netflix or some other streaming service.

The data provider has one copy of data that they can stream to any number of consumers, where the number of consumers they are distributing to does not matter and can scale to any number. Consumers have one-click instant access to any of the data and they don't have to wait. The way they have access to that data is pretty much identical to streaming movies where it is always live and the consumer never has to worry about updating or refreshing it.

Netflix is a read-only medium where consumers watch movies. If you are a business that simply wants to access streaming data to look at it or analyze in isolation, Netflix is a good analogy.

However, business is not a read-only consumption model. Business data access is a content creation model. This means businesses want access to data that is not theirs, to blend and augment it with their own internal datasets, to generate new and additional valuable datasets of their own. Those new datasets are sometimes used strictly by that business or most often need to be shared (streamed) to other parties because there is value in it for both the provider and the consumer.

Imagine you are a YouTube content creator. You have your own recorded content, but you also need to add scenes and segments from other media content on the YouTube platform.

You can cut and paste clips to add to your recording and make an awesome video that could get millions of views and make a lot of money. You are essentially enhancing your content with other third-party content that is readily available on the same platform with a single mouse click.

Now you are not only able to access and watch millions of third-party content instantly like you do on Netflix, but you can also create brand new content of your own by blending your own recordings with others to end up with much more valuable video assets.

Once the new video is created, you can also distribute it to millions of viewers with a single mouse click, without any major cost to you, and you can monetize it in a way where you make more money based on the number of users, how long, and how often they watch your content.

Unlike Blockbuster, where you lose all visibility once the VHS tape leaves your hands, YouTube gives you all the usage stats that you can think of in terms of who is watching it, how long, and what parts of it as well as what parts are not being watched. This is huge because it allows you to continuously improve your content based on real-life live telemetry data, putting you in a position where you can avoid creating content that doesn’t work, while doubling down on the videos that get the most attention.

Snowflake's data cloud follows the same principles. It allows you to stream other third-party data content in a live fashion. Blend that content with your own internal data and possibly apply machine learning models to enrich it even further to create brand new data assets, which are much more valuable than any of the individual datasets alone.

Now that you have this great data content, you can freely use it internally to make revenue, increase margins, or decrease costs. You can share the data in a streaming fashion with other organizations, where it simply shows up in their warehouse once they click on a button either for free or as a paid dataset that lets you monetize it.

On top of all that, you get instant feedback on the consumer's usage of your data to see who and how they are using it for product improvement.

All of that with just a few mouse clicks to obtain, use, and distribute data but none of the headaches that come with maintenance, administration, refreshing, and distributing.

Data Sharing

Data sharing allows you to share the data you have within your Snowflake account securely and efficiently either within your organization or externally, joining thousands of other organizations making data available across public clouds as data consumers, data providers, and data service providers.

With data sharing in Snowflake no actual data is moved. This means you do not need to generate additional physical copies of the data. This saves time, storage costs, and reduces risk as the more copies of data you make, the greater the likelihood of them diverging or become stale and out of date. Therefore, the only cost to consumers is the compute resources needed to query the shared data.

All sharing of data is managed between the services and metadata layer of Snowflake and is made possible by Snowflake’s unique architecture, in the way storage is separated from compute.

The Data Marketplace

The Data Marketplace is available directly from the web UI in Snowflake and enables users to discover both free and paid data sources from a wide range of data providers.

You are able to obtain immediate, one-click access to data from within the Data Marketplace, which looks and feels like any other database within your Snowflake account. In addition, this data is live, meaning the data provider refreshes and updates their data feed so you automatically benefit from getting access to the latest data without having to do anything yourself.

The ability to consume data is this way quickly and easily allows you to augment your own internal datasets to generate new business insights.

Monetization

Snowflake has opened up an exciting possibility with this ease at which you can become a data provider and monetize your existing data assets. For example, your organization may store data that other organization may deem highly valuable. Examples of this are up-to-date investment and trading data, operational sensor and device data logs from machinery, or marketing data that identifies behaviors and trends of certain demographics or segments of customers, to name just a few of typical use cases that can take advantage of the Data Marketplace.

The best place to start on your monetization journey is to assess the data sets you currently have and their potential value on the marketplace. When carrying out this assessment, it’s prudent to look at how you can add even more value to this data by enhancing it using machine learning techniques to generate insights and analytics for example. This could make it more attractive to potential customers.

Selecting a pricing strategy should also be carefully considered. Think about your data acquisition costs, the time and effort it takes to refresh the data on time and to ensure the data is free from data quality issues. How much does it cost to enhance and package the data to get it ready for market, ensuring that this includes any augmentation or enrichment? How hard would it be for your customers to obtain this data elsewhere and what value does it bring to their organization? Doing your research up front is important as it will ultimately lead you to a commercial decision of what datasets are truly worthwhile to bring to market.

Data Exchange

The Data Marketplace allows you to create and host your own Data Exchange. This allows you to publish and share data sets securely between a selected group of members you chose to invite. Setting up your own Data Exchange provides the opportunity to remove the need for FTP, file shares, and cumbersome ETL processes to create the files for your data consumers and goes some way to breaking down data silos in your organization.

Providers and Consumers

A data provider is the Snowflake account that makes the data available to other Snowflake accounts to consume. You’re able to share a database with one or more Snowflake accounts. When a database is shared, you can still use the granular access control functionality to manage access to specific objects within a shared database.

A data consumer is any account that chooses to create a database from a share made available by a data provider. Once a consumer creates a database, it acts and behaves like any Snowflake database.

What Is a Share?

Let’s say you have a table in a database in your Snowflake account that contains important information and you want to share it with others. To do so, you need to create a share.

You can create a share and chose to add the table in your database to the share before providing access to another account with whom you’d like to share the data. In practice, all you’re really doing is sharing a pointer to your data with another account.

When you create a share, it creates an object in your database that contains all the information required to share a database. This information consists of privileges that grant access to what database, schema, and specific objects (within that schema) should be shared.

Once a data consumer creates a database from the share, it’ll create a read-only version of the database containing all the objects you, as the provider, have granted access to.

Note

Any new objects added to a share immediately become available to all consumers.

Shares are controlled by the data provider, meaning that access to a share, or any objects within a share, can be revoked at any time. To terminate a data share, you can run the following command:
DROP SHARE <<share name>>

Dropping a share instantly prevents the share from being accessed by any consumer accounts. Any queries to the share will subsequently fail. After you have dropped a share, you can recreate it with the same name but this won’t restore any of the databases from the previously dropped share. Snowflake will treat the recreated share as a completely new share.

In Figure 7-3, the provider hosts two databases, Sales and Orders. Each database contains two separate schemas. The provider decides to share the Store schema from the Sales database with a consumer, while the Transactions schema from the Orders database is shared with another consumer. In this scenario, the consumers have read-only access to only these schemas; they cannot view the Salesperson or Shipping schemas. You can choose to share entire databases or individual database objects such as tables or views, for example.
Figure 7-3

Data sharing example from a provider to consumers

Reader Accounts

Data sharing is only available to customers with Snowflake accounts. However, what if the consumer is not a Snowflake customer? For this scenario, you have the ability to create a reader account which, as the name suggests, allows read-only access to a share. The share resides within the provider’s account, which means a reader account can only consume data from the provider account that created it.

This means that the provider is responsible for all costs generated by the reader account. Thankfully, you can track usage and bill the reader account if required by assigning them a dedicated virtual warehouse to use.

Using a Dedicated Database for Data Sharing

One of the methods I personally like to adopt is to create a dedicated database for sharing. This database contains only secure views (by the way, secure views are the only type of views that support data sharing currently) that reference the database or databases from where the data resides.

Figure 7-4 shows what this might look like. A view is created in a dedicated database. This view joins data together from both the Orders and Sales databases to provide transactions by store. This view is then added to a share, which the consumer is granted access to.

I find that this approach is cleaner, more transparent, and allows for easier management of the data sharing process. Using this approach means I am only sharing secure views and all these views sit within one database. It also means that databases I have within my organization remain unaffected by any new or changing data sharing needs.

Remember that new objects created in an existing share are made visible to the data consumer immediately. This approach makes it harder for a user to mistakenly share an object, as the only reason they would need create anything in this database would be for sharing purposes in the first place!

It is worth pointing out that any database referenced by one of the secure views used for data sharing requires the REFERENCE_USAGE privilege to be granted.
Figure 7-4

Using a dedicated database and secure views for data sharing

Data Clean Rooms

If you’re reading this book and you come from a data warehousing background, then it’s quite possible that you’ve never come across the data clean room term previously. In this brief section I’ll cover the basic concepts of a data clean room and what problem it aims to solve.

In recent times, an individual’s activities on the Web could be tracked using cookies. This allowed brands and retailers to use this information in order to push relevant advertising to their web sessions. As a result, customer privacy suffered, which led to the introduction of some of the data confidentiality legislation covered in Chapter 4 such as GDPR. This means it is now more difficult for organizations that relied on tracking user behavior to gain access to the data in the same way. This fueled the rise of the data clean room.

A data clean room is a concept that enables advertisers and brands to match user-level data without sharing any PII/raw data with each other. It’s a secure and closed environment that follows a privacy first principal . It allows brands to match their own data with their partner’s data to help improve analysis and targeting at a granular user level without the need to expose that information to each other.

Major advertising platforms like Facebook, Amazon, and Google use data clean rooms to provide advertisers with matched data on the performance of their ads on their platforms, for example.

You can use Snowflake’s data sharing capabilities in combination with the various security and access controls to provide the same functionality. User-level data goes into the clean room and aggregated insights come out, giving the brands and partners the information they need while protecting customer data.

Bringing It All Together

Now you’ve built up significant knowledge of a number of capabilities Snowflake offers. Let’s use a number of them together to solve a real-world problem.

The Example Scenario

Let’s say you need to share sales performance data with several franchisees who run stores for you in a different region. They’d like to get up-to-date sales data for their store in near real time. All sales data for all stores resides in a master sales table. Your franchisees want and are only permitted to see their own store data.

Previously, with your old, on-premise database technology, you needed a series of jobs running throughout the day to query the database and output the data to a flat file. The file would be uploaded to a secure FTP server and transferred to the franchisees’ FTP servers where they could pick it up. The jobs sometime failed due to bandwidth constraints over the network. It also took a long time to copy the file because it had to be copied from one geographic region to another. With all this data movement and several points of failure, it was a huge resource drain on the team that supported this process.

Thankfully your company recently migrated its data from this legacy database to Snowflake in the cloud!

Now your job is to not just recreate the file transfer process but find a better way of making this data available to your franchisees. You immediately think of using the data sharing feature you’ve read about. However, you know that creating a share from your primary database isn’t the most efficient way forward on its own due to the latency involved in querying cross-region.

You consider creating a replica of the primary database in the region where this franchisee is located. You would rather avoid the cost and inefficiency of replicating the entire database, just to create a share of the sales master table.

You wonder if you can use streams and tasks to capture changes to the sales master table and push into a new database, as well as using a set of secure views (each one filters the data to just the required store) over the replicated table.

It would then be possible to replicate just this database along with the secure views to the region where the franchisees are located. Shares can then be created to map each secure view to each franchisee.

Figure 7-5 illustrates the design you’re looking to put in place here.
Figure 7-5

A diagram of the practical example including streams, tasks, replication, and data sharing using secure views

The following code shows how to achieve this in Snowflake:
USE ROLE ACCOUNTADMIN;
//IN YOUR LOCAL ACCOUNT, CREATE OR REPLACE A DATABASE WITH A SUBSET OF DATA
CREATE OR REPLACE DATABASE SOURCEDB;
CREATE OR REPLACE SCHEMA SALES;
//CREATE A SEQUENCE TO USE FOR THE TABLE.
CREATE OR REPLACE SEQUENCE SEQ_SALES
START = 1
INCREMENT = 1;
//CREATE A THE SALES_MASTER TABLE
CREATE OR REPLACE TABLE SALES.SALES_MASTER
(ID NUMBER,
STORE_NO INT,
SALES_DATE TIMESTAMP_TZ,
SALES INT,
REGION VARCHAR(10));
INSERT INTO SALES.SALES_MASTER
SELECT SOURCEDB.SALES.SEQ_SALES.NEXTVAL, 100, '2021-06-07 02:21:10.00 ​-0700', 100, 'APAC'
UNION
SELECT SEQ_SALES.NEXTVAL, 100, '2021-06-08 02:21:10.00 -0700', 190, 'APAC'
UNION
SELECT SEQ_SALES.NEXTVAL, 100, '2021-06-09 02:21:10.00 -0700', 104, 'APAC'
UNION
SELECT SEQ_SALES.NEXTVAL, 100, '2021-06-10 02:21:10.00 -0700', 150, 'APAC'
UNION
SELECT SEQ_SALES.NEXTVAL, 101, '2021-06-07 02:21:10.00 -0700', 3201, 'APAC'
UNION
SELECT SEQ_SALES.NEXTVAL, 101, '2021-06-08 02:21:10.00 -0700', 2987, 'APAC'
UNION
SELECT SEQ_SALES.NEXTVAL, 101, '2021-06-09 02:21:10.00 -0700', 3241, 'APAC'
UNION
SELECT SEQ_SALES.NEXTVAL, 101, '2021-06-10 02:21:10.00 -0700', 3829, 'APAC'
UNION
SELECT SEQ_SALES.NEXTVAL, 102, '2021-06-07 02:21:10.00 -0700', 675, 'EUR'
UNION
SELECT SEQ_SALES.NEXTVAL, 102, '2021-06-08 02:21:10.00 -0700', 435, 'EUR'
UNION
SELECT SEQ_SALES.NEXTVAL, 102, '2021-06-09 02:21:10.00 -0700', 867, 'EUR'
UNION
SELECT SEQ_SALES.NEXTVAL, 102, '2021-06-10 02:21:10.00 -0700', 453, 'EUR';
SELECT * FROM SALES.SALES_MASTER;
//CREATE A DATABASE TO STORE THE APAC SALES TABLE AND SECURE VIEWS
CREATE OR REPLACE DATABASE PRIMARYDB;
CREATE OR REPLACE SCHEMA SALES;
CREATE OR REPLACE TABLE PRIMARYDB.SALES.SALES_MASTER_APAC
(ID NUMBER,
STORE_NO INT,
SALES_DATE TIMESTAMP_TZ,
SALES INT);
//SEED THE TABLE WITH EXISTING RECORDS
INSERT INTO PRIMARYDB.SALES.SALES_MASTER_APAC
SELECT ID, STORE_NO,SALES_DATE, SALES FROM SOURCEDB.SALES.SALES_MASTER WHERE REGION = 'APAC';
//CREATE A SECURE VIEW FOR EACH FRANCHISEE STORE
CREATE OR REPLACE SECURE VIEW PRIMARYDB.SALES.FRANCHISEE_100 AS SELECT * FROM PRIMARYDB.SALES.SALES_MASTER_APAC WHERE STORE_NO = 100;
CREATE OR REPLACE SECURE VIEW PRIMARYDB.SALES.FRANCHISEE_101 AS SELECT * FROM PRIMARYDB.SALES.SALES_MASTER_APAC WHERE STORE_NO = 101;
SELECT * FROM PRIMARYDB.SALES.FRANCHISEE_100;
SELECT * FROM PRIMARYDB.SALES.FRANCHISEE_101;
//SET UP A STREAM TO RECORD CHANGES MADE TO THE SOURCE TABLE
CREATE OR REPLACE STREAM SOURCEDB.SALES.STR_APAC_SALES ON TABLE SOURCEDB.SALES.SALES_MASTER APPEND_ONLY = TRUE;
//SET UP A TASK TO LIFT THE CHANGES FROM THE SOURCE DATABASE AND INSERT THEM TO THE PRIMARYDB DATABASE
CREATE OR REPLACE TASK SOURCEDB.SALES.TSK_APAC_SALES
  WAREHOUSE = COMPUTE_WH
  SCHEDULE = '1 MINUTE'
WHEN
  SYSTEM$STREAM_HAS_DATA('STR_APAC_SALES')
AS
  INSERT INTO PRIMARYDB.SALES.SALES_MASTER_APAC SELECT SOURCEDB.SALES.SEQ_SALES.NEXTVAL,STORE_NO, SALES_DATE, SALES FROM MYSTREAM WHERE METADATA$ACTION = 'INSERT' AND STORE_NO IN (100,101) AND REGION = 'APAC';
ALTER TASK SOURCEDB.SALES.TSK_APAC_SALES RESUME;
//INSERT A NEW SALES RECORD INTO THE SALES MASTER TABLE FOR STORE 100
INSERT INTO SOURCEDB.SALES.SALES_MASTER
SELECT SOURCEDB.SALES.SEQ_SALES.NEXTVAL, 100, '2021-06-13 02:21:10.00 ​-0700', 999, 'APAC';
//CHECK THE STREAM FOR THE RECORD ADDED
SELECT * FROM SOURCEDB.SALES.STR_APAC_SALES WHERE METADATA$ACTION = 'INSERT';
//CHECK FOR THE NEW RECORD
SELECT *
FROM PRIMARYDB.SALES.SALES_MASTER_APAC;
//PROMOTE THE NEW DATABASE AS PRIMARY
ALTER DATABASE PRIMARYDB ENABLE REPLICATION TO ACCOUNTS AZURE_APAC.ACMEPROVIDERACCOUNT2;
//REPLICATE YOUR EXISTING DATABASE TO ANOTHER REGION IN APAC
CREATE OR REPLACE DATABASE SECONDARYDB
  AS REPLICA OF AP-SOUTHEAST-2.ACMEPROVIDERACCOUNT1.PRIMARYDB;
//SCHEDULE REFRESH OF THE SECONDARY DATABASE
CREATE OR REPLACE TASK REFRESH_SECONDARYDB_TASK
  WAREHOUSE = MYWH
  SCHEDULE = '10 MINUTE'
AS
  ALTER DATABASE SECONDARYDB REFRESH;
ALTER TASK REFRESH_SECONDARYDB_TASK RESUME;
//CREATE OR REPLACE A SHARE FOR EACH FRANCHISEE
CREATE OR REPLACE SHARE SHARE100;
CREATE OR REPLACE SHARE SHARE101;
//ADD OBJECTS TO THE SHARE:
GRANT USAGE ON DATABASE SECONDARYDB TO SHARE SHARE100;
GRANT USAGE ON SCHEMA SECONDARYDB.SALES TO SHARE SHARE100;
GRANT SELECT ON VIEW SECONDARYDB.SALES.FRANCHISEE_100 TO SHARE SHARE100;
GRANT USAGE ON DATABASE SECONDARYDB TO SHARE SHARE101;
GRANT USAGE ON SCHEMA SECONDARYDB.SALES TO SHARE SHARE101;
GRANT SELECT ON VIEW SECONDARYDB.SALES.FRANCHISEE_101 TO SHARE SHARE101;
//ADD ONE OR MORE CONSUMER ACCOUNTS TO THE SHARE
ALTER SHARE SHARE100 ADD ACCOUNTS=FRANCHISEE_ACCOUNT_100;
ALTER SHARE SHARE101 ADD ACCOUNTS=FRANCHISEE_ACCOUNT_101;

Summary

In this chapter on data sharing and the Data Cloud, you investigated how Snowflake solves the challenge of slow, error-prone ways of sharing data internally within an organization and externally with third-party consumers. You discovered that by leveraging the global scale of the cloud computing backbone, Snowflake can make physical movement of data a concept relegated to the history books.

Now consumers can get their hands on the latest, most up-to-date data on demand when they need it. They can also do this quickly and easily, without the need for either party to build and configure APIs for data access.

You explored a practical example based on a real-world situation that incorporated a number of Snowflake features you learned earlier in the book.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset