Chapter 12. Sampling Data Sets

To sample or not to sample? That is the question. In a world where data volumes are growing, storage solutions are getting cheaper, and data creation is easier than ever, data preppers must decide whether to use a sample subset and understand the implications of doing so. This chapter will look at why sampling should be used with caution, when you might need to sample, and what techniques you can use to sample data in Prep Builder.

One Simple Rule: Use It All If Possible

The reason we use data is to find the story, trends, and outliers that will help us make better decisions in our everyday and working lives. So why not aim to use all the data and information you can?

Using the full data set is not always possible, though, frequently due to the size of the data set. The reason Preppin’ Data exists is because data often needs to be prepared for analysis. To do that, we need to know what is possible to clean completely and what is not. If it isn’t possible to clean data sets completely, then it makes sense to remove sections that can’t be cleaned. This is not what is meant by sampling, though. Sampling means using a subset of the full data set—not because the data can’t be cleaned but for lots of other reasons.

Sampling to Work Around Technical Limitations

A sample allows you to take the data you need to clean and freeze it in time to deal with the two main technical challenges of data prep:

Volume of data
A sample lets you set up your analysis and then run the full data set against that logic.
Velocity of data
A sample limits the amount of continual change, allowing you to set up the logic before permitting more frequent updates.

Let’s look at both in more detail.

Volume of Data

The world is swimming in data and there is no sign of stopping. Many data sets have become too big to store in files, so databases are being increasingly used to support the volume of data. The good news is that if your data is in a database, someone has at least architected the data to be stored that way. The likely data prep challenges here are:

  • Joining other data sets together. Database tables are rarely in the perfect form for your analysis. When working out what is useful for your analysis and what isn’t, it can be beneficial to take small samples of the data tables to assess how they can be joined or what data fields you need from each. Samples also allow you to work faster by avoiding the long processing times of joining large data sets together, especially if you don’t get the right join condition(s).

  • Determining your ideal structure for analysis. Database tables are designed for storing data, not optimizing your analysis. Pivoting data, removing unnecessary fields like database keys/IDs, and filtering to relevant time periods are common techniques that you can apply more quickly to samples than the full data set. Related considerations are:

    • What columns are there? Do you need to add more (calculations) or remove some?

    • How clean are those data fields? Check whether the data set has concatenated fields that you need to break apart, clean up strings of text to turn them into meaningful categories, and make sure there are no foreign characters slipping into your measures. By taking a sample, you can break down the overall cleaning step into more manageable sections. After you’ve added this cleaning logic back to the original data set, you can tweak the logic to cover other challenges in the data set.

Velocity of Data

Media streaming services and the Internet of Things are two areas currently facing the daily battle of data velocity. Long gone are the days where an overnight batch run to update key data sources was sufficient. The speed at which data is created by users of modern media services and digital platforms means that data preparation can be a constantly moving target. By using samples of this data, you can avoid many of the pitfalls of trying to use all the data, all the time. You create logic for a sample, and as more data floods in, you can apply that same logic to the live stream by simply removing the restriction you used for the sample.

Other Reasons for Sampling

There are other aspects of data preparation where sampling can prove beneficial.

Reduce Build Times

Like many others, I spent my early career working in large corporate companies. While these experiences can present fantastic opportunities, they can also produce a serious amount of frustration with slow computers, servers, and connections between them. In one institution, I used two computers so one could run queries while I was building the next set of queries on the other. My love of coffee comes from having to fill my time when running queries on both machines at the same time. Using samples to set up the data structure and analysis was key to keeping my pace of delivery high and caffeine levels lower.

When I had the queries structured, joins checked, and relevant filters in place, I still had to wait for the full data set to run, but I did so with confidence that I had done everything I could to just run it once.

Determine What You Need

Sampling also comes in handy when you don’t actually know what you need. Chapter 2 advocated sketching out what you need to complete your analysis. However, sometimes the only way to iterate that need is to try to start forming that analysis. Multiple iterations of data preparation are needed just as much as multiple iterations of data analysis as people learn and ask follow-up questions. Samples of data can give you a feel for additional changes you might want to make. Writing out an output that takes a long time to form is only an issue if you have to do it time and time again.

Sampling data not only speeds up the preparation process and makes it easier to understand your data but also allows you to more easily communicate the challenges you are facing to others.

Sampling Techniques

By default, Prep Builder uses samples whenever you connect a data set to it (Figure 12-1).

Tableau’s default input sampling
Figure 12-1. Tableau’s default input sampling

Through the clever use of algorithms, Prep Builder automatically allows you to begin to see the full shape of your data, categories, and distribution of values without having to run the full data set through each step until you are ready to generate the final output. (At which point Tableau will use the full input data set unless instructed otherwise.) If you have a small data set, Tableau will use the full data set, but if you have wider data sets (more columns), Tableau’s default sampling will return fewer rows and records of data than if you have thinner data sets (fewer columns).

Even though Prep Builder is sampling the data, you might want to control the sample yourself. There are two basic controls for this within the Input step.

Fixed Number of Rows

The fixed number of rows will be taken from the first rows in the data source. These will not necessarily be in any prescribed order, as data sets hold data in different ways based on how they load and store data. Figure 12-2 shows where you can set the number of rows you want to return.

The “Fixed number of rows” option
Figure 12-2. The “Fixed number of rows” option

Random Sample

You can control the sample further by changing the “Sampling method” option to “Random sample” (Figure 12-3). Unlike the “Quick select” method, with this option Prep Builder will select random rows of the data set. To make this selection, Prep Builder has to work through the data set and return various rows.

The “Random sample” option
Figure 12-3. The “Random sample” option

You can always set your own sample using a bespoke filter in Prep Builder so you can control what the smaller set of data contains (Figure 12-4). Just remember, by setting this yourself you are biasing the data set, so be careful with any analytical conclusions you draw from this approach.

Setting up a filter in the Input step
Figure 12-4. Setting up a filter in the Input step

When Not to Sample

Sampling can speed up the data prep process, but what if you want to use Prep for the analytical process too? With the Profile pane showing the distribution of values and highlighting records of data, it’s possible to answer your stakeholders’ questions within Prep itself. In this case, you would want to avoid using a sample to prevent issues when you’re asked questions like:

  • How many of these are there?

  • When was the first sale we made?

  • Can you confirm this has never happened?

If the input data set were sampled, it would be difficult to give a confident answer to these questions. So, in this situation, you’d want to simply change the input to use all the data in Prep. This might increase the processing times, but you gain the assurance that all the data is present. When you run your flows by writing outputs, you write the whole data set regardless of whether you have a sample on the input, so you only need to switch to the whole data set on the Input step when answering questions in Prep.

Summary

Sampling data can be a useful technique for increasing the speed and agility of your data preparation. But any sample by definition is not the whole data set. To produce reliable analytical findings, you really need to use the full data set where possible and make it available to your users. This way, outliers and patterns can be discovered that a sample might not have revealed.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset