Chapter 45. Storing Your Data

One of the key considerations in data preparation is where to hold the output. After all, what is the point of doing all that hard work if you then put the data somewhere that is:

  • Inaccessible to those who need to use the data

  • Slow or unresponsive

  • Not protected against accidental overwrites, risking permanent loss of the source data

Let’s consider each of these scenarios in turn to determine what you should consider when writing your output to a location (Figure 45-1).

The Output step in Prep
Figure 45-1. The Output step in Prep

Inaccessibility

As the previous chapter discussed, it can be challenging to find the right balance between data openness and data security. More restrictive data legislation is being passed across the world as the general public realizes the value of their personal data and the potential effects of a data breach on their lives. At the same time, without breaking any rules or laws, giving people freedom to work with data will lead to future innovation and better, more efficient decisions. So how can you strike this balance? Well, let’s first consider the absolute don’ts of data access.

Don’t Break the Law

There are some things that you just can’t do, and breaking the law is one of them. The following are two important things to keep in mind.

Personally identifiable information

Personally identifiable information (PII) is any data that can identify an individual. For operational reasons, you may need to be able to identify someone (i.e., check the balance in their bank account), but for analytical purposes, this shouldn’t be the case. This isn’t a book on data security, so I won’t go into too much detail here, but the point is that you should restrict access to any data that could be used to identify the individuals it involves.

The right to be forgotten

Numerous pieces of legislation have been created or restructured over the last few years protecting an individual’s right to have their data removed from your organization’s possession. In the EU, this policy is known as the right to be forgotten. To comply with these laws, be sure there is a clear trail of where data is used and what it is used for, and delete it once it has been used in your analysis or is no longer relevant.

Don’t Delete Operational Data

Operational systems—the technology systems that allow you to make payments, take orders, or provide services—must not be affected by your analytical queries. All of these systems rely on data and often store it in databases. If you are querying these systems directly, you are one poor query away from causing a lot of damage by deleting operational data points or slowing down key operational processes. If it is legal and useful for your analysis, copy this data into a specialist analytical environment. This way, you are querying a database that, in the event of an issue, will not affect the key systems your organization relies on to operate.

So, with these aspects in mind, now let’s look at the other side of the coin and still see data as an asset, not a liability. Making data accessible by following this list of do’s is key for any organization to progress and develop.

Do Grant Access to Data for the Experts

Not being familiar with what is in the data can lead to poor decisions. This isn’t a matter of technical skills but of understanding the context of the data. Giving subject-matter experts access to the data, and working with them to understand exactly what each column is doing, will ultimately ensure the data source gets documented and becomes useful. Otherwise, you are storing or potentially using data that could be misconstrued. Storing the data in a location the business experts can’t access will result in “opinion-driven” rather than data-driven decisions.

Do Document Your Sources

Storing data in such a way that it isn’t clear and obvious what it is won’t lead to any success. Curating data sources so that end users can understand them is more important than saving on small amounts of storage space. Clearly naming column headers, creating views on top of tables to “humanize” the language, and publishing data sources on user-friendly platforms like Tableau Server are all ways to document a data source and make it more accessible.

Slow/Unresponsive Performance

One major consideration when you are deciding where to store data is the response time. In an age where people all over the world are able to ask questions online and get answers in seconds, a data set that takes 20 seconds to load can feel positively glacial. Ensuring data sets are responsive to the queries being made is key to putting data at the heart of your organization. Not all data resides in this state—lots of data sets are stored on slow, archaic databases—but as data preppers, our task is not just to clean input data sets but also to ensure the output data set will be responsive. Ultimately, if it isn’t, your users will tell you by not using the data set.

Overwriting Risks

Although Tableau Desktop is a read-only tool, the same can’t be said for Prep Builder. Any alterations a user makes in Tableau Desktop will never change the underlying data source. This frees users to experiment and try out new techniques or queries without fear of damaging anything. If you enable the same level of freedom in Prep Builder, however, you run the risk of users overwriting data that you might not be able to recover.

This risk is nothing new for running data infrastructure. For decades, database environments have been a balancing act between managing limited resources and meeting the needs of the users. If you reduce the administrator’s workload, you open up less experienced users to more responsibility. However, tilt that balance in the other direction, and everyone is stuck waiting for the administrator to be able to do anything. Creating an equilibrium is clearly impossible, but there are some approaches that can help.

Grant Read-Only Access

Giving people access to the raw data can help them find what they need without putting huge processing loads on the data storage environment. At a large bank, the idea of giving users Tableau Desktop was initially met with resistance for fear that the increased demand would overwhelm an already stretched processing environment. The opposite actually happened, thanks to the quality of the drivers that Tableau uses. Queries became more optimized in the majority of cases, and running the environment dramatically increased the value that users were getting from the data assets. With Desktop, this was an easy move, as the tool is read-only and therefore couldn’t affect the underlying data.

Still, you should look at Prep Builder in a similar way. The only difference is that instead of producing visual analysis, users will be producing cleaner, more streamlined data sets. Giving the users a set location to publish these to (a “sandbox” or “playground”) not only will give them the confidence to try to gain more value from the data but also can improve specifications for future developments, since users will know what they need to do to empower themselves and others. However, remember that with Prep, you can write the data back to its source with the Output to Database option (Chapter 20). To prevent any issues with accidental overwrites, give data preppers access solely to read-only information.

Train Before Publishing

The Prep Builder flow itself doesn’t have to be run as an output. The process of cleaning data can be beneficial both for users, who learn what they would like to achieve and how to do so, and for those who hardcode the results (i.e., write them to the database), who get to see the details of the cleaning process. This means the users’ requirements, which may have been spurious or iterative, are now clarified before the administrator spends time implementing this work. Over time, the skills the users develop will empower them to do the publishing work themselves, as they will know how to use the tool and the environment correctly.

So, Where Do You Write That Output?

The short answer to where to write your output is...it depends. It depends on the user, the type of data involved, the responsiveness of the database, the organization’s investment in the data platform, and the key roles supporting it. Empowering end users and allowing them to learn over time will promote data-driven decision-making in your organization. Ensuring that they can’t go too far wrong is good for them and the platform administrator both, as fixing mistakes can be time-consuming. Where possible, writing data sources to a centrally managed location will help protect it, especially as data privacy laws and corporate policies become more stringent.

Central locations, like databases and shared spaces, means potential data users know where to look for data sets that may provide answers to their questions. Potential users need to be clear on what they can and cannot use, and what is in the process of being tested. Organizations will need to create data repositories for validated data sets, as well as repositories to hold data sets that are in development. These environments are dramatically different, so be sure to check with the data experts in your organization about how to store the data you have prepared.

Good documentation will ease the migration from a development environment to a “productionalized” data set and help reduce the time it takes to fix issues that may arise.

Note

Documentation in Prep Builder was covered in Chapter 42.

Summary

Creating an analytical data store that allows users to work with their data is a worthwhile investment that has huge benefits for the organization. Allowing users to access Tableau Prep, which is focused on making data preparation skills easier to learn, is another plus in an organization’s approach to data work. Combining the two is a strategy that has a lot of upsides but may take some time to achieve. Setting this as a target is a great starting point if data is locked down or potential users don’t yet have the skills to access the data sources they need. Those skills will develop over time, and navigating this learning curve is far better than a situation where data is unclearly documented or locked away for fear of overwriting or misuse.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset