Chapter 44. Managing Data

Data management and information security are growing focus areas for consumers and companies. This chapter is not a guide on what the laws are (or are not), but rather will highlight the areas that most Prep Builder users should consider. After learning about the types of sensitive data you may come across, you’ll be better equipped to handle and use the data correctly. This chapter will also cover when and why you may wish to delete data.

Throughout this book, you may have picked up on the fact that I support the use of data and data proliferation. This is because I have seen many organizations lock down data access so much that it becomes impossible to make data-driven decisions, either due to the lack of data or lack of practical experience working with it once it does become available. This is not to say that you should be reckless with data access, but preventing data use altogether can cause consumers to make poor, even harmful, “gut instinct” decisions.

What Is Sensitive Data?

Data sensitivity is measured in many different ways depending on the organization, the sector in which it operates, and the data’s subject matter. Most data is categorized by its sensitivity—that is, the extent to which the accidental or purposeful release of this data would cause issues (legal or otherwise) for the organization that owns the data or the subjects of the data. Typically, there are about three to five levels of data security. This section will describe four of the most common levels.

Public

Public information has been pulled from publicly available sources and is probably free to use. The data may come from government sources or just be data that is used openly—for example, spatial data sets (like post codes/zip codes), census responses, or social media posts.

Confidential

Confidential data has probably been processed; that is, it has been refined from the raw data. After investing the time and effort in that refinement process, the data holders might not want to openly share the data with potential competitors. If this information were to leak to the public, there would be no consequences for the entities covered by the data.

Strictly Confidential

Strictly confidential data is likely to be information on your sales, customers, and products. This is data that you cannot share and that you do not want your competitors to see. This data may contain lists of your customers’ details, but nothing that is so sensitive that it would impact the customer if the data were leaked. This level of sensitivity is likely to include intellectual property within the company that may cover data assets like financial models and projections.

Restricted

Restricted data is the most sensitive data the organization holds. For large organizations, this can cover a vast spectrum, from customer-sensitive data like banking details to demographic information that could potentially be harmful if leaked to the public. Not all organizations store information like political affiliations or sexual orientation, but the data can be a proxy for this (e.g., bank transactions that show donations to a political campaign).

Restricted data can also include company Price-Sensitive Information (PSI), which relates to company shares traded on the stock market. If PSI is not managed correctly, shareholders can potentially engage in insider trading (i.e., trading shares based on information that isn’t publicly available, a practice that is not just unfair but also illegal). People with access to this information need to understand how to manage this data; they must be prevented from trading shares themselves or divulging the information to those who do. Breaking rules around restricted information poses a risk not just of damaging the organization’s reputation but also of incurring fines or even facing imprisonment.

Managing Data Based on Sensitivity

Striking the balance between overprotection, which prevents people from being able to do their work, and underprotection, which risks misusing sensitive data, is a challenge but far from impossible. Working with the users of the data to understand what they intend to do with it is the key to achieving this balance.

Having data sources that are stored centrally but enable you to grant or gain access quickly is a good target to aim for. Removing the small data sets held individually on colleagues’ laptops and personal drives can help ensure data is up-to-date and correctly removed once it is no longer relevant. Providing a centralized data source will ensure that data is accurate, as multiple people will be using the data sets and can make adjustments as they find errors.

The challenge with centralized data sets is getting access in a timely fashion so people can answer the questions they have as they arise. Centralized data sets often include data across the sensitivity spectrum, so permissions are tightly controlled. It’s important to have a fast turnaround and a devolved permission approval process, where multiple people are authorized to grant access to the individual making the request. Without these measures, no one is able to gain access and requests can be held up by individuals who are away from the office or busy.

For data preppers, practicing in a “sandbox” environment similar to the centralized stores can help them master the following skills:

  • Connecting to a data store

  • Using naming conventions for software and data sources

  • Writing to the data store

  • Removing content from the data store once it is no longer needed

Practicing these skills helps ensure that the data prepper will use the right terms in the centralized production environment and have a process developed in Prep that mirrors what needs to happen in production.

Production Versus Development Environments

Production environment is not a term covered much in this book thus far. Production environments are very tightly controlled, as they store lots of the regulatory or otherwise important reporting of a company’s performance. In contrast, a development environment is where a data set or query will be tested before being placed into the production environment, so it is more flexible and has fewer rules constraining the use of the data sets it holds.

Not all data is prepared perfectly the first time, in the same way that a report or analytical dashboard is rarely production-ready right away. You’ll often need to iterate based on the feedback of others using the data. This is why you need development environments to test the data preparation flow as well as the resulting data set. Only once the asset has been tested and is approved for widespread use and in key reports should you move the flow into a production setup. Again, the production environment is more tightly controlled, so most people will not have permission to write content to it, nor should they, lest they make mistakes that may be very difficult to resolve.

Deleting Data

So, if you understand the sensitivity of the data and have tested the data set in a development environment before publishing the flow to a production environment, are you done? Well, not really. You also need to consider when to delete data. In this section, we’ll cover the two most common reasons for doing so.

When Data Becomes Outdated or Irrelevant

Data can become less relevant and potentially less accurate over time. When creating a data source, you should think about how long to retain that data. Obviously, designating a date for deleting the table or records doesn’t mean you have to do it then. You can always reassess the data for its relevance and accuracy, but setting a date will at least keep you from putting off the decision to keep or remove it.

When a Customer or Client Leaves

You should retain data only as long as you are legally allowed to. This rule has become much stricter with the EU’s introduction of General Data Protection Regulation (GDPR) in 2016. Detailed customer data should be removed when the customer leaves. To be able to do that, you must know or be able to find out where all of that customer’s data actually resides—in which tables in what systems. If data is distributed through a lot of different sources and files, this is a more difficult process. Having data sources in a centralized location means that when you remove customer data from that central source, the change will also be applied to the other data sets.

Summary

Overall, data management and information security aren’t the most fun subjects, but understanding them can make working with data much easier and faster. Having novice data preppers learn and develop these skills within a controlled “sandbox” environment can help ensure the security and integrity of the data once they are granted access to the centralized production environment.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset