15

Data Quality and Management

This chapter is all about making sure that you have high-quality data that is easy to access and use. Instead of focusing on policies, such as data governance, these are standards and techniques used to make sure that you have the best data possible because inaccurate data can only produce inaccurate results. First, we will go over what quality control is, when you should use it, and what sorts of things it is checking. Next, we will talk about specific methods for quality control. Finally, we will cover the concept of master data management and when to apply it.

In this chapter, we’re going to cover the following main topics:

  • Understanding quality control
  • Validating quality
  • Understanding master data management

Understanding quality control

Quality control is the process of testing data to ensure data integrity. Here, we will go over when to perform quality control checks and what sorts of things these checks are trying to find. Remember, bad data leads to bad results, and bad results are worse than no results because they are actively misleading. It is important that your data is as accurate as possible, and to do that, you need quality control.

When to check for quality

While you will probably automate as much of the quality control practices as you can, there are times beyond the routine when it is important to check the quality of your data. You may use different quality control techniques in different instances, but in general, you need to check your data any time there is a major change. Lots of things may qualify as a major change, but the most common are as follows:

  • Data acquisition

Data acquisition is whenever you get new data. This doesn’t necessarily mean you are adding an observation to an old dataset, but every time you or your company get a new dataset. Before you use this new data, you have to check it. You must, whenever possible, ensure that the data has not been gathered or entered in such a way that would inherently introduce bias. It can be the cleanest, prettiest dataset in the world, but if it was collected in such a way that it is inherently biased, there is little you can do. At best, if you know how it is biased, you can attempt to account for it, and at worst, the data is useless. The other thing you need to do is check the current state of the data, which will be a more standard quality control check.

  • Data transformation

Data transformations are any time the data is changed from one form to another. There are lots of things that count as transformations, including intrahops, pass-throughs, and conversions – for example, if you attempt to normalize the data, or when you reformat a variable to match the same format as a variable from another table. It can also be something as small as changing the units you are using to report or changing the time zone. All of these transformations should be done in a new variable; try to avoid transforming the original data, or at least make sure the data is backed up. In this situation, you are mainly testing to see whether the transformation was performed correctly and that all of the data is still accurate.

  • Data manipulation

Data manipulation is when you change the shape of the data, not its content. You can think of this as data wrangling. It can include breaking one variable down into multiple variables, combining multiple variables into one, changing the level of data, or anything else that will make the spreadsheet look different without actually changing the values. Things like combining variables or breaking them down look like you are changing the data, but the same data is there, it is just represented differently. For example, you have a variable, FullName, but you want to break it down into two variables, FirstName and LastName, or vice versa. The name is exactly the same, it is just stored in one cell or two. Like with transformations, you are just making sure the manipulation went smoothly, so you are looking to see whether any new errors or mistakes were introduced.

  • Final product

The final product is one more check before your report. After you have run your analysis, this is just going back and making sure you didn’t goof before your dashboard goes live to hundreds or thousands of people, or before you hand in the technical report you spent months on to the CEO.

These are the major times to check the quality of your data outside of routine maintenance, but these are not the only times you can check your data. If you are in doubt, it never hurts to double-check. It is better to be safe than sorry.

Data quality dimensions

Okay, you now know when to check, but what exactly are you checking? The experts at CompTIA have identified a few key dimensions or specific things for which you should look. They are as follows:

  • Data consistency
  • Data accuracy
  • Data completeness
  • Data integrity
  • Data attribute limitations

Data consistency is making sure your data is the same. This can apply to different levels. Within a variable, consistency is making sure it is reported the same way every time. Within a database or broader system, it means that if the same data is stored in multiple places, they should match. No matter the level, you are effectively making sure the data is uniform and things are the same throughout.

Data accuracy is whether or not the data is correct. This is, as you might guess, important. However, it is not always the easiest thing to check. A lot of this falls into the process of how it is collected, but whenever possible, it is best practice to check your data against an outside source, or go back to the original source to make sure the information you have stored is accurate, or whether there have been any changes.

Data completeness is checking for gaps. These gaps can come in multiple forms. Occasionally, it identifies an entire variable that should have been collected with the rest of the data or is required for a specific analysis. More often, this is looking for missing values within a variable that was collected. This goes right back to missing data, as discussed in Chapter 4, Cleaning and Processing Data.

Data integrity actually includes consistency, accuracy, and completeness, but it is slightly more than that. In certain fields, such as pharmaceuticals, data integrity has very defined rules and definitions, including making sure you know who entered every piece of data, when, and how. If you want to work with pharmaceutical data, make sure you look up these rules in detail. Luckily, for the exam, you only need to know that data integrity as a dimension is looking at consistency, accuracy, completeness, and security at a high level. This is less about specific data values and more about whether the process produces and maintains high-quality data.

Data attribute limitations is just another way to say data constraints, as we discussed in the previous chapter. Here, you are checking to make sure that only valid data can be entered in a given field.

Data quality rules and metrics

Data quality rules and metrics are guidelines that are created by the company, usually whoever is in charge of data governance. These are, effectively, cutoff scores and format templates that say clearly that this data is acceptable, and this data is not. A lot of these rules focus on conformity or making sure that all data of a certain type is in a specific format with a specific size. In other words, conformity, or non-conformity, makes up the rules that govern data consistency. Other rules, such as rows passed or rows failed, give clear numerical ranges for what is considered usable data and what isn’t. If so many rows meet the guidelines and expectations, the data can be used; if not, it can’t.

This gives you a general idea of what quality control is and when it should be applied. Next, we will look at a few methods for actually checking the quality of the data.

Validating quality

There are many ways to check the quality of your data, but there are a few forms that are more common than others. Let’s take a look at the most popular, which are also the ones you need to know for the exam. These include the following:

  • Cross-validation
  • Sample/spot check
  • Reasonable expectations
  • Data profiling
  • Data audits

Some of these are pretty self-explanatory, but let’s go into a little more detail for each.

Cross-validation

Cross-validation is a statistical analysis that checks to see whether the results of a different analysis can be generalized. This analysis has many different uses. It can check data model effectiveness, specifically if you are checking for overfitting. Often, it is used to figure out what the hyperparameters should be for your model, and is a great tool for reducing test error. Cross-validation is a useful tool that can be applied in several different ways to check the quality and function of your data and data models.

Sample/spot check

Sample/spot checks are usually quick checks that focus on one or two data quality dimensions. They can be scheduled and held at regular intervals or after you change the data in any meaningful way, but often, these happen when you notice something unusual with the data while you are working with it. Maybe you get an error or a result that doesn’t make any sense and you go back to look at the raw data to see whether there is something wrong with the data itself.

Reasonable expectations

Reasonable expectations are simply whether or not the data makes any sense. This can come in a few different forms. Sometimes, this is just a gut check to see whether or not the data is anywhere near what you expect it to be. Occasionally, this is a more formalized process, where professionals decide what are considered “reasonable” values, and an error or flag goes off every time the data goes outside of those boundaries. If you work for a company that races snails professionally, and all of the snails in the past have weighed between 3.66 g and 7.86 g, you should be concerned if suddenly, all of the new values are over 1,000 g. Some major change has taken place, probably an error, and not that all snails have grown over 1 kg, but you need to investigate.

Data profiling

Data profiling is a formal process applied to entire databases that checks for data quality and problems. While we call this a process, it is usually divided into multiple processes that each check specific aspects of the database. There are several types, but the main three are as follows:

  • Structure discovery
  • Content discovery
  • Relationship discovery

These each focus on their namesake: structure, content, and relationship, respectively. Data profiling can check how efficiently your database is laid out, what types of data it contains, and even how the different data objects are connected. This kind of high-level database function quality checks only comes from data profiling. Data profiling is important but is rarely performed by a data analyst, usually being left in the hands of database administrators or other database specialists.

Data audits

Data audits check to see whether a dataset is appropriate for a specific goal. They are systematic and usually performed on a schedule. For example, products that release information on a weekly or monthly schedule will almost always have a data audit once for every release cycle. They should also be performed at every stage of the data life cycle. This process takes a lot of time and is often left to a quality assurance analyst or something similar.

Automated checks

Of course, it is best to automate the quality checks as much as possible. You will always need human eyes on it at some point, but the more you can automate, the fewer hours you have to spend checking and double-checking countless rows and columns. There are many different things you can automate to make your life easier, but the exam wants to point out two specifically:

  • Data field to data type validation
  • Number of data points

Data field to data type validation is simply an automated check to ensure that values in a field are the expected data type, size, or within the expected value range. Alternatively, this can be a data constraint, which stops inappropriate data from ever being entered into the field.

The number of data points is a way to automate completeness checks. If you have 1,000 observations in a dataset, but a specific variable only has 500 data points, that means that half of the values for that variable are missing. It is a quick check and easy to automate.

There are many approaches to performing quality control checks, but there is no one perfect solution. The only way to ensure that you have access to quality data is to use a combination of these methods based on your particular circumstances. Now that you have an idea about how to approach data quality, let’s move on to master data management.

Understanding master data management

Master data management (MDM) is effectively the process of creating and managing a centralized data system. The idea is that you create a golden record, or a single source of truth. Having one place where all of your data is clean, standardized, consolidated, and up to date has many benefits. You end up with higher data quality, higher data integrity, cleaner data, and much faster and easier access to the data, and you can even format the data so that it is ready to report. Also, if you create new data objects, you can connect them to the golden record and automatically populate the fields with the appropriate data. You can even automate compliance checks. Effectively, you have the data in its highest quality on tap and ready to use by anyone who has access.

That said, MDM is not a practice that is used by every company. It sounds great, but it can also be a lot of work and expensive to set up. There are even variations, where MDM is only practiced on specific types of data. Maybe a company only has a golden record for their customer data or PII, and everything else is left separate. Alternatively, they can have multiple golden records, each for a different department or data type. Setting up and maintaining MDM is not the easiest thing to do, but there are specific tools and software to help.

When to use MDM

While many use MDM for its many benefits, there are specific times when it comes in handy. These times are as follows:

  • Mergers and acquisitions

Mergers and acquisitions happen when two companies combine, or one company buys another company out. Times like this are always a struggle when it comes to data. The two companies probably had very different data that was formatted differently and stored in a database with a very different structure. MDM helps by creating a single, consistent record from these two sources that is easy to access and use.

  • Policy

Policy is in reference to compliance. Having all of the records nice, neat, and in one place makes it much easier to check to see whether you meet all regulations and restrictions. This is especially important when you deal heavily with protected data such as PII, PHI, or PCI. If you work with a lot of protected data, or data with several regulations, MDM is a great way to protect yourself.

  • Streamlining data access

Streamlining data access means that you can get the data you need faster. All of the data you need can be accessed from a single table. No joins or fancy queries are required with MDM. It just makes it easier.

Processes of MDM

Okay, you have a rough idea of what MDM is and when it is useful, but let’s talk briefly about what the process looks like. While it can be a long and complicated process, we’ve broken it down into a few simple, generalized steps:

  • Consolidation

Consolidation is the actual creation of the golden record. Here, you are combining data from multiple sources into one place. Note that the separate original tables still exist, and any updates to the golden record automatically update data sources. This way, if you have the same variable in multiple places, you don’t need to update them separately or worry that they might not be consistent. Updating the golden record will update every data source so that they all say the exact same thing.

  • Standardization

Standardization is making things uniform. This can be standardizing field names between tables, standardizing units, standardizing formats, or even standardizing data entry regulations. All of your data is on the same page literally, so it is time to get it on the same page figuratively. This makes sure that all of your data works together and can be used as a whole.

  • Data dictionary

Data dictionaries are documents that give definitions and attributes for every variable in the dataset. They can also include information on structure, relationships, and data organization. Data dictionaries are one of the most important tools a data analyst can have, and they should be used whether or not you practice MDM. Databases are rarely the property of one person. Often, you have multiple people using a database, or at least, the database will be handed off to someone else when you leave. Without proper documentation, no one will understand what the data is or how it should be used. This is especially important if your data uses shorthand for variable names. For example, RNSVBQ17 means absolutely nothing by itself, but a dictionary can tell you that this variable holds patient responses to the Registered Nurse Survey Version B Question 17. Hopefully, it will even tell you what that question was and how it relates to other variables. Data dictionaries can be handwritten notes or software-generated documents, it doesn’t matter; just make sure that one exists.

Summary

This chapter has gone over some important information for ensuring the highest possible quality of your data. First, we covered what quality control is, when it should be applied, and what sorts of things it looks into. Then, we looked at some common methods and approaches to quality control. Finally, we went over what MDM is, when it should be applied, and the processes behind it.

Congratulations! This wraps up the educational content of this book. You have now covered every concept that will be on the exam. There is a lot of information covered in this book, so feel free to go back and review any sections with which you had any problems. Next, it’s time for the practice exams!

Practice questions

Let’s try to practice the material in this chapter with a few example questions.

Questions

  1. Which of the following is an appropriate time to check the quality of your data?
    1. After data manipulation
    2. After data transformation
    3. Before the final report
    4. All of the above
  2. Which of the following is a data quality dimension?
    1. Data completeness
    2. Data retention
    3. Rows passed
    4. Data manipulation
  3. Which of the following is a structured formal process for identifying the quality and efficiency of an entire database?
    1. Cross-validation
    2. Spot check
    3. Reasonable expectations
    4. Data profiling
  4. Over 9,000! is a power company that is rapidly growing in its area. Recently, it purchased another power company that was one of its major competitors. This is an appropriate time to institute MDM. True or false?
    1. True
    2. False
  5. Creating a document that explains what the variables in a dataset are, how they are used, and how they connect to one another represents which part of the MDM process?
    1. Data audits
    2. Consolidation
    3. Data dictionary
    4. Standardization

Answers

Now we will briefly go over the answers to the questions. If you got one wrong, make sure to review the topic in this chapter before continuing:

  1. The answer is: All of the above

These are all circumstances where you should check for quality.

  1. The answer is: Data completeness

Data completeness is the only data quality dimension listed. Data retention is a section in the data use agreement, rows passed is an aspect of a data quality rule, and data manipulation is a general concept of data shaping.

  1. The answer is: Data profiling

While these are all methods of validating quality, data profiling is the only structured approach to checking the quality of an entire database.

  1. The answer is: True

One company buying another is called an acquisition, and acquisitions and mergers are one of the major reasons for MDM. This is because you have two completely separate datasets, one from each company, and you have to merge them into one dataset that is still somehow usable.

  1. The answer is: Data dictionary

A data dictionary is, as you would expect, a dictionary of your data. It gives definitions for every variable, as well as how they are used and how they relate to other variables. These are very useful tools and an important part of the MDM process.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset