The data elements that we capture in our transactional systems every day have enormous potential! We can generate interesting information to see whether we are still on track with our strategic objectives, how our business is doing, or how successful we are at what we do. With this gathered information and data we are able to take a variety of actions to improve our services, products, business operations, and so on. As described in Chapter 2, Unfolding Your Data Journey, we see issues with data quality and the need for data management in some form in order to advance in our data and analytics maturity.
Remember that the power of (big) data is not in the data itself but in how you use it!
Data management is the process of managing, maintaining, operationalizing, and securing your data environment. Data’s importance has grown significantly in recent decades; where we once thought data registers, metadata, or even data management were unnecessary, it now appears that data is an asset on company profit, loss, and balance sheets. As a result, as previously stated, data has enormous potential and value. This makes sense given the rapidly increasing amount of data. When working with qualitative dashboards and reports, we must consider our data environment.
To be honest, there are numerous books, methods, articles, and websites that discuss or describe data management, master data management, data governance, data quality, and other related topics. We’d like to share our hands-on approach to data management with you in this chapter. This way, regardless of how large or small your organization is, you will be able to address data management and take care of that important step in your data and analytics maturity and solve data quality issues from the start of your journey.
In this chapter, we will discuss the following topics:
As previously stated (Chapter 2, Unfolding Your Data Journey), when we create information for our organization using data from our transactional systems, we encounter a variety of issues such as poor data quality, missing data, varying definitions of elementary data fields, and so on. Although business intelligence (BI) is not the cause of this problem, its use makes it painfully clear.
One of the primary reasons we should address and work on data quality issues is to save money! According to Thomas Redman’s book Getting in Front on Data, poor data quality accounts for 50% of an average organization’s operational costs. He also claims that organizations can cut 80% of their operational costs by improving data quality.
Another reason is to improve process quality and thus service quality. Customers appreciate it when a process runs smoothly the first time, and we will have happy customers. Good data is required to keep processes running smoothly and to achieve the goals that they must handle. Working on data quality is thus an important aspect not only for process improvement and progress toward error-free processes but also for a good reputation. Lower costs and satisfied customers are not the only reasons. Data is an important part of company strategy today, and thus an important asset for the company. It is possible to achieve efficiency and effectiveness in business operations by generating insights from the available data environment. Poor data quality stymies the efficient and successful implementation of data-centric business strategies, ultimately costing a company a lot of money!
Figure 5.1 describes the five fundamental points that explain the true potential of data and how to fully utilize it:
Figure 5.1 – Chart showing when data becomes valuable
By determining a data strategy that is stored, described, and accessible in order to use the data to analyze in a trustworthy manner, we must have processes in place for data quality, data consistency, privacy, and security.
Our systems are overflowing with data to support both primary and secondary processes. This information is stored in the underlying databases of our transactional systems. We do this to support the steps in our processes and to create a large history of data in our databases. When we start extracting data, some complicated issues can occur: think of different formatting data elements, working with various platforms, or the frequency at which we can capture data. We need to take care that the extraction of data remains clean, consistent, and flowing. Unfortunately, data is not easily extracted from systems and converted into information. According to author Thomas Redman, a data element goes through two basic stages in its life:
When we examine data thoroughly, we discover that the majority of our registered data is never used again after it is stored in our systems. When the data is of high quality, we can use it for data-informed decision-making, planning, and business processes.
The quality of data becomes apparent only when it is used to generate dashboards and reports. As previously stated, we frequently see data quality issues arise when an organization begins extracting and visualizing data, and it becomes painfully clear that we must address the elements of data management and data quality in our data and analytics processes.
By talking to business users, you are able to determine whether or not a data element is relevant for dashboards and reports. You can do this with the following types of questions:
Improving data quality is only possible if you look for the causes of bad data. There are three factors that influence the quality of data:
We were able to do a project with various health care organizations a few years ago. This was a fun project because health care is a subject close to our hearts. We began with projects to truly understand what needed to be reported to authorities, as well as what needed to be improved in processes, and so on.
We began with some visualizations and reports. We discovered a data quality issue with some of the analyzed data fields from the initial visualizations and reports. To figure out what to look for, we talked with the product owner about which fields we were primary and which fields were secondary. To concentrate on the fundamentals, we identified several tables that needed to be addressed and discussed on a daily basis. For example, if you want to register a patient in the system, you’ll need their date of birth, address, urgency of the visit, gender, and so on.
These fundamental elements are required for classification and reporting to internal and external organizations. However, the accuracy of register time, dates, and so on is also important. Because data quality was not mentioned previously, we proposed adding a section to the dashboard with several elementary tables that had inconsistencies in registration quality. Figure 5.2 shows an example:
Figure 5.2 – Example of a data quality report
By including those reports, organizations were able to discuss data quality in a productive manner, and as we discovered, the data quality of those elementary fields was not only discussed but also corrected every morning! As a result, they noticed that the quality of their registrations, visualizations, and reporting was improving.
One of the most amazing moments during those times was when one of our customers told us that they were proud to see the data quality improving and that the beginning of the day was the ideal time to discuss the actions to improve.
Many organizations want to get more returns from data and grow to a higher maturity level after delivering a solid data infrastructure and the first reports based on that new infrastructure. In practice, we see that data-informed decision-making does not succeed or even begin without the support of management and the entire organization.
Embracing data literacy will help you develop your decision-making skills, and people will learn to ask the right questions, interpret the findings, and take appropriate action. It is critical to give data management a prominent place within the data and analytics team or within your organization to ensure a steady supply of information. But keep in mind that data management is more than just buying a tool. It is, once again, a process that must be thoroughly set up in our data and analytics world. Figure 5.3 depicts the five steps in the framework that we use in our projects and the project approach. The data management framework will be described in detail in the following sections:
Figure 5.3 – The five-step approach of data management
When you are able to define and work on the displayed steps, you can actually begin (no matter how small the step) improving your data quality in a practical sense. Starting with the fundamental fields and adding them to your dashboards, you have the option of working to improve your data quality. Of course, the data quality reports must be discussed and actions taken; this is the human factor that is required to actually improve!
In the following section, we will go over the five steps in more detail.
Wanting everything is the same as wanting nothing at all. A data intake provides information on the most important technical and functional properties of a data element. By determining which data is required to measure organizational goals, you will only collect the information that is required for data-informed decision-making. This is the crucial part of creating and having a data strategy.
To be more specific, establishing a data strategy that is driven by the business strategy of an organization will help to guide all data management activities. A solid data strategy should also include people, processes, and technology to ensure that data as an asset is managed by the organization. To get attention and buy-in from the whole organization, the following steps could be considered:
Getting buy-in within an organization can be arranged in several ways. We have three main points for you that can be taken care of in an easy manner:
According to Bernard Marr’s book, Data Strategy:
As we discussed earlier in this chapter, it is no longer about the data we have at this point—it is more about your company and what you want to achieve with it. When you understand how data can assist you in activating your strategic objectives, you will be able to move forward.
Obtaining all of your data is not the way to proceed on your data journey. There are various types of data fields, and that is what you should consider before deciding what you want to use and how you want to use it. In Figure 5.4, we depict an information pyramid to illustrate how different data or information needs exist at each level of an organization. In Figure 5.4, the information and data streams are divided into three categories: Strategic, Tactical, and Operational.
Those levels have their own informational, and therefore data need. Decision-making is actually one of the three levels but varies in the level of detail, the number of targets, and so on. The bottom level is the Operational level where we will find fewer strategic decision-making reports but also more detailed information on what to do today, what my personal scores are, and so on. When we move up to the Tactical and Strategic levels, less aggregated data is required. Although the management needs to be able to ask their supporting teams for analyzing the values and target, detailed information is needed in the analysis section. But the management team needs to be able to see in one glance of an eye how the organization, department, or team is performing:
Figure 5.4 – The information pyramid
A data vision describes a set of decisions and choices made by an organization in order to map out high-level actions to achieve high-level objectives. To develop a solid data vision, we must first understand what we require in terms of data usage, including where it is stored, collected, maintained, shared, and used. To actually succeed in meeting your data and analytics goals, we need common methods and formalized primary and secondary processes. These methods and formalized processes are essential for establishing your data vision and therefore support your data strategy. We can use some common steps to achieve this, such as identifying your data, where your data is stored, how you can get your data, and if you are able to combine and enrich your data. In the following section, we will go over the steps to determining your data strategy.
This crucial step, we believe, should be included in your measurement plan (functional design). When you describe the organizational goals, you should be able to locate the data you need to create visualizations or reports.
It is critical to determine what you require, how, and if you can use it. You will need to understand where it is stored, what is required to retrieve the data, what the descriptions are (the metadata), and, perhaps, if your organization is larger, you will also find data stewards and data owners.
As a general rule, if your organization truly views data as an asset and wishes to use it, your data strategy must ensure that the data can be identified, described, and used.
Saving your data in a secure manner is critical for your organization and your data and analytics environment. In practice, we see data stored in small extrapolated data marts (created by any tool), such as relational schemas, star schemas, and so on.
When developing a solid data strategy, we must consider the fact that data must be accessible and shareable. When we have it in a secure location that is described and accessible, there is no need to copy the data and perform our technical magic on it over and over again.
We need to be able to retrieve data from a single point of truth, as we explain to our students and customers. Why should we rebuild connections or transformations over and over again when they have no value? The only thing that will happen is a massive increase in the number of dashboards and reports that contain various calculations that are not centrally designed or registered.
We also see a shift in which data is reused in more systems that support management decision support or business processes.
Data integration (DI) solutions and so-called data pipelines are now commonly used to combine and enrich data. We can store structured data from our transactional systems as well as unstructured data collected primarily from external environments in our data environments.
We see in our projects that DI is not defined as a specific role within some organizations, so there is no cohesion between teams for collaboration (everybody is focused on their own bit of information or specific data integration projects). As a result, the work is mostly dispersed throughout the organization. This is a risk that we should be aware of; as organizations grow in size, you will most likely require a dedicated person to oversee the data teams. This will be covered further in our People section.
We discussed four steps to define the way we think about the data strategy and where we should focus our efforts. Keep in mind that sharing data in the form of pulling data from a data warehouse or other data storage and other types of environments is required. We should always approach it from a business necessity rather than a technical or IT necessity. For extracting, integrating, and transforming data, the needs and desires of the business must always come first.
The good news is that we’ve seen some fantastic data integration solutions where we can go data shopping. We can select datasets from a data environment in a webstore-like environment using those tools.
When we can work with data and integrate it into our environments, we can discuss data quality and even justify targets for our data quality levels from a governance standpoint.
Only when the standard is known can data quality be measured. What are the content expectations, and is it absolutely necessary? Is 80% sufficient? Remember that while the systems may be completely filled, the quality of the data is unaffected. We can determine the desired business rules and standards through a series of structured workshops/interviews or questionnaires. As a small example, data entry is done by operations, and that data is stored in our databases. When we have 100 fields that need to be filled by our operations, we could expect a data filling or storage rate of 90%. But data entry and storage is one thing—that could be 90%—but the quality is often much lower—as low as 60%.
Determining the standard is a decision that must be made by management (support) in order to place data management on the strategic management agenda.
The end result will be that we will have a solid data strategy and will be able to measure our organizational goals. However, the human factor in this case is the next necessity that we must address, in order to identify why data is required, who is interested, and so on. As a result, we must consider that there are various types of roles within an organization, which are described in the following section.
People are required within an organization; they collaborate with transactional systems to support our primary and secondary processes, and all of them could (and should) be interested in data and contribute to improving data quality. This is determined by the person’s position within the organization and the type of work they perform. In this section, we will discuss the various types of roles that can be addressed if you want to improve data quality in your organization. Taking care of those roles, or implementing those depends totally on an organization. From our perspective, a bigger organization could obtain all those roles, but a smaller organization could have less roles, or combined roles.
Who has an interest in improving data quality depends on their position within an organization:
When an organization hires a chief data officer (CDO) (usually in larger organizations), the CDO is ultimately in charge of the entire organization and is a member of the management team. They assist organizations in transitioning to a digital, but most importantly, information-driven way of working.
In the following section, we will go into greater detail about the roles of the data owner, data steward, and CDO, as we see a growing need for more organizations to professionalize their data strategy and the need for those typical data office roles.
The data owner is responsible for the data within a specific data domain, such as a data owner for HR data or facility management data. A data owner is responsible for ensuring that the information in their domain is properly managed across multiple systems and business activities.
As a result, the data owner must understand who the company is, what it wants, and how it is structured. The data owner understands the systems, which systems are used, who is responsible, and that data quality is critical in order to provide consistent and correct information to your end customers across your various channels.
The data steward is a broad-minded data specialist who acts as a liaison between IT and the business within the organization. A data steward’s responsibilities include ensuring the data’s correctness, completeness, integrity, and quality. In addition to the data owner, they play a specific role in discussing data quality topics with the employees of that specific department.
On the one hand, the data steward is a source of information for all types of data-related questions. On the other hand, with the available data sources, they are constantly searching for (new) possibilities.
They understand the meaning and reliability of data, as well as its applications. The data steward is familiar with the company’s processes, data sources, and the organization and its customers. They maintain the connection between customers and the organization by analyzing and presenting data in the best possible way.
The CDO is ultimately responsible for an organization’s digital transformation, but most importantly, for an information-driven way of working. Aside from being a data-focused manager, they are also a data evangelist who double-checks the figures and numbers on the dashboards and inspires the organization to work from the standpoint of data-informed decision-making. The CDO must ensure that data is managed correctly and oversees data management activities.
If this position is filled with sufficient mandate and decisiveness, the CDO will be able to quickly demonstrate their value to the organization. In short, the CDO is the driving force behind an intelligent and data-informed organization.
By telling the story over and over, you build support and spread the importance of good data. Make certain that everyone in the organization understands why you should work on data quality, which obstacles to expect, and where opportunities for improvement exist.
You are able to achieve this by doing the following:
A BI environment requires information that is consistent, integrity, trustworthy, and meaningful. As a result, the data quality process aims to achieve and maintain high data quality, ensuring better information for your business.
If you want to start improving your data quality, we can categorize the data quality forces as follows: data discovery, profiling, rules, monitoring, correction, and quality reporting.
In Figure 5.5, we see the six-step model in more detail:
Figure 5.5 – The data quality process
When you begin gathering requirements for mapping out the required information, there are ways to design what is important for your organization based on your organization’s objectives (see Chapter 6, Aligning with Organizational Goals).
The next step is to identify the basic data fields for each process. What is important is that decisions must be made about whether or not to address data quality (for example, you tackle an elementary field, a less important field—perhaps because it is not used for decision-making, for example). See, for example, the story of pre-washing in Chapter 2, Unfolding Your Data Journey.
Data profiling is a method for data analysis that can give a quick insight into the value, structure, and quality of data. In Figure 5.6, we introduce you to several techniques that are involved in data profiling:
Figure 5.6 – Data profiling
With the different techniques described, it is possible to generate insight to determine how far data deviates from the norm. Data can also be examined for quality issues. These signals define the scope of the following step: improving data quality!
In today’s data world, it is also possible to use algorithms to visualize data quality. This can be a fairly simple algorithm that depicts outliers in the data—for example, detecting peaks and troughs of customers based on the first few digits of the postal codes. Peaks and dips in the collected, stored data can indicate a data quality problem.
Or, from a data science perspective, a neural network (NN) that predicts the value of an attribute based on other data elements, such as an algorithm that predicts a salary based on the age of a person, how many hours per week the person works, and the job title. The next step is then to compare the predicted salary with the registered salary.
It is also possible to easily detect another example of duplicate customers with an algorithm or, for example, place names, province names—Province of North Holland, Prov Noord-Holland, Province of NH, and so on. When you are able to detect anomalies, you will be able to identify and correct them in a sufficient way. When an algorithm such as this is trained well, it is possible to filter out a big amount of errors in no time; in this way, machine learning (ML) can make a radical difference. Data fields that should in any case be checked include the following elements so that a valid statement can be made with regard to data quality:
Which fields must be checked or are fundamental for your data-informed decision-making must be determined per process, source, dashboard, or report so that we can focus on the correct data quality aspects. Aside from the previously mentioned elements, the following controlling aspects must be considered:
A method for conducting your research could be as follows: Determine what the quality issues are using data research. Investigate the data by performing a fault cluster analysis, for example (based on eliminating possibilities that could cause a fault to occur). Alternatively, conduct an event analysis (studies when the data is used and where errors can occur).
Then, determine the impact on the organization (what is fundamental and what is not).
Applying data rules can be done from two types of perspectives: a technical point of view and a business perspective.
From a technical perspective:
Intermezzo – a data quality issue causes problems
We created an amazing ServiceDesk application some time ago (back in 2008). We were able to analyze the data, from incident to machine, to see the software that was implemented on that machine and which provider we had to address the question to after working on the project, which was amazing! The management was completely unfamiliar with the transition from high aggregations to such a detailed level. Not having to wait 15 days for reports from our administration office, but having direct insight and more proactive actionability on the first of the month saved a lot of money at the time (and yes, a very positive business case!).
So, at the time, we were mostly walking around with a laptop, and we could easily show the facts displayed on the dashboard. However, if we wanted to know which machines had issues, we had to have the configuration ID (CI) of that machine, but if the CI was not a required field, you can imagine what happened during that time. However, by shortening the feedback loop, displaying the results, and preaching about the importance of data quality, we were able to help everyone understand why we needed to fill that CI. At the very least, data quality was on the team’s mind.
We ended up with a 99% qualitative CI filling in our system and were able to improve the machines that were causing problems.
Monitoring allows for insight into the developments and trends of data storage and data quality. It is possible to detect an increase or decrease in the number of fields to be registered by monitoring the storage of those fields that are leading, or elementary, for example. We have an example displayed in Figure 5.7, with a suggestion on how to track the data storage within the transactional systems. This enables structural issues to be addressed directly at their source. These measurements must be repeated on a regular basis, not only during management meetings but also in regular discussions with business users:
Figure 5.7 – Data storage monitoring
Data correction is the step of cleaning, organizing, and migrating data so that it is properly protected and serves its intended purpose. It is a misconception that data correction means deleting business data that is no longer needed.
With regard to data that does not meet the standard, it is necessary to discuss whether the data should be cleaned, enriched, duplicated, and/or standardized. Addressing these issues can be done partly with the help of automated conversion rules and partly manually. Nevertheless, it should also be discussed with the data owners so that they are able to address data issues with their teams. This also forms the basis for the design of the data warehouse and data quality reports, which we will cover next.
New solutions are on the rise; with those new techniques, you will be able to show the data in—for example—a table, highlight the data, and give the opportunity to your dashboard and reports to correct the data on the fly! The new technology helps to see, correct, and restore the data back again in the original source systems. This technique is called write-back. Figure 5.8 shows an example of correcting budget figures:
Figure 5.8 – Using a write-back functionality
With this new technology, we are able to address, discuss, and correct data quality even better, and all of this from a pragmatic point of view.
Tracking, reporting, visualizing data, creating data flows, and monitoring data quality are all critical steps in monitoring data quality. This step requires determining which items can be converted automatically and which should be corrected by business users. Figure 5.9 is a standard data warehouse environment design with some steps (5) that can be added in a more simple and accessible manner:
Figure 5.9 – Standard data warehouse setup with quality checks
In this example, we have included five points that can be addressed to improve the quality of your data:
You can easily start setting up the processes one by one by organizing your data environment in such a way that data management has a place and a prominent role within your data and analytics team(s).
Monitoring, continuous improvement, and visualization in a data quality dashboard provide insights into the organization’s data quality development and trends. Any structural issues can be dealt with directly at the source. Figure 5.10 shows elements that we can track when our data environment is properly configured, as well as data elements for monitoring our data environment:
Figure 5.10 – Data quality dashboard
This screenshot contains elements that we have discussed in this chapter. These metrics will need to be discussed on a regular basis, not only during management meetings but also during regular discussions with business users. Only in this manner can we address data quality while also improving it.
Control typically refers to an organization’s integrated and controlled processing of data on both a strategic tactical and operational level in order to achieve the desired quality and availability. The process of organizing, cataloging, locating, storing, retrieving, and maintaining data are the subjects of data management.
Control measures can be set up in various ways that mostly depend on the type of organization:
A data office is simply the team responsible for ensuring that the data within an organization best supports the organization’s objectives. At the highest level, the data office ensures that a comprehensive data vision and strategy are in place and being implemented. A data strategy is made up of three distinct topics:
This section contains several technical options for organizations. We decided to highlight some critical topics that organizations should address in order to advance in their data management or data quality processes.
To be able to register metadata (the glue that holds everything together) and thus speak a universal data language, data definitions must be recorded centrally. A data definition clarifies the meaning and naming of data, such as by storing (and using) table and attribute names and describing their meaning. A universal data language is analogous to two people (or systems) communicating; they understand each other better because they use common data definitions.
An important basis in data architecture is the recording of metadata in a metadata repository. This repository contains definitions of systems, datasets, concepts, data models, and data flows (data lineage) together. It is powered by the dictionary and a data modeling tool. The data architecture ensures that systems can exchange meaningful data and that systems, reports, and analyses build on good data.
It is important to address and understand what is going on during the first data and analytics projects. It is important to speak the same data language and understand the definition of a data field, but also how the net revenue is calculated. We have seen many times in our projects that profit or revenue is calculated in several ways, therefore it is necessary to focus on standard descriptions and definitions. The resulting concepts should be recorded in a dictionary, which can simply be a list of concepts or a thesaurus that also includes how the global interrelationships between the concepts are described.
A data dictionary or company thesaurus can be completed in a variety of ways, including software solutions, Excel overviews, and solutions within data and analytics solutions. To explain it a little more in detail, we will tell you the story of one of our projects in the next intermezzo.
Intermezzo – data definitions are necessary!
We were working on a project a while back to create a new informational stream between two source systems. Those two source systems were created using different technologies. We were supposed to construct a direct message transport from the source system where the notification was received to the second source system. It was necessary to plan a visit and pass by someone to check and discuss some things from this second source system.
However, the addresses in the two systems did not match because they were interpreted and programmed differently in both systems. So, occasionally, a person passing by the client went to the wrong address, and the client was fined for not being home.
There was a connection between the governmental systems (where people in a municipality lived) in this source system, but a person could be officially registered somewhere but be living at another address.
In this case, we advised the project members and client to return to the drawing board and discuss the data fields that needed to be transported correctly from one system to the other.
A data modeling tool is required from an IT standpoint not only to keep track of the data pipelines but also to serve the right information areas with the help of a created data environment. We will not recommend any software because this book is not about tools and is completely agnostic. The tools out there in the world are amazing, and usable for all organizations. What you choose is determined by the type of organization and the tools that are already in place. To be honest, there is no such thing as a bad tool in the world; choose wisely and remember that it is fine to have more than one tool—it is just important to understand where each tool is used and what is built with it! You can then grow beyond your wildest dreams.
Recognizing that poor data quality or a lack of data management can lead to a number of issues. In addition, if you do not have a data vision or data strategy that supports your organizational objectives, your organization is likely to focus on the wrong (non-relevant) objectives. Having a data strategy and a clear vision of where you want to go with your data and analytics plans can help an organization advance in its data and analytics maturity.
We now have a better understanding of data, data management, and data quality after reading this chapter. We have provided you with a five-step framework that includes data strategy, data people, data processes, data control, and data IT steps.
The data quality process is divided into five steps: discovering what data you need, profiling, rules (cleansing, correcting, and so on), monitoring the filling data elements in your source systems, and, finally, the amazing part where we see that you can actually measure your data quality. We concluded with the last section of this crucial chapter, in which we discussed IT. Remember—if your data foundation and data quality aren’t correct, your reports will never be!
In the following chapter, we will discuss organizational goals and how important it is to design, measure, and display them in such a way that an organization can track its strategic objectives.