Chapter 1

Purpose, Scope and Audience

Abstract

This book describes an approach to making data governance repeatable, reliable, and cost-effective in your organization. Our approach uses a Playbook, a play-by-play description of how to set expectations and perform data governance jobs right the first time. Before we dig deeper into the Playbook, we highlight the impact of data issues on organizations, industries, countries, and individuals. While many of the stories were taken from recent headlines and seem like once-in-a-million events, some of the stories describe recurring themes. These stories illustrate the type of data governance issues the Playbook was designed to address and reflect real problems that can ultimately cost an organization money or its reputation.

Keywords

Data analytics; Data governance; Data management; Playbook; Reference data
 
This book describes an approach to making data governance repeatable, reliable, and cost-effective in your organization. Our approach uses a Playbook, a play-by-play description of how to set expectations as well as perform data governance jobs right the first time.
Before we dig deeper into the Playbook, we want to highlight the impact of data issues on organizations, industries, countries, and yourself. While many of the stories were taken from recent headlines and seem like once-in-a-million events, some of the stories describe recurring themes. These stories illustrate the type of data governance issues the Playbook was designed to solve and reflect real problems that can ultimately cost an organization money or its reputation.

Spotting the Need for Data and Analytics Governance: Industry Examples

The examples we cover are a sample; new stories and situations surface every day. Data governance conferences, books, and articles abound with stories from the trenches. Stories describing issues and solutions have been the primary way to describe approaches to data governance because, until now, there have been few best practices that have worked well across multiple organizations. The examples include a few stories around the industry of “you.” Often, the best way to understand the impact of something in our professional lives is to understand the impact that it has on us.
The general concept of “data” is simple to understand at a high level. But as soon as someone actually needs to work with data to perform their job, knowing how to use data quickly becomes difficult and requires additional skills many people have not yet acquired. Similarly, while you may think that writing policies for data usage and access should be simple, understanding all the ways that data is used at an organization makes it difficult to write an effective policy.

Industry: You and Your Country

Target, a major retailer, was hit hard in 2013. It was initially thought that fraudsters made off with the payment data of 40 million customers and the personal information such as phone numbers and email addresses of 70 million customers. Shortly after, the CIO and then CEO were replaced.
Neiman Marcus, a major retailer, was also hit hard. Hackers floated internally unnoticed for eight months. The estimated number of customers exposed was initially 1.1 million. Later that estimate was reduced to 350,000. Malware was directly installed on terminals similar to the intrusion at Target.
Holden Security, based out of Milwaukee, claimed that a Russian hacker group stole 1.2 billion usernames and passwords. While it is believed that the hackers plan to use the information to send out email health-products spam, the full extent of the theft is not yet clear. Spamming our mailboxes may not seem like it has a large impact. If the hackers sell the lists to companies, you may receive email that must be manually deleted. Spam filters are not perfect. If you spend 1–2 s to identify a spam email that passed through the filters and push the delete button, possibly over 600,000 h of time has been wasted across people just like you. That is equivalent to an organization of over 300 people working an entire year doing nothing but deleting email.
Facebook claims over 1 billion active users and is one of the largest social networking sites in the world. It has faced constant criticism over its privacy policies. Several lawsuits and complaints to the Fair Trade Commission (FTC) have highlighted that Facebook’s policies allow the routine use of members’ images and names in commercial advertising without consent. Facebook has updated their Statement of Rights and Responsibilities and Data Use policies. It was later disclosed that Facebook ran experiments on their members in order to test the impact of user-interface changes to their members’ emotions and actions without consent. Facebook noted that recent updates to their data use policy clarified the language in the policy, but they did not actually change the policy that allows them to perform these experiments.
After a lengthy lawsuit in the European Union, Google launched the “right to be forgotten webform” to handle removal requests. European citizens can request that links to information about them be removed from search results. The webform is the first step toward supporting the court ruling. However, Google says the form is used to evaluate the request and the process of how to comply with the ruling and requests will come later. Information such as the needs for public information, timeliness of the information, and other factors will be considered in the request. Other search engines such as Bing or Yahoo must follow suit.
All of these stories highlight issues that affect you directly:
• You have provided a substantial amount of personal and financial information both explicitly or implicitly to many different types of organizations. Most of these organizations have different levels of privacy, data use policies, and different levels of maturity around implementing them.
• Nearly all companies have explicit data policies. You may not be aware of them, you may disagree with them, or they may be hard to understand.
• At least 1 billion people knowingly share a significant amount of detailed information that can be used by fraudsters for identity theft. Or that information can be easily stolen. At the very least, access to this data could cause you pain and suffering.
• The impact on both you and the country can be huge when your customer data is not well managed.

Industry: Manufacturing

An international computer manufacturer operates direct and indirect sales channels. Manufacturers have generally focused on operational quality metrics as the key management model – ie, manage by the numbers. Manufacturing is amenable to metrics-based management because a vast majority of supply chain processes are automated or electronically monitored. The era of Deming and other quality gurus raised the bar not only by lowering costs but also improving product and process quality. Improving quality often also lowers costs, which makes some business managers report that quality is free.
When any organization has a direct sales channel to consumers, they should also become focused on ensuring the customer experience is optimally tuned. The customer experience is often directly tied to product quality and supply chain variability. For example, if a manufacturer delivers a laptop to a customer, it may care about the following types of metrics:
• Shipping target variance
The percentage of times that you ship to a customer within a certain time window.
• Shipped product defect rate
The rate at which a customer requires or perceives to require parts, service, or system replacement within the initial ownership period, say 60 days.
• Missing or wrong rate
The rate at which the customer receives or perceives to have been sent the wrong order or did not receive the order.
• Service delivery on-time percentage and first-time resolution percentage
The percentage of service calls where the technician arrived within the contractual or committed hours.
The percentage of service calls where the technician resolved the issue on the first visit.
These customer experience metrics reflect what customers have said impact their economic behavior over time. However, similar to all customer experience metrics, creating these metrics, setting targets, and monitoring them requires substantial amounts of integrated data from many different groups. Identifying and defining these metrics may take a substantial amount of time and may involve subcalculations with many nuances.
Here are some examples of the types of issues that may occur when defining these metrics:
• Are the metrics gathered at the individual component level (eg, monitor, computer, printer) or at the customer level? What about smaller peripherals?
• If products have staggered ship orders, how is this evaluted?
• Are all countries included in the calculation? Should South America be included in North America’s rollup?
• Which orders should be excluded from the calculation? What if an order is missing key date fields such as order entry date or ship date?
• What should the time window be for shipping target? Promised date or a general level of service commitment such as 5 days? What if the customer asked for a ship and delivery date outside of the 5 days?
• How do you count business days versus nonbusiness days? Do you count holidays? What if there was corporate work during a holiday, such as the end-of-year holiday?
• What defines a ship date? When the factory order entry system provides a specific code? Or should it be the invoiced date?
• Should leased systems be counted?
• Should these metrics be tracked on product exchanges?
• Should you count employee purchases? How do you define an employee?
• If we are using a percentage, should the denominator be a fiscal quarter or a month?
• Who maintains the table for “field incidents” that is used to determine service delivery events?
• What hierarchy should be used for reporting the metrics?
As the metrics need to be sliced and diced, the number of combinations between the various factors such as hierarchies and master data becomes large (Fig. 1.1). Merely trying to establish a common vocabulary, a first step, can become a huge hurtle.
image
Figure 1.1 Key hierarchies and master data involved in customer experience reporting.
Here are some key thoughts to extract from this story:
• Most of these questions need to be resolved by business managers or executives.
• All of these questions require a substantial amount of effort around defining, locating, and determining the data’s “readiness” to calculate the metrics.
• In addition to the actual “event” data, there is a substantial amount of related data (eg, country hierarchy, product hierarchy, and lists of valid product codes).

Industry: Financial Services

Financial service firms provide many complex services to customers. In addition to standard deposits and loans, financial service organizations engage in global money management and movement activities. By helping create liquidity in the financial marketplace, financial service firms provide service to individuals, companies, countries as well as the marketplace itself.
As drugs, terrorism, and human trafficking have risen to the forefront of global politics, new policies have been directed at squeezing the money supply of these acts. Some estimate that the total amount of “bad” money flowing in the global economy to be close to 1 trillion US dollars annually. If “bad” money were a country it would make the top three in total Gross Domestic Product (GDP).
A money launderer takes money produced from illegal activities and cleans the money so that it can be used to purchase legitimate assets such as a house or car. Today, even clean money is a problem. Some donors send their clean money to bad people. For example, clean money, such as revenue generated from selling cigarettes, can turn dirty if the cigarettes were smuggled into a high-cigarette-tax states to generate higher profits. Sometimes, clean money is directly wired (electronic transfer) to terrorists. The entire money movement process is global, complex, and a source of concern for many countries. Ultimately, money in the hands of “bad” guys comes back to haunt sovereign states.
The Bank Secrecy Act of 1970 as well the USA Patriot Act of 2001 created authority and policies to reduce financial institutions’ role in funding terrorist activities. As part of the broad antimoney laundering agenda, multiple US government agencies, working in concert with other countries, enacted laws and imposed regulations to support antimoney laundering detection and prosecution.
Federal enforcement actions against banks have increased. Banks have been hit with Office of the Comptroller (OCC) consent orders, Financial Crimes Enforcement Network (FinCEN), or federal and state regulators for material weaknesses in their Anti-Money Laundering (AML) and Bank Secrecy Act (BSA) or The Currency and Foreign Transactions Reporting Act of 1970 programs. OCC consent orders must be resolved or the bank will be shut down or severely restricted. The list below shows some recent fines as documented on the FinCEN website:
• JPMorgan Chase: $461 million
• TD Bank: $37.5 million
• First Bank of Delaware: $15 million
• TC National Bank: $10 million
• HSBC: $9.1 million (multiple assessments)
• Saddle River Valley Bank: $4.1 million
Recently, regulators have started aggressively pursuing both corporate penalties as well as personal penalties against company officers in order to create an environment of deterrence.
There are specific regulatory requirements to ensure that a bank is not doing business with a money launderer. Financial institutions must perform checks, which are called “controls.”
For example, when a bank opens an account, it must perform Customer Due Diligence (CDD) to determine if the account is legit. If the due diligence results are negative, a bank can and should decline to open an account and even close an active account. CDD is part of a broader Know Your Customer (KYC) theme. KYC involves gathering significant amounts of information about a customer and using the information to assess customer risk.
Another example of a control is the set of steps taken before an electronic wire transfer can be sent. The receiver’s name, country, and other information must be validated. When a wire is created, the name and destination country are listed on the wire instructions. It should be easy to match the name and destination against a known sanctions list, but some banks use multiple lists internally to represent country codes and these lists are sometimes conflicting. Various international groups such as the International Standards Organization (ISO) have created “master” lists of frequently used country codes. However, the ISO list is not always quickly updated with changes, and it is possible to have country names appear or change and not be reflected in the ISO list. Furthermore, the two- and three-letter codes used to represent different countries can be the same code on different lists but map to different countries on an ISO and a non-ISO list.
Another control is used when assessing commercial banking services. Similar to the due diligence with an individual, a bank wishing to do business with a company must understand who that company is, how it generates money, and assess its AML risk. For example, a company that is a money services business or a “cash” oriented business like a cleaning or massage service will have a higher AML risk score. Validating a company’s “industry affiliation” is an important step in the process. Like country names, industry affiliations are kept on standardized lists and nonstandardized lists. Similar to the country name and abbreviation issue, industry affiliation can be misunderstood. FinCEN recently assessed a civil money penalty on Mian, doing business as Tower Package Store, because it determined, with help of the Internal Revenue Services (IRS), that Mian was performing as a money services business. A favorable industry affiliation can sometimes dramatically reduce the risk score in some risk models and create a sense of false security.
In addition to controls, simply defining the terms used in the policies can be difficult. In 2012, the FinCEN was preparing to issue updated guidelines on its policies and per its policy solicited feedback on the changes before finalizing the guidelines. The policy update was fairly explicit, but there were areas that needed further clarity. In their public feedback, the American Banker’s Association (ABA) expressed that more definition was needed:

…And, within each type of legal entity there are many variations among the states. Therefore, ABA believes and strongly urges FinCEN to create a chart or outline of the many different entities that exist and then indicate the perceived risks for each type of entity and the appropriate steps that should be taken. Otherwise, the expectations are too vague to be workable and may not reflect actual levels of risk. Clearly, with the great spectrum of varying legal entities, one size fits all approach cannot begin to work.

Here are some key thoughts to extract from these stories:
• The financial impact of not performing a function, such as BSA compliance and fraud prevention, can be large and material. Many of these functions are now data driven.
• Changes in policies at an industry level can have a large and sometimes confusing impact on individual companies. Running a healthy communication and consensus process is important so that the intent of the regulation is understood and can be effectively implemented.
• There are substantial amounts of change management issues associated with data including timeliness and correctness.
• A few, critical data elements can make the difference between doing something well, such as following BSA regulations, or putting your organization at serious reputational and financial risk.

Industry: Healthcare

The healthcare industry is undergoing massive change. The Patient Protection and Affordable Care Act (ACA) changed the risk, reimbursement, and power structure in the industry. Rapid consolidation of physician groups is leading to stronger pricing resistance against insurance companies as physicians fear future reimbursement reductions. Overall, healthcare spending represents roughly one third of total government spending today. Anything that touches healthcare touches a large part of the US economy.
Health information is private information. A broad set of policies has been enacted in the healthcare industry to enhance protection of personal healthcare information (PHI). Title II of the Healthcare Insurance Portability and Accountability Act (HIPAA) established national standards for electronic healthcare transactions and national identifiers for providers, health insurance plans, and employers. It also established civil money penalties for violations of its provisions.
The healthcare industry, often led by the federal government, create initiatives to accelerate value delivery with special emphasis on high-cost chronic diseases. For example, National Institutes of Health (NIH) Department from the U.S. Department of Health and Human Services sponsored the cancer Biomedical Informatics Grid (caBIG) project, now retired. Focused on improving cancer outcomes, the program accelerated investment and capabilities along several fronts. One focus area was on improving access and use of cancer data including extensive amounts of PHI. It has been and continues to be a challenge to share data in healthcare settings due to HIPAA and other policies. Because cancer funding is highly fragmented spread across many constituent states, cancer-related data is created and housed in many different research, government, and commercial locations. To make this data more sharable, the caBIG project worked to make data dictionaries, medical taxonomies, and the ability to run queries across distributed databases as easy as using a spreadsheet. This makes data more accessible and usable to all cancer researchers.
The caBIG program funded projects to establish common vocabularies. Common vocabularies help researchers identify data for their queries. A word describing a condition used in one location needs to mean the same thing in another location. Medicine is still largely implemented and practiced locally with localized standards and practices. Merely identifying data of interest to a particular research query is the first step. There are many cultural and policy impediments to moving and using healthcare data in consolidated facilities. These impediments reduce the ability to run integrated analysis across larger datasets to improve analysis results. Additional caBIG-funded projects created tools to allow a cancer query to execute on a local database and return results to the query requester. The highly fragmented state of healthcare research databases coupled with privacy restrictions has led to greatly reduced access to data for analysis and significantly higher healthcare costs.
Understanding what data is available and its fitness for use is an important step in analyzing data to answer questions. The Center for Medicare Services (CMS) was ordered to provide broader access to the data it collects as part of its mission of administering Medicare. CMS made a dataset available that describes reimbursement dollars and prescription counts for physicians that provide Medicare services.
Once the data was released, many articles were published by experts to explain the data. For example, it was found that some physicians had prescribed more expensive drugs even though proven, lower-cost alternatives were readily available. The data also suggested that some physicians wrote excessive prescriptions counts, high enough that it was nearly impossible to correlate the counts with any form of patient schedule. It was later explained that prescriptions are often attributed to a department head or another physician manager and not the physician providing the actual care.
These datasets were also truncated. Due to HIPAA concerns, physicians who had fewer patients than a threshold were excluded because it is possible to infer from very small datasets actual patient specifics.
The cost of combining data together to answer healthcare queries can be high. As part of the desire to improve clinical analysis, the healthcare industry decided to migrate from a specific set of industry classification codes to a newer set. The industry is in the process of moving to the ICD-10 standard from the current ICD-9 standard. The ICD codes describe diagnosis and procedures performed while delivering healthcare services. The diagnosis codes are the boxes on the form your healthcare physician fills out during a doctor’s visit. ICD-10 codes are more descriptive than ICD-9 and allow analysts to perform more fine-grained analysis between the diagnosis, the procedures used to treat the underlying issue, and eventually outcomes. ICD-10 also enhances the ability to analyze patient data by episode of care (longitudinal analysis) versus one-off interactions with the healthcare industry. Even though it may seem like a fairly straightforward process to migrate from ICD-9 to ICD-10, the road has been very difficult. The industry started moving to ICD-10 a decade ago. Fig. 1.2 summarizes how ICD-9 and ICD-10 differ.
In addition to the cost of migrating technical systems being much larger than expected it has also been found that changing diagnosis codes changes reimbursement levels because the reimbursement rules are based on these codes. Changes affecting reimbursement are highly susceptible to political forces and conservative change. As you might expect, there have been multiple delays in implementation deadlines, increased costs as well as extensive and advanced financial analysis and modeling around the changes. Healthcare organizations simulate code changes based on their current book of business (​it refers to the current customers and transactions they are experiencing) and project the impact to future reimbursement under the new codes. Based on the simulation, they provide feedback to the government. Then the cycle repeats. If reimbursement goes down, negotiations around reimbursement become intertwined with the technical changes, causing further delays and costs.
image
Figure 1.2 ICD-9 versus ICD-10.
Here are some key thoughts to extract from these stories:
• Healthcare data is complex and difficult to analyze because of the nature of healthcare issues as well as the set of policies that have been enacted to meet society’s privacy needs.
• Changing access to healthcare data could significantly affect healthcare costs.
• The need for innovation in certain data areas, such as providing access to data while still honoring privacy needs, is important to improving outcomes.
• Understanding what data exists where and its appropriateness to answer questions are key ingredients in increasing healthcare value delivery.
• The cost of changing data, such as ICD codes, can be substantial. Even small changes can trigger waves of efforts and anxiety. In general, trying to make large changes all at once can take a long time. Migrating to ICD-10, still in progress, has taken several years.

What This Book Is About and Why It Is Needed Now

The industry stories note many data issues, that if addressed, could decrease costs and improve outcomes – the definition of value. Many business issues can be mitigated or eliminated with information-based approaches.
This book describes a detailed approach to executing data governance consistently with the objective of improving business outcomes. Organizations need to execute data governance better than they do today. Regardless of whether your organization uses the data governance term to describe how it manages data, your organization manages your data using some type of process even if that process is not explicit or purposeful. Some organizations may be actively managing their data, and they may be doing so without the benefit of three decades’ worth of experience in the information management field. If you have data, you have some form of data governance. An organization has a responsibility to manage that data appropriate to its operations. As consumers, we have the right to ask organizations to manage the data we create.
An organization has many functions such as finance, human resources, sales and marketing, and manufacturing. There are many ways organizations can configure these functions. Most of these functions have common characteristics across companies. For example, the finance department produces financial statements each quarter. Finance departments also have highly variable responsibilities. For example, some finance departments may also own the strategic planning process. Sales and marketing, which has little regulatory motivation to standardize its functions, also shows remarkable similarity between companies. For example, many sales and marketing departments have a sales-lead pipeline process and a marketing events calendar-planning model.
The roles and activities of these functions have been studied and analyzed for decades, which has resulted in a multitude of organizational models, process definitions, and technology components to make these functions more efficient and effective. These functions provide a rich ground to draw from when new functions develop. For example, many companies want to improve their ability to innovate over time. While innovation management may sound like a dichotomy because innovation suggests a free-wheeling activity with little if any process, innovation is being studied and analyzed so it can be managed into a repeatable and reliable process.
Data governance needs to a repeatable and reliable process. Data governance needs to be consistent.
While corporate functions are often reorganized and changed and employees come and go, the jobs that need to be done in an organization remain mostly the same. Organization functions are often organized by the types of jobs that are performed and the need to efficiently perform them. To become repeatable and reliable over time, a process needs explicit details that describe it as well as the flexibility to adapt when needed.
Without specific and descriptive details, it is very difficult to maintain consistency. However, even being specific is not enough. Schools and universities have created specialized classes in finance, accounting, human resources, supply chain, and management that cover the gamut of jobs performed in a company. Training can provide foundational knowledge, but the basic ingredients must then be mixed together in a special, organization-specific way. Many organizations provide specific job aids, corporate functional training, and other supportive efforts so the jobs can be done regardless of the functional configuration or the employees executing them.
For outcomes that are heavily data-driven, organizations need detailed, step-by-step procedures for performing data governance jobs on a daily basis, and they want to adapt those procedures to their organization as it changes.
These needs can be seen by the types of questions we encounter:
• What is the core focus of data governance? It seems to cover everything.
• How do we ensure our data governance can prove its relevance and effectiveness?
• How do we justify ongoing investment beyond initial funding?
• How do we clarify and reduce confusion around the scope, roles, and execution of activities under data governance?
• What should data governance do and not do?
• How do we take it to the next level?
• Data governance seemed to start with a bang, but now it feels like it is languishing and no one shows up to the meetings anymore. Why is that?
• How do we avoid reinventing the wheel every time something around data governance comes up?
• Why do the people we put in data governance jobs keep quitting?
The industry stories demonstrate that data-related issues and their disproportionate impact on an organization are on the rise. Fortunately, at the same time, organizations have increased their data governance efforts. At this point, there is widespread adoption of data governance activities. Unfortunately, many organizations have tried multiple attempts at standing up data governance and have found it difficult to sustain them over time.
This book provides a detailed description of the jobs data governance needs to perform and detailed processes required to accomplish those jobs all collected together in a “playbook.” The Playbook captures the tactical steps each player needs to take in order to consistently perform the jobs that need to get done.
A Playbook approach is needed now because many organizations have finally realized that while it is fairly easy to start a data governance program, it is difficult to sustain it. Each organization will have a slightly different approach reflecting its unique characteristics. No single, static Playbook is appropriate for all organizations, and since organizations change over time, the Playbook must also adapt.
This book provides the starting point and adaption process for your organization’s Playbook. We cover all the plays, the supporting components, and how to maintain and customize it for your company. Additionally, we provide numerous tactical examples while blending in four decades of information management best practices.

Basic Concepts

Data is the electronic representation of information required for a company to operate and fulfill its purpose. Some companies are very data intensive, such as banks. Other organizations are less data intensive. Every organization engages in data governance whether explicitly managed and supported or implicitly assumed and unsupported. Some organizations have found that executing data governance is critical while others have found that an ad-hoc approach meets their needs. In all information-intensive organizations, data and the capabilities to manage that data well are a critical part of fulfilling its purpose. Data governance has become a critical capability.
We like to explain data governance by describing the data-related “jobs” that need to be performed. You can consider these jobs to be similar to objectives or goals. Employees perform these jobs following a process. The Playbook describes the process.
There are four core data governance jobs. Sometimes the scope of data governance appears to be large and absorbs many more jobs than those listed below. Lumping more “jobs” into data governance can cause confusion. A broad definition also makes it difficult to organize and execute because a broad definition includes too many noncore issues peripheral to solving the underlying issues.
The four jobs are:
• Locate and catalog data
• Understand data limitations and constraints
• Determine data readiness for use
• Improve and control
That’s it!
Some organizations have defined data governance (perhaps with capital letters – Data Governance) to be a broad set of programs, projects, and initiatives with impressive desired outcomes and large funding. Others may define it to include substantial amounts of technical development resources for implementing data management software programs. These are all choices organizations have made for their data governance programs. Implementation models are highly variable. By carefully designing how these four jobs are defined and performed, you can reduce the overall cost footprint of data governance and reduce confusion.
In the next few sections, we describe common organizational areas and capabilities that are often lumped into data governance. We argue that they are separate and distinct but intersect with data governance.

Information and Data Management

Information management generally refers to the following areas:
• Extract, Transform, and Load (ETL)
The process of moving data from one resting place to another.
• Data Quality (DQ)
A characteristic or state of the data that affects its usefulness for answering questions or solving problems.
• Data controls
A set of monitoring points that detect scenario-specific data issues.
Controls already exist in companies. For example, requiring approval to purchase office supplies is a financial control. A data control is the capability to detect, prevent, or correct data issues that may affect the data’s ability to be used. For example, if data passes through a system prior to being used to create a list of customers, a data control could monitor the number of customers passing through the system to see if it varies too much using a statistical test. If the variation is large, an alert would be generated and the issue resolved.
• Master Data Management (MDM) and Reference Data Management (RDM)
The process of managing lists and hierarchies of data.
A list could be a list of countries, a list of customers, or the hierarchy of products and product categories your organization sells.
• Data Modeling (DM)
A data model is typically a graphic with boxes and lines representing business concepts and their relationships.
Data modeling is the process of creating models of your data.
The models are often used to communicate with stakeholders or to develop a technical system.
• Data Marts and Warehousing (DM and DW)
The process of grouping data together and preparing it to answer business questions.
Information management grew into a strong discipline over two decades ago. There are differences between “information management” and “data management,” but the distinction between the two terms is not important for this book and we will use the two terms interchangeably.
There has been rapid convergence to best practices for managing and deploying information management resources and capabilities. However, consistently achieving expected business outcomes in the information management areas has remained elusive. Reasons include:
• Lack of business sponsorship or an “internal customer”
• Lack of talent for and commitment to
Planning
Resourcing
Execution
• Misunderstanding of roles and capabilities
• Poor data quality
The top three issues above are really caused by management deficiencies: starting projects when they should not be started or when they lack adequate support to be successful.
We have seen severe deficiencies in many organizations’ capability to define realistic roles for the IT group. Business groups sometimes contribute to this problem by completely delegating oversight and responsibility of information management projects to the IT group. IT groups are often eager to absorb responsibility for creating full solutions. However, business groups do not always need end-to-end solutions as they can cost effectively do part of the work. Sometimes, having IT do less and ensuring responsibility is placed where it is best executed is a better solution. Organizations that experience multiple, successive information management project failures often have IT groups carrying too much scope. Business groups that have their own analytics applications, such as SAS or R, are fully capable of working on the “last mile” of information delivery.
It’s better to stick to an “understand the jobs” approach. If we understand the jobs that need to be performed, we know what needs to be done regardless of who does the job. Once the organization’s dynamics have been mapped into the “jobs,” the assignments for who performs which jobs can be assigned and the Playbook adapted appropriately.
The Playbook provides workstreams across stages and phases that collectively perform the data governance jobs. The Playbook workstreams we cover in this book intersect the following information management areas:
• Data quality: The Playbook provides a specific workstream for working through data-quality issues.
• Data controls: The Playbook has specific provisions to identify needed data controls.
• Master data/Reference data management: The Playbook provides a specific workstream for working master data management issues.
• Data modeling/Data Catalog: The Playbook speaks directly to capturing definitions and locating data that would support conceptual, logical, and physical data models.
• Data marts and warehousing: The Playbook creates repeatable processes that can be owned and executed by any group that wants execution consistency. This will help improve data mart and warehouse development value delivery and success rates over time.

Business Intelligence

The term “business intelligence” has been used to refer to the following areas:
• Reporting
• Reporting tools and applications
• Data mining
• Analytics, which is sometimes described differently than data mining
• A general state of knowledge about business operations
• A market research group
• All of the above
The term can be confusing without more context. Business resources often use the term to refer to activity that uses data and math to answer business questions. IT resources often use the term to refer to reporting and reporting applications.
Using any of the above definitions, business intelligence depends on data. The data used in business intelligence is often:
• Aggregated
Example: Quarterly summaries across geographies.
• Integrated
Example: The sales totals from all products are summed together.
• Cleansed
Example: All sales are included with a ship date in the calendar quarter.
• Attestable
Example: The head of operations knows that the cost calculation is properly represented in a performance metric.
Business intelligence data is a major beneficiary of data governance activities. Data suitable for aggregation and integration must be cleansed to a certain degree. The field of data cleansing and the concept of cleansing data can be a bit abstract, but we define it as the activity that changes data so it is more suitable for use. Restructuring dates consistently so that sales figures can be aggregated is an example of data cleansing.
Integrating or aggregating data often occurs across functional groups. Data governance can help ensure that each data concept can be conceptually “added” together. Data governance is not a “specific” answer to solve disconnects between groups that do not want to agree to integrate and aggregate their data. If groups do not agree, they do not agree. While the Playbook has processes that can be used across groups, organizational dynamics can still affect the ability to execute the process regardless of who owns the data governance jobs. We provide some insight into how to manage these organizational dynamics in later chapters.
Attestable data is data that has data controls applied to it. Data controls allow the data consumer to attest that the information is correct for their needs. Attestable data requires structured processes and well-defined controls and the Playbook provides activities for creating these controls.
The best way to think about the Playbook and business intelligence is that the Playbook is not business intelligence – the Playbook makes business intelligence better.

Math, Statistics, Data Mining, Artificial Intelligence, Analytics, and More

Businesses cannot get enough analytics. In the past few years, the claimed use of analytics to solve large-scale, global problems as well as help you decide what to purchase next has dominated media outlets.
Analytics was typically physics- and engineering-oriented rather than business-oriented. Applied and theoretical mathematics were targeted at understanding and explaining various physical phenomena. For example, new mathematical approaches have been developed to understand why Newtonian physics break down at small scale.
As populations grew, more scientists and engineers were available to work on other problems. Mathematics turned its attention to solving business and social issues such as improving health, alleviating traffic, or building timesaving machines such as dishwashers and airplanes. Mathematics and statistics became a potent corporate weapon, but initially only large corporations could afford the talent.
At some point, the use of mathematics went into hyperdrive. Starting with mundane tasks such as reducing the costs of mailings for magazine subscription solicitations (ie, junk mail), the field of statistics morphed into the field of data mining and machine learning. Thousands of analytically oriented employees worked on problems such as how to optimally select ads and optimally place them on websites. Fairly quickly, advertising became the primary focus of the largest companies on Earth.
The widespread availability of talent and infrastructure led to dramatic decreases in the cost of deploying business analytics, and a funny thing happened on the way to profit heaven. The data used in the analytics suddenly became a bottleneck in achieving better analytical outcomes. Data is noisy due to the nature of data collection (eg, web page data entry) and is often conflicting. As data was integrated from significantly larger numbers of disparate sources, data quality reduced the effectiveness of the analytical models.
Large areas of the information management space can trace its growth to the rise of analytics and the rise of systems to create analytical outputs. New industries arose to address the data warehousing space. A plethora of data-quality tools, now integral to the dataflow processing inside a company, became routine.
The rise of data-quality issues and their impact on analytics rapidly created the need to drive agreement across aggregation, integration, cleansing, and attestable data. New issues arose such as whom in an organization should have access to certain data elements? Who gets to make that decision? Although these questions are really business management questions, employees began to believe that data governance should answer these questions.
This book is not about the math behind statistics or data mining, but it does deal with the need for improved data and is specifically designed to make consuming data in analytical applications consistent for analysts. Just like business intelligence, the Playbook makes data mining better.

Big Data, NoSQL, Data Scientists, Clouds and Social Media

Big Data may solve all of the world’s problems – at least this is what you may have been led to believe from vendors and the press. The success of social media companies such as Google and Facebook, coupled with their openness concerning algorithms and commodity infrastructure software applications, has given rise to the Big Data movement. Originally targeted at search, ad selection, and placement, Big Data technologies are applied to a multitude of industry problems and are starting to see wider use. Along with the rise of Big Data and social media, a new kind of employee, the data scientist, has emerged. A data scientist is able to fuse math and information management together at large scale to arrive at business insight.
With the rise of Big Data, NoSQL databases also took center stage. No longer restricted to a single machine, NoSQL databases promise to span commodity infrastructure on-demand to solve large data management problems. Many believe that NoSQL databases have a flexible schema more suitable to applications that require highly connected data and specific access patterns. With different guarantees than traditional relational database management systems (RDBS), NoSQL databases and related technologies such as graph databases have become a common complement to Big Data.
For example, many NoSQL databases running on clusters offer “eventual consistency.” Eventual consistency fits social media interactions well. Most social media users are fine if the time it takes to reflect a new “post” is a few minutes or hours. The same guarantees applied in the business world, say around trading data, might be costly. The use of Big Data for storing specific types of data across large clusters provides cost-effective solutions to web-dominant applications that require enormous scale or to business applications where these guarantees are sufficient.
Big Data technologies became synonymous with cloud computing. Because Big Data technologies often use parallel computation and data management processes, they are applications that inherently assume that single-system limitations will always be present. These technologies assume large clusters. Early Web companies relied on commodity clusters to scale as their businesses scaled upward, ie, scaling horizontally instead of vertically became the norm. Public clouds, and even private clouds to some extent, allowed these companies to take advantage of pay-as-you-go plans as their needs increased. As security and other factors became more significant to some commercial clients, public clouds and private clouds became the basis for deploying Big Data technologies.
The motivation behind Big Data technologies is rooted in a specific set of processing scenarios. Today, these technologies are entering mainstream use and are used in other scenarios such as offloading processing from more expensive, dedicated systems such as datawarehouse appliances. Big Data and NoSQL have shifted the cost and capability landscape and will continue to do so as these technologies mature. Big Data enjoys the most success where the technology fits the need. For example, Big Data is very useful in the life sciences upstream in the discovery process where large datasets generated from assays generate enormous datasets and eventual consistency guarantees are fine.
Interestingly, current parallel extract, transform, and load (ETL) tools already contain the same components that popular Hadoop-based Big Data systems are developing such as parallel filesystems, resource management, and scheduling and parallel algorithms. However, the cost of purchasing and keeping the talent needed to run ETL tools can be large and beyond the reach of many smaller companies. Whether true or not, some businesses feel that the cost curve for newer Big Data technologies has shifted enough to enable these technologies at their organization. While Big Data technology will certainly be misapplied to situations that do not benefit from its value proposition, Big Data will be present in the landscape and data governance needs to take them into account.
Big Data technologies have the capabilities needed to store and process data. While the specific technologies may be different compared to traditional data management technologies, Big Data and traditional technologies are conceptually the same – they are applications that hold, manage, and process data.
Big Data deployments can strain a data governance program because of issues such as:
• Flexible and changing schemas make it a challenge to have up-to-date information on “which” data is “where.”
• Big Data is often applied to unstructured data (news stories vs. tabular data). Unstructured data are can make it harder to understand “what’s in there” and is more difficult and interconnected than tabular data.
• The number of processing layers in Big Data architectures is often larger than traditional environments. More layers lead to more data to be tracked. For example, Hadoop is both a storage system and a processing system. The creation and retention of large datasets in an analytical processing sequence is normal. Many of these intermediate datasets are large and retained so that the analytical process can be restarted or resumed midcourse.
• The results of Big Data, especially for web applications, are often quickly applied to customer-facing situations. But there is more risk that the data’s correctness for a particular situation may actually be lower than datasets that have been processed and curated. While it is true that all datasets typically have unresolved quality issues, Big Data often accelerates business velocity, thereby increasing the risk of unintended customer experiences.
• Datasets may be located in highly disparate locations especially when cloud applications are considered. Similar to the data governance story around healthcare data and distributed datasets, data that is distributed and under the control of third parties is often more difficult to work with and understand.
While it may seem that the integration of Big Data is really just a policy definition exercise, elements of the technology at different cost points makes the data volume and diversity managed under Big Data “bigger.” Data may move faster and be more complex. Many organizations already find it difficult to apply data governance to the data they oversee today in traditional technologies. Data governance needs to eventually have 100% coverage over data in the organization. Big Data deployments increase the surface area that must be covered. These issues are not insurmountable. Just as vendor products make using Big Data technologies easier, the data managed under Big Data technologies will be added to the queue of data governance work. When it comes to Big Data, the Playbook applies just as it does to any other data.

Maturity Model

Maturity models help you assess your capabilities across a range of capability areas. By assessing your capabilities, you can better understand your current maturity level and identify a reasonable target level to achieve in the future. The results can be used to communicate expectations for advancement and create a sense of urgency for action.
The maturity levels are often set by an independent body or based on analysis of the industry or function. In addition to serving as a great communication and consensus-driving tool about current capabilities, the categories and levels also help educate an organization on what capabilities need improvement. Most maturity models are descendants of the Carnegie Mellon University’s Software Engineering Institute’s Capability Maturity Model Integration (CMMI) for application development.
There are multiple maturity models available in the information management space. Many of the information management maturity models such as the Enterprise Data Management Council’s Data Management Maturity Model (http://www.edmcouncil.org/dmm) overlap somewhat with the capabilities needed to implement data governance well. CMMI Institute’s own Data Management Maturity Model (DMM) is specifically designed to align your data management strategy to business results. These organizations also work together to help improve alignment between the models. We greatly encourage you to use these models to understand the capabilities that may be important to you across the entire data management landscape.
The Playbook is not a capability maturity model. While we find that maturity models are great at communicating and identifying improvement areas, organizations are constrained because they do not have a detailed execution plan for data governance that provides day-to-day guidance. The Playbook describes an execution path represented by a series of specific activities that should be performed to successfully perform the data governance jobs. The areas described in this book cover those that are most commonly addressed by many organizations.
There are variations in how some data governance activities can be performed. You also need to look into your organization’s toolbox to find and select tools that help you execute. The selected set of options forms the “toolkit,” which could be very simple, such as using spreadsheets and email, or it could be more complex. For example, using advanced business glossary and workflows tools to manage the workload and achieve your complex communication and validation requirements is an advanced approach.
The Playbook contains a basic set of activities needed to accomplish the jobs in a reasonable, cost-effective manner. Essentially it describes how to conduct data governance operations. The Playbook is designed to be customized, and optional activities added by an organization can reflect different maturity levels.
We tend to think about maturity along two distinct dimensions:
• Execution maturity
Execution maturity is more aligned with the capability maturity model concept. To execute a process, you must have the capabilities to perform that process. Execution and capabilities are related.
• Service maturity
Service maturity is aligned with the concept of coverage. For example, if you execute the Playbook workstreams and are only covering 25% of the data your organization has, then your coverage is low. Coverage is sometimes more important than execution maturity.
Fig. 1.3 shows that maturity along these two dimensions can vary not only company to company but also within a company, department by department. Typically, maturity starts low. Different groups often wind up in different parts of the grid over time. For example, as Human Resources become more quantitative, its potential to move has changed allowing it to mature its execution.
image
Figure 1.3 Execution versus service maturity.
These two maturity dimensions are the “breadth” and “depth” of your Playbook. Execution maturity increases if you adopt a Playbook and adapt it to your organization. Service maturity increases when we discuss data governance as an operational process. We address Execution maturity by providing a process to adapt the Playbook to your organization and update it over time. Service maturity increases over time as you repeatedly apply the Playbook to your organization’s data. Most maturity model literature calls out consistency of execution as a key outcome of using a maturity model.

The Playbook as an Organizing Process Model

We have provided examples about data issues and their impact on people, organizations, and countries. There are many areas in an organization that need data to operate. While recent decades have created a good set of best practices around data management, gaps still cause organizations to miss their targets.
The Playbook is a set of processes an organization uses to consistently execute the core set of data governance jobs. The Playbook is organized by workstream. Workstreams group together a set of activities. You can think of workstreams as a selected set of data governance jobs that have been grouped together to realize a specific outcome.
Adaptions of the Playbook for your organization may add additional jobs and activities. We have tried to focus on the core data governance jobs, activities, and deliverable outcomes. You may feel that additional activities and outcomes should be added. Feel free to do so. Adapting the Playbook is essential.
Each workstream in the Playbook contains a small set of content as illustrated in Fig. 1.4:
• A set of activities that make up steps in the workstream.
• A set of inputs and outputs for each step.
• A set of roles and responsibilities for each step and what jobs the roles perform in that step.
image
Figure 1.4 Playbook general schematic.
You can also consider the Playbook to be a “methodology” or “route map” for performing jobs. However, the words methodologies or route map do not always convey a specific level of detail. We think that the word “playbook” communicates expectations around the level of detail found in the activity descriptions.
This book covers the workstreams we think are tablestakes an organization needs to manage their data well. The workstreams are:
• Data stewardship
Managing data on a daily basis.
• Data cataloging
Collecting information about your data together.
• Data quality
Improving the state of your data so it is ready for use.
• Master/reference data
Managing lists and groupings of data used by many different data consumers.
The Playbook is designed to be adapted to your organization. The workstreams we cover in this book are good starting points. We also provide insight on how to customize the Playbook for your organization. You can start using the Playbook as is, but we think that most organizations will find value in customizing the Playbook’s language and activity descriptions based on the actual set of functional groups, technologies, and data issues you think are important.

What You Can Get From This Book

The Playbook describes the roles needed to perform each activity. You can think of the role list as a simpler view of a traditional roles and responsibilities matrix. Working with data often involves fluid processes. The adjustments we inherently make in our daily work lives to keep an organization moving are fairly numerous.
The roles we cover in the Playbook include:
• Chief Data Officer
• Business Sponsor
• Business Data Owner
• Data Analyst
• Data-Quality Analyst
• Program Leader (Business or IT)
• Data Steward
• Data Steward Manager
• Subject Matter Expert (SME)
A single individual may play multiple roles or the same role may be played by several people. Because of this fluidity, a detailed roles and responsibilities matrix can be too detailed to use easily by most practitioners and is sometimes difficult to maintain.
If you are playing or will play one of these roles, the Playbook will provide you with a specific set of instructions around what activities and outcomes you are expected to produce. Many labels you can be given to these roles. You may find that other terms for these roles are already used in your organization. You may want to change the labels or expand the list as part of customizing the Playbook. However, the more labels you create, the harder it may be to communicate “who” is doing “what,” “when,” “why,” and “how.”
In addition to obtaining daily guidance around activities, you can look at the value of the Playbook based on your organization’s role. In the following, we list several types of categories and the value a person in that category can obtain from using the Playbook.
• Business management and executives
This category includes line managers and executives including managing directors, senior directors, and directors, vp and svp, and senior analyst- to analyst-level business workers in revenue areas, cross-functional areas including finance, marketing and hr, and operations areas.
This category also includes positions for those people tasked with or building governance, quality, or control functions or programs in an organization. In this case, the Playbook provides a specific set of activities that can be immediately deployed and evolved over time.
By providing explicit workstreams, activities, and roles, management and executives can rely on concrete guidance for each person and know what activities and outcomes each person should follow and achieve. In other words, the Playbook can establish a concrete action plan that can be managed.
• Data management/information delivery professionals
While many of these labels typically describe professionals in the IT group, there are many data management professionals in other groups as well. For example, professionals who live in data preparation groups attached to an analytics function in a line of business or functional group like underwriting/actuarial sciences or claims processing.
This book provides explicit guidance on activities that should be performed in a data governance role and what to expect as outputs of the workstreams. If a Playbook has been deployed at a company, data management professionals will know how to adjust their own expectations, project plans, and efforts to be consistent with the Playbook model.
• Enterprise data services
An enterprise data services group often lives in the IT group and provides data services as needed for a project. For example, if projects need data quality profiling or data cataloging support using the approved corporate data cataloging application, an enterprise data services group can provide a targeted level of support. Data services groups typically provide services to all projects and programs across both IT and business groups. Data services will require data governance activities to be sustainable.
The Playbook helps in two ways. First, the Playbook will help enterprise data services groups identify the sets of services they should offer to support data governance jobs. Second, the Playbook can identify where the data services can play a role in the process. This improves overall value communication and utilization of the services and improves overall viability.
• Project and program management
Project or program managers often live in two groups. First, as part of the IT group and second, as part of an operations or a business functional group.
Project and program management professionals can view this book as a process description for performing data governance and managing many aspects of the data.
A Playbook is not a System/Software Development Lifecycle (SDLC) methodology. Playbooks do intersect and overlap with parts of an SDLC. IT-oriented project and program managers can use the Playbook to tune or instill best practices into their management development process.
• Application development
Application development professionals almost always deal with data issues in their applications. Many applications can be viewed from a technology perspective as software programs that manage data in a database using a set of transactions or queries.
The Playbook helps application development professionals understand the steps they should follow to access and use the best data. The Playbook also makes application development easier by enabling discipline and support around data areas, which directly reduces their development effort and risk. For example, if the Data Cataloging workstream is performed as described in this book, the outputs from the activities can be directly used as inputs into a developer’s analysis and design steps, reducing or removing the need to perform those steps in their project.
• Risk management
This category includes many different areas of a company including antimoney laundering, fraud examiners, internal auditors, enterprise risk managers, and external auditors and examiners.
Internal risk management professionals can view this book as a set of best practices the organization should follow to reduce risks caused by data that is not ready or suited for a risk management purpose. This book covers the concept of data controls, which like regular corporate controls, are a set of processes and procedures that should be in place to ensure the data used in business processes is of suitable readiness.
External auditors and examiners can use this book to establish or enhance audit or examination guides. While the nature of external auditing varies by industry and regulatory area, a Playbook establishes a baseline. Deviations from the Playbook can identify additional areas to investigate.

Summary

We introduced the area of data governance through multiindustry examples and highlighted the different discussion areas that will be covered in the book. We also defined basic terms and concepts. This chapter touched on a variety of topics that often intersect a discussion around data governance including maturity models, Big Data, and operational models. We created a foundation for understanding what a Playbook is and why this approach is needed now. Finally, we identified the benefit of using a Playbook for different audiences.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset