Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 4. Data Governance Over a Data Life Cycle

In previous chapters, we introduced governance, what it means, the tools and processes that make governance a reality, as well as, the people and process aspects of governance. This chapter will bring together those concepts and provide a data lifecycle approach to operationalize governance within your organization.

You will learn about a data lifecycle, the different phases of a data lifecycle, data lifecycle management, applying data governance over a data lifecycle, crafting a data governance policy, best practices along each life cycle phase, applicable examples and considerations for implementing governance. For some, this chapter will validate what you already know, and for others, it will help you ponder, plant seeds and consider how these learnings can be applied within your organization. This chapter will introduce and address a lot of concepts which will help you get started on the journey to making governance a reality. Before getting into the detailed aspects of governance, it’s important to center our understanding on data lifecycle management and what it means for governance.

What is a data lifecycle?

Defining what a data lifecycle is, should be easier said than done. If you were to look up definitions and even phases of a data lifecycle, you will quickly realize that it varies from one author to another and from one organization to another. There’s honestly not one right way to think about the different stages a piece of data goes through, however, we can all agree that each phase that is defined, has certain characteristics that are important to distinguishing it from the other phases. And because of these different characteristics within each phase, the way to think about governance will also therefore vary as each piece of data moves through the data lifecycle. In this chapter, we will define a data lifecycle as the order of stages a piece of data goes through from its initial generation or capture to its eventual archival or deletion at the end of its useful life.

It’s important to quickly point out that this definition tries to capture the essence of what happens to a piece of data, however, not all data goes through each phase and these phases are simply logical dependencies and not actual data flows.

Organizations work with both transactional data, as well as, analytical data, and for this chapter, we will primarily focus on the analytics data lifecycle, from the point when data is ingested into a platform all the way to when it is analyzed, visualized, purged, and archived.

Transactional systems are databases that are optimized to run day-to-day transactional operations. These are fully optimized systems that allow for a high number of concurrent users and transaction types. Even though these systems generate data, most of these systems are not optimized to run analytics processes. On the other hand, analytical systems are optimized to run analytical processes. These databases store historical data from various sources, including; CRM, IOT sensors, logs, transactional data (sales, inventory), and many more. These systems allow data analysts, business analysts and even executives to run queries and reports against the data stored in the analytic database.

As you can quickly see, transactional data vs. analytical data can have a completely different data lifecycle depending on what an organization chooses to do. That said, for many organizations, transactional data is usually moved to an analytics system for analysis and will therefore undergo the phases of a data lifecycle we will outline in the following section.

Proper oversight of data throughout its lifecycle is essential to optimizing its usefulness and minimizing potential for errors. Data governance is at the core of making data work for businesses. Defining this process end-to-end across the data lifecycle is needed to operationalize data governance and make it a reality. And because each phase has distinct governance needs, this ultimately helps the mission of data governance.

Phases of a data lifecycle

As mentioned earlier, you will see a data lifecycle represented in many different ways, and there’s no right or wrong answer. Whichever framework you choose to use for your organization, has to be the one guiding the processes and procedures you put in place. Each phase of the data lifecycle as shown in Figure 4-1, has distinct characteristics. In this section, we will go through each phase of the lifecycle as we define it, double click into what each phase means, and walk through the implications for each phase as you think about governance.

Data Creation

The first phase of the data lifecycle is the creation or capture of data. Data is generated from multiple sources, in different formats such as structured or unstructured data and in different frequencies (batch vs. stream). Customers can choose to use existing data connectors, build ETL pipelines, and/or leverage 3rd party ingestion tools to load data into a data platform or storage system. Metadata - data about data - can also be created and captured in this phase. You will notice data creation and data capture used interchangeably, mostly because of the source of data. When new data is created, it is referred to as data creation and when existing data is funneled into a system, then it is referred to as data capture.

In Chapter 1, we mentioned that the rate at which data is generated is growing at an exponential rate with IDC predicting that worldwide data will grow to 175 zettabytes by 2025.¹ This is enormous! Data is typically created in one of these 3 ways:

Data acquisition: this is when an organization acquires data which has been produced by a third-party organization
Data entry: this is when new data is manually entered by humans or devices within the organization
Data capture: this is when data generated by various devices in an organization, like IOT sensors, is captured

It’s important to mention that data can be generated in other ways, however the three ways mentioned above offer significant data governance challenges. For example, what are the different checks and balances for data acquired outside your organization? There are probably contracts and agreements that outline how the enterprise is allowed to use this data and for what purposes. There might also be limitations to who can access that specific data. All these offer considerations and have implications for governance. Later in the chapter, we will look at how to think about governance during this phase and call out the different tools you should think about when designing your governance strategy.

Data Processing

Once data has been captured, it is then processed, without yet deriving any value from it for the enterprise. This is done prior to its use. Data processing is also referred to as data maintenance and this is when data goes through processes such as integration, cleaning, scrubbing or extract-transform-load (ETL) to get it ready for storage and eventual analysis.

In this phase, some of the governance implications that you will come across are data lineage, data quality and data classification. All these have been discussed in much more detail in Chapter 2. To make governance a reality, how do you make sure that as data is being processed, its lineage is tracked and maintained? In addition, checking data quality is very important to make sure we’re not missing any important values, before storing this data. You should also think about data classification. How are you dealing with sensitive information? What is it? How are we ensuring management and access of this data so it doesn’t potentially get into the wrong hands? There are a lot of governance considerations during this phase. We will deep dive into these concepts later in the chapter.

Data Storage

The third phase in the data lifecycle is data storage where both data and metadata are stored on storage systems and devices with the appropriate levels of protection. Because we’re focusing on the analytics data lifecycle, a storage system could be a data warehouse, a data mart, or a data lake. Data should be encrypted at rest to protect it from intrusions and attacks. In addition, data needs to be backed-up to ensure redundancy in the event of a data loss, accidental deletion or disaster.

Data Usage

This phase is important to understanding how data is consumed within an organization to support the organization’s objectives and operations. In this phase, data becomes truly useful and empowers the organization to make informed business decisions when it can be viewed, analyzed and/or visualized for insights. In this phase, users get to ask all types of questions of the data, via a user interface or business intelligence tools, with the hope of getting ‘good’ answers. This is where the rubber meets the road, especially when confirming whether the governance processes already instituted in previous phases truly work. If data quality is not implemented correctly, the types of answers you will receive will be incorrect, might not make too much sense and could potentially jeopardize your business operations.

In this phase, data itself may be the product or service that the organization offers. If data is indeed the product, then different governance policies need to be enacted to ensure proper handling of this data.

Because data is consumed by multiple internal and external stakeholders and processes during this phase, proper access management and audits are key. In addition, there might be regulatory or contractual constraints on how data may actually be used, and part of the role of data governance is to ensure that these constraints are observed accordingly.

Data Archiving

In this phase, data is removed from all active production environments and copied to another environment. It is no longer processed, used or published, but is stored in case it is needed again in an active production environment. Because the volume of data generated is growing, it’s also natural that the volume of archived data inevitably grows. In this phase, no maintenance or general usage occurs. A data governance plan should guide the retention of this data and define the length of time it will be stored including the different controls that will be applied to this data.

Data Destruction

In this final phase, data is destroyed. Data destruction or purging refers to the removal of every copy of data from an organization, typically done from an archive storage location. Even if you wanted to save all your data forever, it’s just not feasible. It’s very expensive to store data that is not in use and compliance issues create the need to get rid of data you no longer need. The primary challenge of this phase is ensuring that all the data is properly destroyed and at the right time.

Before destroying any data, it is critical to confirm whether there are any policies in place that would require you to retain the data for a certain period of time. Coming up with the right timeline for this cycle means understanding state and federal regulations, industry standards and governance policies to ensure that the right steps are taken. You will also need to prove that the purge has been done properly which ensures that data doesn’t consume more resources than necessary at the end of its useful life.

You should now have a solid understanding about the different phases of a data lifecycle and what some of the governance implications are. As stated previously, these phases are logical dependencies and not necessarily actual data flows. Some pieces of data might go back and forth between different processing systems before being stored. And some which are stored in a data lake, might skip processing all together and get stored first, and then processed later. Data does not need to pass through all the phases.

We’re sure you’ve heard the phrase Rome was not built in a day and that’s really what this data lifecycle is trying to do. Applying data governance in an organization is a daunting task and can be very overwhelming. However, if you think about your data within these logical data lifecycle phases, implementing governance can then be a task that can be broken down into each phase and therefore thought through and implemented accordingly.

Data lifecycle management

Now that you understand data lifecycle, another common term you will run into is data lifecycle management (DLM). What’s interesting is that many authors will use data lifecycle and data lifecycle management interchangeably. Even though there might be a need to want to bundle these together, it’s important to realize that a data lifecycle can exist without data lifecycle management. DLM, therefore, refers to a comprehensive policy-based approach to manage the flow of data throughout its lifecycle; from creation to the time when it becomes obsolete and is purged. When an organization is able to define and organize the lifecycle processes and practices into repeatable steps for their company, then this refers to DLM. As you start learning about DLM, you will quickly run into a data management plan (DMP). So let’s quickly look at what it means and what it entails.

Data management plan

A data management plan (DMP) defines how data will be managed, described, and stored. In addition, it will define standards you will use and how data will be handled and protected throughout its lifecycle. You will primarily see data management plans required to drive research projects within institutions, but the concepts of the process are fundamental to implementing governance. Because of this, it’s worth us doing a deep dive into them and seeing how these could be applied to implement governance within an organization.

With governance, you will quickly realize that there isn’t a lack of templates and frameworks - see example from Massachusetts Institute of Technology (MIT). You simply need to pick a plan or framework that works for your project and organization and march ahead. There’s not one right or wrong way to do it, you just need to do it. If you choose to use a data management plan, here is some quick guidance to get you started. The concepts here are much more fundamental than the template itself, that if you were able to capture these in a document, then you’re well ahead of the curve.

Guidance 1: Identify the data to be captured or collected

Data volume is important to help you determine infrastructure costs and people time. It’s important to know how much data you’re expecting and the types of data you will be collecting.

Types: Outline the various types of data you will be collecting. Are they structured or unstructured? This will help determine the right infrastructure to use
Sources: Where is the data coming from. Are there restrictions to how this data can be used or manipulated? What are those rules? All these need to be documented
Volume: This can be a little difficult especially with the exponential growth in data, however, planning for that increase early on and projecting what it could be, would set you apart and help you be prepared for the future

Guidance 2: Define how the data will be organized

Now that you know the type, sources and volume of data you’re collecting, you need to determine how that data will be managed. What tools do you need across the data lifecycle. Do you need a data warehouse, Which type? From which vendor? Or do you need a data lake? Or do you need both? Understanding these implications and what each means will allow you to better define what your governance policies will need to be. There are many regulations that govern how data can and cannot be used, and understanding them is vital.

Guidance 3: Document a data storage and preservation strategy

Disasters happen and ensuring that you’ve adequately prepared for one is very important. How long will a piece of data be accessible and by who? How will the data be stored and protected over its life? As we mentioned previously, data purging needs to happen according to the rules set forth. In addition, understanding what your systems’ backup and retention policies are important.

Guidance 4: Define data policies

It’s important to document how data will be managed and shared. Identify the licensing and sharing agreements that pertain to the data you’re collecting. Are there restrictions that the organization should adhere to? What are the legal and ethical restrictions on access and use of sensitive data, for example. With the introduction of regulations like GDPR, or CCPA, and many more. these can easily get confusing and even become contradictory. In this step, ensure that all the applicable data policies are captured accordingly. This also helps in case you’re audited.

Guidance 5: Define roles and responsibilities

Chapter 3 defined roles and responsibilities. With those roles in mind, determine which are the right ones for your organization and what each one means for you. Which teams will be responsible for metadata management and data discovery? Who will ensure governance policies are followed all the way? And many more.

A DMP should provide your organization and others an easy to follow roadmap that will guide and explain to others how data will be treated throughout its lifecycle. Think of this as a living document that evolves with your organization as new datasets are captured and as new laws and regulations are enacted.

If this was a data management plan for a research project, it would have included a lot more steps and items for consideration. Those plans tend to be more robust because they guide the entire research project and data end-to-end. We will cover a lot more concepts later in the chapter, so we chose to select items that were easily transferable to creating a governance policy and plan for your organization.

Applying governance over the data lifecycle

We’ve gone through fundamental concepts thus far, let’s bring everything together and look at how you can apply governance over the data lifecycle. Governance needs to bring together people, processes and technology to govern data throughout its lifecycle. In Chapter 2, we outlined a robust set of tools to make governance a reality, and Chapter 3 focused on the people and process side of things. It’s important to point out that implementing governance is complicated and there’s no easy way to simply apply everything and you’re done. Most technologies need to be stitched together, and as you can imagine, they’re all coming from different vendors with different implementations. You would need to integrate the best-in-class suite of products and services to make things work. Another option is to purchase a fully integrated data platform or governance platform. This is not a trivial task.

Data governance framework

Frameworks help you visualize the plan and there are several frameworks that can help you think about governance across the data lifecycle. Figure 4-2 is one such framework where we highlight all the concepts from Chapter 2, overlaid with the concepts we’ve discussed in this chapter.

This framework over-simplifies things to make it easier to understand; it assumes things are linear, from left to right, which is usually not the case. When data is ingested from various sources on the left, this is simply at the point of data creation or capture. That data is then processed, stored, and then consumed by the different stakeholders including, data analysts, data engineers, data stewards etc.

Data archiving and data destruction are not reflected in this framework because those take place beyond the point when data is used. As we previously outlined, during archiving, data is removed from all active production environments. It is no longer processed, used or published but is stored in case it is needed again in the future. And destruction is when data comes to the end of life and it is removed according to set forth guidelines and procedures.

One example discrepancy you will quickly notice is that metadata management should be considered from the point of data creation where enterprises need to discover and curate the data as it’s ingested (especially for sensitive data) to when data is stored and discovered in the applicable storage system. Archiving, even though mentioned within data management, tends to happen when the data’s usefulness is done and where it is removed from production environments. And even though it is an important part of governance, in this diagram, it implies to be taking place in the middle of the data lifecycle. That said, it’s also possible to have an archiving strategy when data is simply stored in the applicable storage systems, so we cannot completely rule this off.

As you look at Figure 4-2, it’s important to remember that these are logical representations of the phases a piece of data goes through, from left to right, and not necessarily the actual step by step flow of the data. There’s a lot of back and forth that happens between each phase, and not all pieces of data go through each one of these phases.

Frameworks are good at providing a holistic view of things. They are not the end all, be all. Make sure whichever framework you select, works for your organization and your data.

Data governance in practise

OpenStreetMap (OSM) was created by Steve Coast in the UK in 2004, and was inspired by the success of Wikipedia. It is open source which means it’s created by people like you and free to use under an open license. It was a response to the proliferation of siloed, proprietary international geographical data sources and dozens of mapping software products that didn’t talk to each other. OSM has significantly grown to over 2 million contributors and what’s amazing is that it works. It works well enough to be the trusted source of data for a number of Fortune 500 companies including other small and medium size businesses. With so many contributors, OSM is successful because they were able to establish data standards early in the process and ensured contributors adhered to them. As you can imagine, a crowdsourced mapping system without a way to standardize contributor data could go wrong very quickly. Defining governance standards can bring value to your organization and provide trusted data for your users.

And now that you have an understanding of the data lifecycle with an overlay of the different governance tools, let’s dive further into how the different data governance tools we outlined in Chapters 1 and 2 can be applied and used across this lifecycle. This section also includes best practices which can then help you start to define your organization’s data standards.

Data creation

As previously mentioned, this is the initial phase of the data lifecycle where data is created or captured. During this phase, an organization can choose to capture both the metadata, as well as, the lineage of the data. Metadata describes the data, i.e. data about data, while the lineage describes the where of the data. Trying to capture these during this initial phase, sets you well for the later phases.

In addition, processes such as classification and profiling can be employed as well, especially if you’re dealing with sensitive data assets. Data should also be encrypted in transit to offer protection from intrusions and attacks. Cloud Service Providers such Google Cloud offer encryption in transit and at rest by default.

Define your data types

Establish a set of guidelines for categorizing data that takes into account the sensitivity of the information as well its criticality and value to the organization. Profiling and classifying data helps inform which governance policies and procedures apply to the data.

Data processing

During this phase, data goes through processes such as integration, cleaning, scrubbing or extract-transform-load (ETL) prior to its use, and to get it ready for storage and eventual analysis. It’s important that the integrity of the data is preserved during this phase, that is why data quality plays a critical role.

Lineage needs to be captured and tracked here as well to ensure that the end users understand what processes led to which transformation and ultimately where the data originated from. We heard this from one user “It would be nice to have a better understanding of the lineage of data. When finding where a certain column in a table comes from, I need to manually dig through the source code of that table and follow that trail (if I have access). Automate this process”. This is a common pain point felt by many and one where DLM and governance play a critical role.

Document data quality expectations

Different data consumers may have different data quality requirements, so it’s important to provide a means to document data quality expectations as well as techniques and tools for supporting the data validation and monitoring process. The right processes for data quality management will provide measurably trustworthy data for analysis.

Data storage

In this phase, both data and metadata are stored, ready for analysis. Data should be encrypted at rest to protect it from intrusions and attacks. In addition, data needs to be backed-up to ensure redundancy.

Automated data protection and recovery

Because data is stored in storage devices in this phase, find solutions and products that provide automated data protection to ensure that exposed data cannot be read including encryption at rest, encryption in transit, data masking, and permanent deletion. In addition, implement a robust recovery plan to protect your business when a disaster strikes.

Data usage

In this phase, data is analyzed and consumed for insights and consumed by multiple internal and external stakeholders and processes in the organization. In addition, analyzed data is visualized and used to support the organization’s objectives and operations and Business Intelligence tools play a critical role in this phase.

A data catalog is vital to helping users discover data assets, using captured metadata. Privacy, access management and auditing are paramount at this stage which ensures that the right people and systems are accessing and sharing the data they should for analysis. Furthermore, there might be regulatory or contractual constraints on how data may actually be used, and part of the role of data governance is to ensure that these constraints are observed.

Data access management

It’s important to provide data services that allow data consumers to access their data with ease. Define identities, groups, and roles, and assign access rights to establish a level of managed access. This ensures that only authorized and authenticated individuals and systems are able to access data assets according to defined rules.

Data archiving

In this phase, data is removed from all active production environments. It is no longer processed, used or published but is stored in case it is needed again in the future. Data classification should guide the retention and disposal method of data.

Automated data protection plan

Beyond perimeter security as a way to prevent unauthorized individuals from accessing data, perimeter security is not, and never has been sufficient for protecting data. The same protections applied in data storage, would apply here as well, ensuring that exposed data cannot be read, including encryption at rest, data masking, and permanent deletion. In addition, incase of a disaster, and archive data is now needed in a production environment, it’s important to have a well defined process to revive this data and make it useful.

Data destruction

Finally, data is destroyed or rather removed from the enterprise at the end of its useful life. Before purging any data, it is critical to confirm whether there are any policies in place that would require you to retain the data for a certain period of time. Data classification should guide the retention and disposal method of data.

Create a compliance policy

Coming up with the right timeline for this cycle means understanding state and federal regulations, industry standards and governance policies and staying up to date on these changes, to ensure that the right steps are taken and proving that the purge has been done properly. It also ensures that data doesn’t consume more resources than necessary at the end of its useful life.

IT stakeholders are urged to revisit the guidelines for destroying data every 12-18 months to ensure compliance, since rules change often.

Example scenario

Here’s an example scenario of how data could move through a data platform with the framework in Figure 5-2.

Scenario

Let’s say that a business wants to ingest data onto a cloud data platform, like Google Cloud, Amazon Web Services (AWS) or Azure Synapse and share it with data analysts. This data may include sensitive elements like US Social Security Numbers, phone numbers, and email addresses. Here are the different pieces it might go through:

Business configures an ingestion data pipeline using a batch or streaming service.
1. Goal: As they move raw data into the platform, it will need to be scanned, classified, and tagged before it can be processed, manipulated and finally stored
2. Staged ingestion buckets:
  1. Ingest - heavily restricted
  2. Released - processed data
  3. Admin Quarantine - needs review
Data is then scanned and classified for sensitive information like PII
Some data may be redacted, obfuscated, or anonymized/de-identified. This process may generate new metadata such as what keys were used for tokenization. This metadata would be captured at this stage
Data is tagged with PII tags/labels
Aspects of data quality can be accessed. i.e. are there any missing values, are primary keys in the right format etc.
Start to capture data provenance information for lineage
As data moves between the different services along the lifecycle, it is encrypted in transit
Once ingestion and processing is complete, it will need to be stored in a data warehouse and/or a data lake where it is encrypted at rest. Backup and recovery processes need to be employed as well, incase of a disaster
While in storage, additional business and technical metadata can be added to it and cataloged and users need to be able to discover and find the data
Audit trails need to be captured throughout this data lifecycle and made visible as needed. Audits allow you to check the effectiveness of controls in order to quickly mitigate threats and evaluate overall security health
Throughout this process, it’s important to ensure that the right people and services have access and permissions to the right data across the data platform using a robust Identity and Access Management (IAM) solution
You need to be able to run analytics and visualize the results for use. In addition to access management, additional privacy, de-identification, anonymization tools may be employed
Once this data is no longer needed in a production environment, it is then archived for a determined period of time to maintain compliance
At the end of its useful life, it would be completely removed from the data platform and destroyed

Operationalizing data governance

It’s one thing to have a plan, but it’s something else to ensure that plan works for your organization. NASA learned things the hard way. In September 1999, after almost 10 months of travel to Mars, the $125 million Mars Climate Orbiter lost communication and then burned and broke into pieces, merely 37 miles away from the planet’s surface. The analysis found out that, while NASA had used the metric system, one of its partners had used the British Imperial System. This inconsistency was not discovered until it was time to land the orbiter, leading to a complete loss of the satellite. This of course was crushing to the team. After this incident, proper checks and balances were implemented, to ensure this did not happen again.²

In order to bring things together so that issues like NASA experienced are caught early and rectified before a disaster happens, this all starts with creating a data governance policy. A data governance policy is a living breathing document that provides a set of rules, policies and guidance for safeguarding an organization’s data assets.

What is a data governance policy?

A data governance policy is a documented set of guidelines for ensuring that an organization’s data and information assets are managed consistently and used properly. A data governance policy is essential in order to implement governance. The guidelines will include individual policies for data quality, access, security, privacy and usage which are paramount for managing data across its lifecycle. In addition, data governance policies center on establishing roles and responsibilities for data that include access, disposal, storage, backup, and protection which should all be familiar concepts. This document helps to bring everything together towards a common goal.

Data governance policies are usually created by a data governance committee or data governance council which is made up of business executives and other data owners. This policy document defines a clear data governance structure for the executive team, managers and line workers to follow in their daily operations.

To get started operationalizing governance, a data governance charter template could be useful. Figure 4-3 shows an example template that could help you socialize your ideas across the organization and get the conversation started. Information in this template, will funnel directly into your data governance policy.

Use the data governance charter template to kick off the conversation and get your team assembled. Once you have a team that’s bought into your vision, mission and goals, that is the team that will help you create and define your governance policy.

Importance of a data governance policy

When you have a business idea and are going to friends to socialize the idea and possibly get them bought in, you will quickly run into someone who asks for a business plan. Do you have a business plan you can share so I can read more about this idea and what your plans are? A data governance policy allows you to have all the important elements of operationalizing governance documented according to your organization’s needs and objectives. It also allows consistency within the organization over a long period of time. It is the document that everyone will refer to when questions and issues arise. It should be reviewed regularly and updated when things in the organization change. You can consider it your business plan or to another extreme, it can also be your governance bible.

When a data governance policy is well drafted, it will ensure:

Consistent, efficient and effective management of the data assets throughout the organization, data lifecycle, and over time
The appropriate level of protections of the organization’s data assets based on their value and risk as determined by the data governance committee
the appropriate protection and security levels for different categories of data as established by the governance committee

Developing a data governance policy

A data governance policy is usually authored by the data governance committees or appointed data governance council. This committee will establish comprehensive policies for their data programs that outline how data will be collected, stored, used and protected. The committee will identify risks, regulatory requirements and look into how that will impact or disrupt the business.

Once all the risks and assessments have been identified, the data governance committee will then draft policy guidelines and procedures that will ensure the organization has the data program that was envisioned. When a policy is well written, it helps capture the strategic vision of the data program. The vision for the governance program could be to drive digital transformation for the organization, or possibly get insights to drive new revenue or even use data to provide new products or services. Whichever the case is for your organization, the policies drafted should all coalesce towards the articulated vision and mission as outlined in the data governance charter template.

Part of the process of developing a data governance policy is establishing expectations, wants, and needs of key stakeholders through interviews, meetings, and informal conversations. This will help you get valuable input but it’s also an opportunity to secure additional buy-in for the program.

Data governance policy structure

A well crafted policy should be unique to your organization’s vision, mission and goals. Don’t get hung up on every single piece of information on this template, but use it more like a guide to help you think through things. With that in mind, your governance policy should address:

Vision and mission for the program: If you used a data governance charter template as outlined in Figure 4-3 to get buy-in from other stakeholders, that means you already have this information readily available. As mentioned before, the vision for the governance program could be to drive digital transformation for the organization, or get insights to drive new revenue or even use data to provide new products or services.
Policy purpose: Capture goals for your organization’s data governance program, as well as, metrics for determining success. The mission and vision of the program should drive the goals and success metrics.
Policy scope: Document the data assets covered by this governance policy. In addition, inventory the data sources, data classifications based on whether it’s sensitive, confidential or publicly available, along with the levels of security and protection required at the different levels
Definition and terms: The data governance policy is usually viewed by stakeholders across the organization who might not be familiar with certain terms. Use this section to document terms and definitions ensuring everyone is on the same page
Policy principles: Define rules and standards for the governance program you’re looking to set up along with the procedures and programs to enforce them. The rules could cover data access, who has access to what data, data usage, how the data will be used and details around what’s acceptable, data integration, what transformations the data will undergo, and data integrity, expectations around data quality. Develop best practices to protect data and ensure regulations and compliance are effectively documented.
Program structure: Define roles and responsibilities (R&Rs) which are positions within the organization that will oversee elements of the governance program. A RACI chart could help you map out who is responsible, who is accountable, who needs to be consulted, and who should be kept informed about changes. Detailed information on governance R&Rs are outlined in Chapter 3 of the book.
Policy review: Determine when the policy will be reviewed and updated and how adherence to the policy will be monitored, measured and remedied
Further assistance: Document the right people to address questions from the team and other stakeholders

It’s not enough to document a data governance policy as outlined in Figure 4-4, communicating it to all stakeholders is equally important. This could be a combination of group meetings and training, one-on-one conversations, recorded training videos, and written communication.

In addition, review performance regularly with your data governance team to ensure that you’re still on the right track. This also means regularly reviewing your data governance policy to make sure it still reflects the current needs of the organization and program.

Roles and responsibilities

When operationalizing governance over a data lifecycle, you will interact with many stakeholders within the organization and you will need to bring them together to work on this common goal. While it might be tempting to want to definitively say which roles do what at which part of the data lifecycle, as outlined in Chapter 3, many data governance frameworks revolve around a complex interplay of many roles and responsibilities. The reality is that most companies rarely are able to exactly or even fully staff governance roles due to lack of employee skill set or, more commonly, simply lack of headcount. For this reason employees working in the information and data space of their company often wear different user “hats”.

We will not go into detail about roles and responsibilities in this chapter, because they’re well outlined in detail in Chapter 3. You still need to define what these look like within your organization and how they will interplay with each other to make governance a reality for you. This will typically be outlined in a RACI matrix describing who is responsible, accountable, to be consulted and to be informed within a certain enforcement, process, policy or standard.

Step-by-step guidance

By the time you get to this section of the book, you should know that data governance goes beyond the selection and implementation of products and tools. The success of a data governance program depends on the combination of people, processes and tools all working together to make governance a reality. This section will feel very familiar because it gathers all the elements discussed in the previous section on data governance policy and puts them in a step-by-step process to show you how to get started. It further double clicks into the concepts as well:

Build the business case: As previously mentioned, data governance takes time and is expensive. In addition, data governance initiatives will often vary in scope and objectives. Depending on where the initiative is originating from, you need to be able to build a business case which will identify critical business drivers and justify the effort and investment of data governance. It should identify the pain points, outline perceived data risks and indicate how governance helps the organization mitigate those risks and enable better business outcomes. It’s OK to start small, strive for quick wins and build up ambitions over time. Set clear, measurable, and specific goals. You cannot control what you cannot measure, therefore outline success metrics. A data governance charter template in Figure 4-3, is perfect to help you to get started
Document guiding principles: Develop and document core principles associated with governance and of course associated with the project you’re looking to get off the ground. A core principle of your governance strategy could be to make consistent, and confident business decisions based on trustworthy data aligned with all the various purposes for the use of the data assets. Another one could be to meet regulatory requirements and avoid fines or even to optimize staff effectiveness by providing data assets that meet the desired data quality thresholds. Define principles that are core to your business and project. If you’re still new to this area, there are a lot of resources available. If looking online, there are several vendor agnostic, not-for-profit associations like the Data Governance Institute (DGI), Data Management Association (DAMA), the Data Governance Professionals Organization (DGPO) and the Enterprise Data Management Council which provide great resources and business, IT and data professionals dedicated to advancing the discipline of data governance. In addition, identify if there are any local data governance meetup groups or conferences that you can possibly attend like Data Governance and Information Quality Conference, DAMA International Events, or Financial Information Summit
Get management buy-in: It should be no surprise that without management buy-in, your governance initiative can easily be dead from the get go. Management controls the big decisions and funding which you need. Outlining important KPIs and how your plan helps to move them will get management to be all ears. Engage data governance champions and get buy-in from the key senior stakeholders. Present your business case and guiding principles to C-Level management for approval. You need allies on your side to help make the case. And once the project has gotten off the ground, communicate frequently.
Develop an operating model: Once you have management approval, it’s time to get to work. How do you integrate this governance plan into the way of doing business in your enterprise? We introduced you to the data governance policy which can come in very handy during this process. During this stage, define the data governance roles and responsibilities, and then describe the processes and procedures for the data governance council and data stewardship teams who will define processes for defining and implementing policies as well as reviewing and remediating identified data issues. Leverage the content from the data management policy plan to help you define your operating model. Data governance is a teamwork with deliverables from all parts of the business.
Develop a framework for accountability: Like with any project you’re looking to bring to market, establishing a framework for assigning custodianship and responsibility for critical data domains is paramount. Define ownership. Make sure there is visibility to the “data owners” across the data landscape. Provide a methodology to ensure that everyone is accountable for contributing to data usability. Refer back to your data management policy, it probably started to capture some of these dependencies.
Develop taxonomies and ontologies: This is where a lot of the education you’ve collected thus far comes in handy. Working closely with governance associations, leaning in on your peers and simply learning about things online, will help you in this step. There may be a number of governance directives associated with data classification, organization, and in the case of sensitive information, data protection. To enable your data consumers to comply with those directives, there must be a clear definition of the categories (for organizational structure) and classifications (for assessing data sensitivity). These should be captured in your data governance policy.
Assemble the right technology stack: Once you’ve assigned data governance roles to your staff, defined and approved your processes and procedures, you should then assemble a suite of tools that facilitate implementation and ongoing validation of compliance with data policies and accurate compliance reporting. Map infrastructure, architecture, and tools. Your data governance framework must be a sensible part of your enterprise architecture, the IT landscape and the tools needed. We talked about technology in previous sections, so we won’t go into detail about it here. Finding tools and technology that works for you and satisfies your organization objectives you laid out, is what’s important.
Establish education and training: As highlighted earlier, for data governance to work, it needs buy-in across the organization. You need to ensure that your organization is keeping up and still bought into the project you presented. It’s therefore important to raise awareness of the value of data governance by developing educational materials highlighting data governance practices, procedures, and the use of supporting technology. Plan for regular training sessions to reinforce good data governance practices. Wherever possible use business terms and translate the academic parts of the data governance discipline into meaningful content in the business context.

Considerations for governance across a data lifecycle

Data Governance has been around since there was data to govern, but it was mostly viewed as an IT function. Implementing data governance across the data lifecycle is no walk in the park. Here are considerations you would need to think about as you implement governance in your organization. These should not be surprising to you, because you will quickly notice that they touch on a lot of aspects we introduced in Chapters 1 and 2, as well as, this chapter.

Deployment time

Crafting and setting up governance processes across the data lifecycle takes a lot of time, effort and resources. In this chapter, we have introduced a lot of concepts, ideas and ways to think about operationalizing governance across the data lifecycle and you can see it gets overwhelming, very quickly. There’s not a one size fits all solution, you need to identify what is unique with your business and then forge a plan that works for you. Automation can reduce the deployment time compared with hand-coded governance processes. In addition, Artificial Intelligence is seen as a way to get arms around data governance in the future, especially for things like autodiscovery of sensitive data and metadata management. That means, as you look for solutions in the market, find out how much automation and integration is built into it, how well it works for your environment and situation, and whether that is the most difficult part of that work flow that could use automation. In a hybrid and even multi-cloud world, this becomes even more complex and further increases the deployment time.

Complexity and cost

Complexity comes in many forms. In Chapter 1, we talked about how much the data landscape is changing, and just how quickly data was being produced in the world. Another complexity is a lack of defined industry standards for things like metadata. We touched on this in Chapter 2. In most cases, metadata does not obey the same policies and controls as the underlying data itself and a lack of standardized metadata standards, means that different products and processes will have a different way of presenting this information. Another complexity is the sheer amount of tools, processes and infrastructure needed to make governance a reality. In order to deliver comprehensive governance, organizations must either integrate best-of-breed solutions, which are often complex and very expensive (high license and maintenance costs) or buy turnkey, integrated solutions which are expensive and fewer in the market. With this in mind, Cloud Service Providers (CSPs) are building data platforms with all these governance capabilities built-in, thus creating a one-stop-shop and simplifying the process for customers. As an organization, research and compare the different data platforms provided by CSPs and see which one works for your organization. Some businesses choose to leave some of their data on premises, however, of the data that can move to the cloud, these CSPs are now building robust tools and processes to help customers govern their data end-to-end on the platform. In addition, companies such as Informatica, Alation, and Collibra offer governance specific platforms and products that can be implemented in your organization.

Changing regulation environment

In previous chapters, we’ve clearly outlined the implications of a constantly changing regulatory environment with introduction of General Data Protection Regulation (GDPR) and even California Consumer Privacy Act (CCPA). We will not go into that detail here, however, regulations define a lot of what must be done and implemented to ensure governance. They will outline how certain types of data need to be handled, which types of controls need to be in place and sometimes will even go as far as outlining what the repercussions are when these things are not complied with. Complying with regulations is absolutely something your organization needs to think about as you implement data governance over the data lifecycle.

Location of data

In order to fully implement governance over a data lifecycle, understanding which data is on premises vs. the cloud is very important. Furthermore, understanding how this data will interact with each other along the lifecycle does create complexity. In the current paradigm, most organizational data lives both on premises and in the cloud and having systems and tools that allow for hybrid and even multicloud scenarios is paramount. In Chapter 1, we talked about why governance is easier in the public cloud, and that’s primarily because the public cloud has several features that make data governance easier to implement, monitor, and update. In many cases, these features are unavailable or cost-prohibitive in on-premises systems. Data should be protected no matter where it is located, so a viable data lifecycle management plan will incorporate governance for all data at all times.

Organizational culture

As you know, culture is one of those intangible things in an organization, but one that plays an important role to how an organization functions. In Chapter 3, we touched on how an organization can create a culture of privacy and security which allows employees to understand how data should be managed and treated so that they are good stewards of proper data handling and usage. In this section, we’re referring to organizational culture which often dictates what people do and how they behave. Your organization might be free allowing folks to easily raise questions and concerns, and in such an environment, when something goes wrong, people are more likely to speak up. In other organizations where people are reprimanded for every little thing, then they will be more afraid to speak up and report when things are not working and even when things go wrong. In these environments, governance is a little difficult to implement because without transparency and proper reporting, mistakes are usually not discovered until much later. In the NASA example we provided earlier in this chapter, there were a couple of people within the organization who noticed the discrepancy in the data and even reported it. Their reports were ignored by management, and we all know what happened. Things did not end well for NASA. Remember, instituting governance in an organization is often met with resistance, especially if the organization is accustomed to decentralized operations. Creating an environment where functions are centralized across the data lifecycle simply means that these areas have to adhere to processes that they might not have been used to in the past, but processes that are for the larger good of the organization.

Conclusion

Data lifecycle management is paramount to implementing governance and ensures that useful data is clean, accurate and readily available to users. In addition, it ensures that your organization remains compliant at all times.

In this chapter, we introduced you to data lifecycle management and how to apply governance over the data lifecycle. We then double clicked into operationalizing governance and how the role of a data governance policy is ensuring that an organization’s data and information assets are managed consistently and used properly. Finally, we provided step-by-step guidance for implementing governance and finished with the considerations for governance across the data lifecycle, including deployment time, complexity and cost and organizational culture.