Chapter 2. Ingredients of Data Governance: Tools

A lot of the tasks related to data governance can also benefit from automation, or machine learning heuristics. In this chapter we will review some of the tools commonly referred to when discussing data governance.

When evaluating a data governance system, see if it supports the following features. All of the below capabilities are crucial for a complete, end to end set of tools supporting the processes and personnel responsible for the task. Deep dives into various processes and solutions will occur in later chapters.

The Enterprise Dictionary

To begin, it is important to understand how an organization works with data and enables governance. Usually, there is an “Enterprise Dictionary” or a “Policy Book” of some kind.

The enterprise dictionary will be an agreed upon repository of the infotypes used by the organization, data elements that the organization processes and derives insights from. An infotype will be a piece of information with a singular meaning, for example “email address” or “street address” or even “salary amount”.

In order to refer to individual fields of information, and drive a governance policy accordingly, you need to name those pieces of information.

This “enterprise dictionary” is the collection of the information types (infotypes) used by the organization, and is normally owned by either the legal department (focus would be compliance) or the data office (in that case, focus will be standardization of the data elements used)

Once the enterprise dictionary is defined, the various individual infotypes can be grouped into data classes—and a policy can be defined for each data class.

This document can take many shapes, from a paper document to a tool that encodes the principles below, but it generally contains the following kinds of information:

Enterprise Dictionary: Data Classes

A good enterprise dictionary will contain A listing of the kinds of data the organization processes. Those will be groups of infotypes (as described above) collected into groups that are treated in a common way from the policy management aspect. For example, an organization will not want to treat “street addresses”, “phone numbers”, “city, state” and “zipcode” differently in a granular manner, but rather be able to set a policy such that “all location information for consumers must be only accessible to a privileged group of personnel and be kept only for a maximum of 30 days”. This means that the enterprise dictionary, described above, will actually contain a hierarchy of infotypes—at the leaf nodes will be the individual infotypes (e.g. “address”, “email”) and at the root nodes you will find a data class, or a sensitivity classification (sometimes both).

Figure 2-1 shows an example of such a hierarchy from a fictional organization.

A data class hierarchy
Figure 2-1. A data class hierarchy

In the data class hierarchy above, you can see how infotypes such as, for example, IMEI (cellular device hardware ID), phone number, IP address, were grouped together under PII. For this organization, these are easily identifiable automatically, and policies are defined on “all PII data elements”. PII is paired with PHI (patient health information) and both are grouped under the “restricted data” category. It is likely that there are further policies defined on all data grouped under the “restricted” heading.

Data Classes are usually maintained by a central body within the organization, as policies on “types of data classes” usually impact compliance to regulation.

Some example data classes seen across many organizations are:

PII—personally identifiable information

This is data such as name, address, personal phone number that can be used to uniquely identify a person. For a retailer, this can be a customer list. Other examples can include lists of employee data, list of 3rd party vendors and similar information.

Financial information

This is data such as transactions, salaries, benefits or any kind of data that can include information of financial value

Business Intellectual Property

This is information related to the success and differentiation of the business.

The above are examples, and the variety and kind will change with the business vertical and interest. Do note that data classes are a combination of information elements belonging to one topic. For example, a phone number is usually not a data class, but PII (of which phone number is a member) is normally a data class.

Enterprise Policy Book

We have already discussed the relationship between data classes and policies. Frequently, along with the data class specification, the central data office, or legal, will define an “enterprise policy book”. This is a specification that uses that “data classes” (answering: what kinds of data do we process, as an organization) and elaborates on “what are we allowed, and not allowed” to do with the data we have. This is a crucial element in the following respects.

For compliance, the organization needs to be able to prove, to a regulator, that they have the right policies in place around handling of the data. A regulator will require the organization to submit the policy book and proof (usually from audit logs) of compliance with the policies. The regulator will require evidence of procedures to ensure that the policy book is enforced, and may even comment on the policies themselves.

For limiting liability, risk management, and exposure to legal action, an organization will usually define a maximum (and a minimum) retention rate for data. This is important because certain law enforcement, during investigation, will require certain kinds of data which the organization must therefore be able to supply. In the case of financial institutions, for example, it is common to find requirements for holding certain kinds of data (transactions, for example) for a minimum of several years. Other kinds of data poses a liability: you cannot leak or lose control of data that you don’t have.

Another kind of policy will be access control. For data, access control goes beyond “yes/no” and into “partial access”—for example accessing the data when some bits have been “starred out” or accessing the data in after a deterministic encryption transformation—which will still allow acting on distinct values, or grouping by these values, without being exposed to the underlying cleartext. Partial access can be thought of as a spectrum of access, ranging from zero access to ever increasing details about the data in question (format only, number of digits only, tokenized rendition… to full access)—see Figure 2-2 below.

  Examples of varying levels of access for sensitive data.
Figure 2-2. : Examples of varying levels of access for sensitive data.

Normally, a policy book will specify:

  • Who (in the organization, outside the organization) can access a data class

  • The retention policy for the data class (how long data is preserved)

  • Data Residency/locality rules, if applicable

  • How the data can be processed (OK/NOK for Analytics, Machine learning, etc)

  • Other considerations by the organization

The policy book, and with it—the enterprise dictionary—describe the data managed by the organization. Now let’s discuss specific tools and functionality that can accelerate data governance work and optimize personnel time.

Per-Use Case Data Policies

Data can have different meanings and different policies applicable when the data use case is taken into consideration. An illustrative example can be of a furniture manufacturer that collects personal data (names, addresses, contact numbers) in order to ensure delivery. The very same data can potentially be used for marketing purposes, but in that case, it is very often the case that consent was not granted for marketing (but at the same time, I would very much like that sofa to be delivered to my home). The use case, or purpose, of the data access ideally should be an overlay on top of your organizational membership and organizational roles. One way to think about this would be as a “window” through which the analyst can select data, specifying the purpose ahead of time, potentially moving the data into a different container for that purpose (the marketing database, for example)—all with an audit artifact and lineage tracking that will be used for tracking purposes.

Data Classification and Organization

To control the governance of data, it is beneficial to automate, at least in part, the classification of data into at the very least info-types, although an even greater automation is sometimes adopted. A data classifier will look at unstructured data, or even a set of columns in structured data, and infer “what” the data is—for example it will identify various representations of phone numbers, bank accounts, addresses, location indicators, and more.

An example classifier would be Google’s Cloud Data Loss Prevention (DLP) (https://cloud.google.com/dlp), another Classifier is Amazon’s Macie service (https://aws.amazon.com/macie)

Automation of data classification can be accomplished in two main ways:

  • Identify data classes on ingest—triggering a classification job on the addition of data sources

  • Trigger a data classification job periodically, reviewing samples of your data

When it is possible, identifying new data sources and classifying them as they are added to the data warehouse is most efficient, but sometimes, with legacy or federated data, this is not possible.

Upon classifying data, you can, depending on the desired level of automation:

  1. Tag the data as “belonging to a class” (see above in enterprise dictionary)

  2. Automatically (or manually) apply policies that control access to, & retention of the data according to the definition of the data class

  3. “Purpose” or context for which the data is accessed or manipulated

Data Cataloging and Metadata Management

When talking about data, data classification, and data classes, we need to discuss the “metadata” or the “information about information”; where it’s stored and what governance controls there are on it, specifically. It would be naive to think that metadata obeys the same policies and controls as the underlying data itself. There are many cases, in fact, where this can be a hindrance. Consider, for example, searching in a metadata catalog for a specific table containing customer names. While you may not have access to the table itself, knowing such a table exists is valuable (you can then request access, you can attempt to review the schema and figure out if this table is relevant, and you can avoid creating another iteration of this information if it already exists). Another example is data-residency sensitive information (which must not leave a certain national border) but at the same time, does not necessarily apply to the information about the existence of the data itself, which may be relevant in a Global search. A final example is information about a listing of phone calls (who called who, from where, when) which can be potentially more sensitive than the actual calls themselves as a call list places certain people at certain times at certain locations.

Crucial to metadata management is a Data Catalog, a tool to manage this metadata. Where enterprise data warehouses, such as Google BigQuery, are efficient at processing data, you probably want a tool that spans multiple storage systems to hold the information about the data. This includes where the data is, and what technical information is associated with it (think: table schema, table name, column name, column description), but also allow for the attachment of additional “business” metadata, such as who in the organization owns the data, is the data locally generated or externally purchased, does it relate to production use cases or testing, and so on.

As your data governance strategy grows, you will want to attach the particulars of data governance information to the data in a data catalog: data class, data quality, sensitivity, and so on. It is useful to have these dimensions of information schematized, so that you can run faceted search “show me all data of type:table and that have:”a certain data class” in the “production” environment”.

Data catalog clearly needs to efficiently index all this information and be able to present it to the users whose permissions allow it, using a high performing search and discovery tooling.

Data Assessment and Profiling

A key step in most insight generation workflows, as you sift through data, is to review that data for outliers which are probably the result of data entry errors, or are just inconsistent with the rest of the data. In many cases, you will need to normalize the data for the general case before driving insights

The reason for normalizing data is to both ensure data quality and consistency (sometimes data entry errors lead to data inconsistencies). This is especially important when later using the data for Machine Learning models, which are susceptible to extracting generalizations from erroneous data.

Data preparation and cleanup is accomplished by a data engineer as that person onboards a new data source. The data engineer will look for empty fields, out of bound values (example—people ages over 200, under 0) or just plain errors (string where a number is expected). There are tools to easily review a sample of the data and make the cleanup process easier, for example https://cloud.google.com/dataprep dataprep by trifecta and Stitch https://www.stitchdata.com.

These cleanup processes work to ensure that use cases such as generating a machine learning model do not result in being skewed by data outliers. Ideally, data should be profiled so to detect anomalies per column, make a determination on whether anomalies are making sense in the relevant context (customers shopping in a physical store outside of store hours are probably an error, while late-night online ordering is very much the reality), once the bounds for what kinds of data are acceptable for each field, set automated rules to prepare and cleanup any batch of data or any event stream for ingestion.

Data Quality

Data Quality is an important parameter in both determining the relevant use cases for a data source as well as the ability to rely on data for further calculations/inclusions with other data sets. You can identify data quality by looking at the data source, understanding where it physically came from (error prone human entry?, fuzzy IoT devices optimizing for quantity, not quality? Highly exact mobile app event stream?). Knowing the quality of data sources should guide joining data sets of varying quality because low quality data will reduce confidence in higher quality sources. Data quality management processes include creating controls for validation, enabling quality monitoring and reporting, supporting the triage process for assessing the level of incident severity, enabling root cause analysis and recommendation of remedies to data issues, and data incident tracking.

There should be different confidence levels assigned to different quality data sets. There should also be considerations around allowing (or at least curating) resultant data sets with mixed-quality ancestors. The right processes for data quality management will provide measurably trustworthy data for analysis.

Lineage Tracking

Data does not live in a vacuum, it is generated by certain sources, undergoes various transformations, aggregates, additionals, and eventually is supporting certain insights. There is a lot of valuable context generated from the source of the data and how it was manipulated along the way, which is crucial to track. This is data lineage.

A couple of examples of why lineage tracking is important: one is understanding the quality of a resulting dashboard/aggregate. If that end product was generated from high quality data, but later the information is merged into lower quality data, that leads to a different interpretation of the dashboard. Another example will be viewing, in a holistic manner, the movement of a sensitive data class across the organization data scape, making sure sensitive data is not inadvertently exposed into unauthorized containers.

Lineage tracking should be able to, first and foremost, present a calculation on the resultant metrics such as “quality” or whether or not the data was “tainted” with sensitive information, and later be able to present a graphical “graph” of the data traversal itself. This graph is very useful for debugging purposes, but less so for other purposes.

Lineage tracking is also important when thinking about explaining decisions later on. By identifying input information into a decision making algorithm (think about a neural net, or a machine learning model) you can rationalize later why some business decisions (e.g. loan approval) were made in a certain way in the past and in the future.

The above also brings up the importance of temporal dimension of lineage—the more sophisticated solutions track lineage across time: not only what are the current input to a dashboard but also what were those inputs in the past, and how the landscape evolved.

Key Management and Encryption

One consideration where storing data in any kind of system is whether to store it in a plain text format or whether to encrypt it. Data encryption provides another layer of protection (beyond protecting all data traffic itself) as only the systems or users which have the keys can derive meaning from the data. There are several implementations of data encryptions:

  • Data encryption where the underlying storage can access the key—this allows the underlying storage system to affect efficient storage via data compression (encrypted data usually does not compress well). When the data is accessed outside the bounds of the storage system, for example if a physical disk is taken out of a data center, the data should be unreadable and therefore secure.

  • Data encryption where the data is encrypted by a key inaccessible to the storage system, usually managed separately by the customer. This provides, in some cases, protection from a bad actor within the storage provider itself, but results in inefficient storage and performance impact.

  • Just-in-time decryption, where in some cases, for some users, it is useful to decrypt certain data as it is being accessed, as a form of access control. In this case, encryption works to protect some data classes (think “customer name”) while still allowing insights such as “total aggregate revenues from all customers”, or, “top 10 customers by revenue” or even identifying subjects who meet some condition, with the option to ask for de-masking these subjects later via a trouble ticket.

All data in Google Cloud is encrypted by default both in transit and at rest, ensuring that customer data is always protected from intrusions and attacks. Customers can also choose Customer-managed encryption keys (CMEK) using Cloud KMS or Customer-supplied encryption keys (CSEK) when they need more control over their data.

To provide the strongest protections, your encryption options should be native to the cloud platform/data warehouse you choose. The big cloud platforms all have a native key management which usually allows you to perform operations on keys, without revealing the actual keys. In this case, there are actually two keys in play:

A Data encryption key (DEK)

Used to directly encrypt the data by the storage system.

A key encryption key (KEK)

Used to protect the data encryption key, and resides within a protected service, a key management service.

A Sample Key Management Scenario

Key Management Scenario
Figure 2-3. Key Management Scenario

In the scenario depicted in Figure 2-3, the table (on the right) is encrypted in chunks, with the red data encryption key.1 The data encryption key is not stored with the table, but is stored in a protected form (wrapped) by a green key encryption key. The key encryption key resides (only) in the key management service.

To access the data, a user (or process) follows the following steps:

  1. Request the data, instructing the data warehouse (BigQuery) to use the “green key” to unwrap the data encryption key, basically passing the key ID.

  2. BigQuery retrieves the protected DEK from the table metadata, and accesses the key management service, supplying the wrapped key.

  3. The key management service unwraps the data protection key, while the KEK never leaves the vault of the key management service.

  4. BigQuery uses the DEK to access the data, and then discards it, never storing it in a persistent manner

The scenario above ensures that the key encryption key never leaves a secure, separate, store (the KMS) and that the data encryption key never resides on disk, only in memory and only when needed.

Data Retention and Data Deletion

An important item in the data governance tool chest is not just access to data but also the capability to control how long data is kept. Setting maximal and minimal values. Identifying data that should survive occasional storage space optimization as more valuable to be retained has many use cases, setting a maximum amount of time on data retention for a data class and then deleting this seems less obvious. Consider that retaining PII presents the challenges of proposer disclosure, informed consent, and transparency. Getting rid of PII after a short duration (e.g. retain location only while on the commute) simplifies the above.

Workflow Management for Data Acquisition

One of the key workflows tying together all the tools mentioned above is data acquisition. This workflow usually begins with an analyst seeking data to perform a task. The analyst, through the power of a well-implemented data governance plan, is able to access the data catalog for the organization, and through a multi-faceted search query, is able to review relevant data sources. Data Acquisition continues with identifying the relevant data source and seeking an access grant to it. The governance controls route the access to the right authorizing personnel, and access is granted to the relevant data warehouse, enforced through the native controls of that warehouse. This workflow: identifying a task, shopping for relevant data, identifying relevant data and acquiring access to it, constitutes a data access workflow which is safe, as the level of access: data appears in search, data acquired, and data queried, are all data governance stages.

IAM—Identity and Access Management

When talking about data acquisition, it’s important to detail how access control works. The topic of access control relies on user authentication, and per the user, the authorization of the user to access certain data and the conditions of access.

User authentication: the objective of authenticating a user is to determine that “you are who you say you are”. Any user (and, for that matter, any service, or application) operates under a set of permissions and roles tied to the identity of a service. The importance of securely authenticating a user is clear: if I can impersonate a different user, there is a risk of assuming that user’s roles and privileges and breaking data governance.

Authentication used to be traditionally accomplished by supplying a password tied to the user requesting access. This method has the obvious drawback that anyone who has somehow gained access to the password, can gain access to everything that user has access to. Nowadays, proper authentication requires:

Something you know

This will be your password, or passphrase, and should be hard to guess and regularly changed.

Something you have

This serves as a second factor of authentication. After providing the right passphrase, a user will be prompted to prove that they have a device (a cell phone able to accept single use codes, a hardware token)—adding another layer of security. The underlying assumption is that is you misplace that “object”—you will be reporting that promptly, ensuring the token cannot be used by others.

Something you are

Sometimes, for another layer of security, the user will add biometric information to the authentication request: a fingerprint, a facial scan or similar.

Additional context

Another often used layer of security is ensuring that an authenticated user can only access certain information from within a specific, sanctioned, application, device or other conditions. Such additional context often includes:

  • Being able to access corporate information only from corporate hardware (sanctioned and cleared by central IT). This, for example, eliminates the risk of “using the spouse’s device to check for email” without enjoying the default corporate anti-malware software installed by default on corporate hardware.

  • Being able to access certain information only during working hours—thus eliminating the risk of personnel using their off-hours time to manipulate sensitive data, maybe when those employees are not in appropriate surroundings or even if they are not alert for risk.

  • Limiting access to sensitive information while not logged in to the corporate network—using internet cafe’s, for example, and risking network eavesdropping.

The topic of authentication is the cornerstone of access control, and each organization will define their own balance between risk aversion and user authentication friction. It is a known maxim that the more “hoops” employees need to jump through in order to access data, the more these employees will seek to avoid complexity, leading to shadow IT and information siloing—both a direction opposed to data governance (data governance seeks to promote data access to all, under proper restrictions). There are volumes written on this topic in detail.2

User Authorization and Access Management

Once the user is properly authenticated, access is determined by a process of checking if the user is authorized to access or otherwise perform an operation on the data object in question. Be it a table, a dataset, or a pipeline or streaming data.

Data is a rich medium, and sample access policies can be:

  • For reading the data directly (performing “select” SQL statement on a table, reading a file)

  • For reading/editing the metadata associated with the data—for a table, this would be the schema (the names and types of columns, the table name) for a file, this would be the file name. In addition metadata also refers to the creation date, update date, last read dates.

  • For updating the content, without adding new content.

  • For Copying the data or exporting it.

  • The are also access controls associated with workflows, such as performing an extraction/transformation/load operation (ETL) for moving and reshaping the data (replacing rows/columns with others)

We have expanded here on the policies mentioned for data classes above, which also detail partial read access—which can be its own authorized function.

It’s important to define identities, groups, and roles, and assign access rights to establish a level of managed access.

IAM (Identity and Access management) should provide role management for every user, with the capability to flexibly add custom roles that group together meaningful permissions relevant to your organization ensuring that only authorized and authenticated individuals and systems are able to access data assets according to defined rules. Enterprise-scale IAM should also provide context (IP, device, time the access request is being generated from). As good governance results in context-specific role and permission determination before any data access, the IAM system should scale to millions of users, issuing multiple data access requests, per second.

Summary

In this chapter, we have gone through the basic ingredients of data governance: the importance of having a policy book containing the data classes managed, and how to clean up the data, secure it, and control access. Now it is time to go beyond the tooling and discuss the ingredients of data governance: people and processes.

1 Protection of data at rest is a broad topic, a good starter book would be Applied Cryptography by Bruce Schneier.

2 An example book about identity and access management is Identity and Access Management by Ertem Osmanoglu.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset