Defining the data reservoir ecosystem
The data reservoir solution defines the types of technology components necessary to manage and share a wide variety of information. In addition, it considers the ecosystem of processes and teams that interact with these technical components to ensure that the data in the data reservoir is available, findable, useful, and properly protected.
This chapter takes a closer look at the ecosystem around the data reservoir.
This chapter includes the following sections:
2.1 How does the data reservoir support the business?
After sketching the overall architecture of the data reservoir, Erin begins working with the users of the reservoir across the business. Her goal is twofold. She wants to get to the next level of detail on their use cases, but also wants to get their buy-in on the role and use of the reservoir. She is confident that by starting with specific, narrow initial proof of concepts (POCs) for each of her constituent groups, she can demonstrate value. And she intends that this effort will also shake out the system for the larger use cases.
2.1.1 Extended data warehouse
EightBar Pharmaceuticals (EBP) has had a data warehouse as part of their IT infrastructure for a number of years. At the time it was originally conceived, they hoped that the infrastructure would satisfy the ever-growing information consumption needs of EbP. However, as time has gone on, there have been some notable issues:
Keeping the warehouse up to date with the changing needs of the business has proven the biggest challenge. The business information, and the needs of the business are changing faster than the IT staff can evolve the data warehouse itself.
Although the reports created from the warehouses do provide a great deal of value, they could be more valuable if newer types of information, such as the unstructured data sources that are cropping up frequently, could also be incorporated in them.
The costs associated with the warehouse continue to grow. The payment to their warehouse vendor gets larger every year, as they must pay for more storage and more compute cycles.
The frequency of access of the data is very different, although the storage costs are the same for all of it. With nowhere else to put some of this data, the warehouse starts to contain increasingly obsolete historical data, which is used less and less in the more contemporary business reports.
A data reservoir can help address many of these issues:
The data reservoir provides support for a wider variety of storage mechanisms than are found in the traditional data warehousing environment. Many of these allow for storage of information in formats where prior knowledge of the information structure is not needed. This allows for the storage of the information to evolve with the information itself as it changes over time. Using tools such as IBM Big SQL over open source components such as HBase, you can continue to allow business intelligence (BI) report builders to use the SQL that they are familiar with, but over an ever more diverse set of data.
The overall storage costs of the different repositories in the data reservoir can be dramatically lowered compared to those of the data warehouse alone. This lower price point means that storage decisions get easier, with complementary information going to cheaper repositories, while the warehouse remains dedicated to the core data for the traditional BI reports that it runs.
Older warehouse data can also be moved to cheaper storage. Using federated interfaces, the existing BI reports can run as is, but against data sets that are now a combination of those that are in the warehouse and those in the reservoir.
New styles of repositories can enable a broader range of analytical processing. This is particularly true on unstructured or semi-structured data. In particular, text analytics and graph-based analytics become cheap and effective.
After further discussions with Erin, the warehouse team is excited about using the data reservoir as a way to extend their data warehouse. The lower storage costs will finally allow them to bring some other data sources online that have been a challenge for them in the past. The flexible storage mechanisms will also allow them to keep up with the changing information itself.
2.1.2 Self-service information library
Erin next spends some time with Callie Quartile, the data scientist. Today in EbP, Callie finds much of her time goes to just tracking down the information that she needs. The IT staff are great, but they are very busy. Having to interrupt them all the time to ask where this particular piece of data is, or where a particular report from last year is becomes tiresome to say the least. After she has found a particular piece of information, it is difficult to know when it was last updated or by whom. It can even be hard to figure out who is responsible for the information if she wants more detail about how the information is structured or the meaning of certain elements within it.
Erin explains to Callie how the data reservoir will be associated with a catalog. That catalog will track the assets within the reservoir and be everyone's window into the content there. An important aspect of a catalog (such as the IBM InfoSphere Information Governance Catalog) is the ability to provide a self-service experience for people such as Callie. This includes the ability to use simple phrases or terms to search for associated assets. After an asset is found, it should be clear who the curator and information owner are, when it was added to the catalog, and in many cases what the lineage of the asset is. Many elements of the data sources that Callie finds can be linked back to business terms that define the precise semantics of the values themselves. This allows Callie to use the numerical information with much more confidence.
This can all be done without requiring IT involvement, allowing Callie to find, recognize, and use the information much more quickly.
2.1.3 Shared analytics
In talking with Erin, Callie shares one of her other long standing frustrations: Sharing analytical results with colleagues. She and her other data scientists all end up with file after file of SAS data sets or R results, and no effective way of sharing results. Callie explains that the sharing issue is not just about getting access to the files, but also understanding the sources that contributed to the analytic results themselves. She makes it clear that for the analytics to be reusable, the data scientists need to be able to understand what data contributed to a particular result set.
Erin again explains the role of the reservoir and its catalog. Analytic results can be published back to the reservoir, but when this is done, not just the results themselves should be published. Metadata about the sources and in many cases the analytic operations themselves that were performed to obtain these results should be included. As tooling around the reservoir evolves, this metadata publication along with the result sets will become automatic.
With the analytic result sets published to the reservoir, and details about the analytic operations that were used to create the result set, the collaboration between Callie and her colleagues can change dramatically. They will be able to shop for analytics, in the same way that they can use the catalog to shop for data within the reservoir.
2.1.4 Tailored consumption
For a data reservoir to serve the diverse community of users that need information, it must have a certain amount of contextual awareness.
Predefined vocabularies and business object models can be used by the data reservoir catalog to match against the data itself. This allows users such as Harry Hopeful, the sales specialist for EbP, to query the repository using terminology that is familiar to them. The catalog uses that language to locate the assets that best match Harry's search and objectives. In this way, the same physical assets within the reservoir might be arrived at through different paths by Harry or Tessa Tube, the lead researcher. Each of them operates in a different part of the business with a different vocabulary, but in some cases will need to find the same assets.
This tailored navigation is also supported by a corresponding tailored consumption. Harry needs to find certain data sets, but after they are found, he wants to see, manipulate, and use those data sets using the terms that he is familiar with, not whatever database table or column names where assigned by a contract extract, transform, and load (ETL) developer several years ago.
2.1.5 Confident use
Erin and Jules Keeper, the new Chief Data Officer (CDO) at EbP, are also clear that enabling self-service, storing analytics, and allowing easy consumption will all be for naught if the information that is found cannot be used with confidence.
Confident use of information requires many factors:
The information itself enters the reservoir as part of a well-defined governance program.
The meaning of the data is captured and easily understood. This starts with the simple metadata about a particular asset such as a database table, such as the column names and their types. This information will often be enhanced with an association to a corresponding business term or element within an industry vocabulary.
It is easy to identify the sources of the data. Any legal or usage restrictions, if those exist, should be clear.
2.2 Process tools and lifecycles
The data reservoir is more than a set of passive repositories. The data within the data reservoir is being actively managed to ensure that people can find and access the data that they need.
2.2.1 The need for self-service
The initial objectives of a data reservoir are typically aimed at a small subset of the information available within an enterprise. It is unlikely that a big bang approach, where all data is fed into the reservoir and then switched on will be acceptable to the business stakeholders or feasible to manage from a project perspective. Initially most data reservoirs will demonstrate immediate value to pockets of information users. Over time, as the value for this small subset of users is demonstrated, further phases of the project will increase the volume of data being fed into the reservoir and increase the breadth of the users within the enterprise that will operate within it. Quick wins in those initial phases where tangible business value can be demonstrated is key to ensuring business buy-in from stakeholders across the enterprise.
While the data reservoir implementation is in its infancy, management of the operational aspects of the reservoir can be relatively lightweight. Typically, users of the reservoir will have a similar level of skills, and importantly, given the limited volume of data, will likely be familiar with the type of information that can be discovered within the reservoir. However, over time as the value proposition is proven, as increasing numbers of users want access to increasing volumes of data, the management of the reservoir requires a more automated approach.
2.2.2 Facets of self-service
Self-service can be considered from various perspectives:
A user being provided with all the capabilities they need to be able to search and find the data that they need without support
A data owner being able to ensure that the correct level of governance is enforced on their data
A system administrator being notified of impending resource allocations being exceeded on a file system
A fraud investigation officer being provided with the tools they need to investigate suspicious activity within the reservoir
In all of these situations, self-service allows these individuals to perform their tasks without support and effectively.
2.2.3 Enablers of self-service
Initial stages of a data reservoir might not be focused on the self-service requirements of the future state of the reservoir. However, it is important to understand upfront and plan for how self-service can be used to ensure that the operation of the reservoir is as efficient as possible. Two of the key aspects to self-service are workflow and catalog management.
2.2.4 Workflow for self-service
There are various items to consider when defining a workflow for self-service:
What is a business process
A business process is a series of actionable steps, tied together into a logical sequence to achieve a business result. The actionable steps can be human-centric or system-centric. Human-centric steps typically require an individual to perform a unit of work, often through a user interface. System-centric steps define automatic steps within the business process such as running a business rule, reading/writing from a system, or sending out notifications. A business process will consist of many human and system steps tied together to provide a definition of how individuals and systems should coordinate their interaction with data.
Workflow versus Business Process
Often, the terms business process and workflow are used interchangeably. However, a business process can be described as the abstract definition of the work that must be done, whereas the workflow is the concrete implementation of the business process. Workflows can be defined in many ways, but when a workflow is defined as an implementation of a business process, specialized software, known as Business Process Management software can be used to model, build, and define complex workflows. IBM Business Process Manager is an example of this specialist software. As the number and variety of people using the data reservoir increases, workflow becomes an enabler of efficient and effective operation of the data reservoir, connecting people and ensuring action is taken to address missing, lost, or incorrect data.
For more information about Workflow, see:
Types of Workflow for self-service
When considering workflow as an enabler to self-service, workflow can be classified into the areas shown in Table 2-1.
Table 2-1 Forms of workflow
Classification
Description
Data quality management
A collection of workflows within the system that enforce the accuracy of the data and the governance policies associated with the data
Data curation
A collection of workflows that support the curation of the data within the reservoir. These workflows provide mechanisms for Curators to massage the metadata and respond to notifications that inaccurate metadata exists.
Data protection
Typical workflows here manage the granting of access to data and allow the system to notify security and compliance officers of security alerts and suspicious activities, and allow for auditing of system usage.
Data lifecycle management
These workflows allow for the management of information assets within the reservoir or for artifacts of the governance program. Typical examples here include managing the lifecycle of a piece of reference data and managing the lifecycle of changes in policy and rules that need to be applied to the data within the reservoir.
Data movement and orchestration
Workflows classified in this area define the flow of data into and through the reservoir. Workflows in this category define the steps that must be followed for new sources to be shared within the reservoir. They can also define the masking events that should occur before the data is shared with a specific group of users
For more information about each type of workflow, see Chapter 5, “Operating the data reservoir” on page 105.
2.2.5 Catalog management for self-service
The heart of the data reservoir is the catalog, which contains the descriptions of the data in the data reservoir and related detail that defines how this data is being managed.
The catalog is the first access point for most people using the data reservoir. The accuracy and the usefulness of its content will create the first impression of the data reservoir's usefulness and quality. Therefore, its design and the effort to populate it with useful and relevant content is a key success factor for the data reservoir. Human curators maintain some of this content, incorporating feedback from users along with automated processes that are surveying and monitoring the activity in the data reservoir.
2.3 Defining the information governance program
Establishing a governance program for the data reservoir is not just a necessary step, but an essential one for the reservoir effort to succeed. Without a governance program, the data reservoir can easily turn into a place where any and all manner of information is dumped. At first glance, this might seem to be part of the goal of the reservoir: A place to put and share the information. Indeed that is the goal, but for it to be information, not just raw bits and bytes, it needs governance to ensure that the data that is shared is properly characterized and curated. This section details how to set up a governance program.
2.3.1 Core elements of the governance program
There are three core elements to establishing a governance program that can operate at scale:
Classification
Data in the data reservoir is classified using simple classifications schemes that express the business's perspective on the value, sensitivity, confidentiality, and integrity. Processes and infrastructure are also classified according to their capability along with the different types of activity. These classifications become the vocabulary to express how the data reservoir manages data.
Policies and rules
For each classification of data, corresponding policies and rules should dictate how that data will be handled in different situations. The rules are described using the classifications in Table 2-1 on page 34. For example, sensitive information must be stored in a secure data repository. The rules support one or more policies that define the wanted end state for the data reservoir. Together the policies and rules make up the requirements of the information governance program.
Policy implementation and enforcement
The rules are implemented in the data reservoir processes. Some rules test whether a wanted state is true or not. These are called verification rules. If the wanted state is not true, an exception is raised. The exception can be corrected, logged, and ignored if it is not worth fixing or an exception granted for a period of time. This is a common pattern for data quality because the errors are already present in the data. Enforcement rules, in contrast, are able to force the wanted state. They are more common in the protection of data in the reservoir. If an enforcement rule fails, it is due to an infrastructure failure or a set up error.
These core elements support the information governance principles.
2.3.2 Information governance principles
The information governance principles are the top-level information governance policies. They are typically the first set of definitions that the governance leader, Jules Keeper, establishes because they underpin all other information governance decisions.
The information governance principles define the scope of the information governance program. This is a fairly comprehensive set that covers the responsibilities of individuals in their use and management of information.
These first three principles outline the scope of the information governance program. Define it to cover all information. At a minimum, all information is potentially shareable and must be classified to determine how it is managed. Some classifications indicate that the information is of sufficiently low value and sensitivity that it does not require any special management. As the classification levels rise, the requirements increase.
1. Information is a company asset. It will be managed according to the prescribed governance policies.
2. Information is identified and classified. All information that is stored by the company will be identified and classified according to its sensitivity and content. This classification will determine how the information will be governed.
3. Information is a sharable resource and should be made available for all legitimate needs of the company.
These four principles define the roles that people assume when they work with information:
1. Information is owned. There is an individual responsible for the appropriate management and governance of each information collection.
2. Information users are identified. An individual will be identified and be accountable for each access and change they make to information.
3. Information users are responsible. Individuals are responsible for safeguarding the information that they own, access, and use.
4. Decision makers are responsible for ensuring they use information of appropriate integrity for their work.
These three principles establish the three main disciplines related to information governance: Information lifecycle management, information protection, and information quality:
1. Information is kept as long as it is needed. Information that is no longer needed will be disposed of correctly.
2. Information is protected. Information is secured from unauthorized access and use.
3. Information quality is everyone's responsibility. Information is validated and where necessary it is corrected and made complete.
This principle establishes the need for information architecture to ensure that information is being managed in the most efficient and cost effective manner:
Information is managed in a cost effective manner. This is achieved through a well-defined information architecture that follows standards and best practices.
This last principle establishes the point that because something is technically possible, and legal, it does not mean that it is appropriate to do. Thought must be given to the consequences and impact on customer trust and related brand image:
Information and analytics will only be used for approved, ethical purposes.
The information governance principles are then supported by obligations, delegations, and standards.
Obligations
The information governance obligations are policies that have been delegated from other governance focus areas, governance domains, and business teams. They can be thought of as subpolicies under the policies defined by the originating teams.
The following are typical examples:
Risk Management: The risk management team might handle requirements for classification, data quality, and lineage.
Intellectual Property Protection: Policies that define the requirements for protecting the organization's intellectual property.
Export Controls: Defining the restrictions when moving information between countries.
Financial Reporting: Providing policies related to the collection, management, and retention of financial information.
Privacy: Defining the requirements for safeguarding the privacy of individuals and organizations.
Human Resources: Definition of policies around the management and retention of data about employees.
Marketing: Definition of policies on how customer information should be used.
Delegations
Delegations are policies where the responsibility for implementation has been passed to another governance team. These policies are documented in the catalog to explain where responsibility lies.
Information Security: The information security team typically takes responsibility for managing access control and setting standards for infrastructure security. The information governance classifications provide a business perspective on where these controls need to be applied.
IT Governance: Information governance needs a reliable IT infrastructure to operate successfully. IT infrastructure failures can lead to loss or corruption of data. If data is not available when people need it, they tend to take private copies of the data that they manage themselves.
Approaches
The information governance approaches define the approved standards, best practices, and architecture that underpins the information governance program. Enforcement of standards typically reduces cost, avoids errors and speeds up the development of new IT capability. The following are examples of typical standards:
Standards for data structures and definitions
Standards for reference data
Standards for authoritative sources
Standards for protection of information
Standards for retention and disposal of information
Standards for managing information quality
2.3.3 Classification schemes
Classification is at the heart of information governance. It characterizes the type, value, and cost of information or the mechanism that manages it. The design of the classification schemes is key to controlling the cost and effectiveness of the information governance program.
A classification scheme consists of a discrete set of values that are used to describe one facet of an asset's character. When an asset is classified with a particular classification scheme, it is assigned one or more of these classification values.
The information governance policies and rules are then expressed in terms of the classification values to provide explicit guidance on how information of that classification should be managed and used.
Every classification scheme has a definition for what it means if information is unclassified.
Each classification scheme must be memorable and meaningful to individuals creating and using information. This is why classification schemes can be applied to the information resources at multiple levels of granularity to provide the correct level of behavior at an appropriate cost.
The classification schemes shown below are suggestions for an information governance program. These classification schemes are implemented in the InfoSphere Information Governance Catalog. Business classifications, role classifications, resource classifications, and semantic classifications are implemented as terms in the glossary. Technical data classes are implemented as data classes.
There are five main groups of classification schemes:
Business Classifications: Business classifications characterize information from a business perspective. This captures its value, how it is used, and the impact to the business if it is misused.
Role Classifications: Role classifications characterize the relationship that an individual has to a particular type of data.
Resource Classifications: Resource classifications characterize the capability of the IT infrastructure that supports the management of information. A resource's capability is partly due to its innate functions and partly controlled by the way it has been configured.
Activity Classifications: Activity classifications help to characterize procedures, actions, and automated processes.
Semantic Classification: Semantic classification identifies the meaning of an information element. The classification scheme is a glossary of concepts from relevant subject areas. These glossaries are industry-specific and are included with industry models. The semantic classifications are defined at two levels:
 – Subject area classification
 – Business term classification
Business Classifications
Business classifications characterize information from a business perspective. This captures its value, how it is used, and the impact to the business if it is misused.
Confidentiality
Confidentiality is used to classify the impact of disclosing information to unauthorized individuals:
Unclassified: Unclassified information is information that is publicly known. Open data is an example of unclassified data.
Internal Use: This information is used to drive the everyday functions of the organization. It is widely known within the organization and should not be shared with external parties. However, if it leaks, the impact on the organization is minimal.
Confidential: This information is only for people with a need to know. It is key information that, if disclosed beyond this group, could have a localized negative impact.
 – Business Confidential: This information provides an organization with a competitive advantage.
 – Partner Confidential: This information is about a partner organization (such as customer or supplier) that has requested that this information be kept in confidence.
 – Personal Information: This information is about an individual. Disclosure of this information could harm or expose the individual.
Sensitive: This information is only for people with a need to know. This information can have lasting damage if it is disclosed beyond this group.
 – Sensitive Personal: This information is about an individual. Disclosure of this information could cause lasting harm or exposure to the individual.
 – Sensitive Financial: This information relates to the financial health of the business and disclosure could have a lasting impact on the financial health of the organization.
 – Sensitive Operational: This information relates to how a business is operating. For example, this might involve high value proprietary procedures and practices. Disclosure of this information beyond the trusted group of people could expose the organization to fraud, threat, or loss of competitive advantage in the long term.
Restricted: This type of information is restricted to a small group of trusted people. Inappropriate disclosure could be illegal or seriously damage the organization's business. Every copy of this information must be tracked and accounted for. There are three subtypes:
 – Restricted Financial: Details of the financial health of the organization.
 – Restricted Operational: Details of the operational strategy, approaches, and health of the organization.
 – Trade Secret: Core ideas and intellectual property that underpins the business.
Retention
Retention is used to characterize the length of time this information is likely to be relevant or needed by the organization.
Unclassified: This type of information is useful but not critical to the organization. It will be retained for the default retention period of two years.
Temporary: This information is a copy of information that is held elsewhere and will be removed shortly.
Project lifetime: This information is needed for the lifetime of the associated project. It can be archived after the project completes.
Team lifetime: This information is need for the lifetime of the associated team. It can be deleted or archived after the team disbands.
Managed lifetime: Managed lifetime determines how long information should be kept before it is archived. This value can be set by the business because there is no regulatory requirement that controls the retention period. The subtypes defined for this classification are Six Months Retention, One Year Retention, Two Years Retention, Five Years Retention, Ten Years Retention, and Fifty Years Retention.
Controlled lifetime: This information's retention is controlled by regulation or legal action. The length of time is typically dependent on the type of information and will be captured in the related rules.
Permanent: This information is likely to be needed for an extended period unless the business that the organization is changed dramatically, at which time the retention of this information should be revisited
Confidence
Confidence indicates the known level of quality of the information, and consequently how much confidence to have in the information. The following are suggested confidence levels:
Unclassified: New data that has not had any analysis applied to it. Its level of confidence is unknown. This data might turn out to be high quality, but until it has been assessed, it should be treated with caution.
Obsolete: This information collection is out of date and has been superseded by another information collection. It should only be used to investigate past decisions that were made using this information.
Archived: This information has been archived and is available for historical analysis.
Original: Original information comes from operational applications. It has not been enhanced and so contains the information values that were used by the teams during normal operations. Often this information has a localized perspective and can be narrow in perspective.
Authoritative: Authoritative information is the best knowledge that the organization has on the subject area. This information is continually managed and improved. This information has the highest level of confidence.
Severity
Severity is used to classify the impact of a particular failure of the information technology infrastructure or issue with the data values stored in an information collection. The following are example severity levels:
Severity 1: Critical Situation/System Down/Information Unusable. Business critical software component is inoperable or a critical interface has failed. This indicates that you are unable to use the program, resulting in a critical impact on operations. This condition requires an immediate solution.
Severity 2: Severe impact. A software component is severely restricted in its use, causing significant business impact. This indicates that the program is usable, but is severely limited.
Severity 3: Moderate impact. A noncritical software component is malfunctioning, causing moderate business impact. This indicates that the program is usable with less significant features.
Unclassified: Minimal impact. A noncritical software component is malfunctioning, causing minimal impact, or a nontechnical request is made.
Business impact
The business impact classification defines how critical an information collection is to the organization ability to do business. This classification is typically associated with business continuity and disaster recovery planning. However, it is also an indication of the value of the information in the information collection.
Unclassified: Information that is used by an individual, so only that individual is impacted.
Marginal: Occasional or background, non-essential work is impacted.
Important: Parts of the business are unable to function properly.
Critical: The business is not able to function until capability is restored.
Catastrophic: The business is lost and restoration is unlikely.
Role Classifications
Role classifications are used to control the types of data that an individual can see. They are typically relative to the data itself. Some role classifications are to a subject area or entity/attribute type, whereas others are related to instances, particularly when it comes to personal data. As such, some role classifications can be calculated dynamically rather than manually assigned. Here are some examples of more static user roles:
Information Owner: A person who is accountable for the correct classification and management of the information within a system or store.
Information Curator: A person who is responsible for creating, maintaining, and correcting any errors in the description of the information store in the governance catalog.
Information Steward: A person who is responsible for correcting any errors in the actual information in the information store.
Examples of classifications that are more instance-based might be labels that show the relationship between a user of data and the data subject:
Close Neighbor
Relative
Colleague
Manager
Spouse
These types of classification are specific to the instance and need to be dynamically calculated.
Resource classifications
Resource classifications characterize the capability of the IT infrastructure that supports the management of information. A resource's capability is partly due to its innate functions and partly controlled by the way it has been configured.
Governance zone
The governance zone provides a course-grained grouping of information systems and information collections for a particular type of usage. The governance zones are overlapping so an information system or information collection can be in multiple zones.
These zones are commonly found in a data reservoir:
Traditional IT zones
 – Landing area zone
The landing area zone contains raw data just received through the Data Ingestion component from system of record applications and other sources. This data has had minimal verification and reformatting performed on it. Processes inside the data reservoir called data refineries take this data and process it to improve its quality, simplify its structure, add new insight, and link related information together.
 – Integrated warehouse and marts zone
The integrated warehouse and marts zone contains consolidated and summarized historical information that is managed for reporting and analytics.
 – Shared operational information zone
The shared operational information zone has information sources that contain consolidated operational information that is being shared by multiple systems. This zone includes the master data hubs, content hubs, reference data hubs, and activity data hubs. They support most of the service interfaces of the data reservoir.
 – Audit data zone
This is where log information about the usage of data in the data reservoir is kept. Analytics models run in this zone to detect suspicious activity. It is also used by security experts for investigating suspicious activity and for auditing of the data reservoir operations.
 – Archive data zone
Archive data is no longer needed for production, but has potential value for investigations, audit, and understanding historical trends.
Self-service zones
 – Descriptive data zone
The descriptive data zone contains the metadata that describes and drives the management of the data in the data reservoir. This zone starts out as a simple metadata catalog, but as the business gains self-service and governance capability, the descriptive data zone grows in sophistication.
 – Information delivery zone
The information delivery zone contains information that has been prepared for use by the lines of business. Typically this zone contains a simplified view of information that can be easily understood and used by spreadsheets and visualization tools. Business users access this zone through the View-based Interaction subsystem.
 – Deposited data zone
The deposited data zone is an area where the users of the data reservoir can store their own files either for safety or for sharing. The inclusion of deposited data zone helps to reduce the data leakage from the data reservoir because business and analytics teams are not required to set up their own local files store.
 – Test data zone
The test data zone contains obfuscated data for testing. This is nested in the deep data zone, and is used by developers and analysts when testing new function and analytical models.
Analytic zones
 – Discovery Zone
The discovery zone contains data that is potentially useful for exploring for new analytics. Experienced analysts from the line of business typically use this zone for these purposes:
 • Browse catalog to locate the data they want to work with
 • Understand the characteristics of the data from the catalog description
 • Populate a sandbox with interesting data (this sandbox is typically in the exploration zone.)
 – Exploration zone
The exploration zone contains the data that the analysts and data scientists work with to analyze a situation or create analytics. Users of this zone reformat and summarize the data to understand how a process works, locate unusual values (outliers), and identify interesting patterns of data for use with a new analytical algorithm.
 – Analytics production zone
The analytics production zone contains detailed information that is used by production analytics to create new insight and summaries for the business. This data is kept for some time after the analytics processing is complete to enable detailed investigation of the original facts if the analytics processing discovers unexpected values. There is a large overlap in the data elements found in the analytics production zone and the exploration zone because the production analytics models are typically developed in the exploration zone. Repositories in this zone will have production service level agreements (SLAs) applied to them, particularly as the organization becomes more data and analytics driven.
 – Derived insight zone
The derived insight zone identifies data that has been created as a result of production analytics. This data is unique to the data reservoir and might need additional procedures for backup and archive.
Transport Security Classification
The transport security classification describes the ability of an information provisioning technology to protect information from interception and tampering while it is being provisioned between systems. It typically uses these classifications:
Unclassified: Unsecured
Secured: Access to the technology is controlled so that only approved processes can access it.
Encrypted: Information is encrypted so even if it is accessed, no one can steal or alter the values.
Information store location
Location classifies where an information store is located. The definitions below are examples from an information governance scheme centered on a data reservoir. The classification helps to identify which sources are part of the reservoir and which are connected. The catalog includes sources that are outside of the reservoir to enable lineage to be captured. Each information store can only have one location:
Unclassified
A potential source of information for the data reservoir. It is present in the information governance catalog to advertise that it exists. However, no attempt has yet been made to integrate it with the data reservoir.
Internal system
A system that is owned by the organization, but sits outside of a data reservoir that is exchanging data with the data reservoir repositories.
Third-party source
A system that is operated by a third party.
Adjacent reservoir
An information source that is managed by a different data reservoir. This adjacent reservoir is either sending or receiving information.
Data reservoir repository
A core repository of the data reservoir.
Data reservoir service store
A sandbox or data mart that contains information for the business to use for analytics.
Data refinery store
A private information store that is used internally in the data reservoir to transform raw data into useful information.
Activity classifications
Activity classifications help to characterize procedures, actions, and automated processes.
Business process type
The information governance program must be seen to support the business strategy directly and be flexible to adapt to changing business needs. For example, the information governance program should cover five types of business process:
Communication: Ensuring each individual employee is aware of his/her roles and responsibilities related to their use and management of information.
Compliance: Ensuring requirements are met and incidents of non-compliance are reported.
Exemption: Handling special cases in an effective and timely manner.
Feedback: Measuring the effectiveness of the program and handling suggestions for improvement and complaints.
Vitality: Evolving the program to support new requirements and reach deeper into the organization.
These business processes can be manual procedures, tasks, or automated processes, and together they keep the governance program grounded in the needs of the organization.
Control point decision classification
Control points are decisions made in business process that determine the response to a governance requirement. The governance program should provide descriptions on how to proceed after each possible choice is made. It has these classifications:
Correct to comply: The data or processing environment will be changed to bring it into compliance.
Request exemption: The current situation will not be changed and an exemption requested. Typically, exemptions are for a set time period. Many permanent exemption requests suggest that the governance program is not meeting the needs of the business.
Ignore: This is used when what is causing the situation has low business impact and a conscious decision is made to ignore it, at least for the short term. For example, there might be quality errors detected in the contact details of the main customer database. A control point decision can be made not to correct contact details for customers that have been inactive for more than two years.
Request clarification: Governance requirements are unclear and more information is needed.
Enforcement point classification
The enforcement point classification is used to characterize the behavior of an automated rule or component implementation. Typically it is one of these components:
Verification rules: Testing that a particular governance requirement has been met. For example, verifying that an attribute contains valid values. If the rule fails, an exception is raised.
Enforcement rules: Ensuring that a particular governance requirement is met. For example, masking sensitive data as it is saved in an unsecured repository.
Examples of enforcement point classifications include copy, delete, mask, validate values, validate completeness, derive values, enrich, standardize, archive, back up, link, merge, collapse, and raise exception.
Semantic classification
Semantic classification identifies the meaning of an information element. The classification scheme is a glossary of concepts from relevant subject areas. These glossaries are industry-specific and are included with IBM industry models.
The semantic classifications are defined at two levels:
Subject area classification
Provides a course-grained classification of the source and use of information. This classification is typically used to provide a context or scope to the business classifications. For example, to define retention periods for Controlled Longevity information collections, or to specify the subject area that an authoritative information collection supports.
Business term classification
This uses the business terms that are defined in the business glossary to provide a fine-grained semantic classification for information elements. This helps people find the data that they need and helps prevent integration errors as information is copied and consolidated between information collections.
Data classifications
Business classifications typically define the type of governance that is required. Data classifications characterize the way that data is typed and supported technically, which is key information when automating the actions associated with information governance.
Data classes
Data classes define the fine-grained logical data types. They are used to determine which implementation of a governance action to run. The following are examples of data classes:
Personal information such as first and surname, gender, age, Passport Number, Personal Identification Number, personal income, date of birth, country-related identification number, and drivers license number
Company name
Location information such as address, city, postal code, province/state, and country
Financial information such as Credit Card Number, Credit Card Verification Number, and Account number
Contact information such as phone number, email address, Internet Protocol Address, Uniform Resource Locator, and computer host name
The concept of a data class is common between IBM InfoSphere and IBM InfoSphere Optim™ products.
2.3.4 Governance Rules
Information governance rules are defined to explain how the information governance policies will be implemented. Typically they apply to the activity of a system or team, but they can also apply to specific types of data. Either way, they are organized according to the business owners who are responsible for its implementation.
There is a governance rule for defined for each situation where an information policy is relevant. The rules are then expressed in terms of the business classifications. For example, the owner of a collaboration space might define a set of governance rules that define these classifications:
Which classifications of data can be posted in the collaboration space
Those classifications of data that are allowed, what are the restrictions on use and access that must be observed.
2.3.5 Business terminology glossary
For reservoir catalog's such as the IBM Information Governance Catalog, the classification process includes assignment of assets to business terms. Business terms are part of a business glossary, where terms are organized into folder-like structures called business categories. A business term captures the vocabulary used by the business and includes a textual description that defines the term itself. By associating assets in the catalog with one or more business terms, a powerful semantic is created. That association says that this information asset is classified by this particular business term. This classification can allow for searching by business terms so the users can find all the assets that are associated with that business concept. Navigation can also be facilitated in the other direction with a classification in place. After a particular information asset is found, semantically related assets can be found by seeing what other assets have also been classified by business terms associated with this asset.
For some technical users, seeing the actual term asset assignments can be interesting. For most of the others, this information will be used to drive more meaningful search and to provide suggested assets that are related to the one selected, all in a fully automated manner.
Creating a full Business Glossary from scratch can be difficult, but fortunately there are usually starting points to ease the process. For Jules Keeper and his team, they started with industry standard glossaries that facilitate interaction between different pharmaceutical companies and government regulators. Other industry’s predefined vertical glossaries (and more) can be obtained from sources such the IBM Industry Models.
For more information about IBM Industry Models, see:
2.3.6 EbP starts its governance program
Jules Keeper sees the governance program as a key part of his role as Chief Data Officer (CDO) for EbP. As an industry veteran, he has been responsible for establishing governance programs at several previous companies. He has learned that it is vital to start with a fairly narrow vertical slice through the domain to be governed, establish early success there, and build on it.
With the information governance principles in place, Jules creates a core governance team. The primary initial goal of the team is to establish the core set of policies, rules, and procedures that will govern the information in the reservoir. As the scope of the reservoir grows and changes over time, so will the composition of the governance team itself.
In this case, the team begins with a vertical domain slice that deals with the US Federal Food and Drug Administration (FDA). Within this domain, they further focus on the handling of patient information for clinical trails. Working from a set of existing FDA vocabularies and other internal sources, the team assembles a core Business Glossary. These terms give explicit meaning to what would seem to be self-evident terms such as Patient. The clear definition removes any ambiguity about when a person moves from a candidate for a particular trail, to a patient in that trail, and thus can be properly considered for inclusion in reports.
As part of this process the team also identifies personally identifiable information (PII). Compliance is required with US and other country statutes around personal privacy and the handling personally identifiable information. PII is a key business responsibility for Jules Keeper and is essential for the success of their reservoir itself. US Social Security Numbers, Patient Identifiers, and other data elements are given formal definition in terms of their structure so that they can be formally tagged by automated means when information with these fields is added to the reservoir. Rules for who can see the data, and what transformations must be applied to the data (masked, removed, blocked) are all defined. These rules can include which zones within the reservoir data with unmasked PII is allowed and which zones will require masking.
It might seem like all PII should be masked by default on entry to the reservoir, but there can be valid business reasons for adding it unmasked. Providing strong governance around those fields is the next step, so that only authorized individuals can see the unmasked data. One of those use cases can be for fraud investigations. A risk officer for instance will often need access to all the data to facilitate an investigation.
Another area that Jules directs the core governance team to look into is information lineage. Information lineage shows the primary upstream sources and the downstream users for pieces of information. This perspective can be invaluable in determining whether a particular data set is the correct one for use, if it is up to date, and so on. However, lineage needs to be governed like anything else. The team establishes rules around which assets must be published to the reservoir with lineage, and which ones do not. Further they also define policies that validate lineage on a periodic basis for certain key data sources. Because lineage is often built and reported through automated means, it can be subject to bugs and errors similar to any other software process. It is important that a selected subset of lineage flows is regularly audited for correctness. Given the large number of assets in the reservoir and the huge number of flows that are defined, it is impractical to check them all. Checking the key flows can help gain confidence that the others are proper too.
2.3.7 Automating Curation Tasks
Much of what is being described can be classified as various types of curation, which is giving and maintaining meaning for a collection of information assets. For the reservoir to scale, it is vital to use both automated and social curation mechanisms.
The following automated curation tasks have already been described:
Assignment of terms to assets
These can be done on a contingent basis, automatically, and then confirmed (in batch) by others, or can be fully automated. The effectiveness of automated assignment can vary with asset types, but expect this to be an area where research and technology advancements rapidly increases effectiveness.
Data type determination and assignment to data classes
This can discover US Social security codes in data sets, credit card numbers, and so on. These data classes can be used to determine verification rules to run, and identify potential cases where data has been misclassified.
Assignment to other ontologies
It turns out the business glossary is just one means of classifying the assets in the repository. Other classifications schemes are not tied directly to the business language used, but can instead reflect reusable logic. For example, a geographic classification of an asset would understand that in the US, an address exists within a town, which is a type of political jurisdiction. A town is in a county, which is another political jurisdiction, which is in a state, and so on. This geographical classification can be used by data scientists for an example query such as finding “all clinical trials and patients conducted in Orange County California in 2012”.
Social curation uses the knowledge of the users of the data reservoir. Although automated techniques should do the bulk of the work, there is no substitute for a subject matter export (SME) to tag, comment, rate, or classify a particular asset or set of assets. These users and others can also rate data sources (zero to five stars, for example). Different users can have different weights assigned to their ratings and classification depending on the zone and domain of area of the reservoir that they are operating in. All of this allows for more assets to be discovered and suggested in a self-service manner.
2.3.8 Policies for administering the reservoir
Jules and the governance team do not stop just at policies around data privacy. Other policies are defined to make sure that the reservoir itself is maintained in a manner that ensures that relevant data is easy to find.
To do this, they also define policies for data retention. Certain data sets are added to the reservoir from well-defined sources on a scheduled basis. How long should these be retained? Where do they go when they are retired? If the reservoir catalog supports it, policies can also be defined that handle unused data, archiving it as needed.
2.4 Creating a culture that gets value from a data reservoir
The data reservoir is part of a profound shift of the culture at EbP to data-centric decision making. With the proper information from the reservoir and easy to use analytic tools, the days of acting on gut instinct are replaced by data rich analysis based on trusted data.
2.4.1 Reservoir as a vital daily tool
To make this shift, both Erin and Jules realize that the reservoir must be a vital daily tool for knowledge workers such as Callie and Tessa. Callie, Tessa, and other knowledge workers have been involved from day one on the project and played a key voice in selecting the vendor for the catalog and collaboration tools. The knowledge workers considered these key points in their evaluation:
Is there a notion of project or investigation that allows virtual notebooks to be created that target a particular analytic task or research item?
Can their favorite analytic tools easily load data from the reservoir and produce results? Is lineage automatically generated for these analytic flows?
Are quality and trust scores computed automatically for data sets within the reservoir?
Can the knowledge workers easily flag bad data, rate good data highly, add tags/labels, and otherwise curate the data that they find there?
Is it easy to find and use lineage for information assets?
The goal for knowledge workers like Callie or Tessa is that the data reservoir should become similar to a trusted assistant or colleague. It should always be ready to help find the data that they need, help determine whether the data is good or bad, record analytic results, and share the results and overall investigation in a form that others can find, share, and reuse.
2.4.2 Reassuring information suppliers
So far this section has focused on the consumption side of the reservoir in making it a vital component, but the supply side is also key. Information suppliers, be they other business units or other entities within the company, need to be assured that the data that they supply to the reservoir is both managed properly and eventually used. The governance program that Jules has developed helps with the concerns about the way information is managed. Individual information suppliers can have a say in the policies that are defined for their data sources. Usage statistics and other metrics for the reservoir itself can also show which data sets are being accessed and with what frequency.
2.5 Setting limits on the use of information
This chapter has touched briefly on some of the ethical and security aspects of data within the reservoir, but these are such important topics that they are worth a more detailed look.
2.5.1 Controlling information access
The goal of the reservoir is to get high quality, trusted data to the correct people in a self-service manner. This allows them to do their jobs and creates value for the company. However, the wide user community of the reservoir itself means that there is a requirement to ensure that data is both used appropriately and only by those that have authorization to do so.
The key to information security is the appropriate classification of data, either by the owner or by a trusted curator upon entry to the data reservoir. This defines the type of protection that is appropriate for the data. The classification can be applied at the information collection or attribute level depending on the variety of data in the data source.
Access is authorized by the business owner of the data using a well-defined business process that maintains a record of the people with access rights and then makes requests to the IT team to update the access control list.
2.5.2 Auditing and fraud prevention
Applications tend to limit the amount of data an individual can see and the actions they can take. The purpose of the data reservoir is to remove these restrictions by freeing the data from boundaries of the original application. However, with this freedom comes the responsibility to use data responsibly.
After information access controls are in place, it is vital to monitor patterns of usage and raise flags/alarms when access falls outside the boundaries of acceptable use. The combination of actionable policies to prevent data from being exposed to inappropriate individuals and the monitoring to look for access that might fall outside of prescribed policies can help ensure that governance policies and procedures are being adhered to by the data reservoir community.
One other area of information use in the reservoir concerns the terms and conditions around purchased data sets, and other data sets that might have precise legal restrictions on their use. Proper information governance policies and rules must be attached to these assets so that the terms and conditions can be clearly seen, honored by users, and in some cases enforced by the reservoir itself.
A strong set of policies and procedures for information access and audit contributes to the confidence in the reservoir itself. Information suppliers can be confident that their information is being used within the terms and conditions that they have specified, and by people that are authorized to see and use the data. Information users can be confident that the information they find in the reservoir can be used without further restrictions when used within the clearly stated and easily identifiable policies and rules. Both are key components of a self-service data reservoir.
2.5.3 Ethical use
The ability to process large amounts of data from multiple sources is moving faster than the governmental and other regulatory bodies can define what is the proper use of information is. This can result in a gray area between what is currently clearly legal, and what is newly possible in terms of insights gained from perhaps what until now have been disparate data sources.
This situation leaves companies and the individuals within those companies to use their best judgment in the ethical treatment of this data. IBM has recently published a paper, Ethics for big data and analytics1, discussing some of these issues. The paper highlights some of the factors to consider when deciding what is ethical use of information:
Context
For what purpose was the data originally surrendered? For what purpose is the data now being used? How far removed from the original context is its new use? Is this appropriate?
Consent and choice
What are the choices given to an affected party? Do they know that they are making a choice? Do they really understand what they are agreeing to? Do they really have an opportunity to decline? What alternatives are offered?
Reasonable
Is the depth and breadth of the data used and the relationships derived reasonable for the application it is used for?
Substantiated
Are the sources of data used appropriate, authoritative, complete, and timely for the application?
Ownership
Who owns the resulting insight? What are their responsibilities towards it in terms of its protection and the obligation to act?
Fair
How equitable are the results of the application to all parties? Is everyone properly compensated?
Considered
What are the consequences of the data collection and analysis?
Access
What access to data is given to the data subject?
Accountable
How are mistakes and unintended consequences detected and repaired? Can the interested parties check the results that affect them
Each of these aspects can be used to evaluate a specific use of data in the data reservoir and to design appropriate measures into the solution to safeguard both the privacy of the data subjects and the reputation of the organization.
An important point to remember is that there is no fixed definition of what is the ethical use of information. This varies between people of different backgrounds and age groups. Collectively, our perception of what is expected and what seems creepy is changing all of the time. Therefore, it is important for the organization to have a position they are comfortable with, that they can be transparent about their use of data with any data subjects and stakeholders, and there is an appropriate process where people can raise concerns and have a situation redressed that they consider is an unethical use of data about them.
2.5.4 Crossing national and jurisdictional boundaries
If your company maintains data that crosses national boundaries, you must pay particular attention to a growing and diverse set of laws that governs the allowable use of data for individuals within a particular country and the movement of data across national boundaries.
As a starting point to get basic information about data protection laws worldwide, see the Data Protection Handbook at:
If you are operating across national or other jurisdictional boundaries, you must consider regional requirements for data handling and establish polices and rules that respect those requirements. This can result in particular reservoir repositories being cited in particular countries with limits on who can access the data and the activities this data can be used in.
Where regulations differ wildly from country to country, it might be easier to have a data reservoir in each country with links between them to share summarized data where needed.
2.6 Conclusions
The data reservoir is an ecosystem that enables an organization to share information both effectively and safely. The definition of how this ecosystem works is embedded in the information governance program.
This governance program must be comprehensive to ensure that the ecosystem operates effective. It must be flexible because there is a wide variety of data stored and it is not appropriate to implement a single set of standards and processes for all types of data.
 

1 Ethics for big data and analytics, available at http://www.ibmbigdatahub.com/whitepaper/ethics-big-data-and-analytics
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset