Chapter 9

Governing Big Data and Analytics

Abstract

We begin this chapter very much with the end in mind. The majority of work we've seen published related to governing Big Data analytics is an extension of approaches used in the structured data world. Given the recent Big Data phenomenon and some significant differences between the data approaches, technologies, sources, and challenges, it would seem more useful to step back and consider the scope of governance options available to you, which is exactly what we've chosen to do here. We therefore think that the end you want to keep in mind for Big Data analytics is different than the one used for a structure or legacy data environment. Data governance and controls have been maturing alongside traditional application and data management systems for decades. There was not, in our experience, a moment when anyone questioned whether or not structured data systems, applications, stores, and delivery services would be required. Quite the opposite, governance and controls were introduced as it became clear that the size and scope of the systems required them. In the current Big Data environment, that presumption has still been not entirely proven.

Keywords

Analytics; Big Data; Data analytics; Governance; Security

Big Data

We begin this chapter very much with the end in mind. The majority of work we’ve seen published related to governing Big Data analytics is an extension of approaches used in the structured data world. Given the recent Big Data phenomenon and some significant differences between the data approaches, technologies, sources, and challenges, it seems more useful to step back and consider the scope of governance options available, which is exactly what we’ve chosen to do here. Thus we think the end you may want to keep in mind for Big Data analytics is different from the one used for a structure or legacy data environment. Data governance and controls have been maturing alongside traditional application and data management systems for decades. There was not, in our experience, a moment when anyone questioned whether or not structured data systems, applications, stores, and delivery services would be required. In fact, the opposite was true—governance and controls were introduced as it became clear that the size and scope of the systems required them. In the current Big Data environment, that presumption has still not been entirely proven.
Most firms, governments, and nonprofits considering Big Data initiatives and investments have the opportunity to openly question the terms under which they are going to make those investments. Their business and other functions are already operating and effectively automated with traditional computer systems and data management, including some controls and governance. But the opportunities surrounding Big Data are so new that we must analyze and understand the terms under which we would engage in those options. So we’ve come to the considered conclusion that Big Data and analytic governance is first and foremost a function of making an informed investment decision. We really believe that governance over Big Data is lateral; we must first choose to engage in investing and controlling Big Data as opposed to adding governance and controls to existing structure data environments.
This choice highlights the need for the demand management and rationalization approaches we discussed in the previous chapter. It also highlights the value of a broader, business-case approach to Big Data. The data-centric business cases often rest on the foundation of structured data-delivery systems and add value layers. These are based on analytic, forecasting, and behavioral change-oriented systems. We’re all familiar with the use of Big Data techniques and technologies in the online world to suit consumer preference, purchasing, and crowd-level marketing and communications. Thus business-case analysis has to include the potential benefits that come from these kinds of outcomes.
We also think that, because Big Data is a relatively new phenomenon, it is important to use current, worked examples and cases to highlight where governance and controls are needed or are most useful. In our experience with Big Data programs, federal government and commercial firms seeking to ingest very high volumes and diversities of data do not always have a well-defined purpose. Often these programs are focused on collecting data and storing it in a way so that it can be provisioned dynamically for multiple, undefined uses. The work we have done in intelligence, as well as civilian and federal sectors, is fairly commonplace. We were surprised by how common this approach is in the commercial space. So we’ll use a series of scenarios to discover the kinds of Big Data sources, streams, storage options, and business outcomes that can result. Before we engage in scenario-based reviews, it may be useful to consider three real-world examples unfolding today.
It’s important to note in these three real-world examples that, in each case, there was a conscious decision to engage Big Data methods and technologies. That is the first real moment of governance over Big Data. Choosing to engage in Big Data methods and outcomes is the initial governance entry point for Big Data and should not be dismissed or assumed in any governance model. This becomes increasingly important as more of the projects we consider in the Big Data world entail extending existing systems or replacing them. In the opportunities that are substantially similar to those we’ve used to build traditional systems, we should be able to leverage business-case thinking around cost-benefit analysis. In fact, in a direct comparison to traditional system development approaches, many of the initial business cases that have launched successful Big Data efforts showed that data can provide tremendous savings in cost and time.

Example 1—Preventing Litigation: An Early Warning System to Get Big Value out of Big Data, by William “Bill” Inmon and Nelson E. Brestoff

This book captures the legal and technical ramifications of joining the Big Data world, or at least one of the domains within the structured data world. For decades, Bill Inmon has led the development of architectures, solution approaches, technologies, and business outcomes for the largest global consumers and producers of information technology. He is best known for his work in data warehousing and data integration, and he has continued that work into the Big Data world by showing how complex, textual data can be disambiguated and selectively mapped into the structured data world. Bill has shown that it is critical to associate objects in the Big Data world with master data objects that are well defined in the structural world. Examples include customers, suppliers, counterparties, products, and so on. In his work on preventing litigation, he has extended this paradigm to show that it is possible to scan through thousands, hundreds of thousands, or even millions of documents related to contractual obligations and structured in unknown or highly varied ways. The ability to scan these documents electronically and provision the text in a way that allows for disambiguation and integration with other data is a game changer for litigation, due diligence, and the legal and business proceedings that often follow. It also establishes a new watermark for the movement toward more intelligent systems, which replaces heavy manual effort loads with rapid, reliable analysis and integration of critical data.
In litigation prevention, Bill and his partner prove that the use of a textual disambiguation engine allows for the pooling of information from hundreds of thousands of contracts. This is done in order to relate it to the key counterparties, time periods, and conditions they have embedded in these long, complex documents. Now questions can be answered quickly about what potential liabilities, responsibilities, or other issues are associated with each counterparty to the firm, for what period of time, and, potentially, with what dollar limitations and impact. This work was previously done, to a limited extent and at best, by manual effort and labor from law firms and individual firms. As a result, it could take months or years to complete. Bill’s software and architecture proves that this is another area where Big Data methods that connect Big Data content with structured data concepts are viable and can take orders of magnitude in cost and time out of the equation.
The other aspect of this solution is the decision frame he provides for using a Big Data approach. He frames the governance decision concerning Big Data investment around a business problem that has specific, quantifiable, financial, and legal impacts. So if a pharmaceutical industry participant needs to gauge their legal exposure over a particular drug or other product, they would engage in this investment to quickly scope an analysis of all legal documents and commitments related to that drug. There are many other aspects to Bill’s technology, and we’ve chosen to highlight it because it was one of the first in this space and holds several key patents around its capabilities. Bill’s approach also applies controls to the data store it produces, ensuring appropriate change-control and stewardship capabilities. This approach provides end-to-end Big Data governance, starting with the investment decision and completing the cycle with change and quality controls over the data it generates. All of the content that is produced and integrated is then subject to change controls in a typical stewardship approach consistent with the Playbook.

Example 2—Digital Healthcare (Telehealth 2.0)

We have identified Big Data investment options and benefits with the president and CEO of a leading Telehealth and digital healthcare firm. The CEO and his team had already started to leverage Big Data methods and technologies to support enhanced, care-delivery options for a large system of community members. While many of the benefits and business case details are highly proprietary and sensitive, we have permission to share a couple of key aspects of the CEO’s approach and success. The firm has always espoused the need to enhance medical care delivery at every step of the process. They work very closely with care providers and recipients to ensure the value-added mapping of services and their quality-assurance process extends beyond the initial service delivery. Its desire to extend to every area of care provision was an early distinction for this leading digital healthcare firm. According to the CEO, governing Big Data means making intelligent investments in Big Data partners, methods, and technologies, which are all focused on specific improvements in care quality and patient outcomes. The firm has always been focused on care efficacy across digital channels of delivery.
A key aspect of the firm’s success is its ability to ensure that quality-assurance metrics are embedded in the scheduling selection, engagement, and follow-through aspects of care delivery. Many of these metrics require gathering data from Big Data sources, such as the Web, mobile platforms, and cellular phone platforms. These metrics must be tracked, baselined, and analyzed on a recurring basis, which requires large analytic capabilities in advanced, cloud-driven architectures. So the CEO’s team thoughtfully engaged a multitude of partnerships, such as with Microsoft for Azure Cloud, to ensure they have an end-to-end, secure environment through which to provide and analyze care efficacy. Their Big Data investment decisions have necessitated additional strategic partnerships and investments to scale out the initial Big Data collection and analysis required by their business. That thinking and execution is what enables this CEO to leapfrog over his competition, in terms of the membership base size he can support, the level of effective care he can provide, and his ability to diversify types of care more rapidly than any other market participant.

Example 3—Chief Data Officer Integrates Big Data for Financial Service Firm

In a recent industry interview, Derek Strauss, a chief data officer at a major financial service firm, discussed his team’s insights into leveraging Big Data. The team leverages data from Web, mobile, and other platforms about its customers’ preferences and investment options to provide affinity-based suggestions for customer consideration. His firm and its clients clearly see the value in this approach. His clients receive options and insights tailored to their status and interests. His firm is able to maintain a very close relationship with its clients by providing these valuable insights in a controlled and timely manner. In this case, Derek was able to identify the business case to integrate Big Data into their customer environment, with the appropriate controls and investment that entails.
This investment approach extends the data governance controls Derek had already put in place using his “Seven Streams of Data Resource Management” approach. Along with existing Playbook methods, a final Data Playbook was then constructed using the Seven Streams approach and methods to support his team’s work. We encourage you to review some of the basis for that work in the publications Derek and his partners Bill Inmon and Genia Neuschloss published, such as DW 2.0: The Architecture for the Next Generation of Data Warehousing.
Derek also noted that they had engaged the college and university community to crowdsource data-science skills and insights. The crowdsourcing approach has become integrated in the Big Data movement and repeatedly proves to have high value, where it is targeted appropriately. Once again, deciding to engage and invest in this under whatever terms that investment makes sense is basic governance. Derek’s choice proved to be highly valuable: it gave him access to some of the freshest thinking in research and analysis areas, as well as allowed him to tap into a potential recruiting source for the firm. His willingness to partner with these learning institutions and the data science skills they brought to the table allowed him to sidestep the high initial investment required to build a pool of dedicated data scientists, while ensuring a consistent refresh of thinking and methods. This is a clear example of leadership governance as a focus, as opposed to passive or control governance. Leadership governance moves beyond current models to ensure the investment in the Big Data approach, in this case data science, is done in ways that scale and refresh themselves sustainably.
In the interview, Derek also noted that he had worked with his teams to improve their foundational work for data stewardship, governance, and quality as they engaged in these other areas. Derek is very clear that Big Data layers are built on top of traditional data skills and systems, and he emphasizes the value of strengthening those foundations. His ability to engage his executive sponsors on a regular basis and show the value of improving his foundation while building additional layers of value is what makes it possible for him to sustain his firm’s competitive data advantage.
Each of these examples demonstrates the need to engage in Big Data as a formal governance decision and to do so from the beginning. They also identify the different types of benefits that can be obtained and the discipline required to see them through. Note that, in each case, the discussion begins with the business challenge or opportunity and then proceeds to consider Big Data methods and technologies, as the option to traditional systems development approaches.

Personal Big Data

There is, of course, a personal dimension to Big Data governance—ie, the choice to opt in or opt out of having the Big Data you generate tracked by others. This is becoming a very complex area but starts with a simple pretext: the digital footprint we leave behind when we engage in certain digital activities is not subject to collection and analysis by others. In the European Union, this became especially important, as a court decided that a search engine like Google has to comply with a citizen’s “right to be forgotten.” Essentially, under certain circumstances, search engines have to remove the links to a person’s personal information. This legal framework attempts to force digital footprint trackers to eliminate personally identifiable information from their vast collections. There continues to be some testing around abstracting people’s identity but keeping their digital footprint. In the meantime, we have some choice over what level of digital footprint tracking and analysis we support. There is also the initial governance decision we choose to make when considering Big Data devices and activities. This can include the use of fitness-tracking, wearable devices, location tracking for our smartphones, and even biometric sensors within wireless communications. In each case, we are choosing to generate a larger digital footprint that is at least captured on a smartphone or computer and potentially shared with commercial interest. From a personal perspective, governing our Big Data footprint is a function of understanding what choices we have before we engage in generating more Big Data.
As we move toward understanding scenario-based approaches to Big Data governance and application, we should consider the way Big Data and analytics is delivered and operated within scenarios. The managed analytics services, or MAS space, is rapidly emerging as a next-generation approach to delivering on the promise of Big Data without the large, fixed investments or lengthy time periods normally required to achieve results. There are many providers in this space serving different industries and market segments. One example, Elevondata, has proven that a fully managed, service-based approach can be cost-effective and rapidly produce business results for small, medium, and even global clients. This end-to-end managed service starts with full data provisioning and requirements analysis with business-case scenarios, which allow for flexible target benefits. The creation of the Big Data architectural constructs, applications, services, delivery mechanisms, and consumption tools is all part of the managed-service process. The limited, initial investment allows for a rapid business-case decision to be made, and the short ramp-up time means that initial discovery and analysis results are delivered in weeks or months, opposed to the months or years offered in the traditional systems-development approach. Elevondata has delivered enough of these managed-service solutions to show the ongoing benefit of a managed-service operator, who leverages cloud- and distributed-computing architectures to provide search capabilities without large fixed investments. It is important to keep this kind of managed-service approach in mind as we look at our scenarios, since we may need to partner with managed-service providers in order to establish and operate governance controls for Big Data.

Governing Big Data in Large and Small Ways

The standard approach to Big Data in the data governance community suggests that we take the same basic steps for governing Big Data that we do for structured data. We’ve already discussed the fact that there is a larger step we get to take in Big Data, that we haven’t traditionally had the choice to make in structured and traditional data. That choice is to engage in using Big Data and its data methods and technologies. That is the major decision point when considering Big Data approaches so it is important to formalize the decision and its supporting business case. There is some truth to the approach that applies traditional data governance methods to Big Data. But those take place only after a business case-based decision has been made to engage in Big Data.
Once that decision is made, traditional data governance methods, as explained in the Playbook, are applicable to portions of Big Data structures. In each of the following scenarios, the presumption is that the metadata tags connected to Big Data objects, which are stored in file-oriented locations (eg, Hadoop filesystem), must be governed and standardized just as with traditional structured data. The range of values available to describe metadata tags for Big Data objects, such as media type, subject, provider, contributor, etc., should all be standardized and required fields for any Big Data object we choose to capture, store, and analyze. Governing Big Data objects in a file-store environment has special challenges and conditions. The first challenge is to recognize that these filestores are predicated on the notion of redundancy. In a Hadoop filesystem, mass redundancy is expected, due to the nature of the Big Data collection process and the lack of transformation or normalization steps in its collection. This is the biggest single disparity between Big Data stores, such as data lakes, and traditional data warehouses.
The key concept here is the need to govern or control Big Data objects or files at the metadata level. Standardizing metadata tags and available values, as well as required minimum metadata, ensures vast quantities of Big Data files can be searched, filtered, and analyzed based on the unique properties of those files. Simple metadata, such as the source, time of collection, type of file format, and object it represents, are examples of basic, minimum, required metadata. In large ways, we govern Big Data principally on how we decide to invest and leverage Big Data solutions. In small ways, we govern through our control of the minimum required metadata, our identification of uniquely Big Data files and objects, and our comparison and analysis of them in vast storage arrays or clouds.

Scenario 1—Hospital Workflow Model

image
This graphic is a simple depiction of a hospital workflow. At each stage in the workflow, we can visualize the kind of Big Data footprint additions that occur and begin to think about the kinds of questions the hospital, patients, and providers want to be able to answer for any given interaction. This workflow shows two different admission channels: the helicopter pad on the roof and the ambulance portal in the main floor. Each of these clearly carries different components of the digital signature or footprint, such as patient arrival information and condition upon arrival. Visualize if you will a patient on a gurney in either a helicopter or an ambulance. The patient would have some type of sensors connected to them, such as a heart monitor or blood-pressure gauge. These sensors would be streaming information on the patient’s condition to both their local caregivers in the mobile platform as well as their hospital-based caregivers. Then also imagine that the admission process provides information about the patient and their history, ideally, via electronic medical record access rather than manual intake procedures and imports. Big data collection about the patient’s condition and route would provide the ability to add their status and condition at intake and admission rather than after, as has been the common practice. This information would provide a wealth of knowledge about the kinds of conditions under which most patients are admitted to the ER and could be broken down into time of day, day of week, day of year, age, demographic, method of conveyance, source of referral, decision to route to this hospital, and so forth. It’s important to use scenarios to couple the types of Big Data that can be provisioned with the kinds of questions that would provide business value for any investment in Big Data.
Extending the scenario above, we can see where the patient proceeds through to a surgical theater, then beyond into intensive care, and finally a patient room. In each of the settings, we can imagine additions to the patient digital footprint, including doctors notes and videos from the surgery itself, nursing and physician care notes from the intensive care unit, and patient inputs about their pain levels, preferred medication options, and other conditions. Finally, in the patient room as well as in the ICU, we see the introduction of video and other capture tools to help healthcare providers monitor patient conditions remotely. All of these data inputs add to the Big Data collection or digital footprint that can support both the individual patient as well as some aggregated analysis for care improvement and cost controls. Over the past decade, much has been written and proven about the value of checklist-based surgical and intensive care in hospitals. Regimented use of checklists has dramatically reduced postop infections in intensive care and has improved overall levels of patient recovery in many facilities. Big data collection of the type we are describing would further prove the value of those checklists, as well as their contents and use. Similarly, from a cost-control perspective, tracking supplies using data-collection methods such as Radio Frequency ID (RFID) tags and Bluetooth connections would provide insight into cost-containment options. While this is a very simple hospital workflow example, it provides dozens of instances of Big Data collection points and types, as well as business questions that can be answered by them.

Scenario 2—Personalized Online Sales and Delivery

image
This scenario allows us to think about typical, logistical issues for personal delivery and to add current and emerging trends to such issues, which enables us to see how Big Data is helpful in making key decisions. Let’s call this company “Nile” and use it to explain its Personal Best delivery service. The cornerstone concept of personal best is the rapid, reliable delivery of even the smallest orders on a personal level. In this workflow, we can see how orders are pulled together, routed, and delivered to customers. The data that is generated from each step of this process helps with every aspect of performance management, from just-in-time inventory, to intelligent pricing for products and shipping, and to least-cost, most-effective routing algorithms for delivery. The information that is generated from sensors in the product and the delivery vehicle would drive the tuning of these algorithms and the improvement of service over time.
This Big Data approach would also provide the basis for comparing traditional delivery vehicles and channels, such as trucks, to emerging options like drones. Comparing land-based to air-based delivery would begin with cost and effectiveness issues, but it would almost certainly have to add risk and control issues that do not typically arise from land-based delivery. So we can imagine that, if we were to engage a trial of drone-based delivery in certain areas, we would want to understand what factors determined the best locales for the trial and what additional information or Big Data we need to provision from the drone delivery.
Our land-based vehicles produce GPS data, which is sent via cellular connection to the data lake in a hosted cloud. This data is used for both real-time tracking of all delivery vehicles and their contents, as well as long-term, least-cost, most-effective route algorithm tuning. The GPS data is then compared with time of day, weather condition, driver experience, and other factors to improve delivery algorithms. In a pilot with drone-delivery vehicles, additional information, including weather conditions, wind speed and direction, altitude flown, route taken, and obstacle-avoidance patterns required for delivery, are also added to the algorithm. This is but one example of where a business-process change leads to additional Big Data components and analysis.
Imagine this delivery service also comparing public delivery methods, such as the US Postal Service, and drop-and-ship locations, such as local packaging and delivery stores, as options. Each of these options carries with it additional Big Data sources and analysis requirements. When constructing a pilot program to consider trying new methods and processes, such as drones or dropship locations, it is clear that planning for additional Big Data sources and analysis is a critical component. What is less obvious is the fact that the types of Big Data to be provisioned and analyzed may drive additional Big Data governance over minimum-governed metadata. The drone example is instructive, because, in that case, we could expect to see video inputs from the drone to help understand the effects of weather, obstacle avoidance, and the efficiency of its delivery patterns. The addition of this new video format may require additional metadata requirements that must be provided with the video files. One example of additional, minimum-governed metadata might be the duration of the video that is encapsulated in the file. Another might be the relationship of that video file to others that may have been taken in a sequence and should therefore be considered together. This scenario gives us another set of business and technical challenges to consider when employing Big Data methods and technologies to enhance business processes and performance.

Scenario 3—Cloud-based Information Technology

image
This scenario provides a fairly simple view of managing cloud-based, technology assets. We know that cloud computing is supported by data centers, server farms, and other constructs, which are fully virtualized by cloud delivery mechanisms. If we take the perspective of the cloud-service provider, who needs to track the status of their technology assets in terms of performance, utilization, capacity, and security, we need to identify the types of information those servers and their environment can provide. This is a commercial or industrial version of the “Internet of Things” approach that is so prevalent in the retail market. We see Internet of Things coming to life in cars, appliances, and even retail settings. In this example, we’re looking at what are typically referred to as instrumented servers, which track their conditions and report to the Internet. One of the key benefits of cloud computing is the elastic computing capacity, or the ability to provide surge levels of computing capacity on-demand from a pool of shared resources.
Elastic computing, and related dynamic capacity models, requires constant monitoring of overall capacity utilization, as well as the ability to dynamically employ additional capacity that may be available but not in production. So we have an example where Big Data is both inbound, as a means of monitoring status and conditions, and outbound, as a means of changing the capacity and performance of the cloud-computing environment. In traditional operational applications, this is referred to as closed-loop analytics. In those settings we employ an analytics engine that is able to change things like pricing or availability commitments for orders based on volumes, time of day, and other factors in the operational environment. In this model, changes to online capacity and tracking of its use, for billing purposes as well as for performance management, require constant bidirectional Big Data movement and analytics. This is another environment where, in a way, physical sensors are often employed to track the physical conditions of the data center, including temperature and power fluctuation. Video sources may also be included, in order to confirm or deny the presence of fire, water, or other physical hazards. All of these inputs are critical to precisely managing the performance of the data center in the cloud it supports. This example provides a more industrial view of how the Internet of Things and additional Big Data sources are combined to provide real-time updates and changes in the way facilities and technology are managed.

Big Data and Analytic Delivery Example—analytics.usa.gov

image
The last component of governing Big Data and analytics is to look at the way they are delivered and consumed. It’s only through this lens that we understand the impact change control, data integrity, and even visualization approaches have on the realization of the data value. This example is from the federal government site that provides real-time statistics updates on a variety of aspects of a government and its citizen services. It’s a valuable example, because it highlights a number of choices the Big Data delivery team has to make, as well as the underlying value of governance in the quality and integrity of the information provided.
We’ve started our discussion of Big Data governance by reminding ourselves that we have options when it comes to choosing to use it, when we use it, and on what terms we use it. Those same options have to be considered when it comes time for the visualization and exploitation of Big Data, or basic delivery and consumption. The prevailing practice for Big Data projects is to include, from the outset, sample visualizations based on use cases or larger storyboards. These establish the purpose of the Big Data initiative, who its primary audience is, and what kinds of questions it’s intended to answer. Marrying the answers to those questions up to appropriate and impactful visualizations is a key aspect of Big Data analytics.
This example demonstrates two different ways that delivering on Big Data analytics can fail. If we learned that the data on this website is deeply flawed and unreliable, our belief in the entire site and its promise would be diminished to the point of not using that service. Similarly, if the data is of high integrity and timeliness but the visualization selected is too complex or undermines the actual understanding needed from the data, we will be unable to rely on the site for decision support and other uses. The Playbook approach to governing data controls and analytics is equally important in this area. The Playbook talks about how to govern analytic models, and it’s important to understand that those include the visualization and consumption models, not just the internal algorithms and data structures. Best practice in delivering Big Data analytics now includes embedding feedback capabilities into the delivery mechanism, whether in a website or an application. This allows users to easily provide feedback where they see difficulties in the usability of the interface or perceive gaps in the data viability. This example helps us understand how the combination of data controls and visualization alignment with user needs drives value from all the underlying work that has been done to capture the data and deliver it in this format.
Governing analytic models, including computational and presentation formats, is a critical function in the Playbook. Consider minimum required metadata for analytic presentations. We need to be able to identify the type of visualization being used as well as other heuristic factors in order to gauge the effectiveness of the presentation report’s intended audience. We also need to track metadata related to the feedback we receive so that we can assemble a meaningful picture of user satisfaction over time. These are just two simple examples of using metadata-based governance for Big Data analytics and visualization.

Conclusion

This chapter focused on two aspects of governing Big Data and analytics. The first is the decision to employ Big Data and analytics, which is the simplest form of governance—deciding where and when to invest in your information technology. The second aspect of governing Big Data and analytics is the adjustments we make for the way the data operates. While many of the methods and approaches we use for traditional or structured data are still important to employ here, they have to be applied in different ways, alluding to the different forms the data takes and the redundancy with which it is stored.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset