Chapter 6. Data Unification Brings Out the Best in Installed Data Management Strategies

Companies are now investing heavily in technology designed to control and analyze their expanding pools of data, reportedly spending $44 billion for big data analytics alone in 2014. In relation, data management software now accounts for over 40 percent of the total spend on software in the US With companies focusing on strategies like ETL (extract, transform, and load), MDM (master data management), and data lakes, it’s critical to understand that while these technologies can provide a unique and significant handle on data, they still fall short in terms of speed and scalability—with the potential to delay or fail to surface insights that can propel better decision making.

Data is generally too siloed and too diverse for systems like ETL, MDM, and data lakes, and analysts are spending too much time finding and preparing data manually. On the other hand, the nature of this work defies complete automation. Data unification is an emerging strategy that catalogs data sets, combines data across the enterprise, and publishes the data for easy consumption. Using data unification as a frontend strategy can quicken the feed of highly organized data into ETL and MDM systems and data lakes, increasing the value of these systems and the insights they enable. In this chapter, we’ll explore how data unification works with installed data management solutions, allowing businesses to embrace data volume and variety for more productive data analyses.

Positioning ETL and MDM

When enterprise data management software first emerged, it was built to address data variety and scale. ETL technologies have been around in some form since the 1980s. Today, the ETL vendor market is full of large, established players, including Informatica, IBM, and SAP, with mature offerings that boast massive installed bases spanning virtually every industry. ETL makes short work of repackaging data for a different use—for example, taking inventory data from a car parts manufacturer and plugging it into systems at dealerships that provide service, or cleaning customer records for more efficient marketing efforts.

Extract, Transform, and Load

Most major applications are built using ETL products, from finance and accounting applications to operations. ETL products have three primary functions for integrating data sources into single, unified datasets for consumption:

  1. Extracting data from data sources within and outside of the enterprise

  2. Transforming the data to fit the particular needs of the target store, which includes conducting joins, rollups, lookups, and cleaning of the data

  3. Loading the resulting transformed dataset into a target repository, such as a data warehouse for archiving and auditing, a reporting tool for advanced analytics (e.g., business intelligence), or an operational database/flat file to act as reference data

Master Data Management

MDM arrived shortly after ETL to create an authoritative, top-down approach to data verification. A centralized dataset serves as a “golden record,” holding the approved values for all records. It performs exacting checks to assure the central data set contains the most up-to-date and accurate information. For critical business decision making, most systems depend on a consistent definition of “master data,” which is information referring to core business operational elements. The primary functions of master data management include:

  • Consolidating all master data records to create a comprehensive understanding of each entity, such as an address or dollar figure

  • Establishing survivorship, or selecting the most appropriate attribute values for each record

  • Cleansing the data by validating the accuracy of the values

  • Ensuring compliance of the resulting single “good” record related to each entity as it is added or modified

Clustering to Meet the Rising Data Tide

Enterprise data has changed dramatically in the last decade, creating new difficulties for products that were built to handle mostly static data from relatively few sources. These products have been extended and overextended to adjust to modern enterprise data challenges, but the workaround strategies and patches that have been developed are no match for current expectations.

Today’s tools, like Hadoop and Spark, help organizations reduce the cost of data processing and give companies the ability to host massive and diverse datasets. With the growing popularity of Hadoop, a significant number of organizations have been creating data lakes, where they store data derived from structured and unstructured data sources in its raw format.

Upper management and shareholders are challenging their companies to become more competitive using this data. Businesses need to integrate massive information silos—both archival and streaming—and accommodate sources that change constantly in content and structure. Further, every organizational change brings new demand for data integration or transformation. The cost in time and effort to make all of these sources analysis-ready is prohibitive.

There is a chasm between the data we can access thanks to Hadoop and Spark and the ordered information we need to perform analysis. While Hadoop, ETL, and MDM technologies (as well as many others) prove to be useful tools for storing and gaining insight from data, collectively they can’t resolve the problem of bringing massive and diverse datasets to bear on time-sensitive decisions.

Embracing Data Variety with Data Unification

Data variety isn’t a problem; it is a natural and perpetual state. While a single data format is the most effective starting point for analysis, data comes in a broad spectrum of formats for good reason. Data sets typically originate in their most useful formats, and imposing a single format on data negatively impacts that original usefulness.

This is the central struggle for organizations looking to compete through better use of data. The value of analysis is inextricably tied to the amount and quality of data used, but data siloed throughout the organization is inherently hard to reach and hard to use. The prevailing strategy is to perform analysis with the data that is easiest to reach and use, putting expediency over diligence in the interest of using data before it becomes out of date. For example, a review of suppliers may focus on the largest vendor contracts, focusing on small changes that might make a meaningful impact, rather than accounting for all vendors in a comprehensive analysis that returns five times the savings.

Data unification represents a philosophical shift, allowing data to be raw and organized at the same time. Without changing the source data, data unification prepares the varying data sets for any purpose through a combination of automation and human intelligence.

The process of unifying data requires three primary steps:

  1. Catalog: Generate a central inventory of enterprise metadata. A central, platform-neutral record of metadata, available to the entire enterprise, provides visibility of what relevant data is available. This enables data to be grouped by logical entities (customers, partners, employees), making it easier for companies to discover and uncover the data necessary to answer critical business questions.

  2. Connect: Make data across silos ready for comprehensive analysis at any time while resolving duplications, errors, and inconsistencies among the source data’s attributes and records. Scalable data connection enables data to be applied to more kinds of business problems. This includes matching multiple entities by taking into account relationships between them.

  3. Publish: Deliver the prepared data to the tools used within the enterprise to perform analysis—from a simple spreadsheet to the latest visualization tools. This can include functionality that allows users to set custom definitions and enrich data on the fly. Being able to manipulate external data as easily as if it were their own allows business analysts to use that data to resolve ambiguities, fill in gaps, enrich their data with additional columns and fields, and more.

Data Unification Is Additive

Data unification has significant value on its own, but when added to an IT environment that already includes strategies like ETL, MDM, and data lakes, it turns those technologies into the best possible versions of themselves. It creates an ideal data set for these technologies to perform the functions for which they are intended.

Data Unification and Master Data Management

The increasing volume and frequency of change pertaining to data sources poses a big threat to MDM speed and scalability. Given the highly manual nature of traditional MDM operations, managing more than a dozen data sources requires a large investment in time and money. Consequently, it’s often very difficult to economically justify scaling the operation to cover all data sources. Additionally, the speed at which data sources are integrated is often contingent on how quickly employees can work, which will be at an increasingly unproductive rate as data increases in volume.

Further, MDM products are very deterministic and up-front in the generation of matching rules. It requires manual effort to understand what constitutes potential matches, and then define appropriate rules for matching. For example, in matching addresses, there could be thousands of rules that need to be written. This process becomes increasingly difficult to manage as data sources become greater in volume; as a result, there’s the risk that by the time new rules (or rule changes) have been implemented, business requirements will have changed.

Using data unification, MDM can include the long tail of data sources as well as handle frequent updates to existing sources—reducing the risk that the project requirements will have changed before the project is complete. Data unification, rather than replacing MDM, works in unison with it as a system of reference, recommending new “golden records” via matching capability and acting as a repository for keys.

Data Unification and ETL

ETL is highly manual, slow, and not scalable to the number of sources used in contemporary business analysis. Integrating data sources using ETL requires a lot of up-front work to define requirements, target schemas, and establish rules for matching entities and attributes. After all of this work is complete, developers need to manually apply these rules to match source data attributes to the target schema, as well as to deduplicate or cluster entities that appear in many variations across various sources.

Data unification’s probabilistic matching provides a far better engine than ETL’s rules when it comes to matching records across all of these sources. Data unification also works hand-in-hand with ETL as a system of reference to suggest transformations at scale, particularly for joins and rollups. This results in a faster time-to-value and more scalable operation.

Changing Infrastructure

Additionally, data unification solves the biggest challenges associated with changing infrastructure—namely, unifying datasets in Hadoop to connect and clean the data so that it’s ready for analytics. Data unification creates integrated, clean datasets with unrivaled speed and scalability. Because of the scale of business data today, it is very expensive to move Hadoop-based data outside of the data lake. Data unification can handle all of the large-scale processing within the data lake, eliminating the need to replicate the entire data set.

Data unification delivers more than technical benefits. In unifying enterprise data, enterprises can also unify their organizations. By cataloging and connecting dark, disparate data into a unified view, for example, organizations illuminate what data is available for analysts, and who controls access to the data. This dramatically reduces discovery and prep effort for business analysts and “gatekeeping” time for IT.

Probabilistic Approach to Data Unification

The probabilistic approach to data unification is reminiscent of Google’s full-scale approach to web search and connection. This approach draws from the best of machine and human learning to find and connect hundreds or thousands of data sources (both visible and dark), as opposed to the few that are most familiar and easiest to reach with traditional technologies.

The first step in using a probabilistic approach is to catalog all metadata available to the enterprise in a central, platform-neutral place using both machine learning and advanced collaboration capabilities. The data unification platform automatically connects the vast majority of sources while resolving duplications, errors, and inconsistencies among source data. The next step is critical to the success of a probabilistic approach—where algorithms can’t resolve connections automatically, the system must call for expert human guidance. It’s imperative that the system work with people in the organization familiar with the data, to have them weigh in on mapping and improving the quality and integrity of the data. While expert feedback can be built into the system to improve the algorithms, it will always play a role in this process. Using this approach, the data is then provided to analysts in a ready-to-consume condition, eliminating the time and effort required for data preparation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset