Chapter 2. An Alternative Approach to Data Management

For much of the 60 years or so that organizations have been managing data in electronic form, there has been an overpowering desire to subdue it through centralized planning and architectural initiatives.

These initiatives have had a variety of names over the years, including the most familiar: “information architecture,” “information engineering,” and “master data management.” Underpinning them has been a set of key attributes and beliefs:

  • Data needs to be centrally controlled.

  • Modeling is an approach to controlling data.

  • Abstraction is a key to successful modeling.

  • An organization’s information should all be defined in a common fashion.

  • Priority is on efficiency in information storage (a given data element should only be stored once).

  • Politics, ego, and other common human behaviors are irrelevant to data management (or at least not something that organizations should attempt to manage).

Each of these statements has at least a grain of truth in it, but taken together and to their full extent, I have come to believe that they simply don’t work as the foundation for data management. I rarely find business users who believe they work either, and this dissatisfaction has been brewing for a long time. For example, in the 1990s I interviewed a marketing manager at Xerox Corporation who had also spent some time in IT at the same company. He explained that the company had “tried information architecture” for 25 years, but got nowhere—they always thought they were doing it incorrectly.

Centralized Planning Approaches

Most organizations have had similar results from their centralized architecture and planning approaches.

Not only do centralized planning approaches waste time and money, but they also drive a wedge between those who are planning them and those who will actually use the information and technology. Regulatory submissions, abstract meetings, and incongruous goals can lead to months of frustration, without results.

The complexity and detail of centralized planning approaches often mean that they are never completed, and when they are finished, managers frequently decide not to implement them. The resources devoted to central data planning are often redeployed into other IT projects of more tangible value. If by chance they are implemented, they are typically hopelessly out of date by the time they go into effect.

As an illustration of how the key tenets of centralized information planning are not consistent with real organizational behavior, let’s look at one: the assumption that all information needs to be common.

Common Information

Common information—agreement within an organization on how to define and use key data elements—is a useful thing, to be sure. But it’s also helpful to know that uncommon information—information definitions that suit the purposes of a particular group or individual—can also be useful to a particular business function, unit, or work group. Companies need to strike a balance between these two desirable goals.

After speaking with many managers and professionals about common information, and reflecting on the subject carefully, I formulated “Davenport’s Law of Common Information” (you can Google it, but don’t expect a lot of results). If by some strange chance you haven’t heard of Davenport’s Law, it goes like this:

The more an organization knows or cares about a particular business entity, the less likely it is to agree on a common term and meaning for it.

I first noticed this paradoxical observation at American Airlines more than a decade ago. Company representatives told me during a research visit that they had 11 different usages of the term “airport.” As a frequent traveler on American Airlines planes, I was initially a bit concerned about this, but when they explained it, the proliferation of meanings made sense. They said that the cargo workers at American Airlines viewed anyplace you can pick up or drop off cargo as the airport; the maintenance people viewed anyplace you can fix an airplane as the airport; the people who worked with the International Air Transport Authority relied on their list of international airports, and so on.

Information Chaos

So, just like Newton being hit on the head with an apple and discovering gravity, the key elements of Davenport’s Law hit me like a brick. This was why organizations were having so many problems creating consensus around key information elements. I also formulated a few corollaries to the law, such as:

If you’re not arguing about what constitutes a “customer,” your organization is probably not very passionate about customers.

Davenport’s Law, in my humble opinion, makes it much easier to understand why companies all over the world have difficulty establishing common definitions of key terms within their organizations.

Of course, this should not be an excuse for organizations to allow alternative meanings of key terms to proliferate. Even though there is a good reason why they proliferate, organizations may have to limit—or sometimes even stop—the proliferation of meanings and agree on one meaning for each term. Otherwise they will continue to find that when the CEO asks multiple people how many employees a company has, he/she will get different answers. The proliferation of meanings, however justifiable, leads to information chaos.

But Davenport’s Law offers one more useful corollary about how to stop the proliferation of meanings. Here it is:

A manager’s passion for a particular definition of a term will not be quenched by a data model specifying an alternative definition.

If a manager has a valid reason to prefer a particular meaning of a term, he/she is unlikely to be persuaded to abandon it by a complex, abstract data model that is difficult to understand in the first place, and is likely never to be implemented.

Is there a better way to get adherence to a single definition of a term?

Here’s one final corollary:

Consensus on the meaning of a term throughout an organization is achieved not by data architecture, but by data arguing.

Data modeling doesn’t often lead to dialog, because it’s simply not comprehensible to most nontechnical people. If people don’t understand your data architecture, it won’t stop the proliferation of meanings.

What Is to Be Done?

There is little doubt that something needs to be done to make data integration and management easier. In my research, I’ve conducted more than 25 extended interviews with data scientists about what they do, and how they go about their jobs. I concluded that a more appropriate title for data scientists might actually be “data plumbers.” It is often so difficult to extract, clean, and integrate data that data scientists can spend up to 90% of their working time doing those tasks. It’s no wonder that big data often involves “small math”—after all the preparation work, there isn’t enough time left to do sophisticated analytics.

This is not a new problem in data analysis. The dirty little secret of the field is that someone has always had to do a lot of data preparation before the data can be analyzed. The problem with big data is partly that there is a large volume of it, but mostly that we are often trying to integrate multiple sources. Combining multiple data sources means that for each source, we have to determine how to clean, format, and integrate its data. The more sources and types of data there are, the more plumbing work is required.

So let’s assume that data integration and management are necessary evils. But what particular approaches to them are most effective? Throughout the remainder of this chapter, I’ll describe five approaches to realistic, effective data management:

  1. Take a federal approach to data management.

  2. Use all the new tools at your disposal.

  3. Don’t model, catalog.

  4. Keep everything simple and straightforward.

  5. Use an ecological approach.

Take a Federal Approach to Data Management

Federal political models—of which the United States is one example—don’t try to get consensus on every issue. They have some laws that are common throughout the country, and some that are allowed to vary from state to state or by region or city. It’s a hybrid approach to the centralization/decentralization issue that bedevils many large organizations. Its strength is its practicality, in that it’s easier to get consensus on some issues than on all of them. If there is a downside to federalism, it’s that there is usually a lot of debate and discussion about which rights are federal, and which are states’ or other units’ rights. The United States has been arguing about this issue for more than 200 years.

While federalism does have some inefficiencies, it’s a good model for data management. It means that some data should be defined commonly across the entire organization, and some should be allowed to vary. Some should have a lot of protections, and some should be relatively open. That will reduce the overall effort required to manage data, simply because not everything will have to be tightly managed.

Your organization will, however, have to engage in some “data arguing.” Hashing things out around a table is the best way to resolve key issues in a federal data approach. You will have to argue about which data should be governed by corporate rights, and which will be allowed to vary. Once you have identified corporate data, you’ll then have to argue about how to deal with it. But I have found that if managers feel that their issues have been fairly aired, they are more likely to comply with a policy that goes against those issues.

Use All the New Tools at Your Disposal

We now have a lot of powerful tools for processing and analyzing data, but up to now we haven’t had them for cleaning, integrating, and “curating” data. (“Curating” is a term often used by librarians, and there are typically many of them in pharmaceutical firms who manage scientific literature.) These tools are sorely needed and are beginning to emerge. One source I’m close to is a startup called Tamr, which aims to help “tame” your data using a combination of machine learning and crowdsourcing. Tamr isn’t the only new tool for this set of activities, though, and I am an advisor to the company, so I would advise you to do your own investigation. The founders of Tamr (both of whom have also contributed to this report) are Andy Palmer and Michael Stonebraker. Palmer is a serial entrepreneur and incubator founder in the Boston area.

Stonebraker is the database architect behind INGRES, Vertica, VoltDB, Paradigm4, and a number of other database tools. He’s also a longtime computer science professor, now at MIT. As noted in his chapter of this report, we have a common view of how well-centralized information architecture approaches work in large organizations.

In a research paper published in 2013, Stonebraker and several co-authors wrote that they had tested “Data-Tamer” (as it was then known) in three separate organizations. They found that the tool reduced the cost of data curation in those organizations by about 90%.

I like the idea that Tamr uses two separate approaches to solving the problem. If the data problem is somewhat repetitive and predictable, the machine learning approach will develop an algorithm that will do the necessary curation. If the problem is a bit more ambiguous, the crowdsourcing approach can ask people who are familiar with the data (typically the owners of that data source) to weigh in on its quality and other attributes. Obviously the machine learning approach is more efficient, but crowdsourcing at least spreads the labor around to the people who are best qualified to do it. These two approaches are, together, more successful than the top-down approaches that many large organizations have employed.

A few months before writing this chapter, I spoke with several managers from companies who are working with Tamr. Thomson Reuters is using the technology to curate “core entity master” data—creating clear and unique identities of companies and their parents and subsidiaries. Previous in-house curation efforts, relying on a handful of data analysts, found that 30–60% of entities required manual review. Thomson Reuters believed manual integration would take up to six months to complete, and would identify 95% of duplicate matches (precision) and 95% of suggested matches that were, in fact, different (recall).

Thomson Reuters looked to Tamr’s machine-driven, human-guided approach to improve this process. After converting the company’s XML files to CSVs, Tamr ingested three core data sources—factual data on millions of organizations, with more than 5.4 million records. Tamr deduplicated the records and used “fuzzy matching” to find suggested matches, with the goal of achieving high accuracy rates while reducing the number of records requiring review. In order to scale the effort and improve accuracy, Tamr applied machine learning algorithms to a small training set of data and fed guidance from Thomson Reuters’ experts back into the system.

The “big pharma” company Novartis is also using Tamr. Novartis has many different sources of biomedical data that it employs in research processes, making curation difficult. Mark Schreiber, then an “informatician” at Novartis Institutes for Biomedical Research (he has since moved to Merck), oversaw the testing of Tamr going all the way back to its academic roots at MIT. He is particularly interested in the tool’s crowdsourcing capabilities, as he wrote in a blog post:

The approach used gives you a critical piece of the workflow bridging the gap between the machine learning/automated data improvement and the curator. When the curator isn’t confident in the prediction or their own expertise, they can distribute tasks to your data producers and consumers to ask their opinions and draw on their expertise and institutional memory, which is not stored in any of your data systems.

I also spoke with Tim Kasbe, the COO of Gloria Jeans, which is the largest “fast fashion” retailer in Russia and Ukraine. Gloria Jeans has tried out Tamr on several different data problems, and found it particularly useful for identifying and removing duplicate loyalty program records. Here are some results from that project:

We loaded data for about 100,000 people and families and ran our algorithms on them and found about 5,000 duplicated entries. A portion of these represented people or families that had signed up for multiple discount cards. In some cases, the discount cards had been acquired in different locations or different contact information had been used to acquire them. The whole process took about an hour and did not need deep technical staff due to the simple and elegant Tamr user experience. Getting to trustworthy data to make good and timely decisions is a huge challenge this tool will solve for us, which we have now unleashed on all our customer reference data, both inside and outside the four walls of our company.

I am encouraged by these reports that we are on the verge of a breakthrough in this domain. But don’t take my word for it—do a proof of concept with one of these types of tools.

Don’t Model, Catalog

One of the paradoxes of IT planning and architecture is that those activities have made it more difficult for people to find the data they need to do their work. According to Gartner, much of the roughly $3–4 trillion invested in enterprise software over the last 20 years has gone toward building and deploying software systems and applications to automate and optimize key business processes in the context of specific functions (sales, marketing, manufacturing) and/or geographies (countries, regions, states, etc.). As each of these idiosyncratic applications is deployed, an equally idiosyncratic data source is created. The result is that data is extremely heterogeneous and siloed within organizations.

For generations, companies have created “data models,” “master data models,” and “data architectures” that lay out the types, locations, and relationships of all the data that they have now and will have in the future. Of course, those models rarely get implemented exactly as planned, given the time and cost involved. As a result, organizations have no guide to what data they actually have in the present and how to find it. Instead of creating a data model, they should create a catalog of their data—a straightforward listing of what data exists in the organization, where it resides, who’s responsible for it, and so forth.

One reason why companies don’t create simple catalogs of their data is that the result is often somewhat embarrassing and irrational. Data is often duplicated many times across the organization. Different data is referred to by the same term, and the same data by different terms. A lot of data that the organization no longer needs is still hanging around, and data that the organization could really benefit from is nowhere to be found. It’s not easy to face up to all of the informational chaos that a cataloging effort can reveal.

Perhaps needless to say, however, cataloging data is worth the trouble and initial shock at the outcome. A data catalog that lists what data the organization has, what it’s called, where it’s stored, who’s responsible for it, and other key metadata can easily be the most valuable information offering that an IT group can create.

Cataloging Tools

Given that IT organizations have been more preoccupied with modeling the future than describing the present, enterprise vendors haven’t really addressed the catalog tool space to a significant degree. There are several catalog tools for individuals and small businesses, and several vendors of ETL (extract, transform, and load) tools have some cataloging capabilities built into their own tools. Some also tie a catalog to a data governance process, although “governance” is right up there with “bureaucracy” as a term that makes many people wince.

At least a few data providers and vendors are actively pursuing catalog work, however. One company, Enigma, has created a catalog for public data, for example. The company has compiled a set of public databases, and you can simply browse through its catalog (for free if you are an individual) and check out what data you can access and analyze. That’s a great model for what private enterprises should be developing, and I know of some companies (including Tamr, Informatica, Paxata, and Trifacta) that are developing tools to help companies develop their own catalogs.

In industries such as biotech and financial services, for example, you increasingly need to know what data you have—and not only so you can respond to business opportunities. Industry regulators are also concerned about what data you have and what you are doing with it. In biotech companies, for example, any data involving patients has to be closely monitored and its usage controlled, and in financial services firms there is increasing pressure to keep track of customers’ and partners’ “legal entity identifiers,” and to ensure that dirty money isn’t being laundered.

If you don’t have any idea of what data you have today, you’re going to have a much tougher time adhering to the demands from regulators. You also won’t be able to meet the demands of your marketing, sales, operations, or HR departments. Knowing where your data is seems perhaps the most obvious tenet of information management, but thus far, it has been among the most elusive.

Keep Everything Simple and Straightforward

While data management is a complex subject, traditional information architectures are generally more complex than they need to be. They are usually incomprehensible not only to nontechnical people, but also to the technical people who didn’t have a hand in creating them. From IBM’s Business Systems Planning—one of the earliest architectural approaches—up through master data management (MDM), architectures feature complex and voluminous flow diagrams and matrices. Some look like the circuitry diagrams for the latest Intel microprocessors. MDM has the reasonable objective of ensuring that all important data within an organization comes from a single authoritative source, but it often gets bogged down in discussions about who’s in charge of data and whose data is most authoritative.

It’s unfortunate that information architects don’t emulate architects of physical buildings. While they definitely require complex diagrams full of technical details, good building architects don’t show those blueprints to their clients. For clients, they create simple and easy-to-digest sketches of what the building will look like when it’s done. If it’s an expensive or extensive building project, they may create three-dimensional models of the finished structure.

More than 30 years ago, Michael Hammer and I created a new approach to architecture based primarily on “principles.” These are simple, straightforward articulations of what an organization believes and wants to achieve with information management; the equivalent of a sketch for a physical architect. Here are some examples of the data-oriented principles from that project:

  • Data will be owned by its originator but will be accessible to higher levels.

  • Critical data items in customer and sales files will conform to standards for name, form, and semantics.

  • Applications should be processed where data resides.

We suggested that an organization’s entire list of principles—including those for technology infrastructure, organization, and applications, as well as data management—should take up no more than a single page. Good principles can be the drivers of far more detailed plans, but they should be articulated at a level that facilitates understanding and discussion by businesspeople. In this age of digital businesses, such simplicity and executive engagement is far more critical than it was in 1984.

Use an Ecological Approach

I hope I have persuaded you that enterprise-level models (or really models at any level) are not sufficient to change individual and organizational behavior, with respect to data. But now I will go even further and argue that neither models nor technology, policy, or any other single factor is enough to move behavior in the right direction. Instead, organizations need a broad, ecological approach to data-oriented behaviors.

In 1997 I wrote a book called Information Ecology: Mastering the Information and Knowledge Environment (Oxford University Press). It was focused on this same idea—that multiple factors and interventions are necessary to move an organization in a particular direction with regard to data and technology management. Unlike engineering-based models, ecological approaches assume that technology alone is not enough to bring about the desired change, and that with multiple interventions an environment can evolve in the right direction. In the book, I describe one organization, a large UK insurance firm called Standard Life, that adopted the ecological approach and made substantial progress on managing its customer and policy data. Of course, no one—including Standard Life—ever achieves perfection in data management; all one can hope for is progress.

In Information Ecology, I discussed the influence on a company’s data environment of a variety of factors, including staff, politics, strategy, technology, behavior and culture, process, architecture, and the external information environment. I’ll explain the lesser-known aspects of this model briefly.

Staff, of course, refers to the types of people and skills that are present to help manage information. Politics refers primarily to the type of political model for information that the organization employs; as noted earlier, I prefer federalism for most large companies. Strategy is the company’s focus on particular types of information and particular objectives for it. Behavior and culture refers to the particular information behaviors (e.g., not creating new data sources and reusing existing ones) that the organization is trying to elicit; in the aggregate they constitute “information culture.” Process involves the specific steps that an organization undertakes to create, analyze, disseminate, store, and dispose of information. Finally, the external information environment consists of information sources and uses an outside of organization’s boundaries that the organization may use to improve its information situation. Most organizations have architectures and technology in place for data management, but they have few, if any, of these other types of interventions.

I am not sure that these are now (or ever were) the only types of interventions that matter, and in any case the salient factors will vary across organizations. But I am quite confident that an approach that employs multiple factors to achieve an objective (for example, to achieve greater use of common information) is more likely to succeed than one focused only on technology or architectural models.

Together, the approaches I’ve discussed in this chapter comprise a common-sense philosophy of data management that is quite different from what most organizations have employed. If for no other reason, organizations should try something new because so many have yet to achieve their desired state of data management.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset