Chapter 5. From DevOps to DataOps

Why It’s Time to Embrace “DataOps” as a New Discipline

Over the past 10 years, the technology industry has experienced the emergence of “DevOps.” This new set of practices and tools have improved the velocity, quality, predictability, and scale of software engineering and deployment. Starting at the large Internet companies, the trend toward DevOps is now transforming the way that systems are developed and managed inside the enterprise—often dovetailing with enterprise cloud adoption initiatives. Regardless of your opinion about on-prem versus multitenant cloud infrastructure, the adoption of DevOps is improving how quickly new features and functions are delivered at scale for end users.

There is a lot to learn from the evolution of DevOps, across the modern Internet as well as within the modern enterprise—most notably for those who work with data every day.

At its core, DevOps is about the combination of software engineering, quality assurance, and technology operations (Figure 5-1). DevOps emerged because traditional systems management (as opposed to software development management) was not adequate to meet the needs of modern, web-based application development and deployment.

gtdr 0501
Figure 5-1. DevOps in the enterprise

From DevOps to DataOps

It’s time for data engineers and data scientists to embrace a new, similar discipline—let’s call it “DataOps”—that at its core addresses the needs of data professionals inside the modern enterprise.

Two trends are creating the need for DataOps:

  1. The democratization of analytics is giving more individuals access to cutting-edge visualization, data modeling, machine learning, and statistical tools.

  2. The implementation of “built-for-purpose” database engines is improving the performance and accessibility of large quantities of data, at unprecedented velocity. The techniques to improve beyond legacy relational DBMSs vary across markets, and this has driven the development of specialized database engines such as StreamBase, Vertica, VoltDB, and SciDB.

    More recently, Google made its massive Cloud Bigtable database (the same one that powers Google Search, Maps, YouTube, and Gmail) available to everyone in a scalable NoSQL database service through the Apache HBase API.

Together, these trends create pressure from both “ends of the stack.” From the top of the stack, users want access to more data in more combinations. From the bottom of the stack, more data is available than ever before—some aggregated, but much of it not. The only way for data professionals to deal with the pressure of heterogeneity from both the top and bottom of the stack is to embrace a new approach to managing data This new approach blends operations and collaboration. The goal is to organize and deliver data from many sources, to many users, reliably. At the same time, it’s essential to maintain the provenance required to support reproducible data flows.

Defining DataOps

DataOps is a data management method used by data engineers, data scientists, and other data professionals that emphasizes:

  • Communication

  • Collaboration

  • Integration

  • Automation

DataOps acknowledges the interconnected nature of data engineering, integration, quality, and security and privacy. It aims to help an organization rapidly deliver data that accelerates analytics, and to enable previously impossible analytics.

The “ops” in DataOps is very intentional. The operation of infrastructure required to support the quantity, velocity, and variety of data available in the enterprise today is radically different from what traditional data management approaches have assumed. The nature of DataOps embraces the need to manage many data sources and many data pipelines, with a wide variety of transformations.

Changing the Fundamental Infrastructure

While people have been managing data for a long time, we’re at a point now where the quantity, velocity, and variety of data available to a modern enterprise can no longer be managed without a significant change in the fundamental infrastructure. The design of this infrastructure must focus on:

  • The thousands of sources that are not centrally controlled, and which frequently change their schemas without notification (much in the way that websites change frequently without notifying search engines)

  • Treating these data sources (especially tabular data sets) as if they were websites being published inside of an organization

DataOps challenges preconceived notions of how to engage with the vast quantities of data being collected every day. Satisfying the enormous appetite for this data requires that we sort it in a way that is rapid, interactive, and flexible. The key to DataOps is that you don’t have to theorize and manage your data schemas up front, with a misplaced idealism about how the data should look.

DataOps Methodology

Using DataOps methodology, you start with the data as it is and work from the bottom up. You work with it, integrate it, uncover insights along the way, and find more data and more data sources that support or add to what you have discovered. Eventually, you come away with more quality outcomes than if you had tried to sort through the information from the top down with a specific goal in mind.

DataOps methodology brings a more agile approach to interrogating and analyzing data, on a very large scale. At some point, what you want is all the data. If you have all the data in a clear, comprehensible format, then you can actually see things that other people can’t see. But you can’t reach that monumental goal by simply declaring that you’re going to somehow conjure up all of the data in one place—instead, you have to continually iterate, execute, evaluate, and improve, just like when you are developing software.

If you want to do a better job with the quality of the data you are analyzing, you’ve got to develop information-seeking behaviors. The desire to look at more information and use more data sources gives you better signals from the data and uncovers more potential sources of insight. This creates a virtuous cycle: as data is utilized and processed, it becomes well organized and accessible, allowing more data to emerge and enter the ecosystem.

Any enterprise data professional knows that data projects can quickly become insurmountable if they rely heavily on manual processes. DataOps requires automating many of these processes to quickly incorporate new data into the existing knowledge base. First-generation DataOps tools (such as Tamr’s Data Unification platform) focus on making agile data management easier.

Integrating DataOps into Your Organization

Much of what falls under the umbrella of big data analytics today involves idiosyncratic and manual processes for breaking down data. Often, companies will have hundreds of people sifting through data for connections, or trying to find overlap and repetition. Despite the investment of these resources, new sources of data actually make this work harder—much, much harder—which means more data can limit instead of improve outcomes. DataOps tools will eliminate this hypolinear relationship between data sources and the amount of resources required to manage them, making data management automated and truly scalable.

To integrate this revolutionary data management method into an enterprise, you need two basic components. The first is cultural—enterprises need to create an environment of communication and cooperation among data analytics teams. The second component is technical—workflows will need to be automated with technologies like machine learning to recommend, collect, and organize information. This groundwork will help radically simplify administrative debt and vastly improve the ability to manage data as it arrives.

The Four Processes of DataOps

As illustrated in Figure 5-2, four processes work together to create a successful DataOps workflow:

  • Engineering

  • Integration

  • Quality

  • Security

Within the context of DataOps, these processes work together to create meaningful methods of handling enterprise data. Without them, working with data becomes expensive, unwieldy, or—worse—unsecure.

gtdr 0502
Figure 5-2. Four processes of DataOps

Data Engineering

Organizations trying to leverage all the possible advantages derived from mining their data need to move quickly to create repeatable processes for productive analytics. Instead of starting with a specific analytic in mind and working through a manual process to get to that endpoint, the data sifting experience should be optimized so that the most traditionally challenging, but least impactful, aspects of data analysis are automated.

Take, for example, the management of customer information in a CRM database or other database product. Sorting through customer data to make sure that the information is accurate is a challenge that many organizations either address manually—which is bad—or don’t address at all, which is worse. No company should be expected to have bad data or be overwhelmed by working with its data in an age when machine learning can be used as a balm to these problems.

The central problem of previous approaches to data management was the lack of automation. The realities of manually bringing together data sources restricted projects’ goals and therefore limited the focus of analytics—and if the analytical outcomes did not match the anticipated result, the whole effort was wasted. Moving to DataOps ensures that foundational work for one project can give a jump-start to the next, which expands the scope of analytics.

A bias toward automation is even more critical when addressing the huge variety of data sources that enterprises have access to. Only enterprises that engineer with this bias will truly be able to be data-driven—because only these enterprises will begin to approach that lofty goal of gaining a handle on all of their data.

To serve your enterprise customers the right way, you have to deliver the right data. To do this, you need to engineer a process that automates getting the right data to your customers, and to make sure that the data is well integrated for those customers.

Data Integration

Data integration is the mapping of physical data entities, in order to be able to differentiate one piece of data from another.

Many data integration projects fail because most people and systems lack the ability to differentiate data correctly for a particular use case. There is no one schema to rule them all; rather, you need the ability to flexibly create new logical views of your data within the context of your users’ needs. Existing processes that enterprises have created usually merge information too literally, leading to inaccurate data points. For example, often you will find repetitive customer names or inaccurate email data for a CRM project; or physical attributes like location or email addresses may be assigned without being validated.

Tamr’s approach to data integration is “machine driven, human guided.” The “machines” (computers running algorithms) organize certain data that is similar and should be integrated into one data point. A small team of skilled analysts validate whether the data is right or wrong. The feedback from the analysts informs the machines, continually improving the quality of automation over time. This cycle can remove inaccuracies and redundancies from data sets, which is vital to finding value and creating new views of data for each use case.

This is a key part of DataOps, but it doesn’t work if there is nothing actionable that can be drawn from the data being analyzed. That value depends on the quality of the data being examined.

Data Quality

Quality is purely subjective. DataOps moves you toward a system that recruits users to improve data quality in a bottom-up, bidirectional way. The system should be bottom-up in the sense that data quality is not some theoretical end state imposed from on high, but rather is the result of real users engaging with and improving the data. It should be bidirectional in that the data can be manipulated and dynamically changed.

If a user discovers some weird pattern or duplicates while analyzing data, resolving these issues immediately is imperative; your system must give users this ability to submit instant feedback. It is also important to be able to manipulate and add more data to an attribute as correlating or duplicate information is uncovered.

Flexibility is also key—the user should be open to what the data reveals, and approach the data as a way to feed an initial conjecture.

Data Security

Companies usually approach data security in one of two ways—either they apply the concept of access control, or they monitor usage.

The idea of an access control policy is that there has to be a way to trace back who has access to which information. This ensures that sensitive information rarely falls into the wrong hands. Actually implementing an access control policy can slow down the process of data analysis, though—and this is the existing infrastructure for most organizations today.

At the same time, many companies don’t worry about who has access to which sets of data. They want data to flow freely through the organization; they put a policy in place about how information can be used, and they watch what people use and don’t use. However, this leaves companies potentially susceptible to malicious misuse of data.

Both of these data protection techniques pose a challenge to combining various data sources, and make it tough for the right information to flow freely.

As part of a system that uses DataOps, these two approaches need to be combined. There needs to be some access control and use monitoring. Companies need to manage who is using their data and why, and they also always need to be able to trace back how people are using the information they may be trying to leverage to gain new big data insights. This framework for managing the security of your data is necessary if you want to create a broad data asset that is also protected. Using both approaches—combining some level of access control with usage monitoring—will make your data more fluid and secure.

Better Information, Analytics, and Decisions

By incorporating DataOps into existing data analysis processes, a company stands to gain a more granular, better-quality understanding of the information it has and how best to use it. The most effective way to maximize a system of data analytics is through viewing data management not as an unwieldy, monolithic effort, but rather as a fluid, incremental process that aligns the goals of many disciplines.

If you balance out the four processes we’ve discussed (engineering, integration, quality, and security), you’ll empower the people in your organization and give them a game-changing way to interact with data and to create analytical outcomes that improve the business.

Just as the movement to DevOps fueled radical improvements in the overall quality of software and unlocked the value of information technology to many organizations, DataOps stands to radically improve the quality and access to information across the enterprise, unlocking the true value of enterprise data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset