Chapter 5. Key Components of a DataOps Ecosystem

There are many ways to think about the potential components of a next-generation data ecosystem for the enterprise. Our friends at DataKitchen have done a good job with this post, which refers to some solid work by the Eckerson Group. In the interest of trying to simplify the context of what you might consider buying versus building and which vendors you might consider, I’ve tried to lay out the primary components of a next-generation enterprise data ecosystem based on the environments I’ve seen people configuring over the past 8 to 10 years and the tools (new and old) that are available. We can summarize these components as follows:

  • Catalog/Registry

  • Movement/ETL

  • Alignment/Unification

  • Storage

  • Publishing

  • Feedback

I provide a brief summary of each of these components in the sections that follow.

Catalog/Registry

Over the past 5 to 10 years, a key function has emerged as a critical starting point in the development of a functional DataOps ecosystem—the data catalog/registry. There are a number of FOSS and commercial projects that are attempts to provide tools that enable large enterprises to answer the simple question, “What data do you have?” Apache Atlas, Alation, and Waterline are the projects that I see the most in my work at Tamr and discussions with my chief data officer friends. I’ve always believed that the best data catalog/registry is a “vendor neutral” system that crawls all tabular data and registers all tabular datasets (files and inside of database management systems) using a combination of automation and human guidance.

The problem with vendor-specific systems is inevitably their implementation devolves into wars between vendor-oriented camps within your organization and you end up not with one catalog/registry, but multiple—which sort of defeats the purpose. If the vendor providing the catalog system is more open and interoperable (per Chapter 4 on the principles of a DataOps ecosystem), it’s likely OK. This is not an impossible thing to build yourself—I’ve seen two to three built in the past few years that function well—but it is probably just a matter of time before something like Apache Atlas evolves to provide the basics in vendor-neutral systems that are highly open and interoperable.

Movement/ETL

There are so many options for data movement that it’s a bit mind numbing. They range from the existing ETL/extract, load, and transform (ELT) vendors (Informatica, Talend, Oracle, IBM, Microsoft) to the new breed of movement vendors (my favorites are StreamSets, Data Kitchen, and KNIME) and the cloud data platform vendors such as Google GCP/DataFlow, Microsoft Azure, Amazon Web Services (AWS), Databricks, Snowflake, and the dozens of new ones that every VC in the Bay Area is funding by the month.

The most interesting dynamic here, in my opinion, is that most large enterprises use Python for most of their data pipelining, and the idea of broadly mandating a single pipelining tool for their users seems like a bit of a reach. Python is so easy for people to learn, and if you use an orchestration framework such as Apache AirFlow, you get significant control and scale without all the overhead and restrictions of a proprietary framework. If you need massive ingest performance and scale, I think that something like StreamSets is your best bet, and I have seen this work incredibly well at GSK. However, most large companies will have requirements and heterogeneity among their pipelines that makes Python and AirFlow a better fit as the lowest common denominator across their enterprise.

One of the benefits of having a more open, interoperable, and best-of-breed approach is that as you need high-performance movement tools, you can adopt these incrementally. For example, you can use Python and Airflow as the baseline or default across your organization and then titrate in high-performance tools like StreamSets as required where you need the scale and performance. In the long term, this enables your ecosystem of tools to evolve gracefully and avoid massing single vendor lift-and-shift projects, which are prone to failure, despite the expectations that any single vendor might want to set along with an eight-figure-plus proposal.

Alignment/Unification

The tools required to create consistency in data need to be strongly rooted in the use of three key methods: rules, models, and human feedback. These three key methods, implemented with an eye toward an Agile process, are essential to tame the large challenge of variety in enterprise data. Traditional tools that depend on a single data architect to a priori define static schema, controlled vocabulary, taxonomy, ontologies, and relationships are inadequate to solve the challenges of data variety in the modern enterprise. The thoughtful engineering of data pipelines and the integration of rules, active learning–based probabilistic models of how data fits together, and the deliberate human feedback to provide subject matter expertise and handle corner cases is essential to success in the long-term alignment and molding of data broadly.

Storage

The biggest change in data systems over the past 10 years has been the evolution of data storage platforms—both cloud and on-premises. When my partner Mike and I started Vertica back in 2004, the database industry had plateaued in terms of new design patterns. At the time, everyone in the enterprise was using traditional “row stores” for every type of work load—regardless of the fundamental fit. This was the gist of Mike and Ugur Cetintemel’s “‘One Size Fits All’: An Idea Whose Time Has Come and Gone” paper and my Usenix talk in 2010. Overall, the key next-generation platforms that are top of mind now include the following:

  • AWS: Redshift, Aurora, et al.

  • GCP: BigTable, Spanner

  • Azure: SQL Services et al.

  • Databricks

  • SnowFlake

  • Vertica

  • Postgres

The capabilities available in these platforms are dramatic and the pace of improvement is truly exceptional. I’m specifically not putting the traditional big vendors on this list—IBM, Oracle, Teradata—because most of them are falling further behind by the day. The cloud platform vendors have a huge advantage relative to the on-premises vendors in that the cloud vendors can radically improve their systems quickly without the latency associated with slow or long on-premises release cycles and the proverbial game of “telephone” that on-premises customers and vendors play—a game where it takes at least quarters and more often years and decades to get improvements identified, prioritized, engineered, and delivered into production.

The pace of change is what I believe will make the cloud-oriented platform vendors more successful than the on-premises platforms in the medium term. The best customers who are doing on-premises now are configuring their infrastructures to be compatible with cloud platforms so that when they migrate, they will be able to do so with minimal change. Those smart, on-premises vendors will have the advantage of a healthy abstraction from cloud platform vendors’ proprietary services that could “lock in” other customers who start with that cloud platform from scratch.

Publishing

When data is organized and cleaned, providing a mechanism to broadly publish high-quality data is essential. This component delivers both a machine and human-readable form of dynamic datasets that have been broadly and consistently prepared. This component also has methods to recommend new data (rows, columns, values, relationships) that are discovered bottom up in the data over time. These methods can also be instrumented broadly into consumption endpoints such as analytic tools so that as new data becomes available, recommendations can be made dynamically to data consumers who might be interested in consuming the continuously improved and updated data in the context of their analytics, operational systems, and use.

Feedback

This is perhaps the most impactful and least appreciated component of a next-generation DataOps ecosystem. Currently, data in most enterprises flows unidirectionally from sources through deterministic and idiosyncratic pipelines toward data warehouses, marts, and spreadsheets—and that’s where the story stops. There is an incredible lack of systematic feedback mechanisms to enable data to flow from where data is consumed back up into the pipelines and all the way back to the sources so that the data can be improved systematically over time. Most large organizations lack any “queue” of data problems identified by data consumers. At Tamr we’ve created “Steward” to help address this problem, providing a vendor-neutral queue of data consumer–reported problems with data for the enterprise.

Governance

Governance is a key component of a modern ecosystem—the evolution of data privacy has raised governance up to the top of the priority list—driven by a need to comply to many important new regulations. My belief is that the best governance infrastructure focuses on the automation of information access policy and the prosecution of that policy across users in the context of key roles that are aligned with policy. Focusing on governance in the context of information use helps avoid boiling the infinite proverbial ocean of data source complexity. Having a codified information access policy as well as methods for running that policy in real time as users are consuming data should be the core goal of any governance infrastructure initiative.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset