Chapter 5. AI Data Pipeline

In God we trust; all others bring data.

W. Edwards Deming

There are now more mobile devices than people on the planet and each is collecting data every second on our habits, physical activity, locations traveled, and daily preferences. Daily, we create 2.5 quintillion bytes of data from a wide variety of sources. And it’s coming from everywhere. Just think of all the sources collecting data—IoT sensors in the home, social media posts, pictures, videos, all our purchase transactions, as well as GPS location data monitoring our every move.

Data is even touted as being more important and valuable than oil. For that reason, companies are creating vast repositories of raw data (typically called data lakes) of both historical and real-time data. Being able to apply AI to this enormous quantity of data is a dream of many companies across industries. To do so, you have to pick the right set of tools not only to store the data but also to access the data as efficiently as possible. Current tools are evolving, and the way in which you store and present your data must change accordingly. Failure to do so will leave you and your data behind. To illustrate this point, MIT professor Erik Brynjolfsson performed a study that found firms using data-driven decision making are 5% more productive and profitable than competitors. Additional research shows that organizations using analytics see a payback of $13.01 for every dollar spent.

As we’ve seen so far, if large amounts of high-quality data are a prerequisite for a successful implementation of AI in the enterprise, then a process for obtaining and preparing the data is equally critical.

In previous chapters, we’ve covered some of the major applications of AI in the enterprise, from NLP to chatbots to computer vision. We’ve also discussed the importance of data to all of these implementations. What we’ve implied, yet haven’t actually touched on to this point, is the concept of a data pipeline that forms the backbone for all these AI implementations.

Whether you use off-the-shelf technology or build your own, these AI solutions can’t be effective without a data pipeline. With the numerous third-party solutions in the market like IBM Watson, it’s easy to forget that with even the simplest implementation, you need to have a data pipeline. For example, in a computer vision solution you still need to find representative images, train them, and then provide a mechanism for repeating this loop with new and better data as the algorithm improves. With NLP, you still need to feed the APIs source text to process and then often train custom models with data from your domain. With chatbots, your initial data pipeline would focus on the known questions and answers from your existing customer support logs and then building a process to capture new data to feed back into the chatbot. From SaaS all the way to developing your own AI solutions, data pipeline needs grow even larger—and more critical to the overall implementation of AI in our enterprise applications.

Not only is the data pipeline a crucial component of performing AI, but it also applies elsewhere in the enterprise—specifically in analytics and business intelligence. While the intricacies of creating a full AI data pipeline are outside the scope of this book, next we’ll provide a high-level guide for getting started.

So what exactly is a data pipeline for AI? Dataconomy defines a data pipeline as “an ideal mix of software technologies that automate the management, analysis and visualization of data from multiple sources, making it available for strategic use.” Data preparation, a data platform, and discovery are all significant pieces of an effective pipeline. Unfortunately, a data pipeline can be one of the most expensive parts of the enterprise AI solution.

Key to this process is making sure data can be accessed in an integrated manner instead of sitting in different silos, both internal and external to the enterprise. This ability to access and analyze real-time or at least recent data is key to an AI data pipeline (Figure 5-1).

Figure 5-1. AI data pipeline

As we’ll discuss in the next few sections, outside of the actual data itself, the two most popular components of an AI data pipeline are Apache Hadoop and Apache Spark. In fact, an IBM study of organizations with more than 1,000 employees from 90 countries showed that 57% of surveyed firms either already had or planned to implement a pipeline based on Hadoop/Spark.

Preparing for a Data Pipeline

At a high level, there are several areas enterprises must focus on to effectively build and maintain a data pipeline. The key consideration before even beginning the process is to make sure all stakeholders have bought into the idea of having a data pipeline. There are no shortcuts here and enterprises must be sure of their preparedness in several areas. First, are your data storage and cloud practices solid? Are employees trained and aware of them? Next, where is the data stored? Frequently in the enterprise, the data is located in numerous silos, controlled and maintained by different groups in the company. Is everyone bought in and prepared to break down these data silos in order to build the AI pipeline? Finally, is there a process for cleaning data and repairing any issues with metadata? It’s crucial that the enterprise embraces modern data science practices and not just a rehash of their existing business intelligence systems.

Sourcing Big Data

Just like AI itself, the term big data has a variety of definitions depending on who you talk to. But generally speaking, once the data you’re collecting is too large to store in memory or your existing storage systems, you’re entering big data territory. Depending on the infrastructure in your enterprise, this will obviously vary considerably. As the adage goes, you’ll know it when you see it! For our purposes, we’ll define big data as bringing structured and unstructured data together in one place to do some analysis with AI.

IBM describes big data as having four major dimensions: volume, velocity, variety, and veracity. Volume refers to the amount of data needed. As discussed in previous chapters, we’re inundated with data, and it isn’t slowing down. From the over 500 million tweets per day to the 350 billion annual meter readings to better predict power consumption, enterprises generate large amounts of information. How fast are you storing and processing your data? Many applications are extremely time-sensitive, so the velocity of the data with respect to your storage and application processing is critical. For example, in areas like fraud detection and customer support, being able to access the data quickly is essential. Variety refers to the diversity of the collected data, from structured to unstructured. This includes text, video, audio, clicks, IoT sensor data, payment records, and more. Finally, big data must have veracity. How accurate or trustworthy is the data? When 1 in 3 business leaders don’t trust the data they need to make decisions, it’s clear that the vast majority of data has accuracy issues.

Returning to our previous discussion of AI winters and how the convergence of various trends has allowed AI to flourish again, a subtheme of this is how big data technology has come to the forefront. Commodity hardware, inexpensive storage, open source software and databases, and the adoption of APIs all have enabled new strategies for creating data pipelines.

Storage: Apache Hadoop

Originally written in Java by Doug Cutting, Hadoop is an open source framework for distributed processing and computing of large data sets using MapReduce for parallel processing. Incredibly popular and rapidly growing, it’s been estimated that the global market for Hadoop will reach $21 billion by 2018. So just what is Hadoop? IBM Analytics defines it as “a highly scalable storage platform designed to process very large data sets across hundreds to thousands of computing nodes that operate in parallel. It provides a cost-effective storage solution for large data volumes with no format requirements.”

Hadoop can store data from many sources, serving as a centralized location for storing data needed for machine learning. Apache Hadoop is itself an ecosystem, made popular by its ability to run on commodity hardware. Two major Hadoop concepts we’ll discuss that are relevant to an AI data pipeline are HDFS and MapReduce. Built to support MapReduce, the Hadoop Distributed File System (HDFS) can process both structured and unstructured data, resiliently enabling scalable storage across multiple computers. HDFS is a purpose-built filesystem for storing big data, while MapReduce is a programming paradigm that refers to two distinct tasks that are performed: map and reduce. The map job takes a set of data and converts it to key/value pairs. The reduce job then takes this output from the map job and combines it into a smaller set of key/value pairs for summary operations.

Like other powerful programming tools, MapReduce allows developers to write code without needing to understand the underlying complexion of distributed systems. If you’d like more detailed info on Hadoop, HDFS, and MapReduce, the following books are excellent resources:

Hadoop as a Data Lake

Hadoop is often used as a data lake. Again, while definitions vary, data lakes are typically considered shared storage environments for large amounts of varying data types, both structured and unstructured. This data can then be used for a variety of applications including analytics and machine learning.

The main feature of a data lake is the ability to centrally store and process raw data where before it would be too expensive to do so. In contrast to data warehouses, which stored structured, processed data, data lakes store large amounts of raw data in its native format, including structured, semistructured, and unstructured data. Hadoop shines in storing both this structured as well as unstructured data, thus making it an excellent tool for data lakes.

While data lakes have numerous benefits, from supporting data discovery to analytics and reporting, they do come with a caveat. As an IBM report stated: “Without proper management and governance, a data lake can quickly become a data swamp.”

Discovery: Apache Spark

Created in 2009 at the University of California Berkeley AMPLab, Apache Spark is an open source distributed computing framework that uses in-memory processing to speed up analytic applications. Written in Scala (though also supporting Java, Python, Clojure, and R), Spark is not necessarily a replacement for Hadoop, but is instead complementary and can work on top of Hadoop, taking advantage of Hadoop’s previously discussed benefits.

Contributing to its popularity among data scientists, the technology is extremely fast. According to Databricks, Apache Spark can be up to 100x faster than Hadoop MapReduce for large-scale data processing. This increased speed enables it to solve machine learning problems at a much greater scale than other solutions. Additionally, Spark comes with built-in libraries for working with structured data, streaming/stream processing, graphs, and machine learning (Figure 5-2). There are also numerous third-party projects that have created a thriving ecosystem around Spark.

Figure 5-2. Apache Spark stack

Spark itself doesn’t have a persistent data store and instead keeps data in memory for processing. It’s important to reiterate that Spark isn’t a database, but instead connects to external data sources, usually Hadoop’s HDFS, but also anything commercial or open source that developers are already using and/or familiar with, such as HBase, Cassandra, MapR, MongoDB, Hive, Google Cloud, and Amazon S3. Selecting a database for your application is outside the scope of this book, but it’s helpful to know that Spark supports a wide variety of popular database options.

Spark Versus MapReduce

While using Spark doesn’t necessarily preclude the use of MapReduce, it does in many ways compete with MapReduce. For our purposes, there are two main differences to consider. First, the major difference between the two is where the data is stored while processing. MapReduce stores the data to disk, constantly writing in/out, while Spark keeps the data in memory. Writing to disk is much slower, which is why Spark often sees performance gains of 100x over MapReduce. Additionally, development is considered easier and more expressive, as in addition to map and reduce, Spark also has filter, join, and group-by functions.

Machine Learning with Spark

As previously mentioned, Spark has a built-in module for machine learning called MLib. This is an integrated, scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, and collaborative filtering. Having this library native to Spark makes machine learning much more accessible, easy, and scalable for developers. It’s easy to get started locally from a command line and then move on to full cluster deployments. With machine learning in Spark, it becomes a foundation to build data-centric applications across the organization. Since much of machine learning centers on repeated iterations for training, Spark’s ability to store data in memory makes this process much faster and more efficient for the task.

Summary

Big data is being captured everywhere and rapidly increasing. Where is your data stored and what can you do now to future-proof for machine learning? While the enterprise typically embraces the tried and true—and in this case, Hadoop and Spark—there are other technologies that show great promise and are being embraced in research as well as startup companies. Some of these open source projects include TensorFlow, Caffe, Torch, Chainer, and Theano.

We’ve covered some of the basics of AI, discussed essential technologies like NLP and chatbots, and now provided an overview of the AI data pipeline. Though various tools and techniques have been discussed throughout, in the next chapter, we’ll wrap up the discussion of AI in the enterprise with a look at how to move forward and really begin your AI journey.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset