Data analytics pipeline

Data modeling is the process of using data to build predictive models. Data can also be used for descriptive and prescriptive analysis. But before we make use of data, it has to be fetched from several sources, stored, assimilated, cleaned, and engineered to suit our goal. The sequential operations that need to be performed on data are akin to a manufacturing pipeline, where each subsequent step adds value to the potential end product and each progression requires a new person or skill set.

The various steps in a data analytics pipeline are shown in the following diagram:

Steps in data analytics pipeline

Extract Data
Transform Data
Load Data
Read & Process Data
Exploratory Data Analysis
Create Features
Build Predictive Models
Validate Models
Build Products

These steps can be combined into three high-level categories: data engineering, data science, and product development.

Data Engineering: Step 1 to Step 3 in the preceding diagram fall into this category. It deals with sourcing data from a variety of sources, creating a suitable database and table schema, and loading the data in a suitable database. There can be many approaches to this step depending on the following:
- Type of data: Structured (tabular data) versus unstructured (such as images and text) versus semi-structured (such as JSON and XML)
- Velocity of data upgrade: Batch processing versus real-time data streaming
- Volume of data: Distributed (or cluster-based) storage versus single instance databases
- Variety of data: Document storage, blob storage, or data lake
Data Science: Step 4 to Step 8 in figure 1.2 fall into the category of data science. This is the phase where the data is made usable and used to predict the future, learn patterns, and extrapolate these patterns. Data science can further be sub-divided into two phases.

Step 4 to Step 6 comprise the first phase, wherein the goal is to understand the data better and make it usable. Making the data usable requires considerable effort to clean it by removing invalid characters and missing values. It also involves understanding the nitty-gritty of the data at hand—what is the distribution of data, what is the relationship between different data variables, is there a causatory relationship between the input and outcome variable, and so on. It also involves exploring numerical transformations (features) that might explain this causation (between input and outcome variables) better. This phase entails the real forensic effort that goes into the ultimate use of data. To use an analogy, bamboo seeds remain buried in the soil for years with no signs of a sapling growing, and suddenly a sapling grows, and within months a full bamboo tree is ready. This phase of data science is akin to the underground preparation the bamboo seeds undergo before the rapid growth. This is like the stealth mode of a start up wherein a lot of time and effort is committed. And this is where the pandas library, protagonist of this book, finds it raison d'etre and sweet spot.

Step 7 to Step 8 constitute the part where patterns (the parameters of a mathematical expression) are learned from historic data and extrapolated to future data. It involves a lot of experimentation and iterations to get to the optimal results. But if Step 4 to Step 6 have been done with the utmost care, this phase can be implemented pretty quickly thanks to the number of packages in Python, R, and many other data science tools. Of course, it requires a sound understanding of the math and algorithms behind the applied model in order to tweak its parameters to perfection.

Product Development: This is the phase where all the hard work bears fruit and all the insights, results, and patterns are served to the users in a way that they can consume, understand, and act upon. It might range from building a dashboard on data with additional derived fields to an API that calls a trained model and returns an output on incoming data. A product can also be built to encompass all the stages of the data pipeline, from extracting the data to building a predictive model or creating an interactive dashboard.

Apart from these steps in the pipeline, there are some additional steps that might come into the picture. This is due to the highly evolving nature of the data landscape. For example, deep learning, which is used extensively to build intelligent products around image, text, and audio data, often requires the training data to be labeled into a category or augmented if the quantity is too small to create an accurate model.

For example, an object detection task on video data might require the creation of training data for object boundaries and object classes using some tools, or even manually. Data augmentation helps with image data by creating slightly perturbed data (rotated or grained images, for example) and adding it to training data. For a supervised learning task, labels are mandatory. This label is generally generated together with the data. For example, to train a churn model, a dataset with customer descriptions and when they churned out is required. This information is generally available in the company's CRM tool.

Table of Contents for Data analytics pipeline

Create new playlist

Sign In

Sign Up

Table of Contents for
Data analytics pipeline