Chapter 2
IN THIS CHAPTER
Defining big data
Looking at some sources of big data
Distinguishing between data science and data engineering
Hammering down on Hadoop
Exploring solutions for big data problems
Checking out a real-world data engineering project
There’s a lot of hype around big data these days, but most people don’t really know or understand what it is or how they can use it to improve their lives and livelihoods. This chapter defines the term big data, explains where big data comes from and how it’s used, and outlines the roles that data engineers and data scientists play in the big data ecosystem. In this chapter, I introduce the fundamental big data concepts that you need in order to start generating your own ideas and plans on how to leverage big data and data science to improve your lifestyle and business workflow (Hint: You’d be able to improve your lifestyle by mastering some of the technologies discussed in this chapter — which would certainly lead to more opportunities for landing a well-paid position that also offers excellent lifestyle benefits.)
Big data is data that exceeds the processing capacity of conventional database systems because it’s too big, it moves too fast, or it doesn’t fit the structural requirements of traditional database architectures. Whether data volumes rank in the terabyte or petabyte scales, data-engineered solutions must be designed to meet requirements for the data’s intended destination and use.
Three characteristics (known as “the three Vs”) define big data: volume, velocity, and variety. Because the three Vs of big data are continually expanding, newer, more innovative data technologies must continuously be developed to manage big data problems.
The lower limit of big data volume starts as low as 1 terabyte, and it has no upper limit. If your organization owns at least 1 terabyte of data, it’s probably a good candidate for a big data deployment.
A lot of big data is created through automated processes and instrumentation nowadays, and because data storage costs are relatively inexpensive, system velocity is, many times, the limiting factor. Big data is low-value. Consequently, you need systems that are able to ingest a lot of it, on short order, to generate timely and valuable insights.
In engineering terms, data velocity is data volume per unit time. Big data enters an average system at velocities ranging between 30 kilobytes (K) per second to as much as 30 gigabytes (GB) per second. Many data-engineered systems are required to have latency less than 100 milliseconds, measured from the time the data is created to the time the system responds. Throughput requirements can easily be as high as 1,000 messages per second in big data systems! High-velocity, real-time moving data presents an obstacle to timely decision making. The capabilities of data-handling and data-processing technologies often limit data velocities.
Big data gets even more complicated when you add unstructured and semistructured data to structured data sources. This high-variety data comes from a multitude of sources. The most salient point about it is that it’s composed of a combination of datasets with differing underlying structures (either structured, unstructured, or semistructured). Heterogeneous, high-variety data is often composed of any combination of graph data, JSON files, XML files, social media data, structured tabular data, weblog data, and data that’s generated from click-streams.
Structured data can be stored, processed, and manipulated in a traditional relational database management system (RDBMS). This data can be generated by humans or machines, and is derived from all sorts of sources, from click-streams and web-based forms to point-of-sale transactions and sensors. Unstructured data comes completely unstructured — it’s commonly generated from human activities and doesn’t fit into a structured database format. Such data could be derived from blog posts, emails, and Word documents. Semistructured data doesn’t fit into a structured database system, but is nonetheless structured by tags that are useful for creating a form of order and hierarchy in the data. Semistructured data is commonly found in databases and file systems. It can be stored as log files, XML files, or JSON data files.
Big data is being continually generated by humans, machines, and sensors everywhere. Typical sources include data from social media, financial transactions, health records, click-streams, log files, and the Internet of things — a web of digital connections that joins together the ever-expanding array of electronic devices we use in our everyday lives. Figure 2-1 shows a variety of popular big data sources.
Data science and data engineering are two different branches within the big data paradigm — an approach wherein huge velocities, varieties, and volumes of structured, unstructured, and semistructured data are being captured, processed, stored, and analyzed using a set of techniques and technologies that is completely novel compared to those that were used in decades past.
Both are useful for deriving knowledge and actionable insights from raw data. Both are essential elements for any comprehensive decision-support system, and both are extremely helpful when formulating robust strategies for future business management and growth. Although the terms data science and data engineering are often used interchangeably, they’re distinct domains of expertise. In the following sections, I introduce concepts that are fundamental to data science and data engineering, and then I show you the differences in how these two roles function in an organization’s data processing system.
If science is a systematic method by which people study and explain domain-specific phenomenon that occur in the natural world, you can think of data science as the scientific domain that’s dedicated to knowledge discovery via data analysis.
Data scientists use mathematical techniques and algorithmic approaches to derive solutions to complex business and scientific problems. Data science practitioners use its predictive methods to derive insights that are otherwise unattainable. In business and in science, data science methods can provide more robust decision making capabilities:
Data science is a vast and multidisciplinary field. To call yourself a true data scientist, you need to have expertise in math and statistics, computer programming, and your own domain-specific subject matter.
Using data science skills, you can do things like this:
Data scientists must have extensive and diverse quantitative expertise to be able to solve these types of problems.
If engineering is the practice of using science and technology to design and build systems that solve problems, you can think of data engineering as the engineering domain that’s dedicated to building and maintaining data systems for overcoming data-processing bottlenecks and data-handling problems that arise due to the high volume, velocity, and variety of big data.
Data engineers use skills in computer science and software engineering to design systems for, and solve problems with, handling and manipulating big datasets. Data engineers often have experience working with and designing real-time processing frameworks and massively parallel processing (MPP) platforms (discussed later in this chapter), as well as RDBMSs. They generally code in Java, C++, Scala, and Python. They know how to deploy Hadoop MapReduce or Spark to handle, process, and refine big data into more manageably sized datasets. Simply put, with respect to data science, the purpose of data engineering is to engineer big data solutions by building coherent, modular, and scalable data processing platforms from which data scientists can subsequently derive insights.
Using data engineering skills, you can, for example:
Data engineers need solid skills in computer science, database design, and software engineering to be able to perform this type of work.
The roles of data scientist and data engineer are frequently completely confused and intertwined by hiring managers. If you look around at most position descriptions for companies that are hiring, they often mismatch the titles and roles or simply expect applicants to do both data science and data engineering.
Because many organizations combine and confuse roles in their data projects, data scientists are sometime stuck spending a lot of time learning to do the job of a data engineer, and vice versa. To get the highest-quality work product in the least amount of time, hire a data engineer to process your data and a data scientist to make sense of it for you.
Lastly, keep in mind that data engineer and data scientist are just two small roles within a larger organizational structure. Managers, middle-level employees, and organizational leaders also play a huge part in the success of any data-driven initiative. The primary benefit of incorporating data science and data engineering into your projects is to leverage your external and internal data to strengthen your organization’s decision-support capabilities.
Because big data’s three Vs (volume, velocity, and variety) don’t allow for the handling of big data using traditional relational database management systems, data engineers had to become innovative. To get around the limitations of relational systems, data engineers turn to the Hadoop data processing platform to boil down big data into smaller datasets that are more manageable for data scientists to analyze.
In the following sections, I introduce you to MapReduce, Spark, and the Hadoop distributed file system. I also introduce the programming languages you can use to develop applications in these frameworks.
MapReduce is a parallel distributed processing framework that can be used to process tremendous volumes of data in-batch — where data is collected and then processed as one unit with processing completion times on the order of hours or days. MapReduce works by converting raw data down to sets of tuples and then combining and reducing those tuples into smaller sets of tuples (with respect to MapReduce, tuples refer to key-value pairs by which data is grouped, sorted, and processed). In layman’s terms, MapReduce uses parallel distributed computing to transform big data into manageable-size data.
MapReduce jobs implement a sequence of map- and reduce-tasks across a distributed set of servers. In the map task, you delegate data to key-value pairs, transform it, and filter it. Then you assign the data to nodes for processing. In the reduce task, you aggregate that data down to smaller-size datasets. Data from the reduce step is transformed into a standard key-value format — where the key acts as the record identifier and the value is the value being identified by the key. The clusters’ computing nodes process the map tasks and reduce tasks that are defined by the user.
This work is done in two steps:
Map the data.
The incoming data must first be delegated into key-value pairs and divided into fragments, which are then assigned to map tasks. Each computing cluster (a group of nodes that are connected to each other and perform a shared computing task) is assigned a number of map tasks, which are subsequently distributed among its nodes. Upon processing of the key-value pairs, intermediate key-value pairs are generated. The intermediate key-value pairs are sorted by their key values, and this list is divided into a new set of fragments. Whatever count you have for these new fragments, it will be the same as the count of the reduce tasks.
Reduce the data.
Every reduce task has a fragment assigned to it. The reduce task simply processes the fragment and produces an output, which is also a key-value pair. Reduce tasks are also distributed among the different nodes of the cluster. After the task is completed, the final output is written onto a file system.
In short, you can use MapReduce as a batch-processing tool, to boil down and begin to make sense of a huge volume, velocity, and variety of data by using map and reduce tasks to tag the data by (key, value) pairs, and then reduce those pairs into smaller sets of data through aggregation operations — operations that combine multiple values from a dataset into a single value. A diagram of the MapReduce architecture is shown in Figure 2-2.
Do you recall that MapReduce is a batch processor and can’t process real-time, streaming data? Well, sometimes you might need to query big data streams in real-time — and you just can’t do this sort of thing using MapReduce. In these cases, use a real-time processing framework instead.
A real-time processing framework is — as its name implies — a framework that processes data in real-time (or near – real-time) as that data streams and flows into the system. Real-time frameworks process data in microbatches — they return results in a matter of seconds rather than hours or days, like MapReduce. Real-time processing frameworks either
Although MapReduce was historically the main processing framework in a Hadoop system, Spark has recently made some major advances in assuming MapReduce’s position. Spark is an in-memory computing application that you can use to query, explore, analyze, and even run machine learning algorithms on incoming, streaming data in near–real-time. Its power lies in its processing speed — the ability to process and make predictions from streaming big data sources in three seconds flat is no laughing matter. Major vendors such as Cloudera have been pushing for the advancement of Spark so that it can be used as a complete MapReduce replacement, but it isn’t there yet.
Real-time, stream-processing frameworks are quite useful in a multitude of industries — from stock and financial market analyses to e-commerce optimizations, and from real-time fraud detection to optimized order logistics. Regardless of the industry in which you work, if your business is impacted by real-time data streams that are generated by humans, machines, or sensors, a real-time processing framework would be helpful to you in optimizing and generating value for your organization.
The Hadoop distributed file system (HDFS) uses clusters of commodity hardware for storing data. Hardware in each cluster is connected, and this hardware is composed of commodity servers — low-cost, low-performing generic servers that offer powerful computing capabilities when run in parallel across a shared cluster. These commodity servers are also called nodes. Commoditized computing dramatically decreases the costs involved in storing big data.
The HDFS is characterized by these three key features:
The Hadoop platform is the premier platform for large-scale data processing, storage, and management. This open-source platform is generally composed of the HDFS, MapReduce, Spark, and YARN, all working together.
Within a Hadoop platform, the workloads of applications that run on the HDFS (like MapReduce and Spark) are divided among the nodes of the cluster, and the output is stored on the HDFS. A Hadoop cluster can be composed of thousands of nodes. To keep the costs of input/output (I/O) processes low, MapReduce jobs are performed as close to the data as possible — the reduce tasks processors are positioned as closely as possible to the outgoing map task data that needs to be processed. This design facilitates the sharing of computational requirements in big data processing.
Hadoop also supports hierarchical organization. Some of its nodes are classified as master nodes, and others are categorized as slaves. The master service, known as JobTracker, is designed to control several slave services. A single slave service (also called TaskTracker) is distributed to each node. The JobTracker controls the TaskTrackers and assigns Hadoop MapReduce tasks to them. YARN, the resource manager, acts as an integrated system that performs resource management and scheduling functions.
Looking past Hadoop, alternative big data solutions are on the horizon. These solutions make it possible to work with big data in real-time or to use alternative database technologies to handle and process it. In the following sections, I introduce you to massively parallel processing (MPP) platforms and the NoSQL databases that allow you to work with big data outside of the Hadoop environment.
Massively parallel processing (MPP) platforms can be used instead of MapReduce as an alternative approach for distributed data processing. If your goal is to deploy parallel processing on a traditional data warehouse, an MPP may be the perfect solution.
To understand how MPP compares to a standard MapReduce parallel-processing framework, consider that MPP runs parallel computing tasks on costly, custom hardware, whereas MapReduce runs them on inexpensive commodity servers. Consequently, MPP processing capabilities are cost restrictive. MPP is quicker and easier to use, however, than standard MapReduce jobs. That’s because MPP can be queried using Structured Query Language (SQL), but native MapReduce jobs are controlled by the more complicated Java programming language.
A traditional RDBMS isn’t equipped to handle big data demands. That’s because it is designed to handle only relational datasets constructed of data that’s stored in clean rows and columns and thus is capable of being queried via SQL. RDBMSs are not capable of handling unstructured and semistructured data. Moreover, RDBMSs simply don’t have the processing and handling capabilities that are needed for meeting big data volume and velocity requirements.
This is where NoSQL comes in. NoSQL databases are non-relational, distributed database systems that were designed to rise to the big data challenge. NoSQL databases step out past the traditional relational database architecture and offer a much more scalable, efficient solution. NoSQL systems facilitate non-SQL data querying of non-relational or schema-free, semistructured and unstructured data. In this way, NoSQL databases are able to handle the structured, semistructured, and unstructured data sources that are common in big data systems.
NoSQL offers four categories of non-relational databases: graph databases, document databases, key-values stores, and column family stores. Because NoSQL offers native functionality for each of these separate types of data structures, it offers very efficient storage and retrieval functionality for most types of non-relational data. This adaptability and efficiency makes NoSQL an increasingly popular choice for handling big data and for overcoming processing challenges that come along with it.
The NoSQL applications Apache Cassandra and MongoDB are used for data storage and real-time processing. Apache Cassandra is a popular type of key-value store NoSQL database, and MongoDB is a document-oriented type of NoSQL database. It uses dynamic schemas and stores JSON-esque documents. MongoDB is the most popular type of document store on the NoSQL market.
A Fortune 100 telecommunications company had large datasets that resided in separate data silos — data repositories that are disconnected and isolated from other data storage systems used across the organization. With the goal of deriving data insights that lead to revenue increases, the company decided to connect all of its data silos and then integrate that shared source with other contextual, external, non-enterprise data sources as well.
The Fortune 100 company was stocked to the gills with all the traditional enterprise systems: ERP, ECM, CRM — you name it. Slowly, over many years, these systems grew and segregated into separate information silos. (Check out Figure 2-3 to see what I mean.) Because of the isolated structure of the data systems, otherwise useful data was lost and buried deep within a mess of separate, siloed storage systems. Even if the company knew what data it had, it would be like pulling teeth to access, integrate, and utilize it. The company rightfully believed that this restriction was limiting its business growth.
To optimize its sales and marketing return on investments, the company wanted to integrate external, open datasets and relevant social data sources that would provide deeper insights into its current and potential customers. But to build this 360-degree view of the target market and customer base, the company needed to develop a sophisticated platform across which the data could be integrated, mined, and analyzed.
The company had the following three goals in mind for the project:
To meet the company’s goals, data engineers moved the company’s datasets to Hadoop clusters. One cluster hosted the sales data, another hosted the human resources data, and yet another hosted the talent management data. Data engineers then modeled the data using the linked data format — a format that facilitates a joining of the different datasets in the Hadoop clusters.
After this big data platform architecture was put into place, queries that would have traditionally taken several hours to perform could be performed in a matter of minutes. New queries were generated after the platform was built, and these queries also returned efficient results within a few minutes’ time.
The following list describes some of the benefits that the telecommunications company now enjoys as a result of its new big data platform: