Introduction

Many would say the byword for success in the modern era has to be data. The sheer amount of information stored, processed, and moved around over the past 50 years has seen a staggering increase, without any end in sight. Enterprises are hungry to acquire and process more data to get a leg up on the competition. Scientists especially are looking to use data-intensive methods to advance research in ways that were not possible only a few decades ago.

With this worldwide demand, data-intensive applications have gone through a remarkable transformation since the start of the 21st century. We have seen wide adoption of big data frameworks such as Apache Hadoop, Apache Spark, and Apache Flink. The amazing advances being made in the fields of machine learning (ML) and deep learning (DL) are taking the big data era to new heights for both enterprise and research communities. These fields have further broadened the scope of data-intensive applications, which demand faster and more integrable systems that can operate on both specialized and commodity hardware.

Data-intensive applications deal with storing and extracting information that involves disk access. They can be computing intensive as well, with deep learning and machine learning applications that not only consume massive data but also do a substantial number of computations. Because of memory, storage, and computational requirements, these applications require resources beyond a single computer's ability to provide.

There are two main branches of data-intensive processing applications: streaming data and batch data. Streaming data analytics is defined as the continuous processing of data. Batch data analytics involves processing data as a complete unit. In practice, we can see streaming and batch applications working together. Machine learning and deep learning fall under the batch application category. There are some streaming ML algorithms as well. When we say machine learning or deep learning, we mostly focus on the training of the models, as it is the most compute-intensive aspect.

Users run these applications on their laptops, clouds, high-performance clusters, graphic processing unit (GPU) clusters, and even supercomputers. Such systems have different hardware and software environments. While we may be developing our applications to deploy in one environment, the framework we use to construct them needs to work in a variety of settings.

History of Data-Intensive Applications

Data-intensive applications have a long history from even before the start of the map-reduce era. With the increasing popularity of internet services, the need for storing and processing large datasets became vitally important starting around the turn of the century. In February 2001, Doug Laney published a research note titled “3D Data Management: Controlling Data Volume, Velocity, and Variety” [1]. Since then, we have used the so-called “3 Vs” to describe the needs of large-scale data processing.

From a technology perspective, the big data era began with the MapReduce paper from Jeffrey Dean and Sanjay Ghemawat at Google [2]. Apache Hadoop was created as an open source implementation of the map-reduce framework along with the Hadoop Distributed File System following the Google File System [3]. The simple interface of map-reduce for data processing attracted many companies and researchers. At the time, a network was often a bottleneck, and the primary focus was to bring the computation to where the data were kept.

A whole ecosystem of storage solutions and processing systems rose around these ideas. Some of the more notable processing systems include Apache Spark, Apache Flink, Apache Storm, and Apache Tez. Storage system examples are Apache HBase, Apache Cassandra, and MongoDB.

With more data came the need to learn from them. Machine learning algorithms have been in development since the 1950s when the first perceptron [4] was created and the nearest neighbor algorithm [5] was introduced. Since then, many algorithms have appeared steadily over the years to better learn from data. Indeed, most of the deep learning theory was created in the 1980s and 1990s.

Despite a long evolution, it is fair to say that modern deep learning as we know it spiked around 2006 with the introduction of the Deep Belief Network (DBN). This was followed by the remarkable success of AlexNet in 2009 [6]. The primary reason for this shift was the increase in computational power in the form of parallel computing that allowed neural networks to grow several orders of magnitude larger than what had been achieved in the past. A direct consequence of this has been the increase in the number of layers in a neural network, which is why it's called deep learning.

With machine learning and deep learning, users needed more interactive systems to explore data and do experiments quickly. This paved the way for Python-based APIs for data processing. The success of Python libraries such as NumPy and Pandas contributed to its popularity among data-intensive applications. But while all these changes were taking place, computer hardware was going through a remarkable transformation as well.

Hardware Evolution

Since the introduction of the first microprocessor from Intel in 1972, there have been tremendous advances in CPU architectures, leading to quantum leaps in performance. Moore's law and Dennard scaling coupled together were driving the performance of microprocessors until recently. Moore's law is an observation made by Gordon Moore on the doubling of the number of transistors roughly every two years. Dennard scaling states that the power density of MOSFET transistors roughly stays the same through each generation. The combination of these two principles suggested the number of computations that these microprocessors could perform for the same amount of energy would double every 1.5 years.

For half a century, this phenomenon has helped speed up applications across the board with each new generation of processors. Dennard scaling, however, has halted since about 2006, meaning clock frequencies of microprocessors hit a wall around 4GHz. This has led to the multicore era of microprocessors, where some motherboards even support more than one CPU or multisockets. The result of this evolution is single computers equipped with core counts that can go up to 128 today. Programming multicore computers require more consideration than traditional CPUs, which we will explore in detail later.

Alongside the multicore evolution came the rise of GPUs as general-purpose computing processors. This trend took a boost with the exponential growth in machine learning and deep learning. GPUs pack thousands of lightweight cores compared to CPUs. This has paved the way for accelerated computing kernels available to speed up computations.

GPUs were not the only type of accelerator to emerge. Intel KNL processors came out initially as accelerators. Field Programmable Gate Arrays (FPGAs) are now being used to develop custom accelerators, especially to improve deep learning training and inference. The trend has gone further, with the development of custom chips known as Application Specific Integrated Circuit (ASIC). Google's Tensor Processing Unit (TPU) is a popular ASIC solution to advance the training of large models.

The future of hardware evolution is leaning more toward accelerators, to the point where devices would carry multiple accelerators designed specifically for different application needs. This is known as accelerator level parallelism. For instance, Apple's A14 chip has accelerators for graphics, neural network processing, and cryptography.

Data Processing Architecture

Modern data-intensive applications consist of data storage and management systems as well as data science workflows, as illustrated in Figure I-1. Data management is mostly an automated process involving their development and deployment. On the other hand, data science is a process that calls for human involvement.

Schematic illustration of overall data processing architecture.

Figure I-1: Overall data processing architecture

Data Storage and Querying

Figure I-1 shows a typical data processing architecture. The priority is to ingest the data from various sources, such as web services and devices, into raw storage. The sources can be internal or external to an organization, the most common being web services and IoT devices, all of which produce streaming data. Further data can be accumulated from batch processes such as the output of an application run. Depending on our requirements, we can analyze this data to some degree before they are stored.

The raw data storage serves as a rich trove for extracting structured data. Ingesting and extracting data via such resources requires data processing tools that can work at massive scale.

From the raw storage, more specialized use cases that require subsets of this data are supported. We can store such data in specialized formats for efficient queries and model building, which is known as data warehousing. In the early stages of data-intensive applications, large-scale data storage and querying were the dominant use cases.

Data Science Workflows

Data science workflows have grown into an integral part of modern data-intensive applications. They involve data preparation, analysis, reflection, and dissemination, as shown in Figure I-2. Data preparation works with sources like databases and files to retrieve, clean, and reformat data.

Schematic illustration of data science workflow.

Figure I-2: Data science workflow

Data analysis includes modeling, training, debugging, and model validations. Once the analysis is complete, we can make comparisons in the reflection step to see whether such an analysis is what we need. This is an iterative process where we check different models and tweak them to get the best results.

After the models are finalized, we can deploy them, create reports, and catalog the steps we took in the experiments. The actual code related to learning algorithms may be small compared to all the other systems and applications surrounding and supporting it [7].

The data science workflow is an interactive process with human involvement every step of the way. Scripting languages like Python are a good fit for such interactive environments, with the ability to quickly prototype and test various hypotheses. Other technologies like Python Notebooks are extensively used by data scientists in their workflows.

Data-Intensive Frameworks

Whether it is large-scale data management or a data scientist evaluating results in a small cluster, we use frameworks and APIs to develop and run data-intensive applications. These frameworks provide APIs and the means to run applications at scale, handling failures and various hardware features. The frameworks are designed to run different workloads and applications:

  • Streaming data processing frameworks—Process continuous data streams.
  • Batch data processing frameworks—Manage large datasets. Perform extract, transform, and load operations.
  • Machine/deep learning frameworks—Designed to run models that learn from data through iterative training.
  • Workflow systems—Combine data-intensive applications to create larger applications.

There are many frameworks available for the classes of applications found today. From the outside, even the frameworks designed to solve a single specific class of applications might look quite diverse, with different APIs and architectures. But if we look at the core of these frameworks and the applications built on top of them, we find there are similar techniques being used. Seeing as how they are trying to solve the same problems, all of them use similar data abstractions, techniques for running at scale, and even fault handling. Within these similarities, there are differences that create distinct frameworks for various application classes.

There Is No Silver Bullet

It is hard to imagine one framework to solve all our data-intensive application needs. Building frameworks that work at scale for a complex area such as data-intensive applications is a colossal challenge. We need to keep in mind that, like any other software project, frameworks are built with finite resources and time constraints. Various programming languages are used when developing them, each with their own benefits and limitations. There are always trade-offs between usability and performance. Sometimes the most user-friendly APIs may not be the most successful.

Some software designs and architectures for data-intensive applications are best suited for only certain application classes. The frameworks are built according to these architectures to solve specific classes of problems but may not be that effective when applied to others. What we see in practice is frameworks are built for one purpose and are being adapted for other use cases as they mature.

Data processing is a complicated space that demands hardware, software, and application requirements. On top of this, such demands have been evolving rapidly. At times, there are so many options and variables it seems impossible to choose the correct hardware and software for our data-intensive problems. Having a deep understanding of how things work beneath the surface can help us make better choices when designing our data processing architectures.

Foundations of Data-Intensive Applications

Data-intensive applications incorporate ideas from many domains of computer science. This includes areas such as computer systems, databases, high-performance computing, cloud computing, distributed computing, programming languages, computer networks, statistics, data structures, and algorithms.

We can study data-intensive applications in three perspectives: data storage and management, learning from data, and scaling data processing. Data storage and management is the first step in any data analytics pipeline. It can include techniques ranging from sequential databases to large-scale data lakes.

Learning from data can take many forms depending on the data type and use case. For example, computing basic statistics, clustering data into groups, finding interesting patterns in graph data, joining tables of data to enrich information, and fitting functions to existing data are several ways we learn from data. The overarching goal of the algorithmic perspective is to form models from data that can be used to predict something in the future.

A key enabler for these two perspectives is the third one, where data processing is done at scale. Therefore, we will primarily look at data-intensive applications from a distributed processing perspective in this book. Our focus is to understand how the data-intensive applications run at scale utilizing various computing resources available. We take principles from databases, parallel computing, and distributing computing to delve deep into the ideas behind data-intensive applications operating at scale.

Who Should Read This Book?

This book is aimed at data engineers and data scientists. If you develop data-intensive applications or are planning to do so, the information contained herein can provide insight into how things work regarding various frameworks and tools. This book can be helpful if you are trying to make decisions about what frameworks to choose in your applications.

You should have a foundational knowledge of computer science in general before reading any further. If you have a basic understanding of networks and computer architecture, that will be helpful to understand the content better.

Our target audience is the curious reader who likes to understand how data-intensive applications function at scale. Whether you are considering developing applications using certain frameworks or you are developing your own applications from scratch for highly specialized use cases, this book will help you to understand the inner workings of the systems. If you are familiar with data processing tools, it can deepen your understanding about the underlying principles.

Organization of the Book

The book starts with introducing the challenges of developing and executing data-intensive applications at scale. Then it gives an introduction to data and storage systems and cluster resources. The next few chapters describe the internals of data-intensive frameworks with data structures, programming models, messaging, and task execution along with a few case studies of existing systems. Finally, we talk about fault tolerance techniques and finish the book with performance implications.

  • Chapter 1: Scaling Data-Intensive Applications—Describes serial and parallel applications for data-intensive applications and the challenges faced when running them at scale.
  • Chapter 2: Data and Storage—Overview of the data storage systems used in the data processing. Both hardware and software solutions are discussed.
  • Chapter 3: Computing Resources—Introduces the computing resources used in data-intensive applications and how they are managed in large-scale environments.
  • Chapter 4: Data Structures—This chapter describes the data abstractions used in data analytics applications and the importance of using memory correctly to speed up applications.
  • Chapter 5: Programming Models—Discusses various programming models available and the APIs for data analytics applications.
  • Chapter 6: Messaging—Examines how the network is used by data analytics applications to process data at scale by exchanging data between the distributed processes.
  • Chapter 7: Parallel Tasks—Shows how to execute tasks in parallel for data analytics applications combining messaging.
  • Chapter 8: Case Studies—Studies of a few widely used systems to highlight how the principles we discussed in the book are applied in practice.
  • Chapter 9: Fault Tolerance—Illustrates handy techniques for handling faults in data-intensive applications.
  • Chapter 10: Performance and Productivity—Defines various metrics used for measuring performance and discusses productivity when choosing tools.

Scope of the Book

Our focus here is the fundamentals of parallel and distributed computing and how they are applied in data processing systems. We take examples from existing systems to explain how these are used in practice, and we are not focusing on any specific system or programming language. Our main goal is to help you understand the trade-off between the available techniques and how they are used in practice. This book does not try to teach parallel programming.

You will also be introduced to a few examples of deep learning systems. Although DL at scale works according to the principles we describe here, we will not be going into any great depth on the topic. We also will not be covering any specific data-intensive framework on configuration and deployment. There are plenty of specific books and online resources on these frameworks.

SQL is a popular choice for querying large datasets. Still, you will not find much about it here because it is a complex subject of its own, involving query parsing and optimization, which is not the focus of this book. Instead, we look at how the programs are executed at scale.

References

At the end of each chapter, we put some important references that paved the way for some of the discussions we included. Most of the content you will find is the result of work done by researchers and software engineers over a long period. We include these as references to any reader interested in learning more about the topics discussed.

References

  1. 1. D. Laney, “3D data management: Controlling data volume, velocity and variety,” META group research note, vol. 6, no. 70, p. 1, 2001.
  2. 2. J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters.,” Sixth Symposium on Operating Systems Design and Implementation, pp. 137–150, 2004.
  3. 3. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google File System,” presented at the 19th ACM Symposium on Operating Systems Principles, 2003.
  4. 4. F. Rosenblatt, The perceptron, a perceiving and recognizing automaton Project Para. Cornell Aeronautical Laboratory, 1957.
  5. 5. T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE transactions on information theory, vol. 13, no. 1, pp. 21–27, 1967.
  6. 6. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, pp. 1097–1105, 2012.
  7. 7. D. Sculley et al., “Hidden technical debt in machine learning systems,” Advances in neural information processing systems, vol. 28, pp. 2503–2511, 2015.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset