Chapter 3. Streaming Data Pipelines

After data is collected from real-time sources, it is added to a data stream. A stream contains a sequence of events, made available over time, with each event containing data from the source, plus metadata identifying source attributes. Streams can be untyped, but more common, the data content of streams can be described through internal (as part of metadata) or external data-type definitions. Streams are unbounded, continually changing, potentially infinite sets of data, which are very different from the traditional bounded, static, and limited batches of data, as shown in Figure 3-1. In this chapter, we discuss streaming data pipelines.

Difference between streams and batches
Figure 3-1. Difference between streams and batches

Here are the major purposes of data streams:

  • Facilitate asynchronous processing

  • Enable parallel processing of data

  • Support time-series analysis

  • Move data between components in a data pipeline

  • Move data between nodes in a clustered processing platform

  • Move data across network boundaries, including datacenter to datacenter, and datacenter to cloud

  • Do this is a reliable and guaranteed fashion that handles failure and enables recovery

Streams facilitate asynchronous handling of data. Data flow, stream processing, and data delivery do not need to be tightly coupled to the ingestion of data: these can work somewhat independently. However, if the data consumption rate does not match the ingestion rate, it can lead to a backlog that needs to be dealt with either through back-pressure, or persistence, which we cover in detail later in this chapter.

Streams also enable parallel processing of data. When a logical data stream is present across multiple nodes in a clustered processing platform, the node on which a particular event will be processed can be determined through a stream partitioning mechanism. This mechanism utilizes a key, or other feature of the data to consistently map events to nodes in a deterministic and repeatable manner.

The data delivered to streams is often multitemporal in nature. This means that the data might have multiple timestamps that can be used for time-series analysis. Timestamps might be present in the original data, or metadata, or can be injected into the stream event at the time of collection or processing. These timestamps enable event sequencing, time-based aggregations, and other key features of stream processing.

Let’s begin our examination of streams through (perhaps) their most important function: moving data between threads, processes, servers, and datacenters in a scalable way, with very low latency.

Moving Data

The most important thing to understand about a stream is that it is a logical entity. This means that a single named stream can comprise multiple physical components running in different locations; it has a logical definition, and a physical location. The stream is an abstraction over multiple implementations that enable it to move data efficiently in many different network topologies.

To understand the various possibilities, let’s use a simple example of a source reader collecting data in real time and writing it to a stream. A target writer reads from this stream and delivers the data in real time to a destination.

Figure 3-2 illustrates the components involved in this simple data flow, and Table 3-1 provides a description of each.

Elements of a simple streaming data flow
Figure 3-2. Elements of a simple streaming data flow
Table 3-1. Components of a streaming data flow
Source The origin of real-time data; for example, the database, files, messaging, and so on
Reader Collects real-time data from the source and writes to a stream
Stream The continuous movement of data elements from one component, thread, or node, to the next one
Network Delineates different network locations; for example, on-premises and cloud
Node A machine on which processes run
Process An operating system process running on a node with potentially many threads
Thread An independent and concurrent programming flow within a process
Component An item running within a thread that can interact with streams
Writer Receives real-time data from a stream and writes to a target
Target The destination for real-time data, for example, database, Hadoop, and so on

In all cases, the reader will write to a named stream, and the writer will receive data from the same named stream. The simplest way in which this flow could work is that everything runs within a single thread, in a single process, on a single node, as depicted in Figure 3-3.

The stream implementation in this case can be a simple method (or function) call given that the reader is directly delivering data to the writer. The data transfer through the stream is synchronous, and no serialization of data is required because the reader and writer operate in the same memory space. However, the direct coupling of components means that the writer must consume events from the reader as soon as they are available but cannot write concurrently with reading. Any slowness on the writing side will slow down reading, potentially leading to lag.

A single thread named stream
Figure 3-3. A single-thread named stream

To enable concurrency, a multithreaded model is required, with reader and writer operating independently and concurrently.

The stream in this case needs to span threads, and is most commonly implemented as a queue. This queue can be in-memory only or spill to disk to as necessary to handle sizing requirements. The reader and writer can now run asynchronously and at different speeds with the stream acting as a buffer, handling occasional writer slowness to the limit of the queue size. As with the single-threaded mode, no data serialization is required.

In multithreaded applications, the operating system can cause bottlenecks between threads. Even in a multicore or multi-CPU system, there is no guarantee that separate threads will run on different cores. If reader and writer threads are running on the same core, performance will be no better, or even worse than a single-threaded implementation.

Multiprocess models can help with this, using processor affinity to assign CPU cores to particular processes.

In this case, the reader and writer are running in different operating system processes, so the stream needs to span the memory space of both. This can be done in a number of ways, utilizing shared memory, using Transmission Control Protocol (TCP) or other socket connections, or implementing the stream utilizing a third-party messaging system. To move data between processes, it will need to be serialized as bytes, which will generate additional overhead.

The natural extension of this topology is to run the reader and writer threads on separate nodes, with the stream spanning both locations, as demonstrated in Figure 3-4.

Running reader and writer threads on separate nodes
Figure 3-4. Running reader and writer threads on separate nodes

This ensures full processor utilization but eliminates the possibility of using shared memory for the stream implementation. Instead, the stream must use TCP communication, or utilize a third-party messaging system. As with the previous example, data must be serialized as bytes to be sent over the wire between nodes. Latency over TCP between nodes is higher than between processes, which can increase overall data flow latency. This topology is also useful for the case in which sources or targets are accessible only from a particular physical machine. These nodes could be running in the same network domain, or spanning networks, in an on-premises-to-cloud topology, for example.

Spanning networks can introduce additional requirements to the stream implementation. For example, the on-premises network might not be reachable from the cloud. There might be firewall or network routing implications. It is common for the on-premises portion to connect into the cloud, enabling data delivery, but not the other way around.

Streams also enable parallel processing of data through partitioning. For cases in which a single reader or writer cannot handle the real-time data generation rate, it might be necessary to use multiple instances running in parallel. For example, if we have CDC data being generated at a rate of 100,000 operations per second, but a single writer can manage only 50,000 operations per second, splitting the load across two writers might solve the problem.

We must then carefully think through how the way the data is partitioned. After all, arbitrary partitioning could lead to timing issues and data inconsistencies, as two writers running asynchronously can potentially lead to out-of-order events.

Within a single node and process, we can achieve parallelism by running multiple writer threads from the same stream, as shown in Figure 3-5.

Achieving parallelism by running multiple writer threads from the same stream
Figure 3-5. Achieving parallelism by running multiple writer threads from the same stream

Each thread will receive a portion of the data based on a partitioning scheme, and deliver data to the target simultaneously. The maximum recommended number of writer threads depends on a number of criteria, but should generally not be greater than the number of CPU cores available (minus one core for reading), assuming threads are appropriately allocated (which usually they are not). The stream should take care of delivering partitioned data appropriately to each thread in parallel.

For greater levels of parallelism, it might be necessary to run multiple writer instances across multiple nodes.

Again, the stream needs to take care of partitioning data: in this case, sending it to different nodes based on partitioning rather than separate threads. It should also be possible to combine the two parallelism mechanisms together to have multiple threads running on multiple nodes to take best advantage of available CPU cores. The level of parallelism possible will depend greatly on the nature of the data, and the requirement for continual consistency.

For example, it might be possible to partition CDC data by table, if the operations acting on those tables are independent. However, if changes are made to related tables (e.g., the submission of an order that makes modifications to multiple tables), the resulting events might need to be handled in order. This might require partitioning by customer or location that all related events are processed in the same partition.

These examples have dealt with the simple case of reading data from a source and writing to a target. It should be clear that there are many possible implementation options within even this basic use case to deal with throughput, scale, and latency. However, many real-world use cases require some degree of stream processing, which requires multiple streams and the notion of a pipeline.

The Power of Pipelines

A streaming data pipeline is a data flow in which events transition through one or more processing steps that progress from being collected by a “reader,” and delivered by a “writer.” We discuss these processing steps in more detail later in the book, but for now it is sufficient to understand these steps at a high level. In general, they read from a stream and can filter, transform, aggregate, enrich, and correlate data (often through a SQL-like language) before delivery to a secondary stream.

Figure 3-6 presents a basic pipeline performing some processing of data (for example, filtering) in a single step between reader and writer.

We can expand this to multiple steps, each outputting to an intermediate stream, as depicted in Figure 3-7.

Basic pipeline performing filtering in a single step
Figure 3-6. Basic pipeline performing filtering in a single step
Performing processes using multiple steps
Figure 3-7. Performing processes using multiple steps

The rules and topologies discussed in the previous section also apply to these pipelines. Each of the streams in Figure 3-7 can have multiple implementations enabling single-threaded, multithreaded, multiprocess, and multinode processing, with or without partitioning and parallelization. Introduction of additional capabilities such as persistent streams, windowing, event storage, key/value stores, and caching add further complications to the physical implementations of data pipelines.

Stream-processing platforms need to handle deployment of arbitrarily complex data pipelines atomically (i.e., the whole pipeline is deployed or nothing is), adopting sensible default stream implementations based on partitioning, parallelism, resource usage, and other metrics while still allowing users to specify certain behaviors to optimize flows in production environments, as scale dictates.

Persistent Streams

As mentioned earlier, a data stream is an unbounded, continuous sequence of events in which each event comprises data and metadata (including timestamp) fields from external or intermediary data sources. Traditionally, to continuously run processing queries over streams, stream publishers and consumers used a classic publish/subscribe model in which main memory was used to bound a portion of streaming data. This bound portion – either a single event or multiple events – was then examined for processing purposes and subsequently discarded so as to not exhaust main memory. Streams as such were always transient in nature. As soon as events in a stream were discarded, they could no longer be accessed.

There are several challenges that naturally arise when streams are processed in a purely in-memory manner, as just described:

  • A subscriber must deal with streams as they are arriving. The consumption model is thus very tightly coupled to the publisher. If a publisher publishes an event, but the subscriber is not available – for example, due to a failure – the event could not be made available to the subscriber.

    If multiple data streams arrive into the stream-processing system, a subsequent replay of those streams from external systems cannot guarantee the exact order of previously acknowledged events if those events are discarded from memory.

  • The publisher of the stream can stall if the consumer of the stream is slow to receive the stream. This has consequences on processing throughput.

Persistent streams are streams that are first reliably and efficiently written to disk prior to processing such that the order of the events is preserved to address the above challenges. This allows for the external source to first write the incoming stream’s sequence of events onto disk, and for subscribers to consume those events independently of the publisher. The main thing is that this is transparent from an implementation standpoint.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset