Understanding Spark Streaming

For real-time processing in Apache Spark, the current focus is on structured streaming, which is built on top of the DataFrame/dataset infrastructure. The use of DataFrame abstraction allows streaming, machine learning, and Spark SQL to be optimized in the Spark SQL Engine Catalyst Optimizer and its regular improvements (for example, Project Tungsten). Nevertheless, to more easily understand Spark Streaming, it is worthwhile to understand the fundamentals of its Spark Streaming predecessor. The following diagram represents a Spark Streaming application data flow involving the Spark driver, workers, streaming sources, and streaming targets:

The description of the preceding diagram is as follows:

  1. Starting with the Spark Streaming Context (SSC), the driver will execute long-running tasks on the executors (that is, the Spark workers).
  2. The code defined within the driver (starting ssc.start()), the Receiver on the executors (Executor 1 in this diagram) receives a data stream from the Streaming Sources. Spark Streaming can receive Kafka or Twitter, and/or you can build your own custom receiver. With the incoming data stream, the receiver divides the stream into blocks and keeps these blocks in memory.
  3. These data blocks are replicated to another executor for high availability.
  4. The block ID information is transmitted to the block manager master on the driver, thus ensuring that each block of data in memory is tracked and accounted for.
  5. For every batch interval configured within SSC (commonly, this is every 1 second), the driver will launch Spark tasks to process the blocks. Those blocks are then persisted to any number of target data stores, including cloud storage (for example, S3, WASB), relational data stores (for example, MySQL, PostgreSQL, and so on), and NoSQL stores.

In the following sections, we will review recipes with Discretized Streams or DStreams (the fundamental streaming building block) and then perform global aggregations by performing stateful calculations on DStreams. We will then simplify our streaming application by using structured streaming while at the same time gaining performance optimizations.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset