List of Figures

Chapter 1. Introducing Storm

Figure 1.1. A batch processor and how data flows into it

Figure 1.2. A stream processor and how data flows into it

Figure 1.3. Example of how Storm may be used within a system

Figure 1.4. How Storm can be used with other technologies

Figure 1.5. Hadoop and how data flows into it

Figure 1.6. Storm and how data flows into it

Chapter 2. Core Storm concepts

Figure 2.1. Mock-up of dashboard for a running count of changes made to a repository

Figure 2.2. The commit count problem broken down into a series of steps with defined inputs and outputs

Figure 2.3. A topology is a graph with nodes representing computations and edges representing results of computations.

Figure 2.4. Design mapped to the definition of a Storm topology

Figure 2.5. Format for displaying tuples in figures throughout the book

Figure 2.6. Two types of tuples in the topology: one for the commit message and another for the email

Figure 2.7. Identifying the two streams in our topology

Figure 2.8. Topology with four streams

Figure 2.9. A spout reads from the feed of commit messages.

Figure 2.10. Bolts perform processing on the commit messages and associated emails within those messages.

Figure 2.11. There are normally multiple instances of a particular bolt emitting tuples to multiple instances of another bolt.

Figure 2.12. Individual instances of a bolt can emit to any number of instances of another bolt.

Figure 2.13. Each stream in the topology will have its own stream grouping.

Figure 2.14. Using a shuffle grouping between our spout and first bolt

Figure 2.15. Use a fields grouping for the bolt that will have a separate in-memory map for each bolt instance.

Figure 2.16. Storm’s class hierarchy for the spout

Figure 2.17. The spout listens to the feed of commit messages and emits a tuple for each commit message.

Figure 2.18. Storm’s class hierarchy for the bolt

Figure 2.19. The two bolts in our topology: the first bolt extracts the email from the commit message and the second bolt maintains an in-memory map of emails to commit counts.

Chapter 3. Topology design

Figure 3.1. Using check-ins to build a heat map of bars

Figure 3.2. Transforming input tuples to end tuples via a series of operations

Figure 3.3. Heat map design mapped to Storm concepts

Figure 3.4. The spout listens to the fire hose of social check-ins and emits a tuple for each check-in.

Figure 3.5. The geocode lookup bolt accepts a social check-in and retrieves the coordinates associated with that check-in.

Figure 3.6. The heat map builder bolt accepts a tuple with time and geocode and emits a tuple containing a time interval and a list of geocodes.

Figure 3.7. The Persistor bolt accepts a tuple with a time interval and a list of geocodes and persists that data to a data store.

Figure 3.8. Heat map topology

Figure 3.9. Focusing our parallelization changes on the Checkins spout

Figure 3.10. Four Checkins instances emitting tuples to one GeocodeLookup instance results in the GeocodeLookup instance being a bottleneck.

Figure 3.11. Four Checkins instances emitting tuples to eight GeocodeLookup instances

Figure 3.12. A worker node is a physical or virtual machine that’s running a JVM, which executes the logic in the spouts and bolts.

Figure 3.13. Executors (threads) and tasks (instances of spouts/bolts) run on a JVM.

Figure 3.14. Updated topology with the TimeIntervalExtractor bolt

Figure 3.15. Updated topology stream groupings

Figure 3.16. Parallelizing all the components in our topology

Figure 3.17. Adding city to the tuple being emitted by GeocodeLookup and having TimeIntervalExtractor pass the city along in its emitted tuple

Figure 3.18. The heat map topology design as a series of functional components

Figure 3.19. The HeatMap topology design as points of repartition

Chapter 4. Creating robust topologies

Figure 4.1. Conceptual solution of the e-commerce credit card authorization flow

Figure 4.2. E-commerce credit card authorization mapped to Storm concepts

Figure 4.3. The AuthorizeCreditCard bolt accepts an incoming tuple from the RabbitMQSpout and emits a tuple regardless of whether or not the credit card was authorized.

Figure 4.4. The ProcessedOrderNotification bolt accepts an incoming tuple from the AuthorizeCreditCard bolt and notifies an external system without emitting a tuple.

Figure 4.5. Initial state of the tuple tree

Figure 4.6. Tuple tree after the AuthorizeCreditCard bolt emits a tuple

Figure 4.7. Tuple tree after the AuthorizeCreditCard bolt acks its input tuple

Figure 4.8. Tuple tree after the ProcessedOrderNotification bolt acks its input tuple

Figure 4.9. Conceptual solution of the e-commerce credit card authorization flow with an extra step for providing better at-least-once processing

Chapter 5. Moving from local to remote topologies

Figure 5.1. Nimbus and Supervisors and their responsibilities inside a Storm cluster

Figure 5.2. The Zookeeper Cluster and its role within a Storm cluster

Figure 5.3. A worker process consists of one or more executors, with each executor consisting of one or more tasks.

Figure 5.4. A hypothetical breakdown of a worker node with multiple worker processes, executors, and tasks for the credit card authorization topology

Figure 5.5. Extracted contents of a Storm release zip

Figure 5.6. Extracted contents of a Storm release zip

Figure 5.7. The command for deploying your topology to a Storm cluster

Figure 5.8. The Cluster Summary screen shows details for the entire Storm cluster.

Figure 5.9. The Topology summary screen shows details for a specific topology.

Figure 5.10. The Cluster Summary screen of the Storm UI

Figure 5.11. The Cluster Summary section on the Cluster Summary screen of the Storm UI

Figure 5.12. The Topology summary section on the Cluster Summary screen of the Storm UI

Figure 5.13. The Supervisor summary section on the Cluster Summary screen of the Storm UI

Figure 5.14. The Nimbus Configuration section on the Cluster Summary screen of the Storm UI

Figure 5.15. The Topology summary screen of the Storm UI

Figure 5.16. The Topology summary section on the Topology summary screen of the Storm UI

Figure 5.17. The Topology actions section on the Topology summary screen of the Storm UI

Figure 5.18. The Topology stats section on the Topology summary screen of the Storm UI

Figure 5.19. The Spouts section on the Topology summary screen of the Storm UI

Figure 5.20. The Bolts section on the Topology summary screen of the Storm UI, up to Capacity column

Figure 5.21. The Bolts section on the Topology summary screen of the Storm UI, remaining columns

Figure 5.22. The Topology Configuration section on the Topology summary screen of the Storm UI

Figure 5.23. The bolt summary screen in the Storm UI

Figure 5.24. The Component Summary section for a bolt in the Storm UI

Figure 5.25. The Bolt stats section in the Storm UI

Figure 5.26. The Input stats section for a bolt in the Storm UI

Figure 5.27. The Output stats section for a bolt in the Storm UI

Figure 5.28. The Executors section for a bolt in the Storm UI, through Capacity column

Figure 5.29. The Executors section for a bolt in the Storm UI, remaining columns

Figure 5.30. The Errors section for a bolt in the Storm UI

Chapter 6. Tuning in Storm

Figure 6.1. The Find My Sale! topology: its components and data points

Figure 6.2. The Find My Sale! design mapped to Storm concepts

Figure 6.3. The spout emits a tuple for each customer ID that it receives.

Figure 6.4. The FindRecommendedSales bolt accepts a customer ID in its input tuple and emits a tuple containing a customer ID and a list of sales IDs.

Figure 6.5. The bolt for looking up sales details accepts a customer ID and list of sales IDs in its input tuple and emits a tuple containing a customer ID and list of Sale objects containing the details of each sale.

Figure 6.6. The save recommended sales bolt accepts an input tuple containing the customer ID and a list of Sale objects and persists that information to a database.

Figure 6.7. Topology summary screen of the Storm UI

Figure 6.8. Identifying the important parts of the Storm UI for our tuning lesson

Figure 6.9. Storm UI shows a minimal change in capacity after a first attempt at increasing parallelism for our first two bolts.

Figure 6.10. Storm UI showing minimal change in capacity after doubling the number of executors for our first two bolts

Figure 6.11. Storm UI showing improved capacity for all three bolts in our topology

Figure 6.12. Because max spout pending is greater than the total number of tuples we can process at one time, it’s not a bottleneck.

Figure 6.13. LatencySimulator constructor arguments explained

Figure 6.14. Storm UI showing the last error for each of our bolts

Chapter 7. Resource contention

Figure 7.1. The various types of nodes in a Storm cluster and worker nodes broken down as worker processes and their parts

Figure 7.2. Many worker processes running on a worker node

Figure 7.3. Diagnosing issues for a particular topology in the Storm UI

Figure 7.4. Looking at the Executors and Errors portion of the Storm UI for a particular bolt to determine the type of issue the bolt is having while also determining the worker nodes and worker processes that bolt is executing on

Figure 7.5. Example Storm cluster where all of the worker processes have been assigned to topologies.

Figure 7.6. Storm UI: Zero free slots could mean your topologies are suffering from slot contention.

Figure 7.7. Worker processes, executors and tasks mapping to the JVM, threads and instances of spouts/bolts, and the threads/instances contending for memory in the same JVM

Figure 7.8. GC log output showing the output for –XX:+PrintGCDateStamps, –XX:+PrintGCTimeStamps, and –XX:+PrintGCApplicationConcurrentTime

Figure 7.9. GC log output showing details of a minor garbage collection of young generation memory

Figure 7.10. GC log output showing details of a major garbage collection of tenured generation memory

Figure 7.11. GC log output showing entire heap values and complete GC time

Figure 7.12. A worker node has a fixed amount of memory that’s being used by its worker processes along with any other processes running on that worker node.

Figure 7.13. sar command breakdown

Figure 7.14. Output of sar –S 1 3 for reporting swap space utilization

Figure 7.15. Output of sar –u 1 3 for reporting CPU utilization

Figure 7.16. Reducing the number of worker processes per worker node in a cluster where there are unused worker processes

Figure 7.17. Reducing the number of worker processes per worker node in a cluster where there are no unused worker processes, resulting in more worker nodes being added

Figure 7.18. Output of sar –u 1 3 for reporting CPU utilization and, in particular, I/O wait times

Figure 7.19. The output for the iotop command and what to look for when determining if a worker node is experiencing disk I/O contention

Chapter 8. Storm internals

Figure 8.1. Commit count topology design and flow of data between the spout and bolts

Figure 8.2. Commit count topology running across two worker nodes, with one worker process executing a spout and bolt and another worker process executing a bolt

Figure 8.3. Breaking down the flow of data through the topology into six sections, each of which highlights something different within an executor or how data is passed between executors

Figure 8.4. Focusing on data flowing into the spout

Figure 8.5. The spout reads messages off the queue containing commit messages and converts those messages into tuples. The main thread on the executor handles emitted tuples, passing them to the executor’s outgoing queue.

Figure 8.6. Focusing on passing tuples within the same JVM

Figure 8.7. A more detailed look at a local transfer of tuples between the commit feed listener spout and the email extractor bolt

Figure 8.8. Focusing on a bolt that emits a tuple

Figure 8.9. The executor for our bolt, with two threads and two queues

Figure 8.10. Focusing on sending tuples between JVMs

Figure 8.11. The steps that occur during a remote transfer of a tuple between executors on different JVMs

Figure 8.12. Focusing on a bolt that does not emit a tuple

Figure 8.13. The executor for the email counter bolt with a main thread that pulls a tuple off the incoming disruptor queue and sends that tuple to the bolt to be processed

Figure 8.14. A worker process broken down with its internal threads and queue along with an executor and its internal threads, queues, and a task

Figure 8.15. An executor with multiple tasks

Figure 8.16. Mapping out the steps taken when determining the destination task for an emitted tuple

Figure 8.17. Snapshot of a debug log output for a bolt instance

Figure 8.18. Breaking down the debug log output lines for the send/receive queue metrics

Chapter 9. Trident

Figure 9.1. Trident topologies operate on streams of batches of tuples whereas native Storm topologies operate on streams of individual tuples.

Figure 9.2. Distribution of a Kafka topic as group of partitions on many Kafka brokers

Figure 9.3. A partition contains an immutable, ordered sequence of messages, where the consumers reading these messages maintain offsets for their read positions.

Figure 9.4. Trident topology for internet radio application

Figure 9.5. The Trident Kafka spout will be used for handling incoming play logs.

Figure 9.6. Operation for deserializing JSON into Trident tuples for each of the artist, title, and tags fields

Figure 9.7. We want to move from a stream with tuples containing multiple values to multiple streams with tuples containing single values.

Figure 9.8. Counting each of the artist, title, and tag values and persisting those values to a store

Figure 9.9. Breaking down the stateQuery operation

Figure 9.10. Our Trident topology broken down into spouts and bolts in the Storm UI

Figure 9.11. Our Trident topology displayed on the Storm UI after naming each of the operations

Figure 9.12. How our Trident operations are mapped down into bolts

Figure 9.13. The Storm UI with named operations for both the Trident topology and DRPC stream

Figure 9.14. How the Trident and DRPC streams and operations are being mapped down into bolts

Figure 9.15. Partitions are distributed across storm worker process (JVMs) and operated on in parallel

Figure 9.16. Partitioned stream with a series of batches in between two operations

Figure 9.17. Kafka topic partitions and how they relate to the partitions within a Trident stream

Figure 9.18. Result of applying a parallelism hint of four to the groupBy operations in our Trident topology

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset