Chapter 5. Real-World Systems

Fast data architectures raise the bar for the “ilities” of distributed data processing. Whereas batch jobs seldom last more than a few hours, a streaming pipeline is designed to run for weeks, months, even years. If you wait long enough, even the most obscure problem is likely to happen.

The umbrella term reactive systems embodies the qualities that real-world systems must meet. These systems must be:

Responsive

The system can always respond in a timely manner, even when it’s necessary to respond that full service isn’t available due to some failure.

Resilient

The system is resilient against failure of any one component, such as server crashes, hard drive failures, network partitions, etc. Leveraging replication prevents data loss and enables a service to keep going using the remaining instances. Leveraging isolation prevents cascading failures.

Elastic

You can expect the load to vary considerably over the lifetime of a service. It’s essential to implement dynamic, automatic scalability, both up and down, based on load.

Message driven

While fast data architectures are obviously focused on data, here we mean that all services respond to directed commands and queries. Furthermore, they use messages to send commands and queries to other services as well.

Batch-mode and interactive systems have traditionally had less stringent requirements for these qualities. Fast data architectures are just like other online systems where downtime and data loss are serious, costly problems. When implementing these architectures, developers who have focused on analytics tools that run in the back office are suddenly forced to learn new skills for distributed systems programming and operations.

Some Specific Recommendations

Most of the components we’ve discussed strive to support the reactive qualities to one degree or another. Of course, you should follow all of the usual recommendations about good management and monitoring tools, disaster recovery plans, etc., which I won’t repeat here. However, here are some specific recommendations:

  • Ingest all inbound data into Kafka first, then consume it with the stream processors. For all the reasons I’ve highlighted, you get durable, scalable, resilient storage as well as multiple consumers, replay capabilities, etc. You also get the uniform simplicity and power of event log and message queue semantics as the core concepts of your architecture.

  • For the same reasons, write data back to Kafka for consumption by downstream services. Avoid direct connections between services, which are less resilient.

  • Because Kafka Streams leverages the distributed management features of Kafka, you should use it when you can to add processing capabilities with minimal additional management overhead in your architecture.

  • For integration microservices, use Reactive Streams–compliant protocols for direct message and data exchange, for the resilient capabilities of backpressure as a flow-control mechanism.

  • Use Mesos, YARN, or a similarly mature management infrastructure for processes and resources, with proven scalability, resiliency, and flexibility. I don’t recommend Spark’s standalone-mode deployments, except for relatively simple deployments that aren’t mission critical, because Spark provides only limited support for these features.

  • Choose your databases wisely (if you need them). Do they provide distributed scalability? How resilient against data loss and service disruption are they when components fail? Understand the CAP trade-offs you need and how well they are supported by your databases.

  • Seek professional production support for your environment, even when using open source solutions. It’s cheap insurance and it saves you time (which is money).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset