Chapter 12. Scalable Frameworks

The advent of social networking, interactive media, and deep analysis has caused the amount of data processed daily to skyrocket. For data scientists, it's no longer just a matter of finding the most appropriate and accurate algorithm to mine data; it is also about leveraging multi-core CPU architectures and distributed computing frameworks to solve problems in a timely fashion. After all, how valuable is a data mining application if the model does not scale?

There are many options available to Scala developers to build classification and regression applications for very large datasets. This chapter covers the Scala parallel collections, Actor model, Akka framework, and Apache Spark in-memory clusters. The following topics are covered in this chapter:

  • An introduction to Scala parallel collections
  • Evaluation of performance of a parallel collection on multi-core CPUs
  • The actor model and reactive systems
  • Clustered and reliable distributed computing using Akka
  • A design of the computational workflow using Akka routers
  • An introduction to Apache Spark clustering and its design principles
  • Using Spark MLlib for clustering
  • Relative performance tuning and evaluation of Spark
  • Benefits and limitations of the Apache Spark framework

An overview

The support for distributing and concurrent processing is provided by different stacked frameworks and libraries. Scala concurrent and parallel collections' classes leverage the threading capabilities of the Java virtual machine. Akka.io implements a reliable action model originally introduced as part of the Scala standard library. The Akka framework supports remote actors, routing, load balancing protocols, dispatchers, clusters, events, and configurable mailbox management. This framework also provides support for different transport modes, supervisory strategies, and typed actors. Apache Spark's resilient distributed datasets with advanced serialization, caching, and partitioning capabilities leverage Scala and Akka libraries.

The following stack representation illustrates the interdependencies between frameworks:

An overview

The Stack representation of scalable frameworks using Scala

Each layer adds a new functionality to the previous one to increase scalability. The Java virtual machine runs as a process within a single host. Scala concurrent classes support effective deployment of an application by leveraging multicore CPU capabilities without the need to write multithreaded applications. Akka extends the Actor paradigm to clusters with advanced messaging and routing options. Finally, Apache Spark leverages Scala higher-order collection methods and the Akka implementation of the Actor model to provide large-scale data processing systems with better performance and reliability, through its resilient distributed datasets and in-memory persistency.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset