Large scale machine learning APIs in Spark

In this section, we will describe two key concepts introduced by the Spark machine learning libraries (Spark MLlib and Spark ML) and the most widely used implemented algorithms that align with the supervised and unsupervised learning techniques we discussed in the above sections.

Spark machine learning libraries

As already stated, in the pre-Spark era, big data modelers typically used to build their ML models using statistical languages such as R, STATA, and SAS. Then the data engineers used to re-implement the same model in Java, for example, to deploy on Hadoop.

However, this kind of workflow lacks efficiency, scalability, throughput, and accuracy as well as extended execution time.

Using Spark, the same ML model can be re-built, adopted, and deployed, making the whole workflow much more efficient, robust, and faster, which allows you to provide hands-on insight to increase the performance. The Spark machine learning libraries are divided into two packages: Spark MLlib (spark.mllib) and Spark ML (spark.ml).

Spark MLlib

MLlib is Spark's scalable machine learning library, which is the extension of the Spark Core API that provides a library of easy to use machine learning algorithms. Algorithms are implemented and written in Java, Scala, and Python. Spark provides support for local vectors and matrix data types stored on a single machine, as well as distributed matrices backed by one or multiple RDDs:

Spark MLlib

  

ML tasks

Discrete

Continuous

Supervised

Classification:

Logistic regression

and regularized variants

Linear SVM

Naïve Bayes

Decision trees

Random forests

Gradient-boosted trees

Regression:

Linear regression

and regularized variants

Linear least squares

Lasso and ridge regression

Isotonic regression

Unsupervised

Clustering:

K-means

Gaussian matrix

Power iteration clustering (PIC)

Latent Dirichlet Allocation (LDA)

Bisecting K-means

Streaming K-means

Dimensionality reduction, matrix factorization:

Principal components analysis

Singular value decomposition

Alternate least square

Reinforcement

N/A

N/A

Recommender systems

Collaborative filtering:

Netflix recommendation

N/A

Table 1: Spark MLlib at a glance.

  • Legend: Continuous: making predictions about continuous variables, for example, prediction of the maximum temperature for the upcoming days
  • Discrete: Assigning discrete class labels to particular observations as outcomes of a prediction, for example, in weather forecasting it could be the prediction of a sunny, rainy, or snowy day

The beauty of Spark MLlib is numerous. For example, the algorithms implemented using Scala, Java, and Python are highly-scalable and leverage Spark's ability to work with a massive amount of data. They are fast towards designed for parallel computing with in-memory based operation, which is 100 times faster compared to MapReduce data processing (they also support disk-based operation that is 10 times faster than what MapReduce has as normal data processing) using Dataset, DataFrame, or Directed Acyclic Graph (DAG)-based RDD APIs.

They are also diverse, since they cover common machine learning algorithms for regression analysis, classification, clustering, recommender systems, text analytics, frequent pattern mining, and they obviously cover all the steps required to build scalable machine learning applications.

Spark ML

Spark ML adds a new set of machine learning APIs to let users quickly assemble and configure practical machine learning pipelines on top of Datasets. Spark ML targets to offer a uniform set of high-level APIs built on top of DataFrames rather than RDDs that help users create and tune practical machine learning pipelines. Spark ML API standardizes machine learning algorithms to make the learning tasks easier to combine multiple algorithms into a single pipeline or data workflow for data scientists.

Spark ML uses the concept of DataFrame (although it's obsolete in Java but still the main programming interface in Python and R), which is introduced in the Spark 1.3.0 release from Spark SQL as machine learning Datasets. The Datasets hold diverse data types such as columns storing text, feature vectors, and true labels for the data. In addition to this, Spark ML also uses the transformer to transform one DataFrame into another or vice-versa, where the concept of the estimator is used to fit on a DataFrame to produce a new transformer. The pipeline API, on the other hand, can restrain multiple transformers and estimators together to specify an ML data-workflow. The concept of the parameter was introduced to specify all the transformers and estimators to share a common API under an umbrella during the development of an ML application:

Spark ML

  

ML tasks

Discrete

Continuous

Supervised

Classification:

Logistic regression

Decision tree classifier

Random forest classifier

Gradient-boosted tree classifier

Multilayer perception classifier

One-vs-Rest classifier

Regression:

Linear regression

Decision tree regression

Random forest regression

Gradient-boosted tree regression

Survival regression

Unsupervised

Clustering:

K-means

Latent Dirichlet allocation (LDA)

Tree Ensembles:

Random forests

Gradient-boosted Trees

Reinforcement

N/A

N/A

Recommender systems

N/A

N/A

Table 2: Spark ML at a glance (legend same as Table 1).

As shown in table 2, Spark ML also provides several classifications, regression, decision trees, and tree ensembles as well as a clustering algorithm implemented for developing ML pipelines on top of DataFrames. The optimization algorithm under active implementation is called Orthant-Wise Limited-memory QuasiNewton (OWL-QN), which is also an advanced algorithm that is an extension of L-BFGS that can effectively handle L1 regularization and elastic net (see also at Spark ML Advanced topic, https://spark.apache.org/docs/latest/ml-advanced.html).

Important notes for practitioners

However, currently only Pearson's and Spearman's correlation are supported and more are to be added in future Spark releases. Unlike the other statistical functions, stratified sampling is also supported by Spark and it can be performed on RDDs as key-value pairs; however, some functionalities are yet to be added to Python developers. Currently there are no reinforcement learning algorithm modules in Spark Machine Learning libraries (please refer to Table 1 and Table 2). The current implementation of Spark MLlib provides a parallel implementation of FP-growth for mining frequent patterns and the association rules. However, you will have to customize the algorithm for mining maximal frequent patterns accordingly. We will provide a scalable ML application for mining privacy preserving maximal frequent pattern in upcoming chapters.

Another fact is that the current implementation of the collaborative based recommendation system in Spark does not support the use of real time stream data, however, in later chapters we will try to show a practical recommender system based on click stream data using association rule mining (see Mitchell, Tom M. The Discipline of Machine Learning, 2006, http://www.cs.cmu.edu/. CMU. Web. Dec. 2014). However, some algorithms are not available or are yet to be added to Spark ML, most notably dimensionality reduction is such an example.

However, developers can seamlessly combine the implementation of these techniques found in Spark MLlib with the rest of the algorithms found in Spark ML as hybrid or interoperable ML applications. Spark's neural networks and perception are brain-inspired learning algorithms covering multiclass, two-class, and regression problems that are not yet implemented in Spark ML APIs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset