In this section, we will describe two key concepts introduced by the Spark machine learning libraries (Spark MLlib and Spark ML) and the most widely used implemented algorithms that align with the supervised and unsupervised learning techniques we discussed in the above sections.
As already stated, in the pre-Spark era, big data modelers typically used to build their ML models using statistical languages such as R, STATA, and SAS. Then the data engineers used to re-implement the same model in Java, for example, to deploy on Hadoop.
However, this kind of workflow lacks efficiency, scalability, throughput, and accuracy as well as extended execution time.
Using Spark, the same ML model can be re-built, adopted, and deployed, making the whole workflow much more efficient, robust, and faster, which allows you to provide hands-on insight to increase the performance. The Spark machine learning libraries are divided into two packages: Spark MLlib (spark.mllib
) and Spark ML (spark.ml
).
MLlib is Spark's scalable machine learning library, which is the extension of the Spark Core API that provides a library of easy to use machine learning algorithms. Algorithms are implemented and written in Java, Scala, and Python. Spark provides support for local vectors and matrix data types stored on a single machine, as well as distributed matrices backed by one or multiple RDDs:
Spark MLlib | ||
ML tasks |
Discrete |
Continuous |
Supervised |
Classification:
Logistic regression
and regularized variants
Linear SVM
Naïve Bayes
Decision trees
Random forests
Gradient-boosted trees |
Regression:
Linear regression
and regularized variants
Linear least squares
Lasso and ridge regression
Isotonic regression |
Unsupervised |
Clustering:
K-means
Gaussian matrix
Power iteration clustering (PIC)
Latent Dirichlet Allocation (LDA)
Bisecting K-means
Streaming K-means |
Dimensionality reduction, matrix factorization:
Principal components analysis
Singular value decomposition
Alternate least square |
Reinforcement |
N/A |
N/A |
Recommender systems |
Collaborative filtering:
Netflix recommendation |
N/A |
Table 1: Spark MLlib at a glance.
The beauty of Spark MLlib is numerous. For example, the algorithms implemented using Scala, Java, and Python are highly-scalable and leverage Spark's ability to work with a massive amount of data. They are fast towards designed for parallel computing with in-memory based operation, which is 100 times faster compared to MapReduce data processing (they also support disk-based operation that is 10 times faster than what MapReduce has as normal data processing) using Dataset, DataFrame, or Directed Acyclic Graph (DAG)-based RDD APIs.
They are also diverse, since they cover common machine learning algorithms for regression analysis, classification, clustering, recommender systems, text analytics, frequent pattern mining, and they obviously cover all the steps required to build scalable machine learning applications.
Spark ML adds a new set of machine learning APIs to let users quickly assemble and configure practical machine learning pipelines on top of Datasets. Spark ML targets to offer a uniform set of high-level APIs built on top of DataFrames rather than RDDs that help users create and tune practical machine learning pipelines. Spark ML API standardizes machine learning algorithms to make the learning tasks easier to combine multiple algorithms into a single pipeline or data workflow for data scientists.
Spark ML uses the concept of DataFrame (although it's obsolete in Java but still the main programming interface in Python and R), which is introduced in the Spark 1.3.0 release from Spark SQL as machine learning Datasets. The Datasets hold diverse data types such as columns storing text, feature vectors, and true labels for the data. In addition to this, Spark ML also uses the transformer to transform one DataFrame into another or vice-versa, where the concept of the estimator is used to fit on a DataFrame to produce a new transformer. The pipeline API, on the other hand, can restrain multiple transformers and estimators together to specify an ML data-workflow. The concept of the parameter was introduced to specify all the transformers and estimators to share a common API under an umbrella during the development of an ML application:
Spark ML | ||
ML tasks |
Discrete |
Continuous |
Supervised |
Classification:
Logistic regression
Decision tree classifier
Random forest classifier
Gradient-boosted tree classifier
Multilayer perception classifier
One-vs-Rest classifier |
Regression:
Linear regression
Decision tree regression
Random forest regression
Gradient-boosted tree regression
Survival regression |
Unsupervised |
Clustering:
K-means
Latent Dirichlet allocation (LDA) |
Tree Ensembles:
Random forests
Gradient-boosted Trees |
Reinforcement |
N/A |
N/A |
Recommender systems |
N/A |
N/A |
Table 2: Spark ML at a glance (legend same as Table 1).
As shown in table 2, Spark ML also provides several classifications, regression, decision trees, and tree ensembles as well as a clustering algorithm implemented for developing ML pipelines on top of DataFrames. The optimization algorithm under active implementation is called Orthant-Wise Limited-memory QuasiNewton (OWL-QN), which is also an advanced algorithm that is an extension of L-BFGS that can effectively handle L1 regularization and elastic net (see also at Spark ML Advanced topic, https://spark.apache.org/docs/latest/ml-advanced.html).
However, currently only Pearson's and Spearman's correlation are supported and more are to be added in future Spark releases. Unlike the other statistical functions, stratified sampling is also supported by Spark and it can be performed on RDDs as key-value pairs; however, some functionalities are yet to be added to Python developers. Currently there are no reinforcement learning algorithm modules in Spark Machine Learning libraries (please refer to Table 1 and Table 2). The current implementation of Spark MLlib provides a parallel implementation of FP-growth for mining frequent patterns and the association rules. However, you will have to customize the algorithm for mining maximal frequent patterns accordingly. We will provide a scalable ML application for mining privacy preserving maximal frequent pattern in upcoming chapters.
Another fact is that the current implementation of the collaborative based recommendation system in Spark does not support the use of real time stream data, however, in later chapters we will try to show a practical recommender system based on click stream data using association rule mining (see Mitchell, Tom M. The Discipline of Machine Learning, 2006, http://www.cs.cmu.edu/. CMU. Web. Dec. 2014). However, some algorithms are not available or are yet to be added to Spark ML, most notably dimensionality reduction is such an example.
However, developers can seamlessly combine the implementation of these techniques found in Spark MLlib with the rest of the algorithms found in Spark ML as hybrid or interoperable ML applications. Spark's neural networks and perception are brain-inspired learning algorithms covering multiclass, two-class, and regression problems that are not yet implemented in Spark ML APIs.