Spark machine learning libraries

In this section, we will describe two main machine learning libraries (Spark MLib and Spark ML) and the most widely used implemented algorithms. The ultimate target is to provide you with some familiarization about the machine learning treasures of Spark since many people still think that Spark is only a general-purpose in-memory big data processing or cluster computing framework. However, this is not like that, rather this information would help you to understand what could be done with the Spark machine learning APIs. In addition, this information would help you to explore and will increase the usability while deploying real-life machine learning pipelines using Spark MLlib and Spark ML.

Machine learning with Spark

In the pre-Spark era, big data modelers typically used to build their ML models. Where a model is prepared through a training process where it is required to make predictions and is corrected when those predictions are wrong. In short, an ML model is an object that takes an input, does some processing, and finally produces the output. Those models were commonly constructed using statistical languages such as R and SAS. Then the data engineers used to re-implement the same model in Java to deploy on Hadoop. However, this kind of workflow lacks efficiency, scalability, throughput, and accuracy with extended execution time. Using Spark, the same ML model can be built, adopted, and deployed, making the whole workflow much more efficient, robust, and faster and that allows you to provide hands-on insight to increase the performance. The main goal of Spark machine learning libraries is to make practical machine learning applications scalable, faster, and easy. It consists of common and widely used machine learning algorithms and their utilities, including classification, regression, clustering, collaborative filtering, and dimensionality reduction. It is divided into two packages: Spark MLlib (spark.mllib) and Spark ML (spark.ml).

Spark MLlib

MLlib is the machine learning library of Spark. It is a distributed, low-level library written with Scala, Java, and Python against Spark core runtime. MLlib mainly focus on learning algorithms and their proper utilities to not only provide machine learning analytical capabilities. The major learning utilities include classification, regression, clustering, recommender system, and dimensionality reduction. In addition, it also aids to optimize the general purpose primitives for developing large-scale machine learning pipelines. As stated earlier, MLlib comes with some exciting APIs written in Java, Scala, R, and Python. The main components of Spark MLlib are described in the following sections.

Data types

Spark provides support of local vectors and matrix data types stored on a single machine, as well as distributed matrices backed by one or multiple RDDs. Local vectors and matrices are simple data models that serve as public interfaces. The vector and matrix operations are heavily dependent on the linear algebra operation, and you are recommended to gain some background knowledge before using these data types.

Basic statistics

Spark not only provides a column summary and basic statistics to be performed on RDDs, but it also supports calculating the correlation between two series of data or more complex correlation operations, such as pairwise correlations among many series of data, which are a common operation in statistics. However, currently Pearson's and Spearman's correlations are only supported and more are to be added in future Spark releases. Unlike the other statistical function, stratified sampling is also supported by Spark and can be performed on RDD's as key-value pairs; however, some functionalities are yet to be added to Python developers.

Spark provides only the Pearson's chi-squared test for hypothesis testing for its goodness of fit and independence of a claim hypothesis, which is a powerful technique in statistics that determines whether a result is statistically significant to satisfy the claim. Spark also provides online implementations of some tests to support use cases such as A/B testing as streaming significance testing typically performed on real-time streaming data. Another exciting feature of Spark is the factory methods to generate random double RDDs or vector RDDs that are useful for randomized algorithms, prototyping, performance, and hypothesis testing. Other functionality in the current Spark MLib provides computation facilities of kernel density estimation from sample RDDs, which is a useful technique for visualizing empirical probability distributions.

Classification and regression

Classification is a typical process that helps new data objects and components to be organized, differentiated, and understood or belong in a certain way on the basis of training data. In statistical computing, two types of classification exist, binary classification (also commonly referred to as binomial classification) and multiclass classification. Binary classification is the task of classifying data objects of a given observation into two groups. Support Vector Machines (SVMs), logistic regression, decision trees, random forests, gradient-boosted trees, and Naive Bayes have been implemented up to the latest release of Spark.

Multiclass classification, on the other hand, is the task of classifying data objects of a given observation into more than two groups. The logistic regression, decision trees, random forests, and naive Bayes are implemented as multiclass classification. However, more complex classification algorithms such as multi-level classification and multiclass perceptron have not been implemented yet. The regression analysis is also a statistical process that estimates relationships among variables or observation. Other than the classification process, regression analysis involves several techniques for modeling and analyzing data objects. Currently, the following algorithms are supported by Spark MLlib library:

  • Linear least squares
  • Lasso
  • Ridge regression
  • Decision trees
  • Random forests
  • Gradient-boosted trees
  • Isotonic regression

Recommender system development

An intelligent and scalable recommender system is an emerging application that is currently being developed by many enterprises to expand their business and cost towards automating recommendation for customers. The collaborative filtering approach is the most widely used algorithm in the recommender system, aiming to fill in the missing entries of a user-item association matrix. For example, Netflix is an example who could manage to reduce their movie recommendation by several million dollars. However, the current implementation of Spark MLlib provides only the model-based collaborative filtering technique.

The pros of a model-based collaborative filtering algorithm are users and products that can be described by a small set of latent factors to predict missing entries using the Alternating Least Squares (ALS) algorithm. The con is that user rating or feedback cannot be taken into consideration for predicting an interest. Interestingly, open source developers are also working to develop a memory-based collaborative filtering technique to be incorporated into Spark MLib in which user rating data could be used to compute the similarity between users or items making the ML model more versatile.

Clustering

Clustering is an unsupervised machine learning problem/technique. The aims are to group subsets of entities with one another based on some notion of similarity that is often used for exploratory analysis and for developing hierarchical supervised learning pipelines. Spark MLib provides support for various clustering models such as K-means, Gaussian matrix, Power Iteration Clustering (PIC), Latent Dirichlet Allocation (LDA), Bisecting K-means, and Streaming K-means from real time streaming data. We will discuss more on supervised/unsupervised and reinforcement learning in upcoming chapters.

Dimensionality reduction

Working with high-dimensional data is cool and demanding to meet the big data related complexities. However, one of the problems with high-dimensional data is unwanted features or variables. Since all of the measured variables might not be important for building the model, to answer the questions of interest you might need to reduce the search space. Therefore, based on certain considerations or requirements, we need to reduce the dimension of the original data before creating any model without sacrificing the original structure.

The current implementation of MLib API supports two types of dimensionality reduction techniques: Singular Value Decomposition (SVD) and Principal Component Analysis (PCA) for tall-and-skinny matrices that are stored in row-oriented formats and for any vectors. The SVD technique has some performance issues; however, PCA is the most widely used technique in dimensionality reduction. These two techniques are very useful in large scale ML applications, but they require strong background knowledge of linear algebra.

Feature extraction and transformation

Spark provides different techniques for making the feature engineering easy to use through the Term frequency-inverse document frequency (TF-IDF), Word2Vec, StandardScaler, ChiSqSelector, and so on. If you are working or planning to work in the area of mining towards building a text mining ML application, TF-IDF would be an interesting option from Spark MLlib. TF-IDF provides a feature vectorization method to reflect the importance of a term to a document in the corpus that is very helpful to develop a text analytical pipeline.

In addition, you might be interested in using the Word2Vec computers distributed vector representation of the words or corpus on your ML application for text analysis. This feature of Word2Vec will eventually make your generalization and model estimation more robust in the area of the novel patterns. You also have the StandardScaler to normalize the extracted features by scaling to the unit variance or by removing the mean based on column summary statistics. It is needed in the pre-processing step while building a scalable ML application typically performed on the samples in the training dataset. Well, suppose you have extracted features through this method, now you will need to select the features to be incorporated into your ML model. Therefore, you might also be fascinated in the ChiSqSelector algorithm of Spark MLlib for feature selection. ChiSqSelector tries to identify relevant features during the ML model building. The reason is obviously to reduce the size of the feature space as well as the search space in a tree-based approach and to improve both speed and statistical learning behavior in the reinforcement learning algorithms.

Frequent pattern mining

Mining frequent items, maximal frequent patterns/itemsets, contiguous frequent patterns or subsequences, or other substructures is usually among the first steps to analyze a large-scale dataset before starting to build your ML models. The current implementation of Spark MLib provides a parallel implementation of FP-growth for mining frequent patterns and the association rules. It also provides the implementation of another popular algorithm, PrefixSpan, for mining sequence patterns. However, you will have to customize the algorithm for mining maximal frequent patterns accordingly. We will provide a scalable ML application for mining privacy, and preserving maximal frequent patterns in upcoming chapters.

Spark ML

Spark ML is an ALPHA component that adds a new set of machine learning APIs to let users quickly assemble and configure practical machine learning pipelines on top of DataFrames. Before praising the features and advantages of Spark ML, we should know about the DataFrames machine learning techniques that can be applied and developed to a wide variety of data types, such as vectors, unstructured (that is, raw texts), images, and structured data. In order to support a variety of data types to make the application development easier, recently, Spark ML has adopted the DataFrame and Dataset from Spark SQL.

A DataFrame or Dataset can be created either implicitly or explicitly from an RDD of objects that supports the basic and structured types. The goal of Spark ML is to provide a uniform set of high-level APIs built on top of DataFrames and datasets rather than RDDs. It helps the users to create and tune practical machine learning pipelines. The Spark ML also provides for the feature estimators and transformers for developing scalable ML pipelines. Spark ML systematizes many ML algorithms and APIs to make it even easier to combine multiple algorithms into a single pipeline, or data workflow that uses the concept of DataFrame and datasets.

The three basic steps in feature engineering are feature extraction, feature transformation, and selection. Spark ML provides implementation of several algorithms to make these steps easier. Extraction provides the facility for extracting features from raw data, whereas transformation provides the facility of scaling, converting, or modifying features that are found from the extraction step and the selection helps to select a subset from a larger set of features from the second step. Spark ML also provides several classifications (logistic regression, decision tree classifier, random forest classifier, and more), regression (liner regression, decision tree regression, random forest regression, survival regression, and gradient-boosted tree regression), decision tree and tree ensembles (random forest and gradient-boosted trees), as well as clustering (K-means and LDA) algorithms implemented for developing ML pipelines on top of DataFrames. We will discuss more on RDDs and DataFrames and their underlying operations in Chapter 3, Understanding the Problem by Understanding the Data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset