Chapter 10.  Configuring and Working with External Libraries

This chapter guides you on using external libraries to expand your data analysis to make the Spark more versatile. Examples will be given for deploying third-party-developed packages or libraries for machine learning applications with Spark core and ML/MLlib. We will also discuss how to compile and use external libraries with the core libraries of Spark for time series. As promised, we will also discuss how to configure SparkR to increase exploratory data manipulation and operations. In a nutshell, the following topics will be covered throughout this chapter:

  • Third-party ML libraries with Spark
  • Using external libraries when deploying Spark ML on a cluster
  • Time series analysis using the Spark-TS package of Cloudera
  • Configuring SparkR with RStudio
  • Configuring Hadoop run-time on Windows

In order to provide a user-friendly environment for the developer, it is also possible to incorporate third-party APIs and libraries with Spark Core and other APIs such as Spark MLlib/ML, Spark Streaming, GraphX, and so on. Interested readers should refer to the following website that is listed on the Spark website as the Third-Party Packages: https://spark-packages.org/.

This website is a community index of third-party packages for Apache Spark. To date, there are total 252 packages registered on this site, as shown in Table 1:

Table 1: Third-party libraries for Spark based on application domain

Third-party ML libraries with Spark

The 55 third-party machine learning libraries include libraries for neural data analysis, generalized clustering, streaming, topic modelling, feature selection, matrix factorization, distributed DataFrame for distributed ML, model matrix, Stanford Core NLP wrapper for Spark, social network analysis, deep learning module running, assembly of fundamental statistics, binary classifier calibration, and tokenizer for DataFrame.

Table 2 provides a summary of the most useful packages based on use cases and application areas of machine learning. Interested readers should visit the respective websites for more insights:

Third party ML library for Spark

Use cases

thunder

ScalaNetwork

Neural network

Large-scale neural data analysis with Spark where the neural network implementation is done with Scala.

generalized-kmeans-clustering

patchwork

bisecting-kmeans

spark-knn

Clustering

This project generalizes the Spark MLLIB K-means cluster to support arbitrary distance functions.Highly scalable grid-density clustering algorithm for Spark MLlib.

This is a prototype implementation of Bisecting K-Means Clustering on Spark.

k-nearest neighbors algorithm on Spark.

spark-ml-streaming

streaming-matrix-factorization

twitter-stream-ml

Streaming

Visualize streaming machine learning in Spark.

Streaming Recommendation engine using matrix factorization with user and product bias.

Machine learning over Twitter's stream. Using Apache Spark, Web Server and Lightning Graph server.

pipeline

Docker-based pipelining

End-to-End, real-time, advanced analytics big data reference pipeline using Spark, Spark SQL, Spark Streaming, ML, MLlib, GraphX, Kafka, Cassandra, Redis, Apache Zeppelin, Spark-Notebook, iPython/Jupyter Notebook, Tableau, H2O Flow, and Tachyon.

dllib

CaffeOnSparkdl4j-spark-ml

Deep learning

dllib is a deep learning tool running on Apache Spark. Users need to download the tools as .jar and then can integrate with Spark and  develop deep-learning-based applications.

CaffeOnSpark is a scalable deep learning running with the Spark executors. It is based on peer-to-peer (P2P) communication. dl4j-spark-ml can be used to develop deep-learning-based ML applications by integrating with Spark ML.

kNN_IS

sparkboostspark-calibration

Classification

kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbours classifier for big data.A distributed implementation of AdaBoost.MH and MP-Boost using Apache Spark.

Assesses binary classifier calibration (that is, how well classifier outputs match observed class proportions) in Spark.

Zen

Regression

Zen provides a platform for large-scale and efficient machine learning on top of Spark. For example, logistic regression, linear regression, Latent Dirichlet Allocation (LDA), factorization machines and Deep Neural Network (DNN) are implemented in the current release.

modelmatrix

spark-infotheoretic-feature-selection

Feature engineering

spark-infotheoretic-feature-selection tools provide an alternative to Spark for developing large-scale machine learning applications. They provides robust feature engineering through the pipelining including the feature extractors, feature selectors. It is focused on building sparse feature-vector-based pipelines.

On the other hand, it can be used as a feature selection framework based on Information Theory. Algorithms based on Information Theory include mRMR, InfoGain, JMI, and other commonly used FS filters.

spark-knn-graphs

Graph processing

Spark algorithms for building and processing k-nn graphs

TopicModeling

Topic modelling

Distributed Topic Modelling on Apache Spark

Spark.statistics

Statistics

Apart from SparkR, Spark.statistics works as an assembler of basic statistics implementation based on the Spark core

Table 2: Summary of the most useful third-party packages based on use cases and application areas of machine learning with Spark

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset