In Chapter 2, Machine Learning Best Practices readers, learned some theoretical underpinnings of basic machine learning techniques. Whereas, Chapter 3, Understanding the Problem by Understanding the Data, describes the basic data manipulation using Spark's APIs such as RDD, DataFrame, and Datasets. Chapter 4, Extracting Knowledge through Feature Engineering, on the other hand, describes feature engineering from both the theoretical and practical point of view. However, in this chapter, the reader will learn the practical know-how needed quickly and powerfully to apply supervised and unsupervised techniques on the available data to the new problems through some widely used examples based on the understandings from the previous chapters. These examples we are talking about will be demonstrated from the Spark perspective. In a nutshell, the following topics will be covered throughout this chapter:
As stated in Chapter 1, Introduction to Data Analytics with Spark and Chapter 2, Machine Learning Best Practices, machine learning techniques can be categorized further into three major classes of algorithms: supervised learning, unsupervised learning, and the recommender system. Where classification and regression algorithms are widely used in the supervised learning application development, clustering, on the other hand, falls in the category of unsupervised learning. In this section, we will describe some examples of the supervised learning technique.
Then we will provide some example of the same example presented using Spark. On the other hand, an example of the clustering technique will be discussed in the section: Unsupervised learning, where a regression technique often models the past relationship between variables to predict their future changes (up or down). Here we show two real-life examples of classification and regression algorithms respectively. In contrast, a classification technique takes a set of data with known labels and learns how to label new records based on that information:
On the other hand, clustering and dimensionality reduction are commonly used for unsupervised learning. Here are some examples:
As already stated, a supervised learning application makes predictions based on a set of examples and the goal is to learn general rules that map inputs to outputs aligning with the real world. For example, a dataset for spam filtering usually contains spam messages as well as non-spam messages. Consequently, we could know which messages in the training set are spams or non-spam. Therefore, supervised learning is the machine learning technique of inferring a function from the labeled training data. The following steps are involved in supervised learning tasks:
Therefore, the dataset for training the ML model, in this case, is labeled with the value of interest and a supervised learning algorithm looks for patterns in those value labels. After the algorithm has found the required patterns, those patterns can be used to make predictions for unlabeled test data.
A typical use of the supervised learning is diverse and commonly used in the bioinformatics, cheminformatics, database marketing, handwriting recognition, information retrieval, object recognition in computer vision, optical character recognition, spam detection, pattern recognition, speech recognition, and so on, and in these applications mostly the classification technique is used. On the other hand, supervised learning is a special case of downward causation in biological systems.
More on how the supervised learning technique works from the theoretical perspective can be found on these books: Vapnik, V. N. The Nature of Statistical Learning Theory (2nd Ed.), Springer Verlag, 2000; and Mehryar M., Afshin R. Ameet T. (2012) Foundations of Machine Learning, The MIT Press ISBN 9780262018258.
Classification is a family of supervised machine learning algorithms that designate input as belonging to one of the several pre-defined classes. Some common use cases for classification include:
Classification data is labeled, for example, as spam/non-spam or fraud/non-fraud. Machine learning assigns a label or class to new data. You classify something based on pre-determined features. Features are the if questions that you ask. The label is the answer to those questions. For example, if an object walks, swims, and quacks like a duck, then the label would be duck. Or suppose for a flight is delayed on to be a departure or arrival by more than say 1 hour, it would be a delay; otherwise not a delay.