What machine learning algorithm should I use? is a very frequently asked question for the Naive machine learning practitioners, but the answer is always it depends on. More elaborately:
The reality is, even the most experienced data scientists or data engineers can't give a straight recommendation about which ML algorithm performs best before trying them all together. Most of the statements of agreement/disagreement begins with It depends on...hmm...Habitually, you might be contemplative if there are cheat sheets of machine learning algorithms and if so, how to use that cheat sheet. Several data scientists we talked to said that the only sure way to find the very best algorithm is to try all of them; therefore, there is no shortcut dude! Let's make it clear, suppose you do have a set of data and you want to do some clustering. Thus, technically, this could be classification or regression if your data is labeled/unlabeled or values or training set data. Now, the first concern that evolves in your mind is:
You will always expect the best answer that is much more justified and explains everything that someone should consider. In this section, we will try to answer these questions with our little machine learning knowledge.
The recommendation or suggestions we are providing here are for the novice data scientist with learner machine learning to expert data scientists who are trying to choose an optimal algorithm to start with the Spark ML APIs. That means, it makes some overviews and oversimplifications, but it will point you in a safe direction, believe us! Suppose you are planning to develop an ML system to answer the following question based on the rule:
IF
feature X has property Z THEN
do YAffirmatively, there should be such rules:
THEN
it is sensible to try Y using property Z and avoid WHowever, what is sensible and what is not depends on:
Moreover, if you want to have a general answer to a general problem, we recommend the Elements of Statistical Learning (Hastie Trevor, Tibshirani Robert, Friedman Jerome, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, 2009) for a fresh start. Nevertheless, we also recommend going with the following algorithmic properties that:
Getting the most accurate results from your ML application isn't always indispensable. Depending on what you want to use it for, sometimes an approximation is adequate enough. If the situation is something like this, you may be able to reduce the processing time drastically by incorporating the better-estimated methods. When you are familiar with the workflow with the Spark machine learning APIs, you will enjoy the advantage of having more approximation methods, because those approximation methods will tend to avoid the overfitting problem out of your ML model automatically.
The execution time requires finishing the data preprocessing or building the model and varies a great deal across different algorithms, the inherited complexities, and of course the robustness. The training time is often closely related to the accuracy. In addition, often you will discover that some of the algorithms you will be using are elusive to the number of data points compared to others. However, when your time is sufficient and especially when the dataset is larger, for doing all the formalities, it can get-up-and-go the choice of algorithm. Therefore, if you are concerned particularly with the time, try to sacrifice the accuracy or performance and use a simple algorithm that fulfils your minimum requirements.
There are many machine learning algorithms developed recently that make use of linearity (also available in the Spark MLlib and Spark ML). For example, the linear classification algorithms allow classes to be separated by plotting a differentiating straight line or otherwise by the higher-dimensional equivalents of the datasets. A linear regression algorithm, on the other hand, assumes that data trends follow a simple straight line. This assumption is not naive for some machine learning problems; however, there might be some other cases where the accuracy will be down. Despite their hazards, linear algorithms are very popular for the data engineers or data scientists as the first line of the outbreak. Moreover, these algorithms also tend to be algorithmically simple and fast to train your models during the whole process.
You will find many machine learning datasets available for free here at http://machinelearningmastery.com/tour-of-real-world-machine-learning-problems/ or at the UC Irvine Machine Learning Repository (at http://archive.ics.uci.edu/ml/). The following data properties should also be placed first:
Parameters or data properties are the handholds for a data scientist like you that gets to turn when setting up an algorithm. They are numbers that affect the algorithm's performance, such as error tolerance or the number of iterations, or options between variants of how the algorithm acts. The training time and accuracy of the algorithm can sometimes be quite sensitive to getting the right settings. Typically, algorithms with a large number of parameters require trial and error to find an optimal combination.
Despite the fact that this is a great way to span the parameter space, the model building or training time increases exponentially with the increased number of parameters. This is a dilemma as well as a time-performance trade-off. The positive sides are having many parameters characteristically indicates greater flexibility of the ML algorithms. And secondly, your ML application achieves much better accuracy.
If your training set is smaller, high bias with low variance classifiers such as Naive Bayes have an advantage over low bias with high variance classifiers such as kNN. Therefore, the latter will over fit. But low bias with high variance classifiers, on the other hand, start to win out as your training set grows linearly or exponentially since they have lower asymptotic errors. This is because high bias classifiers aren't powerful enough to provide accurate models. You can also think of this as a trade-off between generative models versus discriminative model distinction.
For certain types of experimental datasets, the number of extracted features can be very large compared to the number of data points itself. This is often the case with genomics, biomedical, or textual data. A large number of features can swamp some learning algorithms, making training time ridiculously high. Support vector machines are particularly well suited in this case for its high accuracy, nice theoretical guarantees regarding overfitting, and an appropriate kernel.
In this section, we will provide some special notes for the most commonly used machine learning algorithm or techniques. The techniques we will emphasis are logistic regression, linear regression, recommender system, SVM, decision tree, random forest, Bayesian method and decision forests, decision jungles, and variants. Table 3 shows the pros and cons of some widely used algorithms including where and when to chose these algorithms.
Algorithm |
Pros |
Cons |
Better at |
Linear regression (LR) |
Very fast and often runs in a constant time
Easy to understand the modelling
Less prone to overfitting and underfitting
Intrinsically simple
Very fast so less model building time
Less prone to overfitting and underfitting
Has low variance |
Often unable for complex data modelling
Often unable to conceptualize the nonlinear relationships without transforming the input Dataset
Not suitable for complex modelling
Works better with only single decision boundary
Requires large sample size to achieve stable results
High bias |
Numerical dataset with large collection of features
Widely used in biological, behavioral and social sciences to predict possible relationships among variables
Works well for numerical as well as categorical variables
Used in various fields, including the medical and social sciences |
Decision trees (DT) |
Less model building and prediction time
Robust against the noise and missing values
High accuracy |
Interpretation is hard with large and complex trees
Duplication may occur within the same sub-tree
Possible issues with diagonal decision boundaries |
Targeting high accurate classification
Medical diagnosis and prognosis
Credit risk analytics |
Neural networks (NN) |
Extremely powerful and robust
Capable of modelling very complex relationships
Can be working without knowing the underlying data |
Highly overfitting and underfitting prone
High training and prediction time
Computationally expensive requiring significant computing power
Model is not readable or reusable |
Image processing
Video processing
Human-intelligence
Robotics
Deep learning |
Random forest (RF) |
Good for bagged trees
Low variance
High accuracy
Can handle the overfitting problem |
Not as easy to visually and interpret
High training and prediction time |
When dealing with multiple features which may be correlated
Biomedical diagnosis and prognosis
Can be applied both for classification and regression |
Support vector machines (SVM) |
High accuracy |
Susceptible to overfitting and underfitting
No numerical stability
Computationally expensive requiring large computing power |
Image classification
Handwriting recognition |
K-nearest neighbors (K-NN) |
Simple and powerful
Lazy training involved
Can be applied for both multiclass classification and regression |
High training and prediction time
Need to have accurate distance function
Low performance with high dimensional dataset |
Low-dimensional datasets
Anomaly detection like outlier detection
Fault detection in semiconductor
Gene expression
Protein-protein interaction |
K-means |
Linear execution time
Perform better than hierarchical clustering
Excellent with hyper-spherical clusters |
Repeatable and lack consistency
Requires prior knowledge of K |
Is not a good choice if the natural clusters occurring in the dataset are non-spherical
Good for large dataset |
Latent Dirichilet Allocation (LDA) |
Can be applied for large-scale text datasets
Can overcome the overfitting problem of pLSA
Can be applied for both document classification and clustering through topic modelling |
Cannot be applied with high dimensional and complex texts databases
Requires the specification of the number of topics
Cannot find the granularity at optimum level
Hierarchical Dirichlet Process (HDP) is the better choice |
Document classification and clustering through topic modelling from large-scale text dataset
Can be applied in NLP and other text analytics |
Naive Bayes (NB) |
Computationally fast
Simple to implement
Works well with high dimensions
Can handle missing values
Is adaptable since the model can be modified with new training data without rebuilding the model |
Relies on independence assumption so performs badly if the assumption does not met
Relatively low accuracy |
When data has lots of missing values
Dependencies of features from each other are similar between features
Spam filtering and classification
Classifying a news article about technology, politics, or sports
Text mining |
Singular Value decomposition (SVD) and Principal Component Analysis (PCA) |
Reflects the real intuitions about the data
Allows estimation probabilities in high-dimensional data
Dramatic reduction in size of data
Both are based on strong linear algebra |
Too expensive for many applications like Twitter and web analytics
Disastrous for task with fine-grained classes
Need proper understanding of the linearity
Often complexity is cubic
Computationally slower |
SVD is applied for low-rank matrix approximation, image processing, bioinformatics, signal processing, NLP
PCA is used for interest rate derivatives portfolios, neuroscience and so on
Both are suitable for the dataset having high dimension and multivariate data |
Table 3: Pros and cons of some widely used algorithms
Logistic regression is a powerful tool developed around the globe for its two-class and multiclass classification since it's fast as well as simple. The fact is that it uses an S-shaped curve instead of a straight line. making it a natural fit for partitioning data into groups. It provides linear class boundaries, so that when you use it, make sure a linear approximation is something you can survive with. Unlike the decision trees or SVMs, you also have a nice probabilistic interpretation, so you will be able to update your model to adapt for new datasets easily.
Therefore, the recommendation is, use it if you want to have a flavor of probabilistic framework or if you expect to receive more training data in the future to be incorporated into your model. As mentioned previously, linear regression fits a line, plane, or hyperplane to the dataset. It's a workhorse, simple and fast, but it may be overly simplistic for some problems.
We already talked about the accuracy and performance issues of mostly used ML algorithms and tools. However, beyond the accuracy research on recommender systems is concern about finding another environmental factor or/and parameter diversity. Therefore, a recommendation system with good accuracy and higher intra-list diversity will be the winner. As a result, your product will be precious to your target customers. It would be, however, more effective to let the users re-rate the items, rather than showing new items only. If your clients have some extra requirements that need to be fulfilled, such as privacy or security, your system has to be able to deal with the privacy related issues.
This is particularly important because customers have to provide some personal information as well, so it is recommended not to expose that sensitive information publicly.
Building user profiles using some robust techniques or algorithms such as collaborative filtering, on the other hand, could be problematic from the privacy perspective. Moreover, research in this area has found that user demographics information may influence how satisfied the other users are with recommendations (see also in Joeran Beel, Stefan Langer, Andreas Nürnberger, Marcel Genzmehr, The Impact of Demographics (Age and Gender) and Other User Characteristics on Evaluating Recommender Systems. In Trond Aalberg and Milena Dobreva and Christos Papatheodorou and Giannis Tsakonas and Charles Farrugia. Proceedings of the 17th International Conference on Theory and Practice of Digital Libraries, Springer, pp. 400-404, Retrieved 1 November 2013).
Although the serendipity is a crucial measure of how surprising the recommendations are, ultimately trust needs to be built using the recommender system. This can be made possible by explaining how it generates the recommendations, and why it recommends an item even with little demographic information, from the user.
Therefore, if the user does not trust the system at all, they will not provide any demographic information or will not re-rate the items. A SVMs, according to Cowley et al. (G. C. Cawley and N. L. C. Talbot, Over-fitting in model selection and subsequent selection bias in performance evaluation, Journal of Machine Learning Research, vol. 11, pp. 2079-2107, July 2010), there are several advantages of Support Vector Machines:
These promising features of SVM really would help you, and it is suggested that it should be used frequently. On the other hand, the cons are:
Decision trees are cool because of their usability they are easy to interpret and explain the machine learning problem around. In parallel, they can easily be handled for the feature related interactions. Most importantly, they are often non-parametric. Therefore, even if you are an ordinary data scientist with limited working proficiencies, you don't need to be worried about the issues such as outliers, parameter setting, and tuning. Sometimes fundamentally, you can relay with the decision trees so that they will make your stress for handling issue of the data linearity, or more technically, whether your data is linearly separable or not, you need not be worried. On the contrary, there are some cons as well. For example:
Random forests are quite popular and are a winner for the data scientist, since they are divine for a package with plenty of classification problems. They are usually slightly ahead of SVMs in terms of usability and have faster operation for most of the classification problems. In addition to this, they are also scalable when increasing the datasets you have available. In parallel, you don't need to be worried about tuning a cluster of parameters. On the contrary, you need to take care of many parameters and tuning when handling your data.
Decision forests, decision jungles, and boosted decision trees are all based on decision trees, a foundational machine learning concept that is less used. There are many variants of decision trees are there; nonetheless, they all do the same thing, which is subdividing the feature space into regions with the same label. In order to avoid the over-fitting problem, a large set of trees are constructed with mathematical and statistical formulations, where the trees are not correlated at all.
The average of this is referred to as a decision forest; which is a tree that avoids the overfitting problem as stated earlier. However, the disadvantage is that decision forests can use a lot of memory. Decision jungles, on the other hand, are a variant that consume less memory by sacrificing a slightly longer training time. Fortunately, the boosted decision trees avoid overfitting by limiting the number of subdivision and the number of permitted data points in each region.
When the experimental or sample dataset size is large, the Bayesian method often provides results for parametric models that are very similar to the results produced by other classical statistical methods. Some potential advantages of using the Bayesian method was summarized by Elam et al (W. T. Elam, B. Scruggs, F. Eggert, and J. A. Nicolosi, Advantages and Disadvantages of Methods for Obtaining XRF NET Intensities, Copyright ©JCPDS-International Centre for Diffraction Data 2011 ISSN 1097-0002). For example, it provides a natural way of combining prior information with data. Therefore, as a data scientist, you can incorporate that past information regarding the parameters and form a prior distribution for future analysis for new datasets. It also provides inferences that are conditional on the data without the need of asymptotic approximation of the algorithm.
It provides some suitable settings for a wide range of models, such as hierarchical models and missing data problems. There are also disadvantages of using Bayesian analysis. For example, it does not tell you how to select a prior over world models or even that there is no correct way to choose a prior. Therefore, if you do not proceed with caution, you might generate many false positive or false negative results that often come with a high computational cost, if the number of parameters in a model is large.