Choosing the right algorithm for your application

What machine learning algorithm should I use? is a very frequently asked question for the Naive machine learning practitioners, but the answer is always it depends on. More elaborately:

  • It depends on the volume, quality, complexity, and the nature of the data that has to be tested/used
  • It depends on external environments and parameters such as your computing system's configuration or underlying infrastructures
  • It depends on what you want to do with the answer
  • It depends on how the mathematical and statistical formulation of the algorithm was translated into machine instructions for the computer
  • And it depends on how much time you have
  • Figure 11 provides a complete work-flow for choosing the right algorithm for your ML problem. However, note that some tricks might not work-flow depending upon data and problem types:
    Choosing the right algorithm for your application

    Figure 11: A work-flow for choosing the right algorithm

The reality is, even the most experienced data scientists or data engineers can't give a straight recommendation about which ML algorithm performs best before trying them all together. Most of the statements of agreement/disagreement begins with It depends on...hmm...Habitually, you might be contemplative if there are cheat sheets of machine learning algorithms and if so, how to use that cheat sheet. Several data scientists we talked to said that the only sure way to find the very best algorithm is to try all of them; therefore, there is no shortcut dude! Let's make it clear, suppose you do have a set of data and you want to do some clustering. Thus, technically, this could be classification or regression if your data is labeled/unlabeled or values or training set data. Now, the first concern that evolves in your mind is:

  • Which factors should I consider before choosing an appropriate algorithm? Or should I just choose an algorithm randomly?
  • How do I choose any data pre-processing algorithm or tools that can be applied to my data?
  • What sort of feature engineering techniques should I be using to extract the useful features?
  • What factors can improve the performance of my ML model?
  • How can I adopt my ML application for new data types?
  • Can I scale-up my ML application for large-scale datasets? And so on.

You will always expect the best answer that is much more justified and explains everything that someone should consider. In this section, we will try to answer these questions with our little machine learning knowledge.

Considerations when choosing an algorithm

The recommendation or suggestions we are providing here are for the novice data scientist with learner machine learning to expert data scientists who are trying to choose an optimal algorithm to start with the Spark ML APIs. That means, it makes some overviews and oversimplifications, but it will point you in a safe direction, believe us! Suppose you are planning to develop an ML system to answer the following question based on the rule:

  • IF feature X has property Z THEN do Y

Affirmatively, there should be such rules:

  • IF X THEN it is sensible to try Y using property Z and avoid W

However, what is sensible and what is not depends on:

  • Your application and the expected complexity of the problem.
  • Size of the data set (that is, how many rows/columns, how many independent cases).
  • Is your dataset labeled or unlabeled?
  • Type of data and the kind of measurement, since different nature of data suggests a different order or structure, right?
  • And obviously in practice your experience in applying different methods efficiently and intelligently.

Moreover, if you want to have a general answer to a general problem, we recommend the Elements of Statistical Learning (Hastie Trevor, Tibshirani Robert, Friedman Jerome, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, 2009) for a fresh start. Nevertheless, we also recommend going with the following algorithmic properties that:

  • Show excellent accuracy
  • Have fast training times
  • And the use of linearity

Accuracy

Getting the most accurate results from your ML application isn't always indispensable. Depending on what you want to use it for, sometimes an approximation is adequate enough. If the situation is something like this, you may be able to reduce the processing time drastically by incorporating the better-estimated methods. When you are familiar with the workflow with the Spark machine learning APIs, you will enjoy the advantage of having more approximation methods, because those approximation methods will tend to avoid the overfitting problem out of your ML model automatically.

Training time

The execution time requires finishing the data preprocessing or building the model and varies a great deal across different algorithms, the inherited complexities, and of course the robustness. The training time is often closely related to the accuracy. In addition, often you will discover that some of the algorithms you will be using are elusive to the number of data points compared to others. However, when your time is sufficient and especially when the dataset is larger, for doing all the formalities, it can get-up-and-go the choice of algorithm. Therefore, if you are concerned particularly with the time, try to sacrifice the accuracy or performance and use a simple algorithm that fulfils your minimum requirements.

Linearity

There are many machine learning algorithms developed recently that make use of linearity (also available in the Spark MLlib and Spark ML). For example, the linear classification algorithms allow classes to be separated by plotting a differentiating straight line or otherwise by the higher-dimensional equivalents of the datasets. A linear regression algorithm, on the other hand, assumes that data trends follow a simple straight line. This assumption is not naive for some machine learning problems; however, there might be some other cases where the accuracy will be down. Despite their hazards, linear algorithms are very popular for the data engineers or data scientists as the first line of the outbreak. Moreover, these algorithms also tend to be algorithmically simple and fast to train your models during the whole process.

Talking to your data when choosing an algorithm

You will find many machine learning datasets available for free here at http://machinelearningmastery.com/tour-of-real-world-machine-learning-problems/ or at the UC Irvine Machine Learning Repository (at http://archive.ics.uci.edu/ml/). The following data properties should also be placed first:

  • Number of parameters
  • Number of features
  • Size of the training dataset

Number of parameters

Parameters or data properties are the handholds for a data scientist like you that gets to turn when setting up an algorithm. They are numbers that affect the algorithm's performance, such as error tolerance or the number of iterations, or options between variants of how the algorithm acts. The training time and accuracy of the algorithm can sometimes be quite sensitive to getting the right settings. Typically, algorithms with a large number of parameters require trial and error to find an optimal combination.

Despite the fact that this is a great way to span the parameter space, the model building or training time increases exponentially with the increased number of parameters. This is a dilemma as well as a time-performance trade-off. The positive sides are having many parameters characteristically indicates greater flexibility of the ML algorithms. And secondly, your ML application achieves much better accuracy.

How large is your training set?

If your training set is smaller, high bias with low variance classifiers such as Naive Bayes have an advantage over low bias with high variance classifiers such as kNN. Therefore, the latter will over fit. But low bias with high variance classifiers, on the other hand, start to win out as your training set grows linearly or exponentially since they have lower asymptotic errors. This is because high bias classifiers aren't powerful enough to provide accurate models. You can also think of this as a trade-off between generative models versus discriminative model distinction.

Number of features

For certain types of experimental datasets, the number of extracted features can be very large compared to the number of data points itself. This is often the case with genomics, biomedical, or textual data. A large number of features can swamp some learning algorithms, making training time ridiculously high. Support vector machines are particularly well suited in this case for its high accuracy, nice theoretical guarantees regarding overfitting, and an appropriate kernel.

Special notes on widely used ML algorithms

In this section, we will provide some special notes for the most commonly used machine learning algorithm or techniques. The techniques we will emphasis are logistic regression, linear regression, recommender system, SVM, decision tree, random forest, Bayesian method and decision forests, decision jungles, and variants. Table 3 shows the pros and cons of some widely used algorithms including where and when to chose these algorithms.

Algorithm

Pros

Cons

Better at

Linear regression (LR)

Very fast and often runs in a constant time

Easy to understand the modelling

Less prone to overfitting and underfitting

 Intrinsically simple

Very fast so less model building time

Less prone to overfitting and underfitting

Has low variance

Often unable for complex data modelling

Often unable to conceptualize the nonlinear relationships without transforming the input Dataset

Not suitable for complex modelling

Works better with only single decision boundary

 Requires large sample size to achieve stable results

High bias

Numerical dataset with large collection of features

Widely used in biological, behavioral and social sciences to predict possible relationships among variables

Works well for numerical as well as categorical variables

Used in various fields, including the medical and social sciences

Decision trees (DT)

 Less model building and prediction time

Robust against the noise and missing values

High accuracy

Interpretation is hard with large and complex trees

Duplication may occur within the same sub-tree

Possible issues with diagonal decision boundaries

 Targeting high accurate classification

Medical diagnosis and prognosis

Credit risk analytics

Neural networks (NN)

 Extremely powerful and robust

Capable of modelling very complex relationships

Can be working without knowing the underlying data

Highly overfitting and underfitting prone

High training and prediction time

Computationally expensive requiring significant computing power

Model is not readable or reusable

 Image processing

Video processing

Human-intelligence

Robotics

Deep learning

Random forest (RF)

Good for bagged trees

Low variance

High accuracy

Can handle the overfitting problem

Not as easy to visually and interpret

High training and prediction time

When dealing with multiple features which may be correlated

Biomedical diagnosis and prognosis

Can be applied both for classification and regression

Support vector machines (SVM)

High accuracy

Susceptible to overfitting and underfitting

No numerical stability

Computationally expensive requiring large computing power

Image classification

Handwriting recognition

K-nearest neighbors (K-NN)

Simple and powerful

Lazy training involved

Can be applied for both multiclass classification and regression

High training and prediction time

Need to have accurate distance function

Low performance with high dimensional dataset

Low-dimensional datasets

Anomaly detection like outlier detection

Fault detection in semiconductor

Gene expression

Protein-protein interaction

K-means

Linear execution time

Perform better than hierarchical clustering

Excellent with hyper-spherical  clusters

Repeatable and lack consistency

Requires prior knowledge of K

Is not a good choice if the natural clusters occurring in the dataset are non-spherical

Good for large dataset

Latent Dirichilet Allocation (LDA)

Can be applied for large-scale text datasets

Can overcome the overfitting problem of pLSA

Can be applied for both document classification and clustering through topic modelling

Cannot be applied with high dimensional and complex texts databases

Requires the specification of the number of topics

Cannot find the granularity at optimum level

Hierarchical Dirichlet Process (HDP) is the better choice

Document classification and clustering through topic modelling from large-scale text dataset

Can be applied in NLP and other text analytics

Naive Bayes (NB)

Computationally fast

Simple to implement

Works well with high dimensions

Can handle missing values

Is adaptable since the model can be modified with new training data without rebuilding the model

Relies on independence assumption so performs badly if the assumption does not met

Relatively low accuracy

When data has lots of missing values

Dependencies of features from each other are similar between features

Spam filtering and classification

Classifying a news article about technology, politics, or sports

Text mining

Singular Value decomposition (SVD) and Principal Component Analysis (PCA)

Reflects the real intuitions about the data

Allows estimation probabilities in high-dimensional data

Dramatic reduction in size of data

Both are based on strong linear algebra

Too expensive for many applications like Twitter and web analytics

Disastrous for task with fine-grained classes

Need proper understanding of the linearity

Often complexity is cubic

Computationally slower

SVD is applied for low-rank matrix approximation, image processing, bioinformatics,  signal processing,  NLP

PCA is used for interest rate derivatives portfolios, neuroscience and so on

Both are suitable for the dataset having high dimension and multivariate data

Table 3: Pros and cons of some widely used algorithms

Logistic regression and linear regression

Logistic regression is a powerful tool developed around the globe for its two-class and multiclass classification since it's fast as well as simple. The fact is that it uses an S-shaped curve instead of a straight line. making it a natural fit for partitioning data into groups. It provides linear class boundaries, so that when you use it, make sure a linear approximation is something you can survive with. Unlike the decision trees or SVMs, you also have a nice probabilistic interpretation, so you will be able to update your model to adapt for new datasets easily.

Therefore, the recommendation is, use it if you want to have a flavor of probabilistic framework or if you expect to receive more training data in the future to be incorporated into your model. As mentioned previously, linear regression fits a line, plane, or hyperplane to the dataset. It's a workhorse, simple and fast, but it may be overly simplistic for some problems.

Recommendation systems

We already talked about the accuracy and performance issues of mostly used ML algorithms and tools. However, beyond the accuracy research on recommender systems is concern about finding another environmental factor or/and parameter diversity. Therefore, a recommendation system with good accuracy and higher intra-list diversity will be the winner. As a result, your product will be precious to your target customers. It would be, however, more effective to let the users re-rate the items, rather than showing new items only. If your clients have some extra requirements that need to be fulfilled, such as privacy or security, your system has to be able to deal with the privacy related issues.

This is particularly important because customers have to provide some personal information as well, so it is recommended not to expose that sensitive information publicly.

Building user profiles using some robust techniques or algorithms such as collaborative filtering, on the other hand, could be problematic from the privacy perspective. Moreover, research in this area has found that user demographics information may influence how satisfied the other users are with recommendations (see also in Joeran Beel, Stefan Langer, Andreas Nürnberger, Marcel Genzmehr, The Impact of Demographics (Age and Gender) and Other User Characteristics on Evaluating Recommender Systems. In Trond Aalberg and Milena Dobreva and Christos Papatheodorou and Giannis Tsakonas and Charles Farrugia. Proceedings of the 17th International Conference on Theory and Practice of Digital Libraries, Springer, pp. 400-404, Retrieved 1 November 2013).

Although the serendipity is a crucial measure of how surprising the recommendations are, ultimately trust needs to be built using the recommender system. This can be made possible by explaining how it generates the recommendations, and why it recommends an item even with little demographic information, from the user.

Therefore, if the user does not trust the system at all, they will not provide any demographic information or will not re-rate the items. A SVMs, according to Cowley et al. (G. C. Cawley and N. L. C. Talbot, Over-fitting in model selection and subsequent selection bias in performance evaluation, Journal of Machine Learning Research, vol. 11, pp. 2079-2107, July 2010), there are several advantages of Support Vector Machines:

  • You can tackle the problem of the over-fitting problem since SVMs provide you with a regularization parameter
  • SVM use the kernel trick that helps to build the machine learning model via engineering the kernel with ease
  • An SVM algorithm is developed, designed, and defined based on a convex optimization problem, therefore, there is no concept of local minima
  • It is a ballpark figure to a bound on the test error rate, where there is a significant and well-studied theory that works

These promising features of SVM really would help you, and it is suggested that it should be used frequently. On the other hand, the cons are:

  • The theory only can really cover determination of the parameters for a given value of the regularization and kernel parameters. Therefore, you could only choose the kernel.
  • There might be a worse scenario as well, where the kernel model itself can be quite sensitive to over-fitting during the model selection criterion.

Decision trees

Decision trees are cool because of their usability they are easy to interpret and explain the machine learning problem around. In parallel, they can easily be handled for the feature related interactions. Most importantly, they are often non-parametric. Therefore, even if you are an ordinary data scientist with limited working proficiencies, you don't need to be worried about the issues such as outliers, parameter setting, and tuning. Sometimes fundamentally, you can relay with the decision trees so that they will make your stress for handling issue of the data linearity, or more technically, whether your data is linearly separable or not, you need not be worried. On the contrary, there are some cons as well. For example:

  • In some cases, the decision tree will not be suitable, sometimes they don't support online learning for real-time datasets. In that case, you have to rebuild your tree when new examples or datasets come; more technically, gaining model adaptability would not be possible.
  • Secondly, if you are not aware, they will easily become over-fitting.

Random forests

Random forests are quite popular and are a winner for the data scientist, since they are divine for a package with plenty of classification problems. They are usually slightly ahead of SVMs in terms of usability and have faster operation for most of the classification problems. In addition to this, they are also scalable when increasing the datasets you have available. In parallel, you don't need to be worried about tuning a cluster of parameters. On the contrary, you need to take care of many parameters and tuning when handling your data.

Decision forests, decision jungles, and variants

Decision forests, decision jungles, and boosted decision trees are all based on decision trees, a foundational machine learning concept that is less used. There are many variants of decision trees are there; nonetheless, they all do the same thing, which is subdividing the feature space into regions with the same label. In order to avoid the over-fitting problem, a large set of trees are constructed with mathematical and statistical formulations, where the trees are not correlated at all.

The average of this is referred to as a decision forest; which is a tree that avoids the overfitting problem as stated earlier. However, the disadvantage is that decision forests can use a lot of memory. Decision jungles, on the other hand, are a variant that consume less memory by sacrificing a slightly longer training time. Fortunately, the boosted decision trees avoid overfitting by limiting the number of subdivision and the number of permitted data points in each region.

Bayesian methods

When the experimental or sample dataset size is large, the Bayesian method often provides results for parametric models that are very similar to the results produced by other classical statistical methods. Some potential advantages of using the Bayesian method was summarized by Elam et al (W. T. Elam, B. Scruggs, F. Eggert, and J. A. Nicolosi, Advantages and Disadvantages of Methods for Obtaining XRF NET Intensities, Copyright ©JCPDS-International Centre for Diffraction Data 2011 ISSN 1097-0002). For example, it provides a natural way of combining prior information with data. Therefore, as a data scientist, you can incorporate that past information regarding the parameters and form a prior distribution for future analysis for new datasets. It also provides inferences that are conditional on the data without the need of asymptotic approximation of the algorithm.

It provides some suitable settings for a wide range of models, such as hierarchical models and missing data problems. There are also disadvantages of using Bayesian analysis. For example, it does not tell you how to select a prior over world models or even that there is no correct way to choose a prior. Therefore, if you do not proceed with caution, you might generate many false positive or false negative results that often come with a high computational cost, if the number of parameters in a model is large.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset