Introducing Estimators

The Estimator class, just like the Transformer class, was introduced in Spark 1.3. The Estimators, as the name suggests, estimate the parameters of a model or, in other words, fit the models to data.

In this recipe, we will introduce two models: the linear SVM acting as a classification model, and a linear regression model predicting the forest elevation.

Here is a list of all of the Estimators, or machine learning models, available in the ML module:

  • Classification:
    • LinearSVC is an SVM model for linearly separable problems. The SVM's kernel has the  form (a hyperplane), where  is the coefficients (or a normal vector to the hyperplane),  is the records, and b is the offset.

    • LogisticRegression is a default, go-to classification model for linearly separable problems. It uses a logit function to calculate the probability of a record being a member of a particular class.

    • DecisionTreeClassifier is a decision tree-based model used for classification purposes. It builds a binary tree with the proportions of classes in the terminal nodes determining the class membership.
    • GBTClassifier is a member of the group of ensemble models. The Gradient-Boosted Trees (GBT) build several weak models that, when combined, form a strong classifier. The model can also be applied to solve regression problems. 
    • RandomForestClassifier is also a member of an ensemble group of models. Unlike GBT, however, random forests grows fully-grow decision trees and the total error reduction is achieved by reducing variance (while GBTs reduce bias). Just like GBT, these models can also be used to solve regression problems.
    • NaiveBayes uses the Bayes conditional probability theory, , to classify observations based on evidence and prior assumptions about the probability and likelihood.
    • MultilayerPerceptronClassifier is derived from the field of artificial intelligence, and, more narrowly, artificial neural networks. The model consists of a directed graph of artificial neurons that mimic (to some extent) the fundamental building blocks of the brain.
    • OneVsRest is a reduction technique that selects only one class in a multinomial scenario.
  • Regression:
    • AFTSurvivalRegression is a parametric model that predicts life expectancy and assumes that a marginal effect of one of the features accelerates or decelerates a process failure.
    • DecisionTreeRegressor, a counterpart of DecisionTreeClassifier, is applicable for regression problems.
    • GBTRegressor, a counterpart of GBTClassifier, is applicable for regression problems.
    • GeneralizedLinearRegression is a family of linear models that allow us to specify different kernel functions (or link functions). Unlike linear regression, which assumes the normality of error terms, the Generalized Linear Model (GLM) allow models to have other distributions of error terms.
    • IsotonicRegression fits a free-form and non-decreasing line to data.
    • LinearRegression is the benchmark of regression models. It fits a straight line (or a plane defined in linear terms) through the data.
    • RandomForestRegressor, a counterpart of RandomForestClassifier, is applicable for regression problems.
  • Clustering:
    • BisectingKMeans is a model that begins with all observations in a single cluster and iteratively splits the data into k clusters.
    • Kmeans separates data into k (defined) clusters by iteratively finding centroids of clusters by shifting the cluster boundaries so the sum of all distances between data points and cluster centroids is minimized.
    • GaussianMixture uses k Gaussian distributions to break the dataset down into clusters.
    • LDA: The Latent Dirichlet Allocation is a model frequently used in topic mining. It is a statistical model that makes use of some unobserved (or unnamed) groups to cluster observations. For example, a PLANE_linked cluster can have words included, such as engine, flaps, or wings.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset