An overview of statistical and machine learning

The field of Artificial Intelligence (AI) is not new, and if we remember thirty years ago when we studied AI, except for robotics, there was very little understanding of the future this field held back then. Now, especially in the last decade, there has been a considerable growth of interest in Artificial Intelligence and machine learning. In the broadest sense, these fields aim to 'discover and learn something useful' about the environment. The gathered information leads to the discovery of new algorithms, which then leads to the question, "how to process high-dimensional data and deal with uncertainty"?

Machine learning aims to generate classifying expressions that are simple enough to follow by humans. They must mimic human reasoning sufficiently to provide insights into the decision process. Similar to statistical approaches, background knowledge may be exploited in the development phase. Statistical learning plays a key role in many areas of science, and the science of learning plays a key role in the fields of statistics, data mining, and artificial intelligence, which intersect with areas of engineering and other disciplines.

The difference between statistical and machine learning is that statistics emphasizes inference, whereas machine learning emphasizes prediction. When one applies statistics, the general approach is to infer the process by which data was generated. For machine learning, one would want to know how to predict the future characteristics of the data with respect to some variable. There is a lot of overlap between statistical learning and machine learning, and often one side of the experts argues one way versus the other. Let's leave this debate to the experts and select a few areas to discuss in this chapter. Later in the following chapter, there will be elaborate examples of machine learning. Here are some of the algorithms:

  • Regression or forecasting
  • Linear and quadratic discriminant analysis
  • Classification
  • Nearest neighbor
  • Naïve Bayes
  • Support vector machines
  • Decision trees
  • Clustering

The algorithms of machine learning are broadly categorized as supervised learning, unsupervised learning, reinforced learning, and deep learning. The supervised learning method of classification is where the test data is labeled, and like a teacher, it gives the classes supervision. Unsupervised learning does not have any labeled training data, whereas supervised learning has completely labeled training data. Semisupervised learning falls between supervised and unsupervised learning. This also makes use of the unlabeled data for training.

As the context of this book is data visualization, we will only discuss a few algorithms in the following sections.

K-nearest neighbors

The first machine learning algorithm that we will look at is k-nearest neighbors (k-NN). k-NN does not build the model from the training data. It compares a new piece of data without a label to every piece of existing data. Then, take the most similar pieces of data (the nearest neighbors) and view their labels. Now, look at the top k most similar pieces of data from the known dataset (k is an integer and is usually less than 20). The following code demonstrates k-nearest neighbors plot:

from numpy import random,argsort,sqrt
from pylab import plot,show
import matplotlib.pyplot as plt

def knn_search(x, data, K):

  """ k nearest neighbors """

  ndata = data.shape[1]
  K = K if K < ndata else ndata
  # euclidean distances from the other points
  sqd = sqrt(((data - x[:,:ndata])**2).sum(axis=0))
  idx = argsort(sqd) # sorting
  # return the indexes of K nearest neighbors
  return idx[:K]

data = random.rand(2,200) # random dataset
x = random.rand(2,1) # query point

neig_idx = knn_search(x,data,10)

plt.figure(figsize=(12,12))

# plotting the data and the input point
plot(data[0,:],data[1,:],'o,  x[0,0],x[1,0],'o', color='#9a88a1', 
   markersize=20)

# highlighting the neighbors
plot(data[0,neig_idx],data[1,neig_idx],'o', 
  markerfacecolor='#BBE4B4',markersize=22,markeredgewidth=1)

show()

The approach to k-Nearest Neighbors is as follows:

  • Collecting data using any method
  • Preparing numeric values that are needed for a distance calculation
  • Analyzing with any appropriate method
  • Training none (there is no training involved)
  • Testing to calculate the error rate
  • The application takes some action on the calculated k-nearest neighbor search and identifies the top k nearest neighbors of a query

In order to test out a classifier, you can start with some known data so that you can hide the answer from the classifier and ask the classifier for its best guess.

K-nearest neighbors

Generalized linear models

Regression is a statistical process to estimate the relationships among variables. More specifically, regression helps you understand how the typical value of the dependent variable changes when any one of the independent variables is varied.

Linear regression is the oldest type of regression that can apply interpolation, but it is not suitable for predictive analytics. This kind of regression is sensitive to outliers and cross-correlations.

Bayesian regression is a kind of penalized estimator and is more flexible and stable than traditional linear regression. It assumes that you have some prior knowledge about the regression coefficients, and statistical analysis is applicable in the context of the Bayesian inference.

We will discuss a set of methods in which the target value (y) is expected to be a linear combination of some input variables (x1, x2, and … xn). In other words, representing the target values using notations is as follows:

Generalized linear models

Now, let's take a look at the Bayesian linear regression model. A logical question one may ask is "why Bayesian?" The answer being:

  • Bayesian models are more flexible
  • The Bayesian model is more accurate in small samples (may depend on priors)
  • Bayesian models can incorporate prior information

Bayesian linear regression

First, let's take a look at a graphical model for linear regression. In this model, let's say we are given data values—D = ((x1, y1), (x2, y2), … (xn, yn)) —and our goal is to model this data and come up with a function, as shown in the following equation:

Bayesian linear regression

Here, w is a weight vector and each Yi is normally distributed, as shown in the preceding equation. Yi are random variables, and with a new variable x to condition each of the random variable Yi = yi from the data, we can predict the corresponding y for the new variable x, as shown in the following code:

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

from sklearn.linear_model import BayesianRidge
from sklearn.linear_model import LinearRegression

np.random.seed(0)
n_samples, n_features = 200, 200

X = np.random.randn(n_samples, n_features)  # Gaussian data
# Create weights with a precision of 4.
theta = 4.
w = np.zeros(n_features)

# Only keep 8 weights of interest
relevant_features = np.random.randint(0, n_features, 8)
for i in relevant_features:
    w[i] = stats.norm.rvs(loc=0, scale=1. / np.sqrt(theta))

alpha_ = 50.
noise = stats.norm.rvs(loc=0, scale=1. / np.sqrt(alpha_), size=n_samples)
y = np.dot(X, w) + noise

# Fit the Bayesian Ridge Regression 
clf = BayesianRidge(compute_score=True)
clf.fit(X, y)

# Plot weights and estimated and histogram of weights
plt.figure(figsize=(11,10))
plt.title("Weights of the model", fontsize=18)
plt.plot(clf.coef_, 'b-', label="Bayesian Ridge estimate")
plt.plot(w, 'g-', label="Training Set Accuracy")
plt.xlabel("Features", fontsize=16)
plt.ylabel("Values of the weights", fontsize=16)
plt.legend(loc="best", prop=dict(size=12))
plt.figure(figsize=(11,10))
plt.title("Histogram of the weights", fontsize=18)
plt.hist(clf.coef_, bins=n_features, log=True)
plt.plot(clf.coef_[relevant_features], 5 * np.ones(len(relevant_features)),
         'ro', label="Relevant features")
plt.ylabel("Features", fontsize=16)
plt.xlabel("Values of the weights", fontsize=16)
plt.legend(loc="lower left")
plt.show()

The following two plots are the results of the program:

Bayesian linear regression
Bayesian linear regression
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset