In the first chapter, you were acquainted with some rudimentary concepts regarding data processing, clustering, and classification. This chapter is dedicated to the creation and maintenance of a flexible end-to-end workflow to train and classify data. The first section of the chapter introduces a data-centric (functional) approach to create number-crunching applications.
You will learn how to:
Data is the lifeline of any scientist, and the selection of data providers is critical in developing or evaluating any statistical inference or machine learning algorithm.
We briefly introduced the concept of a model in the Model categorization section in Chapter 1, Getting Started.
What constitutes a model? Wikipedia provides a reasonably good definition of a model as understood by scientists [2:1]:
A scientific model seeks to represent empirical objects, phenomena, and physical processes in a logical and objective way.
…
Models that are rendered in software allow scientists to leverage computational power to simulate, visualize, manipulate and gain intuition about the entity, phenomenon or process being represented.
In statistics and the probabilistic theory, a model describes data that one might observe from a system to express any form of uncertainty and noise. A model allows us to infer rules, make predictions, and learn from data.
A model is composed of features, also known as attributes or variables, and a set of relation between those features. For instance, the model represented by the function f(x, y) = x.sin(2y) has two features, x and y, and a relation, f. Those two features are assumed to be independent. If the model is subject to a constraint such as f(x, y) < 20, then the conditional independence is no longer valid.
An astute Scala programmer would associate a model to a monoid for which the set is a group of observations and the operator is the function implementing the model.
Models come in a variety of shapes and forms:
The confusion between a model and design is quite common in computer science, the reason being that these terms have different meanings for different people depending on the subject. The following metaphors should help with your understanding of these two concepts:
The selection of a model's features is the process of discovering and documenting the minimum set of variables required to build the model. Scientists assume that data contains many redundant or irrelevant features. Redundant features do not provide information already given by the selected features, and irrelevant features provide no useful information.
A features selection consists of two consecutive steps:
The process of evaluating each possible subset of features to find the one that maximizes the objective function or minimizes the error rate is computationally intractable for large datasets. A model with n features requires 2n-1 evaluations.
An observation is a set of indirect measurements of hidden, also known as latent, variables, which may be noisy or contain a high degree of correlation and redundancies. Using raw observations in a classification task would very likely produce inaccurate results. Using all features in each observation also incurs a high computation cost.
The purpose of features extraction is to reduce the number of variables or dimensions of the model by eliminating redundant or irrelevant features. The features are extracted by transforming the original set of observations into a smaller set at the risk of losing some vital information embedded in the original set.