Chapter 2. Hello World!

In the first chapter, you were acquainted with some rudimentary concepts regarding data processing, clustering, and classification. This chapter is dedicated to the creation and maintenance of a flexible end-to-end workflow to train and classify data. The first section of the chapter introduces a data-centric (functional) approach to create number-crunching applications.

You will learn how to:

  • Apply the concept of monadic design to create dynamic workflows
  • Leverage some of Scala's advanced patterns, such as the cake pattern, to build portable computational workflows
  • Take into account the bias-variance trade-off in selecting a model
  • Overcome overfitting in modeling
  • Break down data into training, test, and validation sets
  • Implement model validation in Scala using precision, recall, and F score

Modeling

Data is the lifeline of any scientist, and the selection of data providers is critical in developing or evaluating any statistical inference or machine learning algorithm.

A model by any other name

We briefly introduced the concept of a model in the Model categorization section in Chapter 1, Getting Started.

What constitutes a model? Wikipedia provides a reasonably good definition of a model as understood by scientists [2:1]:

A scientific model seeks to represent empirical objects, phenomena, and physical processes in a logical and objective way.

Models that are rendered in software allow scientists to leverage computational power to simulate, visualize, manipulate and gain intuition about the entity, phenomenon or process being represented.

In statistics and the probabilistic theory, a model describes data that one might observe from a system to express any form of uncertainty and noise. A model allows us to infer rules, make predictions, and learn from data.

A model is composed of features, also known as attributes or variables, and a set of relation between those features. For instance, the model represented by the function f(x, y) = x.sin(2y) has two features, x and y, and a relation, f. Those two features are assumed to be independent. If the model is subject to a constraint such as f(x, y) < 20, then the conditional independence is no longer valid.

An astute Scala programmer would associate a model to a monoid for which the set is a group of observations and the operator is the function implementing the model.

Models come in a variety of shapes and forms:

  • Parametric: This consists of functions and equations (for example, y = sin(2t + w))
  • Differential: This consists of ordinary and partial differential equations (for example, dy = 2x.dx)
  • Probabilistic: This consists of probability distributions (for example, p(x|c) = exp (k.logx – x)/x!)
  • Graphical: This consists of graphs that abstract out the conditional independence between variables (for example, p(x,y|c) = p(x|c).p(y|c))
  • Directed graphs: This consists of temporal and spatial relationships (for example, a scheduler)
  • Numerical method: This consists of computational methods such as finite difference, finite elements, or Newton-Raphson
  • Chemistry: This consists of formula and components (for example, H2O, Fe + C12 = FeC13, and so on)
  • Taxonomy: This consists of a semantic definition and relationship of concepts (for example, APG/Eudicots/Rosids/Huaceae/Malvales)
  • Grammar and lexicon: This consists of a syntactic representation of documents (for example, the Scala programming language)
  • Inference logic: This consists of rules (for example, IF (stock vol > 1.5 * average) AND rsi > 80 THEN …)

Model versus design

The confusion between a model and design is quite common in computer science, the reason being that these terms have different meanings for different people depending on the subject. The following metaphors should help with your understanding of these two concepts:

  • Modeling: This describes something you know. A model makes an assumption, which becomes an assertion if proven correct (for example, the US population, p, increases by 1.2 percent a year, dp/dt = 1.012).
  • Designing: This manipulates the representation of things you don't know. Designing can be regarded as the exploration phase of modeling (for example, what are the features that contribute to the growth of the US population? Birth rate? Immigration? Economic conditions? Social policies?).

Selecting features

The selection of a model's features is the process of discovering and documenting the minimum set of variables required to build the model. Scientists assume that data contains many redundant or irrelevant features. Redundant features do not provide information already given by the selected features, and irrelevant features provide no useful information.

A features selection consists of two consecutive steps:

  1. Searching for new feature subsets.
  2. Evaluating these feature subsets using a scoring mechanism.

The process of evaluating each possible subset of features to find the one that maximizes the objective function or minimizes the error rate is computationally intractable for large datasets. A model with n features requires 2n-1 evaluations.

Extracting features

An observation is a set of indirect measurements of hidden, also known as latent, variables, which may be noisy or contain a high degree of correlation and redundancies. Using raw observations in a classification task would very likely produce inaccurate results. Using all features in each observation also incurs a high computation cost.

The purpose of features extraction is to reduce the number of variables or dimensions of the model by eliminating redundant or irrelevant features. The features are extracted by transforming the original set of observations into a smaller set at the risk of losing some vital information embedded in the original set.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset