In the most generic sense, a model is an approximate description of a portion of reality. Models are essential to science and, in fact, any area of knowledge: it is only possible to comprehend the world by concentrating on a small part of it at a time and making suitable simplifications.
In this chapter, we will discuss the following topics:
Models can take many forms: a verbal description, set of mathematical equations, or segment of computer code. In this book, we are interested in a specific kind of model, probabilistic or statistical model, which represents the variability that occurs in a nondeterministic experiment.
We use the term experiment in this book in a somewhat non-technical sense. For us, an experiment is any observation of an event of interest. Examples of experiments are observing the number of visitors to a website or conducting an opinion poll or clinical trial. The main characteristic of experiments, for us, is that they can be repeated and that there is randomness, that is, each repetition of the same experiment may result in different outcomes.
The models that we will consider take the form of random variables. A random variable is an idealized representation of a probabilistic outcome that has numerical results. It is important to realize that a random variable is an abstraction: it does not represent the outcome of a particular experiment, it just models what results we expect to get once the experiment is actually performed.
In the remainder of this chapter, we will discuss how statistical models are formulated and describe the most important models used in data analysis.
Before running the examples in this chapter, start the Jupyter Notebook. After the default imports, run the following commands in a cell:
from pandas import Series, DataFrame import numpy.random as rnd import scipy.stats as st
You are now ready to start running the code for this chapter.