Taking a short detour through probability theory

In order to appreciate Bayes' theorem, we need to get a hold of the following technical terms:

Random variable: This is a variable whose value depends on chance. A good example is the act of flipping a coin, which might turn up heads or tails. If a random variable can take on only a limited number of values, we call it discrete (such as a coin flip or a dice roll); otherwise, we call it a continuous random variable (such as the temperature on a given day). Random variables are often typeset as capital letters.
Probability: This is a measure of how likely it is for an event to occur. We denote the probability of an event, e, happening as p(e), which must be a number between 0 and 1 (or between 0 and 100%). For example, for the random variable X, denoting a coin toss, we can describe the probability of getting heads as p(X = heads) = 0.5 (if the coin is fair).
Conditional probability: This is the measure of the probability of an event, given that another event has occurred. We denote the probability of event y happening, given that we know event x has already happened, as p(y|x) (read as p of y given x). For example, the probability of tomorrow being Tuesday if today is Monday is p(tomorrow will be Tuesday|today is Monday) = 1. Of course, there's also the chance that we won't see a tomorrow, but let's ignore that for now.
Probability distribution: This is a function that tells you the probability of different events happening in an experiment. For discrete random variables, this function is also called the probability mass function. For continuous random variables, this function is also called the probability density function. For example, the probability distribution of the coin toss, X, would take the value 0.5 for X = heads, and 0.5 for X = tails. Across all possible outcomes, the distribution must add up to 1.
Joint probability distribution: This is basically the preceding probability function applied to multiple random variables. For example, when flipping two fair coins, A and B, the joint probability distribution would enumerate all possible outcomes—(A = heads, B = heads), (A = heads, B = tails), (A = tails, B = heads), and (A = tails, B = tails)—and tell you the probability of each of these outcomes.

Now, here's the trick: if we think of a dataset as a random variable, X, then a machine learning model is basically trying to learn the mapping of X onto a set of possible target labels, Y. In other words, we are trying to learn the conditional probability, p(Y|X), which is the probability that a random sample drawn from X has a target label, Y.

There are two different ways to learn p(Y|X), as briefly mentioned earlier:

Discriminative models: This class of models directly learns p(Y|X) from training data without wasting any time trying to understand the underlying probability distributions (such as p(X), p(Y), or even p(Y|X)). This approach applies to pretty much every learning algorithm we have encountered so far: linear regression, k-nearest neighbors, decision trees, and so on.
Generative models: This class of model learns everything about the underlying probability distribution and then infers p(Y|X) from the joint probability distribution, p(X, Y). Because these models know p(X, Y), they not only can tell us how likely it is for a data point to have a particular target label but can also generate entirely new data points. Bayesian models are one example of this class of models.

Cool, so how do Bayesian models actually work? Let's have a look at a specific example.

Table of Contents for Taking a short detour through probability theory

Create new playlist

Sign In

Sign Up

Table of Contents for
Taking a short detour through probability theory