Chapter 7. Basic Statistics

This chapter will focus on the statistical knowledge required by any aspiring data scientist.

We will explore ways of sampling and obtaining data without being affected by bias and then use measures of statistics to quantify and visualize our data. Using the z-score and the empirical rule, we will see how we can standardize data for the purpose of both graphing and interpretability.

In this chapter, we will look at the following topics:

  • How to obtain and sample data
  • The measures of center, variance, and relative standing
  • Normalization of data using the z-score
  • The empirical rule

What are statistics?

This might seem like an odd question to ask, but I am frequently surprised by the number of people who cannot answer this simple and yet powerful question: what are statistics? Statistics are the numbers you always see on the news and in the paper. Statistics are useful when trying to prove a point or trying to scare someone, but what are they?

To answer this question, we need to back up for a minute and talk about why we even measure them in the first place. The goal of this field is to try to explain and model the world around us. To do that, we have to take a look at the population.

We can define a population as the entire pool of subjects of an experiment or a model.

Essentially, your population is who you care about. Who are you trying to talk about? If you are trying to test whether smoking leads to heart disease, your population would be the smokers of the world. If you are trying to study teenage drinking problems, your population would be all teenagers.

Now, consider that you want to ask a question about your population. For example, if your population is all of your employees (assume that you have over 1,000 employees), perhaps you want to know what percentage of them use illicit drugs. The question is called a parameter.

We can define a parameter as a numerical measurement describing a characteristic of a population.

For example, if you ask all 1,000 employees and 100 of them are using drugs, the rate of drug use is 10%. The parameter here is 10%.

However, let's get real; you probably can't ask every single employee whether they are using drugs. What if you have over 10,000 employees? It would be very difficult to track everyone down in order to get your answer. When this happens, it's impossible to figure out this parameter. In this case, we can estimate the parameter.

First, we will take a sample of the population.

We can define a sample of a population as a subset (not necessarily random) of the population.

So, we perhaps ask 200 of the 1,000 employees you have. Of these 200, suppose 26 use drugs, making the drug use rate 13%. Here, 13% is not a parameter because we didn't get a chance to ask everyone. This 13% is an estimate of a parameter. Do you know what that's called?

That's right, a statistic!

We can define a statistic as a numerical measurement describing a characteristic of a sample of a population.

A statistic is just an estimation of a parameter. It is a number that attempts to describe an entire population by describing a subset of that population. This is necessary because you can never hope to give a survey to every single teenager or to every single smoker in the world. That's what the field of statistics is all about: taking samples of populations and running tests on these samples.

So, the next time you are given a statistic, just remember that number only represents a sample of that population, not the entire pool of subjects.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset