While point estimates are okay, estimates of a population parameter and sampling distributions are even better. There are the following two main issues with these approaches:
For these reasons and more, we may turn to a concept known as the confidence interval to find statistics.
A confidence interval is a range of values based on a point estimate that contains the true population parameter at some confidence level.
Confidence is an important concept in advanced statistics. Its meaning is sometimes misconstrued. Informally, a confidence level does not represent a probability of being correct; instead, it represents the frequency that the obtained answer will be accurate. For example, if you want to have a 95% chance of capturing the true population parameter using only a single point estimate, you have to set your confidence level to 95%.
Calculating a confidence interval involves finding a point estimate and then incorporating a margin of error to create a range. The margin of error is a value that represents our certainty that our point estimate is accurate and is based on our desired confidence level, the variance of the data, and how big your sample is. There are many ways to calculate confidence intervals; for the purpose of brevity and simplicity, we will look at a single way of taking the confidence interval of a population mean. For this confidence interval, we need the following:
Obtaining these numbers might seem arbitrary but, trust me, there is a reason for all of them. However, again for simplicity, I will use prebuilt Python modules, as shown, to calculate our confidence interval and then demonstrate its value:
import math sample_size = 100 # the size of the sample we wish to take sample = np.random.choice(a= breaks, size = sample_size) # a sample of sample_size taken from the 9,000 breaks population from before sample_mean = sample.mean() # the sample mean of the break lengths sample sample_stdev = sample.std() # sample standard deviation sigma = sample_stdev/math.sqrt(sample_size) # population standard deviation estimate stats.t.interval(alpha = 0.95, # Confidence level 95% df= sample_size - 1, # Degrees of freedom loc = sample_mean, # Sample mean scale = sigma) # Standard deviation # (36.36, 45.44)
To reiterate, this range of values (from 36.36 to 45.44) represents a confidence interval for the average break time with 95% confidence.
We already know that our population parameter is 39.99, and note that the interval includes the population mean of 39.99.
I mentioned earlier that the confidence level is not a percentage of accuracy of our interval but the percent chance that the interval will even contain the population parameter at all.
To better understand the confidence level, let's take 10,000 confidence intervals and see how often our population means falls in the interval. First, let's make a function, as illustrated, that makes a single confidence interval from our breaks data:
# function to make confidence interval def makeConfidenceInterval(): sample_size = 100 sample = np.random.choice(a= breaks, size = sample_size) sample_mean = sample.mean() # sample mean sample_stdev = sample.std() # sample standard deviation sigma = sample_stdev/math.sqrt(sample_size) # population Standard deviation estimate return stats.t.interval(alpha = 0.95, df= sample_size - 1, loc = sample_mean, scale = sigma)
Now that we have a function that will create a single confidence interval, let's create a procedure that will test the probability that a single confidence interval will contain the true population parameter, 39.99:
times_in_interval = 0. for i in range(10000): interval = makeConfidenceInterval() if 39.99 >= interval[0] and 39.99 <= interval[1]: # if 39.99 falls in the interval times_in_interval += 1 print(times_in_interval / 10000) # 0.9455
Success! We see that about 95% of our confidence intervals contained our actual population mean. Estimating population parameters through point estimates and confidence intervals is a relatively simple and powerful form of statistical inference.
Let's also take a quick look at how the size of confidence intervals changes as we change our confidence level. Let's calculate confidence intervals for multiple confidence levels and look at how large the intervals are by looking at the difference between the two numbers. Our hypothesis will be that as we make our confidence level larger, we will likely see larger confidence intervals to be surer that we catch the true population parameter:
for confidence in (.5, .8, .85, .9, .95, .99): confidence_interval = stats.t.interval(alpha = confidence, df= sample_size - 1, loc = sample_mean, scale = sigma) length_of_interval = round(confidence_interval[1] - confidence_interval[0], 2) # the length of the confidence interval print( "confidence {0} has a interval of size {1}".format(confidence, length_of_interval)) confidence 0.5 has an interval of size 2.56 confidence 0.8 has an interval of size 4.88 confidence 0.85 has an interval of size 5.49 confidence 0.9 has an interval of size 6.29 confidence 0.95 has an interval of size 7.51 confidence 0.99 has an interval of size 9.94
We can see that as we wish to be more confident in our interval, our interval expands in order to compensate.
Next, we will take our concept of confidence levels and look at statistical hypothesis testing in order to both expand on these topics and also create (usually) even more powerful statistical inferences.