Introducing kernel density estimation

Kernel density estimation is a process by which we can estimate the shape of a dataset. After we have computed the shape of a dataset, we can compute the probability in which an event will happen.

In this section, we're going to introduce the kernel density estimator. The kernel density estimator requires a kernel function, and we are going to discuss the requirements of the kernel function and how the normal distribution meets those requirements. Finally, we're going to compute the KDE of a set of values. So, kernel density estimation tries to estimate the shape of a dataset. All data has a shape - we could also refer to this as the density - and that shape is not always clear. Once we have estimated the shape of a dataset, we can compute the probability of a particular observation.

We require a kernel function, and in this section we will use the normal. There are three requirements of a kernel and there are several different types of formulas which meet these requirements. The first  is that a kernel needs to be smooth. Secondly, a kernel needs to be symmetrical and non-negative. Symmetrical means that for a given position that is negative, you have the exact same value for the same position when it is positive. Non-negative simply means that the line exists at or above 0. And then, finally, the final requirement is that the area under the curve is 1. So, the following diagram shows the formula for the normal distribution, expressed as an integral from negative infinity to infinity, with a result that should be 1:

What's nice about normal distribution is that it doesn't matter what you set as the mean or as the standard deviation, the result will always be 1. So, here's the general algorithm for kernel density estimation. For each value in our dataset, we're going to compute a normal curve with a mean of that value. So, the curve itself is going to shift around. We're going to take all of those curves and we're going to add them together; and then finally, we're going to plot.

So, let's go over to our notebook, and we're going to take a small set of values, only three, as shown in the following example:

We are calling this values, and we have passed in [1, 1, 5]. Let's imagine that we have a dataset and we're trying to understand the shape of this dataset, but we only have these three values to work with. We can take the range of these values, and by looking at them, you can tell the range is 1 to 5, as shown in the following example:

Now, whenever we compute a kernel density estimation, we need to set a domain. I like to pick my domain to be whatever is 5 minus the lowest, all the way up to 5 greater than the highest. So, we need to set a domain from -4 to -3.9, all the way up to 10, as shown in the following screenshot:

That is a fairly broad domain, and I think this produces nice plots. Our next step is to compute the normal of each of our datasets. So, curve1 will be the normal at mean 1, because 1 is in our dataset and we will keep the standard deviation at 1, as shown in the following screenshot:

curve2 will be our normal at position 5, because that's another one of our values. Again, we will keep the standard deviation at 1,as shown in the following example:

So, for each value in our dataset we compute normal, where the mean of normal is at that value's position. So, let's go ahead and create a list of all of our curves, in the following way:

Now we need to compute the kernel density estimation, where we will add up all the curves together, as shown in the following example:

We have called this as kde, where we have used the tool foldl1. foldl1 is a nice little Haskell power tool for combining a list. We have then used zipwith with a + over our list of curves, and that adds up all of our curve data together. Now let's go ahead and plot this information, to see what it looks like, as shown in the following example:

We have zipped our domain with our kde. The following graph shows the shape of our dataset:

Now, don't try to read too much into the y axis at this point. Just try to notice that, at about position 1, we have a very high likelihood, and, at position 5, we have another high likelihood. There is a dip at position 3, which is midway between 1 and 5. And, also notice, the graph tends to trail off in both directions on either side. So, this is the shape of our [1, 1, 5] dataset. Now, what I'd like to do is make the area under this curve equal to 1, and we can do that in the following way:

We have called it kdeAdj, and we have simply divided it by the sum of the dataset. Now, if we compute the sum of kdeAdj, we will get the following output:

We've got a value that's almost 1, and now we are going to plot for kdeAdj, as shown in the following example:

The output for this would be as follows:

Now, this won't change the shape of our dataset, but it does now introduce probabilities for events. For example, if we assume that we have a continuous dataset, the probability that we will get exactly 1 is a little over 2.5 %; and the probability that we'll see a 5 looks like it's a little less than 1.5 %; and the probability that we'll get a 3 is about a half of a percent. So, notice that the graph is continuous and that we are allowed to see the probability of every point along the shape of our dataset, not just at the integer portions. Alright, we now have enough to introduce the KDE function, which will be introduced at the beginning of the next section. So, in our next section, we're going to introduce an application of the kernel density estimator, along with the kernel density estimator function.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset