While outliers can reveal all kinds of useful intelligence such as issues in data capture or associated process patterns, they can cause problems when you are building data visualizations. The most common issue is to do with scale, as you can see in the following example.
For this recipe, we will make use on inline data load which gives us the information on the number of call made for each month. Add the following code into the data load editor and reload the Qlik Sense application:
Data: LOAD * INLINE [ Month, Date, Calls Jan, 27/01/15, 25 Jan, 28/01/15, 27 Jan, 29/01/15, 25 Jan, 30/01/15, 600 Jan, 31/01/15, 22 Feb, 01/02/15, 20 Feb, 02/02/15, 19 Feb, 03/02/15, 21 Feb, 04/02/15, 1 Feb, 05/02/15, 600 ]; .
Average Calls
.Avg(If (Calls > Aggr(NODISTINCT Fractile(Calls, 0.1), Month) and Calls < Aggr(NODISTINCT Fractile(Calls, 0.9), Month),Calls))
If we look at the source data we loaded at the beginning of the recipe it is clear that there are some outliers present. To exclude these and get a real picture of the normal, average amount of calls, we remove the top and bottom 10 percent of the value. This is done using the fractile
function. The fractile
function calculates the cut-off point for 10 and 90 percent based on our data. The Aggr
function is needed because fractile
is an aggregation function being nested inside another aggregation.
Another method of handling outliers is not to exclude them from the expression, but hide them from what is visualized. For example, if a data point far exceeds the norm, you can set the axis limit to the second largest value; this focuses the visualization on the points that are closely related. You can do this by going to the object properties:
=Max(aggr(avg(Calls),Date),2)
Here, we simply work out what the second largest number is and set that as the axis limit. This way we can produce an all inclusive line graph by date, albeit one data point will be off the screen.