How to do it...

Let's start with a popular definition of an outlier.

A point, , that meets the following criteria:

Is not considered an outlier; any point outside this range is. In the preceding equation, Q1 is the first quartile (25th percentile), Q3 is the third quartile, and IQR is the interquartile range and is defined as the difference between Q3 and Q1 : IQR= Q3-Q1

To flag the outliers, follow these steps:

  1. Let's calculate our ranges first:
features = ['Displacement', 'Cylinders', 'FuelEconomy']
quantiles = [0.25, 0.75]

cut_off_points = []

for feature in features:
quants = imputed.approxQuantile(feature, quantiles, 0.05)

IQR = quants[1] - quants[0]
cut_off_points.append((feature, [
quants[0] - 1.5 * IQR,
quants[1] + 1.5 * IQR,
]))

cut_off_points = dict(cut_off_points)
  1. Next, we flag the outliers:
outliers = imputed.select(*['id'] + [
(
(imputed[f] < cut_off_points[f][0]) |
(imputed[f] > cut_off_points[f][1])
).alias(f + '_o') for f in features
])
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset