Introduction

Now that we have a thorough understanding of how RDDs and DataFrames work and what they can do, we can start preparing ourselves and our data for modeling.

Someone famous (Albert Einstein) once said (paraphrasing):

"The universe and the problems with any dataset are infinite, and I am not sure about the former."

The preceding is of course a joke. However, any dataset you work with, be it acquired at work, found online, collected yourself, or obtained through any other means, is dirty until proven otherwise; you should not trust it, you should not play with it, you should not even look at it until such time that you have proven to yourself that it is sufficiently clean (there is no such thing as totally clean).

What problems can your dataset have? Well, to name a few:

Duplicated observations: These arise through systemic and operator's faults
Missing observations: These can emerge due to sensor problems, respondents' unwillingness to provide an answer to a question, or simply some data corruption
Aanomalous observations: Observations that, when you look at them, stand out when compared with the rest of the dataset or a population
Encoding: Text fields that are not normalized (for example, words are not stemmed or use synonyms), in different languages, or you can encounter gibberish text input, and date and date time fields may not encoded the same way
Untrustworthy answers (true especially for surveys): When respondents lie for any reason; this type of dirty data is much harder to work with and clean up

As you can see, your data might be plagued by thousands upon thousands of traps that are just waiting for you to fall for them. Cleaning up the data and getting familiar with it is what we (as data scientists) do 80% of the time (the remaining 20% we spend building models and complaining about cleaning data). So fasten your seatbelt and prepare for a bumpy ride that is necessary for us to trust the data that we have and get familiar with it.

In this chapter, we will work with a small dataset of 22 records:

dirty_data = spark.createDataFrame([
          (1,'Porsche','Boxster S','Turbo',2.5,4,22,None)
        , (2,'Aston Martin','Vanquish','Aspirated',6.0,12,16,None)
        , (3,'Porsche','911 Carrera 4S Cabriolet','Turbo',3.0,6,24,None)
        , (3,'General Motors','SPARK ACTIV','Aspirated',1.4,None,32,None)
        , (5,'BMW','COOPER S HARDTOP 2 DOOR','Turbo',2.0,4,26,None)
        , (6,'BMW','330i','Turbo',2.0,None,27,None)
        , (7,'BMW','440i Coupe','Turbo',3.0,6,23,None)
        , (8,'BMW','440i Coupe','Turbo',3.0,6,23,None)
        , (9,'Mercedes-Benz',None,None,None,None,27,None)
        , (10,'Mercedes-Benz','CLS 550','Turbo',4.7,8,21,79231)
        , (11,'Volkswagen','GTI','Turbo',2.0,4,None,None)
        , (12,'Ford Motor Company','FUSION AWD','Turbo',2.7,6,20,None)
        , (13,'Nissan','Q50 AWD RED SPORT','Turbo',3.0,6,22,None)
        , (14,'Nissan','Q70 AWD','Aspirated',5.6,8,18,None)
        , (15,'Kia','Stinger RWD','Turbo',2.0,4,25,None)
        , (16,'Toyota','CAMRY HYBRID LE','Aspirated',2.5,4,46,None)
        , (16,'Toyota','CAMRY HYBRID LE','Aspirated',2.5,4,46,None)
        , (18,'FCA US LLC','300','Aspirated',3.6,6,23,None)
        , (19,'Hyundai','G80 AWD','Turbo',3.3,6,20,None)
        , (20,'Hyundai','G80 AWD','Turbo',3.3,6,20,None)
        , (21,'BMW','X5 M','Turbo',4.4,8,18,121231)
        , (22,'GE','K1500 SUBURBAN 4WD','Aspirated',5.3,8,18,None)
    ], ['Id','Manufacturer','Model','EngineType','Displacement',
        'Cylinders','FuelEconomy','MSRP'])

Throughout the subsequent recipes, we will clean up the preceding dataset and learn a little bit more about it.

Table of Contents for Introduction

Create new playlist

Sign In

Sign Up

Table of Contents for
Introduction