Initial exploration

Before anything else, we need to take a look at the data itself, as well as its columns and rows. It's reasonable to start data exploration by understanding the following:

  1. How do specific values look like, for example, using df.head(N), df.tail(N) , or df.sample(N) to retrieve (and print) the first N, last N, or random N rows from the dataset? As regards heads and tails, by default, N = 5. For our sample, it is 1 (one row). Alternatively, the sample method can take a frac argument, which will return a fraction of records—for example, df.sample(frac=0.25) will return 25% of the initial dataset. Note that printing will omit some columns in the middle if there are too many of them.
  2. The overall shape of the dataset—the number of rows and columns. To do this, we can use the df.shape attribute, which returns a tuple of two numbers: the first stands for the number of rows, and the second the number of columns. Alternatively, len(df) will return the number of rows. For a width dataframe with many columns, it is often useful to print all the column names by converting the index to a list: list(df.columns). Without being converted, columns will hide those names in the middle (this hiding behavior—both for Series and rows in dataframe—can be changed via the pandas settings, if needed).
  3. Learning data types for each column by using the df.dtypes attribute on the dataframe, or the df[col].dtype attribute of a particular column. The first one will return (so we can print) a series of strings representing the data types for each column. The latter will return one string.

In our case, all but one column in this dataset are objects, which usually means strings. Furthermore, many are vaguely structured, and clearly not ready for quantitative analysis—before we run the numbers, we first need to extract them. From the output of df.head(), it is clear that most columns require cleaning in order to be useful. So that's what we'll do next.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset