How it works...

First, we specify the path to our dataset. In our case, as with all the other datasets we use in this book, census_income.csv is located in the data folder, accessible from the parent folder.

Next, we use the .read property of SparkSession, which returns the DataFrameReader object. The first parameter to the .csv(...) method specifies the path to the data. Our dataset has the column names in the first row, so we use the header option to instruct the reader to use the first row for column names. The inferSchema parameter instructs the DataFrameReader to automatically detect the datatype of each column.

Let's check whether the datatype inference is correct:

census.printSchema()

The preceding code produces the following output:

As you can see, the datatype of certain columns was detected properly; without the inferSchema parameter, all the columns would default to strings.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset