Counting the total road length in 2011 revisited

In this next demonstration, you will approach the same problem of enumerating the road length--this time using pandas. To start off with, create a file called pandas_intro.py that import the pandas module as follows:

import pandas

Reading CSV data using the pandas module is quite simple. Pandas combines the process of opening a file with the process of reading and parsing the data. To read a CSV file using the pandas module, you can use the pandas.read_csv() function. The pandas.read_csv() function takes as input the path to a file and returns a pandas dataframe. In the following continuation of pandas_intro.py, the pandas.read_csv() function is used to read the data from artificial_roads_by_region.csv into a pandas dataframe called roads_by_country:

import pandas

## read the csv file into a pandas dataframe
roads = pandas.read_csv("../data/input_data/artificial_roads_by_region.csv")

In the next step, I will add a couple lines to pandas_intro.py to print the column headers of the dataframe using the Python list() function. The list() function converts an item to a Python list. When used with a pandas dataframe, the list() function returns a Python list with the column headers of the dataframe.

import pandas

## read the csv file into a pandas dataframe
roads = pandas.read_csv("../data/input_data/artificial_roads_by_region.csv")

## print out the column headers
print("column headers:")
print(list(roads))

In pandas, columns can be indexed using the column names, similar to the way dictionaries are indexed using keys. In the following continuation of pandas_intro.py, the 2011 column is selected from the dataframe and printed. An individual column selected from pandas dataframe is a slightly different object called a pandas series that functions similarly to a dataframe:

....
# print(list(roads))

## extract roads from 2011
roads_2011 = roads['2011']
print(roads_2011)

Printing a pandas dataframe or series just prints the beginning and end of each column so that it won't flood your terminal. The following is the output on my machine from running pandas_intro.py as this stage:

You may have noticed that several of the values printed to the output are of the type NA. This is because pandas automatically detects an empty string and assigns its own datatype for an NA value. You do not need to remove NA values for this project as pandas will simply skip over them automatically.

Converting between datatypes in pandas can be done using the dataframe.astype() or series.as_type() function. The functions take as input a string that indicates the data type to be converted to. Recall that in the previous exercise the 2011 column was originally a string datatype, and needed to be converted to a float datatype. In the following continuation of pandas_intro.py, the values of the roads_2011 series are converted from a string data type to a float data type:

# print(roads_2011)

## convert the data type from string to float
roads_2011_2 = roads_2011.astype('float')

Once the data is converted, the last step is to add all of the values in the roads_2011_2 column. The dataframe.sum() function, of the series.sum() function can be used to add the values of a pandas dataframe or pandas series respectively. Both skip over NA values automatically. In the following continuation of pandas_intro.py, the sum of the values in the 2011 column is calculated:

....
roads_2011_2 = roads_2011.astype('float')

## find the sum of the values from 2011
total_2011 = roads_2011_2.sum()

print("total length of roads as of 2011:")
print(total_2011)

At this stage, running pandas_intro.csv should produce the following output:

Pandas also makes it possible to select multiple columns at once using an array of column names. In the following continuation of pandas_intro.py, a new dataframe is created that just contains every column except for the non-numerical column with the region name:

....
# print(total_2011)

## create a list of columns to extract
columns = ["2011","2010","2009","2008","2007","2006","2005","2004","2003","2002","2001","2000"]
## extract the numerical data variables
roads_num = roads[columns]

Alternatively, the previous step could be expressed more concisely, using the dataframe.drop() function, which drops a particular set of columns or rows. Here is a more concise way of removing the region name column:

## the more concise way
roads_num = roads.drop("region name",axis=1)

Finally, the dataframe.sum() function can be used to add multiple columns at once. In the following continuation of pandas_intro.py, the total for every year is found simultaneously.

.....
roads_num = roads.drop("region name",axis=1)

# ## sum along the vertical axis for all columns
total_by_year=roads_num.sum(0)

print("total road length by year:")
print(total_by_year)

Note that a parameter was passed to the dataframe.sum() function in order to specify that the sum should take place along the vertical axis.

Running pandas_intro.py at this point should print out a series of summed columns:

That's it for pandas! The way the pandas module works in Python is a bit analogous to the way tabular data is represented and processed in R. This will become clearer when I introduce R in Chapter 6Cleaning Numerical Data - An Introduction to R and Rstudio. Because much of the functionality of the pandas module is similar to R, I have just done a brief coverage here. For more on pandas, I've made a link to the pandas documentation available in the Links and Further Reading document of the external resources.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset