Benefits of using pandas

pandas forms a core component of the Python data analysis corpus. The distinguishing feature of pandas is that the suite of data structures that it provides is naturally suited to data analysis, primarily the DataFrame and, to a lesser extent, series (1-D vectors) and panel (3D tables).

Simply put, pandas and statstools can be described as Python's answer to R, the data analysis and statistical programming language that provides both data structures, such as R-dataframes, and a rich statistical library for data analysis.

The benefits of pandas compared to using a language such as Java, C, or C++ for data analysis are manifold:

  • Data representation: It can easily represent data in a form that's naturally suited for data analysis via its DataFrame and series data structures in a concise manner. Doing the equivalent in Java/C/C++ requires many lines of custom code as these languages were not built for data analysis but rather networking and kernel development.
  • Data subsetting and filtering: It permits easy subsetting and filtering of data, procedures that are a staple of doing data analysis.
  • Concise and clear code: Its concise and clear API allows the user to focus more on their core goal, rather than having to write a lot of scaffolding code in order to perform routine tasks. For example, reading a CSV file into a DataFrame data structure in memory takes two lines of code, while doing the same task in Java/C/C++ requires many more lines of code or calls to non-standard libraries, as illustrated below. Let's suppose that we had the following data to read:

Country

Year

CO2Emissions

PowerConsumption

FertilityRate

InternetUsagePer1000People

LifeExpectancy

Population

Belarus

2000

5.91

2988.71

1.29

18.69

68.01

1.00E+07

Belarus

2001

5.87

2996.81

43.15

9970260

Belarus

2002

6.03

2982.77

1.25

89.8

68.21

9925000

Belarus

2003

6.33

3039.1

1.25

162.76

9873968

Belarus

2004

3143.58

1.24

250.51

68.39

9824469

Belarus

2005

1.24

347.23

68.48

9775591

 

In a CSV file, this data that we wish to read would look like the following:

Country,Year,CO2Emissions,PowerConsumption,FertilityRate,
InternetUsagePer1000, LifeExpectancy, Population
Belarus,2000,5.91,2988.71,1.29,18.69,68.01,1.00E+07
Belarus,2001,5.87,2996.81,,43.15,,9970260
Belarus,2002,6.03,2982.77,1.25,89.8,68.21,9925000
...
Philippines,2000,1.03,514.02,,20.33,69.53,7.58E+07
Philippines,2001,0.99,535.18,,25.89,,7.72E+07
Philippines,2002,0.99,539.74,3.5,44.47,70.19,7.87E+07
...
Morocco,2000,1.2,489.04,2.62,7.03,68.81,2.85E+07
Morocco,2001,1.32,508.1,2.5,13.87,,2.88E+07
Morocco,2002,1.32,526.4,2.5,23.99,69.48,2.92E+07
..

The data here is taken from World Bank Economic data, available at http://data.worldbank.org.

In Java, we would have to write the following code:

public class CSVReader { 
public static void main(String[] args) { 
        String[] csvFile=args[1];
CSVReader csvReader = new csvReader();
List<Map>dataTable=csvReader.readCSV(csvFile);
}
public void readCSV(String[] csvFile)
{
BufferedReader bReader=null;
String line="";
String delim=","; //Initialize List of maps, each representing a line of the csv file
List<Map> data=new ArrayList<Map>(); try {
bufferedReader = new BufferedReader(new FileReader(csvFile));
// Read the csv file, line by line
while ((line = br.readLine()) != null){
String[] row = line.split(delim);
Map<String,String> csvRow=new HashMap<String,String>(); csvRow.put('Country')=row[0];
csvRow.put('Year')=row[1];
csvRow.put('CO2Emissions')=row[2]; csvRow.put('PowerConsumption')=row[3];
csvRow.put('FertilityRate')=row[4];
csvRow.put('InternetUsage')=row[1];
csvRow.put('LifeExpectancy')=row[6];
csvRow.put('Population')=row[7];
data.add(csvRow); } } catch (FileNotFoundException e) {
e.printStackTrace(); } catch (IOException e) {
e.printStackTrace(); } return data;
}

But, using pandas, it would take just two lines of code:

import pandas as pd
worldBankDF=pd.read_csv('worldbank.csv')

In addition, pandas is built upon the NumPy library and hence inherits many of the performance benefits of this package, especially when it comes to numerical and scientific computing. One oft-touted drawback of using Python is that as a scripting language, its performance relative to languages such as Java/C/C++ has been rather slow. However, this is not really the case for pandas.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset