Comparing data

Here, we are going to learn about how to compare data in Python. We will use a pandas module for this purpose.

Pandas is an open source data analysis library that provides data structures and data analysis tools that are easy to use. It makes importing and analyzing data easier.

Before starting with the example, make sure you have pandas installed on your system. You can install pandas as follows:

pip3 install pandas     --- For Python3

or

pip install pandas --- For python2

We will study an example of comparing data using pandas. Initially, we will create two csv files: student1.csv and student2.csv. We will compare the data of these two csv files and in output it should return the comparison. Create two csv files as follows:

Create the student1.csv file content as follows:

Id,Name,Gender,Age,Address
101,John,Male,20,New York
102,Mary,Female,18,London
103,Aditya,Male,22,Mumbai
104,Leo,Male,22,Chicago
105,Sam,Male,21,Paris
106,Tina,Female,23,Sydney

Create the student2.csv file content as follows:

Id,Name,Gender,Age,Address
101,John,Male,21,New York
102,Mary,Female,20,London
103,Aditya,Male,22,Mumbai
104,Leo,Male,23,Chicago
105,Sam,Male,21,Paris
106,Tina,Female,23,Sydney

Now, we will create a compare_data.py script and write the following content in it:

import pandas as pd
df1 = pd.read_csv("student1.csv")
df2 = pd.read_csv("student2.csv")
s1 = set([ tuple(values) for values in df1.values.tolist()])
s2 = set([ tuple(values) for values in df2.values.tolist()])
s1.symmetric_difference(s2)
print (pd.DataFrame(list(s1.difference(s2))), ' ')
print (pd.DataFrame(list(s2.difference(s1))), ' ')

Run the script as follows:

$ python3 compare_data.py

Output:
0 1 2 3 4
0 102 Mary Female 18 London
1 104 Leo Male 22 Chicago
2 101 John Male 20 New York


0 1 2 3 4
0 101 John Male 21 New York
1 104 Leo Male 23 Chicago
2 102 Mary Female 20 London

In the preceding example, we are comparing the data between the two csv files: student1.csv and student2.csv. We first converted our dataframes (df1, df2) into sets (s1, s2). Then, we used the symmetric_difference() set. So, it will check the symmetric difference between s1 and s2 and then we will print the result.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset