Importing a file

To import a file, we are going to use from_csv or read_csv, depending upon the version of Python being used. If you are using Python version 2.x, the following snippet can be used to import the data into the system:

#importing average rating
rating = DataFrame.from_csv("title.ratings.tsv", index_col=None, sep=" ")
#importing episode and season info
episode=DataFrame.from_csv("episode.tsv", index_col=None, sep=" ")

If you are using Python version 3.x, the preceding code will throw deprecation warnings. To overcome this, use the following snippet:

rating = pd.read_csv("title.ratings.tsv", index_col=None, sep="	")
episode= pd.read_csv("episode.tsv", index_col=None, sep=" ")

Now, both the files are imported into the system. To verify whether the files were imported correctly, we can have a look at the first 10 entries of each file using the head command: 

episode.head(10)

The command should output the first 10 entries, shown as follows:

Figure 15.3: First 10 entries from episode file

Similarly, we can verify the top 10 entries from the rating file like so:

rating.head(10)

It should generate output similar to the following screenshot:

Figure 15.4: Output for the first 10 entries from the rating file

In both data frames, tconst is the identifier or the title number, which is the same in both cases. To perform the analysis, we need to combine both files. To combine the files, we need to make sure we pull down the correct entries from both files and merge them. Luckily, panda provides a merge function to do the task: 

#we can use merge to get the data with the same titles for example tt0000000
re=rating.merge(episode)

This should merge these two files with the correct index. To verify, let's check the first 10 entries: 

Figure 15.5: Merged file, first 10 entries

To verify the table has been merged correctly, let's check the first entries from Figure 15.3. It shows tconst of tt0033908. Now let's see the corresponding entry in the rating table. We can use the loc command from pandas to locate the entry. The command should look like the following:

rating.loc[rating['tconst'] == "tt0033908"]

The command basically says: find the entries from the rating DataFrame where tconst is tt0033908. The output should look like the following:

Figure 15.6: Find the entries where tconst is tt0033908

If we examine Figure 15.6 and Figure 15.5, we can see the file is being correctly merged. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset