This section is included to assist the students to perform the activities in the book. It includes detailed steps that are to be performed by the students to achieve the objectives of the activities.
These are the steps to complete this activity:
import random
LIMIT = 100
random_number_list = [random.randint(0, LIMIT) for x in range(0, LIMIT)]
random_number_list
The sample output is as follows:
list_with_divisible_by_3 = [a for a in random_number_list if a % 3 == 0]
list_with_divisible_by_3
The sample output is as follows:
length_of_random_list = len(random_number_list)
length_of_3_divisible_list = len(list_with_divisible_by_3)
difference = length_of_random_list - length_of_3_divisible_list
difference
The sample output is as follows:
62
NUMBER_OF_EXPERIMENTS = 10
difference_list = []
for i in range(0, NUMBER_OF_EXPERIEMENTS):
random_number_list = [random.randint(0, LIMIT) for x in range(0, LIMIT)]
list_with_divisible_by_3 = [a for a in random_number_list if a % 3 == 0]
length_of_random_list = len(random_number_list)
length_of_3_divisible_list = len(list_with_divisible_by_3)
difference = length_of_random_list - length_of_3_divisible_list
difference_list.append(difference)
difference_list
The sample output is as follows:
[64, 61, 67, 60, 73, 66, 66, 75, 70, 61]
avg_diff = sum(difference_list) / float(len(difference_list))
avg_diff
The sample output is as follows:
66.3
These are the steps to complete this activity:
type(multiline_text)
The output is as follows:
str
len(multiline_text)
The output is as follows:
4475
multiline_text = multiline_text.replace(' ', "")
Then, we will print and check the output:
multiline_text
The output is as follows:
# remove special chars, punctuation etc.
cleaned_multiline_text = ""
for char in multiline_text:
if char == " ":
cleaned_multiline_text += char
elif char.isalnum(): # using the isalnum() method of strings.
cleaned_multiline_text += char
else:
cleaned_multiline_text += " "
cleaned_multiline_text
The output is as follows:
list_of_words = cleaned_multiline_text.split()
list_of_words
The output is as follows:
len(list_of_words)
The output is 852.
unique_words_as_dict = dict.fromkeys(list_of_words)
len(list(unique_words_as_dict.keys()))
The output is 340.
for word in list_of_words:
if unique_words_as_dict[word] is None:
unique_words_as_dict[word] = 1
else:
unique_words_as_dict[word] += 1
unique_words_as_dict
The output is as follows:
You just created, step by step, a unique word counter using all the neat tricks that you just learned.
top_words = sorted(unique_words_as_dict.items(), key=lambda key_val_tuple: key_val_tuple[1], reverse=True)
top_words[:25]
These are the steps to complete this activity:
These are the steps to solve this activity:
from itertools import permutations, dropwhile
permutations?
dropwhile?
You will see a long list of definitions after each ?. We will skip it here.
permutations(range(3))
The output is as follows:
<itertools.permutations at 0x7f6c6c077af0>
for number_tuple in permutations(range(3)):
print(number_tuple)
assert isinstance(number_tuple, tuple)
The output is as follows:
(0, 1, 2)
(0, 2, 1)
(1, 0, 2)
(1, 2, 0)
(2, 0, 1)
(2, 1, 0)
An extra task can be to check the actual type that dropwhile returns without casting:
for number_tuple in permutations(range(3)):
print(list(dropwhile(lambda x: x <= 0, number_tuple)))
The output is as follows:
[1, 2]
[2, 1]
[1, 0, 2]
[1, 2, 0]
[2, 0, 1]
[2, 1, 0]
import math
def convert_to_number(number_stack):
final_number = 0
for i in range(0, len(number_stack)):
final_number += (number_stack.pop() * (math.pow(10, i)))
return final_number
for number_tuple in permutations(range(3)):
number_stack = list(dropwhile(lambda x: x <= 0, number_tuple))
print(convert_to_number(number_stack))
The output is as follows:
12.0
21.0
102.0
120.0
201.0
210.0
These are the steps to complete this activity:
from itertools import zip_longest
def return_dict_from_csv_line(header, line):
# Zip them
zipped_line = zip_longest(header, line, fillvalue=None)
# Use dict comprehension to generate the final dict
ret_dict = {kv[0]: kv[1] for kv in zipped_line}
return ret_dict
first_line = fd.readline()
header = first_line.replace(" ", "").split(",")
for i, line in enumerate(fd):
line = line.replace(" ", "").split(",")
d = return_dict_from_csv_line(header, line)
print(d)
if i > 10:
break
The output is as follows:
These are the steps to complete this activity:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Hint: The Pandas function for reading a CSV file is 'read_csv'.
# Don't forget that all functions in Pandas can be accessed by syntax like pd.{function_name}
df=pd.read_csv("Boston_housing.csv")
df.head(10)
The output is as follows:
df.shape
The output is as follows:
(506, 14)
df1=df[['CRIM','ZN','INDUS','RM','AGE','DIS','RAD','TAX','PTRATIO','PRICE']]
df1.tail(7)
The output is as follows:
for c in df1.columns:
plt.title("Plot of "+c,fontsize=15)
plt.hist(df1[c],bins=20)
plt.show()
The output is as follows:
plt.scatter(df1['CRIM'],df1['PRICE'])
plt.show()
The output is as follows:
We can understand the relationship better if we plot log10(crime) versus price.
plt.scatter(np.log10(df1['CRIM']),df1['PRICE'],c='red')
plt.title("Crime rate (Log) vs. Price plot", fontsize=18)
plt.xlabel("Log of Crime rate",fontsize=15)
plt.ylabel("Price",fontsize=15)
plt.grid(True)
plt.show()
The output is as follows:
df1['RM'].mean()
The output is 6.284634387351788.
df1['AGE'].median()
The output is 77.5.
df1['DIS'].mean()
The output is 3.795042687747034.
# Create a Pandas series and directly compare it with 20
# You can do this because Pandas series is basically NumPy array and you have seen how to filter NumPy array
low_price=df1['PRICE']<20
# This creates a Boolean array of True, False
print(low_price)
# True = 1, False = 0, so now if you take an average of this NumPy array, you will know how many 1's are there.
# That many houses are priced below 20,000. So that is the answer.
# You can convert that into percentage by multiplying with 100
pcnt=low_price.mean()*100
print(" Percentage of house with <20,000 price is: ",pcnt)
The output is as follows:
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 True
9 True
10 True
…
500 True
501 False
502 False
503 False
504 False
505 True
Name: PRICE, Length: 506, dtype: bool
Percentage of house with <20,000 price is: 41.50197628458498
These are the steps to complete this activity:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("adult_income_data.csv")
df.head()
The output is as follows:
names = []
with open('adult_income_names.txt','r') as f:
for line in f:
f.readline()
var=line.split(":")[0]
names.append(var)
names
The output is as follows:
names.append('Income')
df = pd.read_csv("adult_income_data.csv",names=names)
df.head()
The output is as follows:
df.describe()
The output is as follows:
Note that only a small number of columns are included. Many variables in the dataset have multiple factors or classes.
# Make a list of all variables with classes
vars_class = ['workclass','education','marital-status',
'occupation','relationship','sex','native-country']
for v in vars_class:
classes=df[v].unique()
num_classes = df[v].nunique()
print("There are {} classes in the "{}" column. They are: {}".format(num_classes,v,classes))
print("-"*100)
The output is as follows:
df.isnull().sum()
The output is as follows:
df_subset = df[['age','education','occupation']]
df_subset.head()
The output is as follows:
df_subset['age'].hist(bins=20)
The output is as follows:
<matplotlib.axes._subplots.AxesSubplot at 0x19dea8d0>
df_subset.boxplot(column='age',by='education',figsize=(25,10))
plt.xticks(fontsize=15)
plt.xlabel("Education",fontsize=20)
plt.show()
The output is as follows:
Before doing any further operations, we need to use the apply method we learned in this chapter. It turns out that when reading the dataset from the CSV file, all the strings came with a whitespace character in front. So, we need to remove that whitespace from all the strings.
def strip_whitespace(s):
return s.strip()
# Education column
df_subset['education_stripped']=df['education'].apply(strip_whitespace)
df_subset['education']=df_subset['education_stripped']
df_subset.drop(labels=['education_stripped'],axis=1,inplace=True)
# Occupation column
df_subset['occupation_stripped']=df['occupation'].apply(strip_whitespace)
df_subset['occupation']=df_subset['occupation_stripped']
df_subset.drop(labels=['occupation_stripped'],axis=1,inplace=True)
This is the sample warning message, which you should ignore:
# Conditional clauses and join them by & (AND)
df_filtered=df_subset[(df_subset['age']>=30) & (df_subset['age']<=50)]
Check the contents of the new dataset:
df_filtered.head()
The output is as follows:
answer_1=df_filtered.shape[0]
answer_1
The output is as follows:
1630
print("There are {} people of age between 30 and 50 in this dataset.".format(answer_1))
The output is as follows:
There are 1630 black of age between 30 and 50 in this dataset.
df_subset.groupby('occupation').describe()['age']
The output is as follows:
The code returns 79 rows × 1 columns.
df_subset.groupby('occupation').describe()['age']
The output is as follows:
Is there a particular occupation group that has very low representation? Perhaps we should remove those pieces of data because with very low data, the group won't be useful in analysis. Actually, just by looking at the preceding table, you should be able to see that the Armed-Forces group has only got a 9 count, that is, 9 data points. But how can we detect this? By plotting the count column in a bar chart. Note how the first argument to the barh function is the index of the DataFrame, which is the summary stats of the occupation groups. We can see that the Armed-Forces group has almost no data. This exercise teaches you that, sometimes, the outlier is not just a value, but can be a whole group. The data of this group is fine, but it is too small to be useful for any analysis. So, it can be treated as an outlier in this case. But always use your business knowledge and engineering judgement for such outlier detection and how to process them.
occupation_stats= df_subset.groupby(
'occupation').describe()['age']
plt.figure(figsize=(15,8))
plt.barh(y=occupation_stats.index,
width=occupation_stats['count'])
plt.yticks(fontsize=13)
plt.show()
The output is as follows:
df_1 = df[['age',
'workclass',
'occupation']].sample(5,random_state=101)
df_1.head()
The output is as follows:
The second dataset is as follows:
df_2 = df[['education',
'occupation']].sample(5,random_state=101)
df_2.head()
The output is as follows:
Merging the two datasets together:
df_merged = pd.merge(df_1,df_2,
on='occupation',
how='inner').drop_duplicates()
df_merged
The output is as follows:
These are the steps to complete this activity:
from bs4 import BeautifulSoup
import pandas as pd
fd = open("List of countries by GDP (nominal) - Wikipedia.htm", "r")
soup = BeautifulSoup(fd)
fd.close()
all_tables = soup.find_all("table")
print("Total number of tables are {} ".format(len(all_tables)))
There are 9 tables in total.
data_table = soup.find("table", {"class": '"wikitable"|}'})
print(type(data_table))
The output is as follows:
<class 'bs4.element.Tag'>
sources = data_table.tbody.findAll('tr', recursive=False)[0]
sources_list = [td for td in sources.findAll('td')]
print(len(sources_list))
The output is as follows:
Total number of tables are 3.
data = data_table.tbody.findAll('tr', recursive=False)[1].findAll('td', recursive=False)
data_tables = []
for td in data:
data_tables.append(td.findAll('table'))
len(data_tables)
The output is as follows:
3
source_names = [source.findAll('a')[0].getText() for source in sources_list]
print(source_names)
The output is as follows:
['International Monetary Fund', 'World Bank', 'United Nations']
header1 = [th.getText().strip() for th in data_tables[0][0].findAll('thead')[0].findAll('th')]
header1
The output is as follows:
['Rank', 'Country', 'GDP(US$MM)']
rows1 = data_tables[0][0].findAll('tbody')[0].findAll('tr')[1:]
data_rows1 = [[td.get_text().strip() for td in tr.findAll('td')] for tr in rows1]
df1 = pd.DataFrame(data_rows1, columns=header1)
df1.head()
The output is as follows:
header2 = [th.getText().strip() for th in data_tables[1][0].findAll('thead')[0].findAll('th')]
header2
The output is as follows:
['Rank', 'Country', 'GDP(US$MM)']
rows2 = data_tables[1][0].findAll('tbody')[0].findAll('tr')[1:]
def find_right_text(i, td):
if i == 0:
return td.getText().strip()
elif i == 1:
return td.getText().strip()
else:
index = td.text.find("♠")
return td.text[index+1:].strip()
data_rows2 = [[find_right_text(i, td) for i, td in enumerate(tr.findAll('td'))] for tr in rows2]
df2 = pd.DataFrame(data_rows2, columns=header2)
df2.head()
The output is as follows:
header3 = [th.getText().strip() for th in data_tables[2][0].findAll('thead')[0].findAll('th')]
header3
The output is as follows:
['Rank', 'Country', 'GDP(US$MM)']
rows3 = data_tables[2][0].findAll('tbody')[0].findAll('tr')[1:]
data_rows3 = [[find_right_text(i, td) for i, td in enumerate(tr.findAll('td'))] for tr in rows2]
df3 = pd.DataFrame(data_rows3, columns=header3)
df3.head()
The output is as follows:
These are the steps to complete this activity:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv("visit_data.csv")
df.head()
The output is as follows:
As we can see, there is data where some values are missing, and if we examine this, we will see some outliers.
print("First name is duplicated - {}".format(any(df.first_name.duplicated())))
print("Last name is duplicated - {}".format(any(df.last_name.duplicated())))
print("Email is duplicated - {}".format(any(df.email.duplicated())))
The output is as follows:
First name is duplicated - True
Last name is duplicated - True
Email is duplicated - False
There are duplicates in both the first and last names, which is normal. However, as we can see, there is no duplicate in email. That's good.
# Notice that we have different ways to format boolean values for the % operator
print("The column Email contains NaN - %r " % df.email.isnull().values.any())
print("The column IP Address contains NaN - %s " % df.ip_address.isnull().values.any())
print("The column Visit contains NaN - %s " % df.visit.isnull().values.any())
The output is as follows:
The column Email contains NaN - False
The column IP Address contains NaN - False
The column Visit contains NaN - True
The column visit contains some None values. Given that the final task at hand will probably be predicting the number of visits, we cannot do anything with rows that do not have that information. They are a type of outlier. Let's get rid of them.
# There are various ways to do this. This is just one way. We encourage you to explore other ways.
# But before that we need to store the previous size of the data set and we will compare it with the new size
size_prev = df.shape
df = df[np.isfinite(df['visit'])] #This is an inplace operation. After this operation the original DataFrame is lost.
size_after = df.shape
# Notice how parameterized format is used and then the indexing is working inside the quote marks
print("The size of previous data was - {prev[0]} rows and the size of the new one is - {after[0]} rows".
format(prev=size_prev, after=size_after))
The output is as follows:
The size of previous data was - 1000 rows and the size of the new one is - 974 rows
plt.boxplot(df.visit, notch=True)
The output is as follows:
{'whiskers': [<matplotlib.lines.Line2D at 0x7fa04cc08668>,
<matplotlib.lines.Line2D at 0x7fa04cc08b00>],
'caps': [<matplotlib.lines.Line2D at 0x7fa04cc08f28>,
<matplotlib.lines.Line2D at 0x7fa04cc11390>],
'boxes': [<matplotlib.lines.Line2D at 0x7fa04cc08518>],
'medians': [<matplotlib.lines.Line2D at 0x7fa04cc117b8>],
'fliers': [<matplotlib.lines.Line2D at 0x7fa04cc11be0>],
'means': []}
The boxplot is as follows:
As we can see, we have data in this column in the interval (0, 3000). However, the main concentration of the data is between ~700 and ~2300.
df1 = df[(df['visit'] <= 2900) & (df['visit'] >= 100)] # Notice the powerful & operator
# Here we abuse the fact the number of variable can be greater than the number of replacement targets
print("After getting rid of outliers the new size of the data is - {}".format(*df1.shape))
After getting rid of the outliers, the new size of the data is 923.
This is the end of the activity for this chapter.
These are the steps to complete this activity:
import urllib.request, urllib.parse, urllib.error
import requests
from bs4 import BeautifulSoup
import ssl
import re
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
# Read the HTML from the URL and pass on to BeautifulSoup
top100url = 'https://www.gutenberg.org/browse/scores/top'
response = requests.get(top100url)
def status_check(r):
if r.status_code==200:
print("Success!")
return 1
else:
print("Failed!")
return -1
status_check(response)
The output is as follows:
Success!
1
contents = response.content.decode(response.encoding)
soup = BeautifulSoup(contents, 'html.parser')
# Empty list to hold all the http links in the HTML page
lst_links=[]
# Find all the href tags and store them in the list of links
for link in soup.find_all('a'):
#print(link.get('href'))
lst_links.append(link.get('href'))
lst_links[:30]
The output is as follows:
['/wiki/Main_Page',
'/catalog/',
'/ebooks/',
'/browse/recent/last1',
'/browse/scores/top',
'/wiki/Gutenberg:Offline_Catalogs',
'/catalog/world/mybookmarks',
'/wiki/Main_Page',
'https://www.paypal.com/xclick/business=donate%40gutenberg.org&item_name=Donation+to+Project+Gutenberg',
'/wiki/Gutenberg:Project_Gutenberg_Needs_Your_Donation',
'http://www.ibiblio.org',
'http://www.pgdp.net/',
'pretty-pictures',
'#books-last1',
'#authors-last1',
'#books-last7',
'#authors-last7',
'#books-last30',
'#authors-last30',
'/ebooks/1342',
'/ebooks/84',
'/ebooks/1080',
'/ebooks/46',
'/ebooks/219',
'/ebooks/2542',
'/ebooks/98',
'/ebooks/345',
'/ebooks/2701',
'/ebooks/844',
'/ebooks/11']
booknum=[]
for i in range(19,119):
link=lst_links[i]
link=link.strip()
# Regular expression to find the numeric digits in the link (href) string
n=re.findall('[0-9]+',link)
if len(n)==1:
# Append the filenumber casted as integer
booknum.append(int(n[0]))
print (" The file numbers for the top 100 ebooks on Gutenberg are shown below "+"-"*70)
print(booknum)
The output is as follows:
The file numbers for the top 100 ebooks on Gutenberg are shown below
----------------------------------------------------------------------
[1342, 84, 1080, 46, 219, 2542, 98, 345, 2701, 844, 11, 5200, 43, 16328, 76, 74, 1952, 6130, 2591, 1661, 41, 174, 23, 1260, 1497, 408, 3207, 1400, 30254, 58271, 1232, 25344, 58269, 158, 44881, 1322, 205, 2554, 1184, 2600, 120, 16, 58276, 5740, 34901, 28054, 829, 33, 2814, 4300, 100, 55, 160, 1404, 786, 58267, 3600, 19942, 8800, 514, 244, 2500, 2852, 135, 768, 58263, 1251, 3825, 779, 58262, 203, 730, 20203, 35, 1250, 45, 161, 30360, 7370, 58274, 209, 27827, 58256, 33283, 4363, 375, 996, 58270, 521, 58268, 36, 815, 1934, 3296, 58279, 105, 2148, 932, 1064, 13415]
You will notice a lot of empty spaces/blanks here and there. Ignore them. They are part of the HTML page's markup and its whimsical nature:
print(soup.text[:2000])
if (top != self) {
top.location.replace (http://www.gutenberg.org);
alert ('Project Gutenberg is a FREE service with NO membership required. If you paid somebody else to get here, make them give you your money back!');
}
The output is as follows:
Top 100 - Project Gutenberg
Online Book Catalog
Book Search
-- Recent Books
-- Top 100
-- Offline Catalogs
-- My Bookmarks
Main Page
…
Pretty Pictures
Top 100 EBooks yesterday —
Top 100 Authors yesterday —
Top 100 EBooks last 7 days —
Top 100 Authors last 7 days —
Top 100 EBooks last 30 days —
Top 100 Authors last 30 days
Top 100 EBooks yesterday
Pride and Prejudice by Jane Austen (1826)
Frankenstein; Or, The Modern Prometheus by Mary Wollstonecraft Shelley (1367)
A Modest Proposal by Jonathan Swift (1020)
A Christmas Carol in Prose; Being a Ghost Story of Christmas by Charles Dickens (953)
Heart of Darkness by Joseph Conrad (887)
Et dukkehjem. English by Henrik Ibsen (761)
A Tale of Two Cities by Charles Dickens (741)
Dracula by Bram Stoker (732)
Moby Dick; Or, The Whale by Herman Melville (651)
The Importance of Being Earnest: A Trivial Comedy for Serious People by Oscar Wilde (646)
Alice's Adventures in Wonderland by Lewis Carrol
# Temp empty list of Ebook names
lst_titles_temp=[]
start_idx=soup.text.splitlines().index('Top 100 EBooks yesterday')
for i in range(100):
lst_titles_temp.append(soup.text.splitlines()[start_idx+2+i])
lst_titles=[]
for i in range(100):
id1,id2=re.match('^[a-zA-Z ]*',lst_titles_temp[i]).span()
lst_titles.append(lst_titles_temp[i][id1:id2])
for l in lst_titles:
print(l)
The output is as follows:
Pride and Prejudice by Jane Austen
Frankenstein
A Modest Proposal by Jonathan Swift
A Christmas Carol in Prose
Heart of Darkness by Joseph Conrad
Et dukkehjem
A Tale of Two Cities by Charles Dickens
Dracula by Bram Stoker
Moby Dick
The Importance of Being Earnest
Alice
Metamorphosis by Franz Kafka
The Strange Case of Dr
Beowulf
…
The Russian Army and the Japanese War
Calculus Made Easy by Silvanus P
Beyond Good and Evil by Friedrich Wilhelm Nietzsche
An Occurrence at Owl Creek Bridge by Ambrose Bierce
Don Quixote by Miguel de Cervantes Saavedra
Blue Jackets by Edward Greey
The Life and Adventures of Robinson Crusoe by Daniel Defoe
The Waterloo Campaign
The War of the Worlds by H
Democracy in America
Songs of Innocence
The Confessions of St
Modern French Masters by Marie Van Vorst
Persuasion by Jane Austen
The Works of Edgar Allan Poe
The Fall of the House of Usher by Edgar Allan Poe
The Masque of the Red Death by Edgar Allan Poe
The Lady with the Dog and Other Stories by Anton Pavlovich Chekhov
These are the steps to complete this activity:
import urllib.request, urllib.parse, urllib.error
import json
The following cell will not be executed in the solution notebook because the author cannot give out their private API key.
with open('APIkeys.json') as f:
keys = json.load(f)
omdbapi = keys['OMDBapi']
The final URL to be passed should look like this: http://www.omdbapi.com/?t=movie_name&apikey=secretapikey.
serviceurl = 'http://www.omdbapi.com/?'
apikey = '&apikey='+omdbapi
def print_json(json_data):
list_keys=['Title', 'Year', 'Rated', 'Released', 'Runtime', 'Genre', 'Director', 'Writer',
'Actors', 'Plot', 'Language', 'Country', 'Awards', 'Ratings',
'Metascore', 'imdbRating', 'imdbVotes', 'imdbID']
print("-"*50)
for k in list_keys:
if k in list(json_data.keys()):
print(f"{k}: {json_data[k]}")
print("-"*50)
def save_poster(json_data):
import os
title = json_data['Title']
poster_url = json_data['Poster']
# Splits the poster url by '.' and picks up the last string as file extension
poster_file_extension=poster_url.split('.')[-1]
# Reads the image file from web
poster_data = urllib.request.urlopen(poster_url).read()
savelocation=os.getcwd()+''+'Posters'+''
# Creates new directory if the directory does not exist. Otherwise, just use the existing path.
if not os.path.isdir(savelocation):
os.mkdir(savelocation)
filename=savelocation+str(title)+'.'+poster_file_extension
f=open(filename,'wb')
f.write(poster_data)
f.close()
def search_movie(title):
try:
url = serviceurl + urllib.parse.urlencode({'t': str(title)})+apikey
print(f'Retrieving the data of "{title}" now... ')
print(url)
uh = urllib.request.urlopen(url)
data = uh.read()
json_data=json.loads(data)
if json_data['Response']=='True':
print_json(json_data)
# Asks user whether to download the poster of the movie
if json_data['Poster']!='N/A':
save_poster(json_data)
else:
print("Error encountered: ",json_data['Error'])
except urllib.error.URLError as e:
print(f"ERROR: {e.reason}"
search_movie("Titanic")
The following is the retrieved data for Titanic:
http://www.omdbapi.com/?t=Titanic&apikey=17cdc959
--------------------------------------------------
Title: Titanic
Year: 1997
Rated: PG-13
Released: 19 Dec 1997
Runtime: 194 min
Genre: Drama, Romance
Director: James Cameron
Writer: James Cameron
Actors: Leonardo DiCaprio, Kate Winslet, Billy Zane, Kathy Bates
Plot: A seventeen-year-old aristocrat falls in love with a kind but poor artist aboard the luxurious, ill-fated R.M.S. Titanic.
Language: English, Swedish
Country: USA
Awards: Won 11 Oscars. Another 111 wins & 77 nominations.
Ratings: [{'Source': 'Internet Movie Database', 'Value': '7.8/10'}, {'Source': 'Rotten Tomatoes', 'Value': '89%'}, {'Source': 'Metacritic', 'Value': '75/100'}]
Metascore: 75
imdbRating: 7.8
imdbVotes: 913,780
imdbID: tt0120338
--------------------------------------------------
search_movie("Random_error")
Retrieve the data of "Random_error":
http://www.omdbapi.com/?t=Random_error&apikey=17cdc959
Error encountered: Movie not found!
Look for a folder called Posters in the same directory you are working in. It should contain a file called Titanic.jpg. Check the file.
These are the steps to complete this activity:
import sqlite3
conn = sqlite3.connect("petsdb")
# a tiny function to make sure the connection is successful
def is_opened(conn):
try:
conn.execute("SELECT * FROM persons LIMIT 1")
return True
except sqlite3.ProgrammingError as e:
print("Connection closed {}".format(e))
return False
print(is_opened(conn))
The output is as follows:
True
conn.close()
print(is_opened(conn))
The output is as follows:
False
conn = sqlite3.connect("petsdb")
c = conn.cursor()
for ppl, age in c.execute("SELECT count(*), age FROM persons GROUP BY age"):
print("We have {} people aged {}".format(ppl, age))
The output is as follows:
sfor ppl, age in c.execute(
"SELECT count(*), age FROM persons GROUP BY age ORDER BY count(*) DESC"):
print("Highest number of people is {} and came from {} age group".format(ppl, age))
break
The output is as follows:
Highest number of people is 5 and came from 73 age group
res = c.execute("SELECT count(*) FROM persons WHERE last_name IS null")
for row in res:
print(row)
The output is as follows:
(60,)
res = c.execute("SELECT count(*) FROM (SELECT count(owner_id) FROM pets GROUP BY owner_id HAVING count(owner_id) >1)")
for row in res:
print("{} People has more than one pets".format(row[0]))
The output is as follows:
43 People has more than one pets
res = c.execute("SELECT count(*) FROM pets WHERE treatment_done=1")
for row in res:
print(row)
The output is as follows:
(36,)
res = c.execute("SELECT count(*) FROM pets WHERE treatment_done=1 AND pet_type IS NOT null")
for row in res:
print(row)
The output is as follows:
(16,)
res = c.execute("SELECT count(*) FROM pets JOIN persons ON pets.owner_id = persons.id WHERE persons.city='east port'")
for row in res:
print(row)
The output is as follows:
(49,)
res = c.execute("SELECT count(*) FROM pets JOIN persons ON pets.owner_id = persons.id WHERE persons.city='east port' AND pets.treatment_done=1")
for row in res:
print(row)
The output is as follows:
(11,)
These are the steps to complete this activity:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')s
education_data_link="http://data.un.org/_Docs/SYB/CSV/SYB61_T07_Education.csv"
df1 = pd.read_csv(education_data_link)
df1.head()
The output is as follows:
df1 = pd.read_csv(education_data_link,skiprows=1)
df1.head()
The output is as follows:
df2 = df1.drop(['Region/Country/Area','Source'],axis=1)
df2.columns=['Region/Country/Area','Year','Data','Enrollments (Thousands)','Footnotes']
df1.head()
The output is as follows:
df2['Footnotes'].unique()
The output is as follows:
type(df2['Enrollments (Thousands)'][0])
The output is as follows:
str
def to_numeric(val):
"""
Converts a given string (with one or more commas) to a numeric value
"""
if ',' not in str(val):
result = float(val)
else:
val=str(val)
val=''.join(str(val).split(','))
result=float(val)
return result
df2['Enrollments (Thousands)']=df2['Enrollments (Thousands)'].apply(to_numeric)
df2['Data'].unique()
The output is as follows:
df_primary = df2[df2['Data']=='Students enrolled in primary education (thousands)']
df_secondary = df2[df2['Data']=='Students enrolled in secondary education (thousands)']
df_tertiary = df2[df2['Data']=='Students enrolled in tertiary education (thousands)']
primary_enrollment_india = df_primary[df_primary['Region/Country/Area']=='India']
primary_enrollment_USA = df_primary[df_primary['Region/Country/Area']=='United States of America']
primary_enrollment_india
The output is as follows:
primary_enrollment_USA
The output is as follows:
plt.figure(figsize=(8,4))
plt.bar(primary_enrollment_india['Year'],primary_enrollment_india['Enrollments (Thousands)'])
plt.title("Enrollment in primary education in India (in thousands)",fontsize=16)
plt.grid(True)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlabel("Year", fontsize=15)
plt.show()
The output is as follows:
plt.figure(figsize=(8,4))
plt.bar(primary_enrollment_USA['Year'],primary_enrollment_USA['Enrollments (Thousands)'])
plt.title("Enrollment in primary education in the United States of America (in thousands)",fontsize=16)
plt.grid(True)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlabel("Year", fontsize=15)
plt.show()
The output is as follows:
Data imputation: Clearly, we are missing some data. Let's say we decide to impute these data points by simple linear interpolation between the available data points. We can take out a pen and paper or a calculator and compute those values and manually create a dataset somehow. But being a data wrangler, we will of course take advantage of Python programming, and use pandas imputation methods for this task. But to do that, we first need to create a DataFrame with missing values inserted – that is, we need to append another DataFrame with missing values to the current DataFrame.
(For India) Append the rows corresponding to missing the years – 2004 - 2009, 2011 – 2013.
missing_years = [y for y in range(2004,2010)]+[y for y in range(2011,2014)]
missing_years
The output is as follows:
[2004, 2005, 2006, 2007, 2008, 2009, 2011, 2012, 2013]
dict_missing = {'Region/Country/Area':['India']*9,'Year':missing_years,
'Data':'Students enrolled in primary education (thousands)'*9,
'Enrollments (Thousands)':[np.nan]*9,'Footnotes':[np.nan]*9}
df_missing = pd.DataFrame(data=dict_missing)
primary_enrollment_india=primary_enrollment_india.append(df_missing,ignore_index=True,sort=True)
primary_enrollment_india
The output is as follows:
primary_enrollment_india.sort_values(by='Year',inplace=True)
primary_enrollment_india.reset_index(inplace=True,drop=True)
primary_enrollment_india
The output is as follows:
primary_enrollment_india.interpolate(inplace=True)
primary_enrollment_india
The output is as follows:
plt.figure(figsize=(8,4))
plt.bar(primary_enrollment_india['Year'],primary_enrollment_india['Enrollments (Thousands)'])
plt.title("Enrollment in primary education in India (in thousands)",fontsize=16)
plt.grid(True)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlabel("Year", fontsize=15)
plt.show()
The output is as follows:
missing_years = [2004]+[y for y in range(2006,2010)]+[y for y in range(2011,2014)]+[2016]
missing_years
The output is as follows:
[2004, 2006, 2007, 2008, 2009, 2011, 2012, 2013, 2016]
dict_missing = {'Region/Country/Area':['United States of America']*9,'Year':missing_years, 'Data':'Students enrolled in primary education (thousands)'*9, 'Value':[np.nan]*9,'Footnotes':[np.nan]*9}
df_missing = pd.DataFrame(data=dict_missing)
primary_enrollment_USA=primary_enrollment_USA.append(df_missing,ignore_index=True,sort=True)
primary_enrollment_USA.sort_values(by='Year',inplace=True)
primary_enrollment_USA.reset_index(inplace=True,drop=True)
primary_enrollment_USA.interpolate(inplace=True)
primary_enrollment_USA
The output is as follows:
primary_enrollment_USA.interpolate(method='linear',limit_direction='backward',limit=1)
The output is as follows:
primary_enrollment_USA
The output is as follows:
plt.figure(figsize=(8,4))
plt.bar(primary_enrollment_USA['Year'],primary_enrollment_USA['Enrollments (Thousands)'])
plt.title("Enrollment in primary education in the United States of America (in thousands)",fontsize=16)
plt.grid(True)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlabel("Year", fontsize=15)
plt.show()
The output is as follows:
These are the steps to complete this activity:
df3=pd.read_csv("India_World_Bank_Info.csv")
The output is as follows:
---------------------------------------------------------------------------
ParserError Traceback (most recent call last)
<ipython-input-45-9239cae67df7> in <module>()
…..
ParserError: Error tokenizing data. C error: Expected 1 fields in line 6, saw 3
We can try and use the error_bad_lines=False option in this kind of situation.
df3=pd.read_csv("India_World_Bank_Info.csv",error_bad_lines=False)
df3.head(10)
The output is as follows:
At times, the output may not found because there are three rows instead of the expected one row.
df3=pd.read_csv("India_World_Bank_Info.csv",error_bad_lines=False,delimiter=' ')
df3.head(10)
The output is as follows:
df3=pd.read_csv("India_World_Bank_Info.csv",error_bad_lines=False,delimiter=' ',skiprows=4)
df3.head(10)
The output is as follows:
df4=df3[df3['Indicator Name']=='GDP per capita (current US$)'].T
df4.head(10)
The output is as follows:
df4.reset_index(inplace=True)
df4.head(10)
The output is as follows:
df4.drop([0,1,2],inplace=True)
df4.reset_index(inplace=True,drop=True)
df4.head(10)
The output is as follows:
df4.columns=['Year','GDP']
df4.head(10)
The output is as follows:
df4.tail(20)
The output is as follows:
df_gdp=df4.iloc[[i for i in range(43,57)]]
df_gdp
The output is as follows:
df_gdp.reset_index(inplace=True,drop=True)
df_gdp
The output is as follows:
df_gdp['Year']
The output is as follows:
df_gdp['Year']=df_gdp['Year'].apply(int)
These are the steps to complete this activity:
primary_enrollment_with_gdp=primary_enrollment_india.merge(df_gdp,on='Year')
primary_enrollment_with_gdp
The output is as follows:
primary_enrollment_with_gdp.drop(['Data','Footnotes','Region/Country/Area'],axis=1,inplace=True)
primary_enrollment_with_gdp
The output is as follows:
primary_enrollment_with_gdp = primary_enrollment_with_gdp[['Year','Enrollments (Thousands)','GDP']]
primary_enrollment_with_gdp
The output is as follows:
plt.figure(figsize=(8,5))
plt.title("India's GDP per capita vs primary education enrollment",fontsize=16)
plt.scatter(primary_enrollment_with_gdp['GDP'],
primary_enrollment_with_gdp['Enrollments (Thousands)'],
edgecolor='k',color='orange',s=200)
plt.xlabel("GDP per capita (US $)",fontsize=15)
plt.ylabel("Primary enrollment (thousands)",fontsize=15)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.grid(True)
plt.show()
The output is as follows:
These are the steps to complete this activity:
import sqlite3
with sqlite3.connect("Education_GDP.db") as conn:
cursor = conn.cursor()
cursor.execute("CREATE TABLE IF NOT EXISTS
education_gdp(Year INT, Enrollment FLOAT, GDP FLOAT, PRIMARY KEY (Year))")
with sqlite3.connect("Education_GDP.db") as conn:
cursor = conn.cursor()
for i in range(14):
year = int(primary_enrollment_with_gdp.iloc[i]['Year'])
enrollment = primary_enrollment_with_gdp.iloc[i]['Enrollments (Thousands)']
gdp = primary_enrollment_with_gdp.iloc[i]['GDP']
#print(year,enrollment,gdp)
cursor.execute("INSERT INTO education_gdp (Year,Enrollment,GDP) VALUES(?,?,?)",(year,enrollment,gdp))
If we look at the current folder, we should see a file called Education_GDP.db, and if we can examine that using a database viewer program, we can see the data transferred there.
In these activities, we have examined a complete data wrangling flow, including reading data from the web and a local drive, filtering, cleaning, quick visualization, imputation, indexing, merging, and writing back to a database table. We also wrote custom functions to transform some of the data and saw how to handle situations where we may get errors upon reading the file.