By the end of this chapter, you will be able to:
This chapter explains the concepts of databases, including their creation, manipulation and control, and transforming tables into pandas DataFrames.
This chapter of our data journey is focused on RDBMS (Relational Database Management Systems) and SQL (Structured Query Language). In the previous chapter, we stored and read data from a file. In this chapter, we will read structured data, design access to the data, and create query interfaces for databases.
Data has been stored in RDBMS format for years. The reasons behind it are as follows:
As we can see in the following chart, the market of DBMS is big. This chart was produced based on market research that was done by Gartner, Inc. in 2016:
We will learn and play around with some basic and fundamental concepts of database and database management systems in this chapter.
An RDBMS is a piece of software that manages data (represented for the end user in a tabular form) on physical hard disks and is built using the Codd's relational model. Most of the databases that we encounter today are RDBMS. In recent years, there has been a huge industry shift toward a newer kind of database management system, called NoSQL (MongoDB, CouchDB, Riak, and so on). These systems, although in some aspects they follow some of the rules of RDBMS, in most cases reject or modify them.
The RDBMS structure consists of three main elements, namely the storage engine, query engine, and log management. Here is a diagram that shows the structure of a RDBMS:
The following are the main concepts of any RDBMS structure:
We will focus on the query engine in this chapter.
Structured Query Language or SQL (pronounced sequel), as it is commonly known, is a domain-specific language that was originally designed based on E.F. Codd's relational model and is widely used in today's databases to define, insert, manipulate, and retrieve data from them. It can be further sub-divided into four smaller sub-languages, namely DDL (Data Definition Language), DML (Data Manipulation Language), DQL (Data Query Language), and DCL (Data Control Language). There are several advantages of using SQL, with some of them being as follows:
The main areas of focus for the following topic will be DDL, DML, and DQL. The DCL part is more for database administrators.
Notice the use of uppercase letters. It is not a specification and you can use lowercase letters, but it is a widely followed convention and we will use that in this book.
Once you define and insert data in a database, it can be represented as follows:
Another thing to remember about RDBMS is relations. Generally, in a table, we have one or more columns that will have unique values for each row in the table. We call them primary keys for the table. We should be aware that we will encounter unique values across the rows, which are not primary keys. The main difference between them and primary keys is the fact that a primary key cannot be null.
By using the primary key of one table and mentioning it as a foreign key in another table, we can establish relations between two tables. A certain table can be related to any finite number of tables. The relations can be 1:1, which means that each row of the second table is uniquely related to one row of the first table, or 1 :N, N:1, or N: M. An example of relations is as follows:
With this brief refresher, we are now ready to jump into hands-on exercises and write some SQL to store and retrieve data.
In this topic, we will focus on how to write some basic SQL commands, as well as how to connect to a database from Python and use it effectively within Python. The database we will choose here is SQLite. There are other databases, such as Oracle, MySQL, Postgresql, and DB2. The main tricks that you are going to learn here will not change based on what database you are using. But for different databases, you will need to install different third-party Python libraries (such as Psycopg2 for Postgresql, and so on). The reason they all behave the same way (apart for some small details) is the fact that they all adhere to PEP249 (commonly known as Python DB API 2).
This is a good standardization and saves us a lot of headaches while porting from one RDBMS to another.
Most of the industry standard projects which are written in Python and use some kind of RDBMS as the data store, most often relay on an ORM or Object Relational Mapper. An ORM is a high-level library in Python which makes many tasks, while dealing with RDBMS, easier. It also exposes a more Pythonic API than writing raw SQL inside Python code.
In this exercise, we will look into the first step toward using a RDBMS in Python code. All we are going to do is connect to a database and then close the connection. We will also learn about the best way to do this:
import sqlite3
conn = sqlite3.connect("chapter.db")
conn.close()
This conn object is the main connection object, and we will need that to get a second type of object in the future once we want to interact with the database. We need to be careful about closing any open connection to our database.
with sqlite3.connect("chapter.db") as conn:
pass
In this exercise, we have connected to a database using Python.
In this exercise, we will look at how we can create a table, and we will also insert data in it.
As the name suggests, DDL (Data Definition Language) is the way to communicate to the database engine in advance to define what the data will look like. The database engine creates a table object based on the definition provided and prepares it.
To create a table in SQL, use the CREATE TABLE SQL clause. This will need the table name and the table definition. Table name is a unique identifier for the database engine to find and use the table for all future transactions. It can be anything (any alphanumeric string), as long as it is unique. We add the table definition in the form of (column_name_1 data_type, column_name_2 data type, … ). For our purpose, we will use the text and integer datatypes, but usually a standard database engine supports many more datatypes, such as float, double, date time, Boolean, and so on. We will also need to specify a primary key. A primary key is a unique, non-null identifier that's used to uniquely identify a row in a table. In our case, we use email as a primary key. A primary key can be an integer or text.
The last thing you need to know is that unless you call a commit on the series of operations you just performed (together, we formally call them a transaction), nothing will be actually performed and reflected in the database. This property is called atomicity. In fact, for a database to be industry standard (to be useable in real life), it needs to follow the ACID (Atomicity, Consistency, Isolation, Durability) properties:
with sqlite3.connect("chapter.db") as conn:
This code will work once you add the snippet from step 3.
cursor = conn.cursor()
cursor.execute("CREATE TABLE IF NOT EXISTS user (email text, first_name text, last_name text, address text, age integer, PRIMARY KEY (email))")
cursor.execute("INSERT INTO user VALUES ('[email protected]', 'Bob', 'Codd', '123 Fantasy lane, Fantasy City', 31)")
cursor.execute("INSERT INTO user VALUES ('[email protected]', 'Tom', 'Fake', '456 Fantasy lane, Fantasu City', 39)")
conn.commit()
This will create the table and write two rows to it with data.
In the preceding exercise, we created a table and stored data in it. Now, we will learn how to read the data that's stored in this database.
The SELECT clause is immensely powerful, and it is really important for a data practitioner to master SELECT and everything related to it (such as conditions, joins, group-by, and so on).
The * after SELECT tells the engine to select all of the columns from the table. It is a useful shorthand. We have not mentioned any condition for the selection (such as above a certain age, first name starting with a certain sequence of letters, and so on). We are practically telling the database engine to select all the rows and all the columns from the table. It is time-consuming and less effective if we have a huge table. Hence, we would want to use the LIMIT clause to limit the number of rows we want.
You can use the SELECT clause in SQL to retrieve data, as follows:
with sqlite3.connect("chapter.db") as conn:
cursor = conn.cursor()
rows = cursor.execute('SELECT * FROM user')
for row in rows:
print(row)
The output is as follows:
The syntax to use the SELECT clause with a LIMIT as follows:
SELECT * FROM <table_name> LIMIT 50;
This syntax is a sample code and will not work on Jupyter notebook.
This will select all the columns, but only the first 50 rows from the table.
In this exercise, we will use the ORDER BY clause to sort the rows of user table with respect to age:
with sqlite3.connect("chapter.db") as conn:
cursor = conn.cursor()
rows = cursor.execute('SELECT * FROM user ORDER BY age DESC')
for row in rows:
print(row)
The output is as follows:
with sqlite3.connect("chapter.db") as conn:
cursor = conn.cursor()
rows = cursor.execute('SELECT * FROM user ORDER BY age')
for row in rows:
print(row)
Notice that we don't need to specify the order as ASC to sort it into ascending order.
In this exercise, we are going to add a column using ALTER and UPDATE the values in the newly added column.
The UPDATE command is used to edit/update any row after it has been inserted. Be careful when using it because using UPDATE without selective clauses (such as WHERE) affects the entire table:
with sqlite3.connect("chapter.db") as conn:
cursor = conn.cursor()
cursor.execute("ALTER TABLE user ADD COLUMN gender text")
cursor.execute("UPDATE user SET gender='M'")
conn.commit()
rows = cursor.execute('SELECT * FROM user')
for row in rows:
print(row)
We have updated the entire table by setting the gender of all the users as M, where M stands for male.
In this exercise, we will learn about a concept that we have already learned about in pandas. This is the GROUP BY clause. The GROUP BY clause is a technique that's used to retrieve distinct values from the database and place them in individual buckets.
The following diagram explains how the GROUP BY clause works:
In the preceding diagram, we can see that the Col3 column has only two unique values across all rows, A and B.
The command that's used to check the total number of rows belonging to each group is as follows:
SELECT count(*), col3 FROM table1 GROUP BY col3
Add female users to the table and group them based on the gender:
cursor.execute("INSERT INTO user VALUES ('[email protected]', 'Shelly', 'Milar', '123, Ocean View Lane', 39, 'F')")
rows = cursor.execute("SELECT COUNT(*), gender FROM user GROUP BY gender")
for row in rows:
print(row)
The output is as follows:
We have been working with a single table and altering it, as well as reading back the data. However, the real power of an RDBMS comes from the handling of relationships among different objects (tables). In this section, we are going to create a new table called comments and link it with the user table in a 1: N relationship. This means that one user can have multiple comments. The way we are going to do this is by adding the user table's primary key as a foreign key in the comments table. This will create a 1: N relationship.
When we link two tables, we need to specify to the database engine what should be done if the parent row is deleted, which has many children in the other table. As we can see in the following diagram, we are asking what happens at the place of the question marks when we delete row1 of the user table:
In a non-RDBMS situation, this situation can quickly become difficult and messy to manage and maintain. However, with an RDBMS, all we have to tell the database engine, in very precise ways, is what to do when a situation like this occurs. The database engine will do the rest for us. We use ON DELETE to tell the engine what we do with all the rows of a table when the parent row gets deleted. The following code illustrates these concepts:
with sqlite3.connect("chapter.db") as conn:
cursor = conn.cursor()
cursor.execute("PRAGMA foreign_keys = 1")
sql = """
CREATE TABLE comments (
user_id text,
comments text,
FOREIGN KEY (user_id) REFERENCES user (email)
ON DELETE CASCADE ON UPDATE NO ACTION
)
"""
cursor.execute(sql)
conn.commit()
The ON DELETE CASCADE line informs the database engine that we want to delete all the children rows when the parent gets deleted. We can also define actions for UPDATE. In this case, there is nothing to do on UPDATE.
The FOREIGN KEY modifier modifies a column definition (user_id, in this case) and marks it as a foreign key, which is related to the primary key (email, in this case) of another table.
You may notice the strange looking cursor.execute("PRAGMA foreign_keys = 1") line in the code. It is there just because SQLite does not use the normal foreign key features by default. It is this line that enables that feature. It is typical to SQLite and we won't need it for any other databases.
We have created a table called comments. In this section, we will dynamically generate an insert query, as follows:
with sqlite3.connect("chapter.db") as conn:
cursor = conn.cursor()
cursor.execute("PRAGMA foreign_keys = 1")
sql = "INSERT INTO comments VALUES ('{}', '{}')"
rows = cursor.execute('SELECT * FROM user ORDER BY age')
for row in rows:
email = row[0]
print("Going to create rows for {}".format(email))
name = row[1] + " " + row[2]
for i in range(10):
comment = "This is comment {} by {}".format(i, name)
conn.cursor().execute(sql.format(email, comment))
conn.commit()
Pay attention to how we dynamically generate the insert query so that we can insert 10 comments for each user.
In this exercise, we will learn how to exploit the relationship we just built. This means that if we have the primary key from one table, we can recover all the data needed from that table and also all the linked rows from the child table. To achieve this, we will use something called a join.
A join is basically a way to retrieve linked rows from two tables using any kind of primary key - foreign key relation that they have. There are many types of join, such as INNER, LEFT OUTER, RIGHT OUTER, FULL OUTER, and CROSS. They are used in different situations. However, most of the time, in simple 1: N relations, we end up using an INNER join. In Chapter 1: Introduction to Data Wrangling with Python, we learned about sets, then we can view an INNER JOIN as an intersection of two sets. The following diagram illustrate the concepts:
Here, A represents one table and B represents another. The meaning of having common members is to have a relationship between them. It takes all of the rows of A and compares them with all of the rows of B to find the matching rows that satisfy the join predicate. This can quickly become a complex and time-consuming operation. Joins can be very expensive operations. Usually, we use some kind of where clause, after we specify the join, to shorten the scope of rows that are fetched from table A or B to perform the matching.
In our case, our first table, user, has three entries, with the primary key being the email. We can make use of this in our query to get comments just from Bob:
with sqlite3.connect("chapter.db") as conn:
cursor = conn.cursor()s
cursor.execute("PRAGMA foreign_keys = 1")
sql = """
SELECT * FROM comments
JOIN user ON comments.user_id = user.email
WHERE user.email='[email protected]'
"""
rows = cursor.execute(sql)
for row in rows:
print(row)
The output is as follows:
('[email protected]', 'This is comment 0 by Bob Codd', '[email protected]', 'Bob', 'Codd', '123 Fantasy lane, Fantasu City', 31, None)
('[email protected]', 'This is comment 1 by Bob Codd', '[email protected]', 'Bob', 'Codd', '123 Fantasy lane, Fantasu City', 31, None)
('[email protected]', 'This is comment 2 by Bob Codd', '[email protected]', 'Bob', 'Codd', '123 Fantasy lane, Fantasu City', 31, None)
('[email protected]', 'This is comment 3 by Bob Codd', '[email protected]', 'Bob', 'Codd', '123 Fantasy lane, Fantasu City', 31, None)
('[email protected]', 'This is comment 4 by Bob Codd', '[email protected]', 'Bob', 'Codd', '123 Fantasy lane, Fantasu City', 31, None)
('[email protected]', 'This is comment 5 by Bob Codd', '[email protected]', 'Bob', 'Codd', '123 Fantasy lane, Fantasu City', 31, None)
('[email protected]', 'This is comment 6 by Bob Codd', '[email protected]', 'Bob', 'Codd', '123 Fantasy lane, Fantasu City', 31, None)
('[email protected]', 'This is comment 7 by Bob Codd', '[email protected]', 'Bob', 'Codd', '123 Fantasy lane, Fantasu City', 31, None)
('[email protected]', 'This is comment 8 by Bob Codd', '[email protected]', 'Bob', 'Codd', '123 Fantasy lane, Fantasu City', 31, None)
('[email protected]', 'This is comment 9 by Bob Codd', '[email protected]', 'Bob', 'Codd', '123 Fantasy lane, Fantasu City', 31, None)
In the previous exercise, we saw that we can use a JOIN to fetch the related rows from two tables. However, if we look at the results, we will see that it returned all the columns, thus combining both tables. This is not very concise. What about if we only want to see the emails and the related comments, and not all the data?
There is some nice shorthand code that lets us do this:
with sqlite3.connect("chapter.db") as conn:
cursor = conn.cursor()
cursor.execute("PRAGMA foreign_keys = 1")
sql = """
SELECT comments.* FROM comments
JOIN user ON comments.user_id = user.email
WHERE user.email='[email protected]'
"""
rows = cursor.execute(sql)
for row in rows:
print(row)
Just by changing the SELECT statement, we made our final result look as follows:
('[email protected]', 'This is comment 0 by Bob Codd')
('[email protected]', 'This is comment 1 by Bob Codd')
('[email protected]', 'This is comment 2 by Bob Codd')
('[email protected]', 'This is comment 3 by Bob Codd')
('[email protected]', 'This is comment 4 by Bob Codd')
('[email protected]', 'This is comment 5 by Bob Codd')
('[email protected]', 'This is comment 6 by Bob Codd')
('[email protected]', 'This is comment 7 by Bob Codd')
('[email protected]', 'This is comment 8 by Bob Codd')
('[email protected]', 'This is comment 9 by Bob Codd')
In this exercise, we are going to delete a row from the user table and observe the effects it will have on the comments table. Be very careful when running this command as it can have a destructive effect on the data. Please keep in mind that it has to almost always be run accompanied by a WHERE clause so that we delete just a part of the data and not everything:
with sqlite3.connect("chapter.db") as conn:
cursor = conn.cursor()
cursor.execute("PRAGMA foreign_keys = 1")
cursor.execute("DELETE FROM user WHERE email='[email protected]'")
conn.commit()
with sqlite3.connect("chapter.db") as conn:
cursor = conn.cursor()
cursor.execute("PRAGMA foreign_keys = 1")
rows = cursor.execute("SELECT * FROM user")
for row in rows:
print(row)
Observe that the user Bob has been deleted.
Now, moving on to the comments table, we have to remember that we had mentioned ON DELETE CASCADE while creating the table. The database engine knows that if a row is deleted from the parent table (user), all the related rows from the child tables (comments) will have to be deleted.
with sqlite3.connect("chapter.db") as conn:
cursor = conn.cursor()
cursor.execute("PRAGMA foreign_keys = 1")
rows = cursor.execute("SELECT * FROM comments")
for row in rows:
print(row)
The output is as follows:
('[email protected]', 'This is comment 0 by Tom Fake')
('[email protected]', 'This is comment 1 by Tom Fake')
('[email protected]', 'This is comment 2 by Tom Fake')
('[email protected]', 'This is comment 3 by Tom Fake')
('[email protected]', 'This is comment 4 by Tom Fake')
('[email protected]', 'This is comment 5 by Tom Fake')
('[email protected]', 'This is comment 6 by Tom Fake')
('[email protected]', 'This is comment 7 by Tom Fake')
('[email protected]', 'This is comment 8 by Tom Fake')
('[email protected]', 'This is comment 9 by Tom Fake')
We can see that all of the rows related to Bob are deleted.
In this exercise, we will see how we can update rows in a table. We have already looked at this in the past but, as we mentioned, at a table level only. Without WHERE, updating is often a bad idea.
Combine UPDATE with WHERE to selectively update the first name of the user with the email address [email protected]:
with sqlite3.connect("chapter.db") as conn:
cursor = conn.cursor()
cursor.execute("PRAGMA foreign_keys = 1")
cursor.execute("UPDATE user set first_name='Chris' where email='[email protected]'")
conn.commit()
rows = cursor.execute("SELECT * FROM user")
for row in rows:
print(row)
The output is as follows:
We have looked into many fundamental aspects of storing and querying data from a database, but as a data wrangling expert, we need our data to be packed and presented as a DataFrame so that we can perform quick and convenient operations on them:
import pandas as pd
columns = ["Email", "First Name", "Last Name", "Age", "Gender", "Comments"]
data = []
with sqlite3.connect("chapter.db") as conn:
cursor = conn.cursor()
Use the execute method from the cursor to set "PRAGMA foreign_keys = 1"
cursor.execute("PRAGMA foreign_keys = 1")
sql = """
SELECT user.email, user.first_name, user.last_name, user.age, user.gender, comments.comments FROM comments
JOIN user ON comments.user_id = user.email
WHERE user.email = '[email protected]'
"""
rows = cursor.execute(sql)
for row in rows:
data.append(row)
df = pd.DataFrame(data, columns=columns)
In this activity, we have the persons table:
We have the pets table:
As we can see, the id column in the persons table (which is an integer) serves as the primary key for that table and as a foreign key for the pet table, which is linked via the owner_id column.
The persons table has the following columns:
The pets table has the following columns:
The name of the SQLite DB is petsdb and it is supplied along with the Activity notebook.
These steps will help you complete this activity:
The solution for this activity can be found on page 324.
We have come to the end of the database chapter. We have learned how to connect to SQLite using Python. We have brushed up on the basics of relational databases and learned how to open and close a database. We then learned how to export this relational database into Python DataFrames.
In the next chapter, we will be performing data wrangling on real-world datasets.