Data analysis starts with data. It is therefore beneficial to work with data storage systems that are simple to set up, operate and where the data access does not become a problem in itself. In short, we would like to have database systems that are easy to embed into our data analysis processes and workflows. In this module, we focus mostly on the Python side of the database interaction, and we will learn how to get data into and out of pandas data structures.
There are numerous ways to store data. In this chapter, we are going to learn to interact with three main categories: text formats, binary formats and databases. We will focus on two storage solutions, MongoDB and Redis. MongoDB is a document-oriented database, which is easy to start with, since we can store JSON documents and do not need to define a schema upfront. Redis is a popular in-memory data structure store on top of which many applications can be built. It is possible to use Redis as a fast key-value store, but Redis supports lists, sets, hashes, bit arrays and even advanced data structures such as HyperLogLog out of the box as well.
Text is a great medium and it's a simple way to exchange information. The following statement is taken from a quote attributed to Doug McIlroy: Write programs to handle text streams, because that is the universal interface.
In this section we will start reading and writing data from and to text files.
Normally, the raw data logs of a system are stored in multiple text files, which can accumulate a large amount of information over time. Thankfully, it is simple to interact with these kinds of files in Python.
pandas supports a number of functions for reading data from a text file into a DataFrame object. The most simple one is the read_csv()
function. Let's start with a small example file:
$ cat example_data/ex_06-01.txt Name,age,major_id,sex,hometown Nam,7,1,male,hcm Mai,11,1,female,hcm Lan,25,3,female,hn Hung,42,3,male,tn Nghia,26,3,male,dn Vinh,39,3,male,vl Hong,28,4,female,dn
In the preceding example file, each column is separated by comma and the first row is a header row, containing column names. To read the data file into the DataFrame object, we type the following command:
>>> df_ex1 = pd.read_csv('example_data/ex_06-01.txt') >>> df_ex1 Name age major_id sex hometown 0 Nam 7 1 male hcm 1 Mai 11 1 female hcm 2 Lan 25 3 female hn 3 Hung 42 3 male tn 4 Nghia 26 3 male dn 5 Vinh 39 3 male vl 6 Hong 28 4 female dn
We see that the read_csv
function uses a comma as the default delimiter between columns in the text file and the first row is automatically used as a header for the columns. If we want to change this setting, we can use the sep
parameter to change the separated symbol and set header=None
in case the example file does not have a caption row.
See the following example:
$ cat example_data/ex_06-02.txt Nam 7 1 male hcm Mai 11 1 female hcm Lan 25 3 female hn Hung 42 3 male tn Nghia 26 3 male dn Vinh 39 3 male vl Hong 28 4 female dn >>> df_ex2 = pd.read_csv('example_data/ex_06-02.txt', sep = ' ', header=None) >>> df_ex2 0 1 2 3 4 0 Nam 7 1 male hcm 1 Mai 11 1 female hcm 2 Lan 25 3 female hn 3 Hung 42 3 male tn 4 Nghia 26 3 male dn 5 Vinh 39 3 male vl 6 Hong 28 4 female dn
We can also set a specific row as the caption row by using the header
that's equal to the index of the selected row. Similarly, when we want to use any column in the data file as the column index of DataFrame, we set index_col
to the name or index of the column. We again use the second data file example_data/ex_06-02.txt
to illustrate this:
>>> df_ex3 = pd.read_csv('example_data/ex_06-02.txt', sep = ' ', header=None, index_col=0) >>> df_ex3 1 2 3 4 0 Nam 7 1 male hcm Mai 11 1 female hcm Lan 25 3 female hn Hung 42 3 male tn Nghia 26 3 male dn Vinh 39 3 male vl Hong 28 4 female dn
Apart from those parameters, we still have a lot of useful ones that can help us load data files into pandas objects more effectively. The following table shows some common parameters:
Parameter |
Value |
Description |
---|---|---|
|
Type name or dictionary of type of columns |
Sets the data type for data or columns. By default it will try to infer the most appropriate data type. |
|
List-like or integer |
The number of lines to skip (starting from 0). |
|
List-like or dict, default None |
Values to recognize as |
|
List |
A list of values to be converted to Boolean True as well. |
|
List | |
|
|
If the |
|
|
The thousands separator |
|
|
Limits the number of rows to read from the file. |
|
|
If set to True, a DataFrame is returned, even if an error occurred during parsing. |
Besides the read_csv()
function, we also have some other parsing functions in pandas:
Function |
Description |
---|---|
|
Read the general delimited file into DataFrame |
|
Read a table of fixed-width formatted lines into DataFrame |
|
Read text from the clipboard and pass to |
In some situations, we cannot automatically parse data files from the disk using these functions. In that case, we can also open files and iterate through the reader, supported by the CSV module in the standard library:
$ cat example_data/ex_06-03.txt Nam 7 1 male hcm Mai 11 1 female hcm Lan 25 3 female hn Hung 42 3 male tn single Nghia 26 3 male dn single Vinh 39 3 male vl Hong 28 4 female dn >>> import csv >>> f = open('data/ex_06-03.txt') >>> r = csv.reader(f, delimiter=' ') >>> for line in r: >>> print(line) ['Nam', '7', '1', 'male', 'hcm'] ['Mai', '11', '1', 'female', 'hcm'] ['Lan', '25', '3', 'female', 'hn'] ['Hung', '42', '3', 'male', 'tn', 'single'] ['Nghia', '26', '3', 'male', 'dn', 'single'] ['Vinh', '39', '3', 'male', 'vl'] ['Hong', '28', '4', 'female', 'dn']
We saw how to load data from a text file into a pandas data structure. Now, we will learn how to export data from the data object of a program to a text file. Corresponding to the read_csv()
function, we also have the to_csv()
function, supported by pandas. Let's see the following example:
>>> df_ex3.to_csv('example_data/ex_06-02.out', sep = ';')
The result will look like this:
$ cat example_data/ex_06-02.out 0;1;2;3;4 Nam;7;1;male;hcm Mai;11;1;female;hcm Lan;25;3;female;hn Hung;42;3;male;tn Nghia;26;3;male;dn Vinh;39;3;male;vl Hong;28;4;female;dn
If we want to skip the header line or index column when writing out data into a disk file, we can set a False
value to the header and index parameters:
>>> import sys >>> df_ex3.to_csv(sys.stdout, sep=' ', header=False, index=False) 7 1 male hcm 11 1 female hcm 25 3 female hn 42 3 male tn 26 3 male dn 39 3 male vl 28 4 female dn
We can also write a subset of the columns of the DataFrame to the file by specifying them in the columns
parameter:
>>> df_ex3.to_csv(sys.stdout, columns=[3,1,4], header=False, sep=' ') Nam male 7 hcm Mai female 11 hcm Lan female 25 hn Hung male 42 tn Nghia male 26 dn Vinh male 39 vl Hong female 28 dn
With series objects, we can use the same function to write data into text files, with mostly the same parameters as earlier.