Abstracting data

Blaze can abstract many different data structures and expose a single, easy-to-use API. This helps to get a consistent behavior and reduce the need to learn multiple interfaces to handle data. If you know pandas, there is not really that much to learn, as the differences in the syntax are subtle. We will go through some examples to illustrate this.

Working with NumPy arrays

Getting data from a NumPy array into the DataShape object of Blaze is extremely easy. First, let's create a simple NumPy array: we first load NumPy and then create a matrix with two rows and three columns:

import numpy as np
simpleArray = np.array([
        [1,2,3],
        [4,5,6]
    ])

Now that we have an array, we can abstract it with Blaze's DataShape structure:

simpleData_np = bl.Data(simpleArray)

That's it! Simple enough.

In order to peek inside the structure you can use the .peek() method:

simpleData_np.peek()

You should see an output similar to what is shown in the following screenshot:

Working with NumPy arrays

You can also use (familiar to those of you versed in pandas' syntax) the .head(...) method.

Note

The difference between .peek() and .head(...) is that .head(...) allows the specification of the number of rows as its only parameter, whereas .peek() does not allow that and will always print the top 10 records.

If you want to retrieve the first column of your DataShape, you can use indexing:

simpleData_np[0]

You should see a table, as shown here:

Working with NumPy arrays

On the other hand, if you were interested in retrieving a row, all you would have to do (like in NumPy) is transpose your DataShape:

simpleData_np.T[0]

What you will then get is presented in the following figure:

Working with NumPy arrays

Notice that the name of the column is None. DataShapes, just like pandas' DataFrames, support named columns. Thus, let's specify the names of our fields:

simpleData_np = bl.Data(simpleArray, fields=['a', 'b', 'c'])

Now you can retrieve the data simply by calling the column by its name:

simpleData_np['b']

In return, you will get the following output:

Working with NumPy arrays

As you can see, defining the fields transposes the NumPy array and, now, each element of the array forms a row, unlike when we first created the simpleData_np.

Working with pandas' DataFrame

Since pandas' DataFrame internally uses NumPy data structures, translating a DataFrame to DataShape is effortless.

First, let's create a simple DataFrame. We start by importing pandas:

import pandas as pd

Next, we create a DataFrame:

simpleDf = pd.DataFrame([
        [1,2,3],
        [4,5,6]
    ], columns=['a','b','c'])

We then transform it into a DataShape:

simpleData_df = bl.Data(simpleDf)

You can retrieve data in the same manner as with the DataShape created from the NumPy array. Use the following command:

simpleData_df['a']

Then, it will produce the following output:

Working with pandas' DataFrame

Working with files

A DataShape object can be created directly from a .csv file. In this example, we will use a dataset that consists of 404,536 traffic violations that happened in the Montgomery county of Maryland.

Note

We downloaded the data from https://catalog.data.gov/dataset/traffic-violations-56dda on 8/23/16; the dataset is updated daily, so the number of traffic violations might differ if you retrieve the dataset at a later date.

We store the dataset in the ../Data folder locally. However, we modified the dataset slightly so we could store it in the MongoDB: in its original form, with date columns, reading data back from MongoDB caused errors. We filed a bug with Blaze to fix this issue https://github.com/blaze/blaze/issues/1580:

import odo
traffic = bl.Data('../Data/TrafficViolations.csv')

If you do not know the names of the columns in any dataset, you can get these from the DataShape. To get a list of all the fields, you can use the following command:

print(traffic.fields)
Working with files

Tip

Those of you familiar with pandas would easily recognize the similarity between the .fields and .columns attributes, as these work in essentially the same way - they both return the list of columns (in the case of pandas DataFrame), or the list of fields, as columns are called in the case of Blaze DataShape.

Blaze can also read directly from a GZipped archive, saving space:

traffic_gz = bl.Data('../Data/TrafficViolations.csv.gz')

To validate that we get exactly the same data, let's retrieve the first two records from each structure. You can either call the following:

traffic.head(2)

Or you can choose to call:

traffic_gz.head(2)

It produces the same results (columns abbreviated here):

Working with files

It is easy to notice, however, that it takes significantly more time to retrieve the data from the archived file because Blaze needs to decompress the data.

You can also read from multiple files at one time and create one big dataset. To illustrate this, we have split the original dataset into four GZipped datasets by year of violation (these are stored in the ../Data/Years folder).

Blaze uses odo to handle saving DataShapes to a variety of formats. To save traffic data for traffic violations by year you can call odo like this:

import odo
for year in traffic.Stop_year.distinct().sort():
    odo.odo(traffic[traffic.Stop_year == year], 
        '../Data/Years/TrafficViolations_{0}.csv.gz'
        .format(year))

The preceding instruction saves the data into a GZip archive, but you can save it to any of the formats mentioned earlier. The first argument to the .odo(...) method specifies the input object (in our case, the DataShape with traffic violations that occurred in 2013), the second argument is the output object - the path to the file we want to save the data to. As we are about to learn - storing data is not limited to files only.

To read from multiple files you can use the asterisk character *:

traffic_multiple = bl.Data(
    '../Data/Years/TrafficViolations_*.csv.gz')
traffic_multiple.head(2)

The preceding snippet, once again, will produce a familiar table:

Working with files

Blaze reading capabilities are not limited to .csv or GZip files only: you can read data from JSON or Excel files (both, .xls and .xlsx), HDFS, or bcolz formatted files.

Tip

To learn more about the bcolz format, check its documentation at https://github.com/Blosc/bcolz.

Working with databases

Blaze can also easily read from SQL databases such as PostgreSQL or SQLite. While SQLite would normally be a local database, the PostgreSQL can be run either locally or on a server.

Blaze, as mentioned earlier, uses odo in the background to handle the communication to and from the databases.

Note

odo is one of the requirements for Blaze and it gets installed along with the package. Check it out here https://github.com/blaze/odo.

In order to execute the code in this section, you will need two things: a running local instance of a PostgreSQL database, and a locally running MongoDB database.

Tip

In order to install PostgreSQL, download the package from http://www.postgresql.org/download/ and follow the installation instructions for your operating system found there.

To install MongoDB, go to https://www.mongodb.org/downloads and download the package; the installation instructions can be found here http://docs.mongodb.org/manual/installation/.

Before you proceed, we assume that you have a PostgreSQL database up and running at http://localhost:5432/, and MongoDB database running at http://localhost:27017.

We have already loaded the traffic data to both of the databases and stored them in the traffic table (PostgreSQL) or the traffic collection (MongoDB).

Tip

If you do not know how to upload your data, I have explained this in my other book https://www.packtpub.com/big-data-and-business-intelligence/practical-data-analysis-cookbook.

Interacting with relational databases

Let's read the data from the PostgreSQL database now. The Uniform Resource Identifier (URI) for accessing a PostgreSQL database has the following syntax postgresql://<user_name>:<password>@<server>:<port>/<database>::<table>.

To read the data from PostgreSQL, you just wrap the URI around .Data(...) - Blaze will take care of the rest:

traffic_psql = bl.Data(
    'postgresql://{0}:{1}@localhost:5432/drabast::traffic'
    .format('<your_username>', '<your_password>')
)

We use Python's .format(...) method to fill in the string with the appropriate data.

Tip

Substitute your credentials to access your PostgreSQL database in the previous example. If you want to read more about the .format(...) method, you can check out the Python 3.5 documentation https://docs.python.org/3/library/string.html#format-string-syntax.

It is quite easy to output the data to either the PostgreSQL or SQLite databases. In the following example, we will output traffic violations that involved cars manufactured in 2016 to both PostgreSQL and SQLite databases. As previously noted, we will use odo to manage the transfers:

traffic_2016 = traffic_psql[traffic_psql['Year'] == 2016]
# Drop commands
# odo.drop('sqlite:///traffic_local.sqlite::traffic2016')
# odo.drop('postgresql://{0}:{1}@localhost:5432/drabast::traffic'
.format('<your_username>', '<your_password>'))
# Save to SQLite
odo.odo(traffic_2016,
'sqlite:///traffic_local.sqlite::traffic2016')
# Save to PostgreSQL
odo.odo(traffic_2016,  
    'postgresql://{0}:{1}@localhost:5432/drabast::traffic'
    .format('<your_username>', '<your_password>'))

In a similar fashion to pandas, to filter the data, we effectively select the Year column (the traffic_psql['Year'] part of the first line) and create a Boolean flag by checking whether each and every record in that column equals 2016. By indexing the traffic_psql object with such a truth vector, we extract only the records where the corresponding value equals True.

The two commented out lines should be uncommented if you already have the traffic2016 tables in your databases; otherwise odo will append the data to the end of the table.

The URI for SQLite is slightly different than the one for PostgreSQL; it has the following syntax sqlite://</relative/path/to/db.sqlite>::<table_name>.

Reading data from the SQLite database should be trivial for you by now:

traffic_sqlt = bl.Data(
    'sqlite:///traffic_local.sqlite::traffic2016'
)

Interacting with the MongoDB database

MongoDB has gained lots of popularity over the years. It is a simple, fast, and flexible document-based database. The database is a go-to storage solution for all full-stack developers, using the MEAN.js stack: M here stands for Mongo (see http://meanjs.org).

Since Blaze is meant to work in a very familiar way no matter what your data source, reading from MongoDB is very similar to reading from PostgreSQL or SQLite databases:

traffic_mongo = bl.Data(
    'mongodb://localhost:27017/packt::traffic'
)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset