To speed up our venture in data analysis, we will convert our csv file into a SQLite3 database. With a SQLite3 database, our work becomes easier. We spend less time on lines of code that are related to the integrity of the data and more time filtering, splicing, and combining columns of data to meet our needs.
We wish to craft a function that will produce a SQLite3 database based on the contents of a csv file. SQLite3 is a relational database management system. In csv files, we explore data through records, where every field in a record has a column heading. In relational databases, we see much of the same approach but with a few important differences. Relational databases comprise of one or more tables, each identified by a name. Each table comprises of zero or more records, where each record comprises of one field per column heading, making a table analogous to a csv file. An important distinction between a table and a csv file is that tables require each field to maintain adherence to a data type. On the contrary, csv files treat all fields as text.
To begin preparing our environment, make sure that you have SQLite3 and the Haskell Database Connectivity (HDBC) package installed on your system. The HDBC is known as a database abstraction layer. We can interface directly with SQLite3, but then, we will be attached to SQLite3 in all our code. By using an abstraction layer, we can continue using SQLite3 or change to a different relational database system in the future (such as MySQL or PostgreSQL) with minimal code changes. Within Haskell, we are going to need the HDBC package, the sqlite
package, and the HDBC-sqlite3
package. On Debian-based systems, this can be done with apt-get
on the command line. We can get the two packages on our system using apt-get
, as follows:
sudo apt-get install sqlite3 sudo apt-get install libghc-hdbc-sqlite3-dev
To install all of the necessary modules, we can run the following cabal
install line:
cabal install HDBC sqlite HDBC-sqlite3
Got all of this installed? Good. As with the previous example, we will break the function into smaller functions, each with their own responsibility. In this example, we will craft a function to open the csv file (which is similar to the function used in the previous example) and a second function to create the database based on a CSV expression.
Those of you who are using Windows can also participate, but you will need to download and install the SQLite binaries from http://www.sqlite.org. There, you will find the Sqlite
and HDBC-sqlite
packages. These can be hooked during the installation using the --extra-lib-dirs and --extra-include-dirs
flags.
To craft our function to create a SQLite3 database from a CSV value, we will require the desired table name, the desired filename for the SQLite3 file, and the type information for each column in the CSV value. For each column in our table, we can take the same column name from the CSV value, but this presents a problem. Often, the column names used in csv files contain spaces and symbols, neither of which are handled elegantly by database systems or the programmers and analysts who use them. An option that you have is the manual editing of the csv file with the desired column names. While this is a solution, our dataset immediately becomes less portable. To transfer our programs to someone else, we will have to request that they also manually edit any new data sources to the desired column names. I humbly request that you resist any temptation to manually edit your data sources except when you have to correct something nuanced that cannot be automated.
The USGS has an excellent glossary describing each of the columns in their datasets, including plain English definitions, data types, and numerical ranges. Each of their columns are represented by single words (or short phrases using camel case). We have no need to rename the columns in this dataset.
Let us examine the column headings used in the earthquake dataset. Using the glossary information found at http://earthquake.usgs.gov/earthquakes/feed/v1.0/glossary.php, I included the data type information, which is as follows:
time
: ISO 8601 Timestamp Stringlatitude
: Decimallongitude
: Decimaldepth
: Decimalmag
: DecimalmagType
: Stringnst
: Integergap
: Decimaldmin
: Decimalrms
: Decimalnet
: Decimalid
: Stringupdated
: ISO 8601 Timestamp Stringplace
: Stringtype
: StringThere are four data types used in the dataset—Integer, Decimal, String, and ISO 8601 Timestamp Strings (which can be used by the time and date functions in SQLite3). Each data type must be converted to its equivalent type in SQLite3. Fortunately, these conversions are going to be relatively straight forward and can be listed as follows:
INTEGER
REAL
TEXT
TEXT
Before we get much further, we need to include our import statements for our database libraries, which are as follows:
> import Database.HDBC > import Database.HDBC.Sqlite3
Here is the function that turns a CSV expression into a database. Rather than return anything, we print confirmation messages on the screen, as follows:
-- Converts a CSV expression into an SQL database -- Returns "Successful" if successful, -- error message otherwise. convertCSVToSQL :: String -> FilePath -> [String] -> CSV -> IO () convertCSVToSQL tableName outFileName fields records = -- Check to make sure that the number of -- columns matches the number of fields if nfieldsInFile == nfieldsInFields then do -- Open a connection conn <- connectSqlite3 outFileName -- Create a new table run conn createStatement [] -- Load contents of CSV file into table stmt <- prepare conn insertStatement executeMany stmt (tail (filter ( ecord -> nfieldsInFile == length record) sqlRecords)) -- Commit changes commit conn -- Close the connection disconnect conn -- Report that we were successful putStrLn "Successful" else putStrLn "The number of input fields differ from the csv file." where nfieldsInFile = length $ head records nfieldsInFields = length fields createStatement = "CREATE TABLE " ++ tableName ++ " (" ++ (intercalate ", " fields) ++")" insertStatement = "INSERT INTO " ++ tableName ++"++" VALUES (" ++ (intercalate ", " (replicate nfieldsInFile "?")) ++ ")""?")) ++ ")" sqlRecords = map ( ecord -> map (element -> toSql element) record ) records
There's a lot to explain in the above code. Here are the related HDBC statements:
connectSqlite3
statement will create a blank SQLite3 file. The run conn createStatement []
statement creates an initial table within the database. Inspecting the createStatement
line in the where
clause reveals that this is a simple SQL statement to create a table based on the types and column names supplied by the fields
input expression. The run
statement allows us to perform string interpolation on SQL statements by replacing the ?
symbols with values supplied in a list. We did not use this particular feature in the connectSqlite3
statement, but we must still pass an empty list.stmt <- prepare conn insertStatement
statement prepares the INSERT SQL
statement to batch process of each record.executeMany stmt
statement performs the batch processing of SQL statements for each record in the CSV expression with the number of columns equal to the number of column headings. If there is a record in your CSV expression that has too many or too few fields, it is going to be skipped.commit conn
statement does just what its name suggests. It commits the recent changes to the file.disconnect conn
statement formally disconnects from the database (that is, it closes the file).where
clause, you may notice that all the fields in the database are converted to their respective SQL type using the toSql
function.Here is the convertCSVFileToSQL
function that will open a csv file and pass the contents of our file to the convertCSVToSQL
function. Since this is functionally similar to the applyToColumnInCSVFile
function mentioned earlier, we will spare you the details of how it works. We will also test the two functions introduced in this section together:
-- Converts a CSV file to an SQL database file -- Prints "Successful" if successful, error message otherwise convertCSVFileToSQL :: String -> String -> String -> [String] -> IO () convertCSVFileToSQL inFileName outFileName tableName fields = do -- Open and read the CSV file input <- readFile inFileName let records = parseCSV inFileName input -- Check to make sure this is a good csv file either handleCSVError convertTool records where convertTool = convertCSVToSQL tableName outFileName fields handleCSVError csv = putStrLn "This does not appear to be a CSV file."
We will begin our testing, as follows:
> :l LearningDataAnalysis02.hs > convertCSVFileToSQL "all_week.csv" "earthquakes.sql" "oneWeek" ["time TEXT", "latitude REAL", "longitude REAL", "depth REAL", "mag REAL", "magType TEXT", "nst INTEGER", "gap REAL", "dmin REAL", "rms REAL", "net REAL", "id TEXT", "updated TEXT", "place TEXT", "type TEXT"] Successful
At the completion of this step, a new file will exist in your current working directory with the earthquakes.sql
filename. We will inspect our new database, as follows:
> conn <- connectSqlite3 "earthquakes.sql" > magnitudes <- quickQuery' conn "SELECT mag FROM oneWeek" [] > :t magnitudes magnitudes :: [[SqlValue]]
Note that the information of the type here is based on a simple SQL statement to our database. The data is returned as a list of SqlValue
. We will inspect the first element that was returned by converting the data into Double
by using the safeFromSql
function, as follows:
> fromSql $ head $ head magnitudes :: Double 0.5
To end this chapter, let's apply the data returned by the SQL statement to the average
function that was written at the beginning of the chapter:
> let magnitudesDouble = map ( ecord -> fromSql $ head record :: Double) magnitudes > average magnitudesDouble 1.6950637393767727