Converting csv files to the SQLite3 format

To speed up our venture in data analysis, we will convert our csv file into a SQLite3 database. With a SQLite3 database, our work becomes easier. We spend less time on lines of code that are related to the integrity of the data and more time filtering, splicing, and combining columns of data to meet our needs.

We wish to craft a function that will produce a SQLite3 database based on the contents of a csv file. SQLite3 is a relational database management system. In csv files, we explore data through records, where every field in a record has a column heading. In relational databases, we see much of the same approach but with a few important differences. Relational databases comprise of one or more tables, each identified by a name. Each table comprises of zero or more records, where each record comprises of one field per column heading, making a table analogous to a csv file. An important distinction between a table and a csv file is that tables require each field to maintain adherence to a data type. On the contrary, csv files treat all fields as text.

Preparing our environment

To begin preparing our environment, make sure that you have SQLite3 and the Haskell Database Connectivity (HDBC) package installed on your system. The HDBC is known as a database abstraction layer. We can interface directly with SQLite3, but then, we will be attached to SQLite3 in all our code. By using an abstraction layer, we can continue using SQLite3 or change to a different relational database system in the future (such as MySQL or PostgreSQL) with minimal code changes. Within Haskell, we are going to need the HDBC package, the sqlite package, and the HDBC-sqlite3 package. On Debian-based systems, this can be done with apt-get on the command line. We can get the two packages on our system using apt-get, as follows:

sudo apt-get install sqlite3
sudo apt-get install libghc-hdbc-sqlite3-dev

To install all of the necessary modules, we can run the following cabal install line:

cabal install HDBC sqlite HDBC-sqlite3

Got all of this installed? Good. As with the previous example, we will break the function into smaller functions, each with their own responsibility. In this example, we will craft a function to open the csv file (which is similar to the function used in the previous example) and a second function to create the database based on a CSV expression.

Those of you who are using Windows can also participate, but you will need to download and install the SQLite binaries from http://www.sqlite.org. There, you will find the Sqlite and HDBC-sqlite packages. These can be hooked during the installation using the --extra-lib-dirs and --extra-include-dirs flags.

Describing our needs

To craft our function to create a SQLite3 database from a CSV value, we will require the desired table name, the desired filename for the SQLite3 file, and the type information for each column in the CSV value. For each column in our table, we can take the same column name from the CSV value, but this presents a problem. Often, the column names used in csv files contain spaces and symbols, neither of which are handled elegantly by database systems or the programmers and analysts who use them. An option that you have is the manual editing of the csv file with the desired column names. While this is a solution, our dataset immediately becomes less portable. To transfer our programs to someone else, we will have to request that they also manually edit any new data sources to the desired column names. I humbly request that you resist any temptation to manually edit your data sources except when you have to correct something nuanced that cannot be automated.

Inspecting column information

The USGS has an excellent glossary describing each of the columns in their datasets, including plain English definitions, data types, and numerical ranges. Each of their columns are represented by single words (or short phrases using camel case). We have no need to rename the columns in this dataset.

Let us examine the column headings used in the earthquake dataset. Using the glossary information found at http://earthquake.usgs.gov/earthquakes/feed/v1.0/glossary.php, I included the data type information, which is as follows:

  • time: ISO 8601 Timestamp String
  • latitude: Decimal
  • longitude: Decimal
  • depth: Decimal
  • mag: Decimal
  • magType: String
  • nst: Integer
  • gap: Decimal
  • dmin: Decimal
  • rms: Decimal
  • net: Decimal
  • id: String
  • updated: ISO 8601 Timestamp String
  • place: String
  • type: String

There are four data types used in the dataset—Integer, Decimal, String, and ISO 8601 Timestamp Strings (which can be used by the time and date functions in SQLite3). Each data type must be converted to its equivalent type in SQLite3. Fortunately, these conversions are going to be relatively straight forward and can be listed as follows:

  • Integer will become INTEGER
  • Decimal will become REAL
  • String will become TEXT
  • ISO 8601 Timestamp String will become TEXT

Before we get much further, we need to include our import statements for our database libraries, which are as follows:

> import Database.HDBC
> import Database.HDBC.Sqlite3

Crafting our functions

Here is the function that turns a CSV expression into a database. Rather than return anything, we print confirmation messages on the screen, as follows:

-- Converts a CSV expression into an SQL database
-- Returns "Successful" if successful,
-- error message otherwise.
convertCSVToSQL ::
    String -> FilePath -> [String] -> CSV -> IO ()
convertCSVToSQL tableName outFileName fields records =
  -- Check to make sure that the number of
  --   columns matches the number of fields
    if nfieldsInFile == nfieldsInFields then do
    -- Open a connection
    conn <- connectSqlite3 outFileName

    -- Create a new table
    run conn createStatement []
    -- Load contents of CSV file into table
    stmt <- prepare conn insertStatement
    executeMany stmt (tail
      (filter (
ecord -> nfieldsInFile == length record)
      sqlRecords))

    -- Commit changes
    commit conn
    -- Close the connection
    disconnect conn
    -- Report that we were successful
    putStrLn "Successful"
    else
    putStrLn
    "The number of input fields differ from the csv file."

  where
    nfieldsInFile = length $ head records
    nfieldsInFields = length fields
    createStatement = "CREATE TABLE " ++ 
                       tableName ++ 
                      " (" ++ (intercalate ", " fields) ++")"
    insertStatement = "INSERT INTO " ++ 
                      tableName ++"++" VALUES (" ++ 
                      (intercalate ", " 
                      (replicate nfieldsInFile "?")) ++ ")""?")) 
                      ++ ")"
    sqlRecords = map (
ecord -> 
                 map (element -> toSql element) 
                 record 
                 ) records

There's a lot to explain in the above code. Here are the related HDBC statements:

  • The connectSqlite3 statement will create a blank SQLite3 file. The run conn createStatement [] statement creates an initial table within the database. Inspecting the createStatement line in the where clause reveals that this is a simple SQL statement to create a table based on the types and column names supplied by the fields input expression. The run statement allows us to perform string interpolation on SQL statements by replacing the ? symbols with values supplied in a list. We did not use this particular feature in the connectSqlite3 statement, but we must still pass an empty list.
  • The stmt <- prepare conn insertStatement statement prepares the INSERT SQL statement to batch process of each record.
  • The executeMany stmt statement performs the batch processing of SQL statements for each record in the CSV expression with the number of columns equal to the number of column headings. If there is a record in your CSV expression that has too many or too few fields, it is going to be skipped.
  • The commit conn statement does just what its name suggests. It commits the recent changes to the file.
  • The disconnect conn statement formally disconnects from the database (that is, it closes the file).
  • In the where clause, you may notice that all the fields in the database are converted to their respective SQL type using the toSql function.

Here is the convertCSVFileToSQL function that will open a csv file and pass the contents of our file to the convertCSVToSQL function. Since this is functionally similar to the applyToColumnInCSVFile function mentioned earlier, we will spare you the details of how it works. We will also test the two functions introduced in this section together:

-- Converts a CSV file to an SQL database file
-- Prints "Successful" if successful, error message otherwise
convertCSVFileToSQL ::
    String -> String -> String -> [String] -> IO ()
convertCSVFileToSQL inFileName outFileName tableName fields = do
    -- Open and read the CSV file
    input <- readFile inFileName
    let records = parseCSV inFileName input

    -- Check to make sure this is a good csv file
    either handleCSVError convertTool records
 where
    convertTool = convertCSVToSQL tableName outFileName fields
    handleCSVError csv =
        putStrLn "This does not appear to be a CSV file."

We will begin our testing, as follows:

> :l LearningDataAnalysis02.hs
> convertCSVFileToSQL "all_week.csv" "earthquakes.sql" "oneWeek" ["time TEXT", "latitude REAL", "longitude REAL", "depth REAL", "mag REAL", "magType TEXT", "nst INTEGER", "gap REAL", "dmin REAL", "rms REAL", "net REAL", "id TEXT", "updated TEXT", "place TEXT", "type TEXT"]
Successful

At the completion of this step, a new file will exist in your current working directory with the earthquakes.sql filename. We will inspect our new database, as follows:

> conn <- connectSqlite3 "earthquakes.sql"
> magnitudes <- quickQuery' conn "SELECT mag FROM oneWeek" []
> :t magnitudes
magnitudes :: [[SqlValue]]

Note that the information of the type here is based on a simple SQL statement to our database. The data is returned as a list of SqlValue. We will inspect the first element that was returned by converting the data into Double by using the safeFromSql function, as follows:

> fromSql $ head $ head magnitudes :: Double
0.5

To end this chapter, let's apply the data returned by the SQL statement to the average function that was written at the beginning of the chapter:

> let magnitudesDouble = map (
ecord ->
    fromSql $  head record :: Double)
    magnitudes
> average magnitudesDouble
1.6950637393767727
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset