A.3. Using databases with R

Sometimes you want to use R to work with data in a database. Usually this is because the data is already in a database, or you want to use a high-performance database (such as Postgres or Apache Spark) to manipulate data at speed.

If your data is small enough to fit in memory (or you can spin up a large enough computer to make this so, say on Amazon EC2, Microsoft Azure, or Google Cloud), we suggest bringing the data over to R using DBI::dbReadTable() and then using data.table. Except for the data transfer time, this will be very hard to beat. Note, however, that writing large results back to a database is not fully supported on all R database drivers (sparklyr, in particular, explicitly does not support this).

If you want to work with data in a database (which we usually do for our clients), then we suggest using a query generator such as rquery or dbplyr. We also believe the idea of thinking in terms of Codd relational operators (or thinking in terms of SQL databases) is very beneficial, so playing around with one of the preceding systems can be well worth the effort.

A.3.1. Running database queries using a query generator

Example

Ranking customer offers

We are given a table of data keyed by customer names, product names. For each of these key pairs we have a suggested price discount fraction and a predicted discount offer affinity (both produced by some machine learning models, of the type we have been discussing in this book). Our task is to take this table and select the two offers with highest predicted affinity for each customer. The business goal is this: we want to show the customer only these two offers, and none of the others.

To simulate this task, we will take some arbitrary data and copy it from R to a Postgres database. To run this example, you would need your own Postgres database, and copy in your own connection details, including host, port, username, and password. The purpose of this exercise is to give a taste of working with databases from R and a taste of thinking in Codd relational terms (the basis for many data processing systems, including dplyr).[9]

9

The full example and worked solution are available here: https://github.com/WinVector/PDSwR2/blob/master/BestOffers/BestOffers.md.

First, we set up our database connection and copy some data into this fresh database:

library("rquery")

raw_connection <- DBI::dbConnect(RPostgres::Postgres(),
                                 host = 'localhost',
                                 port = 5432,
                                 user = 'johnmount',
                                 password = '')              1

dbopts <- rq_connection_tests(raw_connection)                2
db <- rquery_db_info(
  connection = raw_connection,
  is_dbi = TRUE,
  connection_options = dbopts)

data_handle <- rq_copy_to(                                   3
   db,
  'offers',
  wrapr::build_frame(
   "user_name"  , "product"                       , "discount", "predicted_of
     fer_affinity" |
     "John"     , "Pandemic Board Game"           , 0.1       , 0.8596
                   |
     "Nina"     , "Pandemic Board Game"           , 0.2       , 0.1336
                   |
     "John"     , "Dell XPS Laptop"               , 0.1       , 0.2402
                   |
     "Nina"     , "Dell XPS Laptop"               , 0.05      , 0.3179
                   |
     "John"     , "Capek's Tales from Two Pockets", 0.05      , 0.2439
                   |
     "Nina"     , "Capek's Tales from Two Pockets", 0.05      , 0.06909
                   |
     "John"     , "Pelikan M200 Fountain Pen"     , 0.2       , 0.6706
                   |
     "Nina"     , "Pelikan M200 Fountain Pen"     , 0.1       , 0.616
                   ),

  temporary = TRUE,
  overwrite = TRUE)

  • 1 Uses DBI to connect to a database. In this case, it creates a new in-memory SQLite.
  • 2 Builds an rquery wrapper for the connection
  • 3 Copies some example data into the database

Now we will solve the problem by thinking relationally. We work in steps, and with experience, we would see that to solve this problem, we want to assign a per-user rank to each offer and then filter down to the ranks we want.

We will work this example using the rquery package. In rquery, window functions are available though the extend() method.[10] extend() can calculate a new column based both on a partition of the data (by user_name) and an ordering of columns within these partitions (by predicted_offer_affinity). It is easiest to demonstrate this in action.

10

The odd name “extend” was chosen out of respect for the source of these ideas: Codd’s relational algebra.

data_handle %.>%   extend(.,                               1
         simple_rank = rank(),                             2
          partitionby = "user_name",                       3
          orderby = "predicted_offer_affinity",            4
          reverse = "predicted_offer_affinity") %.>%
  execute(db, .) %.>%                                      5
knitr::kable(.)                                            6

# |user_name |product                        | discount| predicted_offer_affi
     nity| simple_rank|
# |:---------|:------------------------------|--------:|---------------------
     ---:|-----------:|
# |Nina      |Pelikan M200 Fountain Pen      |     0.10|                  0.6
     1600|           1|
# |Nina      |Dell XPS Laptop                |     0.05|                  0.3
     1790|           2|
# |Nina      |Pandemic Board Game            |     0.20|                  0.1
     3360|           3|
# |Nina      |Capek's Tales from Two Pockets |     0.05|                  0.0
     6909|           4|
# |John      |Pandemic Board Game            |     0.10|                  0.8
     5960|           1|
# |John      |Pelikan M200 Fountain Pen      |     0.20|                  0.6
     7060|           2|
# |John      |Capek's Tales from Two Pockets |     0.05|                  0.2
     4390|           3|
# |John      |Dell XPS Laptop                |     0.10|                  0.2
     4020|           4|

  • 1 Pipes our data into the execute() method. Notice that we use the wrapr dot pipe.
  • 2 We’ll calculate rank() or the order of the data rows.
  • 3 The ranking will be recalculated for each user (our window partition).
  • 4 The window ordering that controls the rank will be from predicted_offer_affinity, reversed (largest first).
  • 5 Translates the operation plan into SQL, sends it to the database for execution, and brings the results back to R
  • 6 Pretty-prints the results

The question is this: how did we know to use the extend method and what options to set? That requires some experience with relational systems. There are only a few primary operations (adding derived columns, selecting columns, selecting rows, and joining tables) and only a few options (such as the partition and order when adding a windowed column). So the technique can be learned. The power of the theory is that just about any common data transform can be written in terms of these few fundamental data operators.

Now, to solve our full problem, we combine this operator with a few more relational operators (again using the wrapr dot pipe). This time we’ll have the result written into a remote table (so no data ever moves to or from R!) and then only copy the results back after the calculation is complete.

ops <- data_handle %.>%                                           1
   extend(.,                                                      2
          simple_rank = rank(),
         partitionby = "user_name",
         orderby = "predicted_offer_affinity",
         reverse = "predicted_offer_affinity") %.>%
         select_rows(.,                                           3
               simple_rank <= 2) %.>%
   orderby(., c("user_name", "simple_rank")                       4

result_table <- materialize(db, ops)                              5

DBI::dbReadTable(db$connection, result_table$table_name) %.>%     6
   knitr::kable(.)

# |user_name |product                   | discount| predicted_offer_affinity|
      simple_rank|

# |:---------|:-------------------------|--------:|------------------------
     :|-----------:|
# |John      |Pandemic Board Game       |     0.10|                   0.8596|
                1|
# |John      |Pelikan M200 Fountain Pen |     0.20|                   0.6706|
                2|
# |Nina      |Pelikan M200 Fountain Pen |     0.10|                   0.6160|
                1|
# |Nina      |Dell XPS Laptop           |     0.05|                   0.3179|
                2|

  • 1 Defines our sequence of operations
  • 2 Marks each row with its simple per-user rank
  • 3 Selects the two rows with highest rank for each user
  • 4 Orders the rows by user and product rank
  • 5 Runs the result in the database, instantiating a new result table
  • 6 Copies the result back to R and pretty-prints it

The reason we saved the operation plan in the variable ops is because we can do a lot more than just execute the plan. For example, we can create a diagram of the planned operations, as in figure A.5.

Figure A.5. rquery operation plan diagram

Also—and this is the big point—we can see the SQL that gets actually sent to the database. Without a query planner (such as rquery or dbplyr), we would have to write something like this SQL:

ops %.>%
  to_sql(., db) %.>%
  cat(.)

## SELECT * FROM (
##  SELECT * FROM (
##   SELECT
##    "user_name",
##    "product",
##    "discount",
##    "predicted_offer_affinity",
##    rank ( ) OVER (  PARTITION BY "user_name" ORDER BY "predicted_offer_aff
     inity" DESC ) AS "simple_rank"
##   FROM (
##    SELECT
##     "user_name",
##     "product",
##     "discount",
##     "predicted_offer_affinity"
##    FROM
##     "offers"
##    ) tsql_17135820721167795865_0000000000
##  ) tsql_17135820721167795865_0000000001
##  WHERE "simple_rank" <= 2
## ) tsql_17135820721167795865_0000000002 ORDER BY "user_name", "simple_rank"

The issue is that relational thinking is productive, but SQL itself is fairly verbose. In particular, SQL expresses sequencing or composition as nesting, which means we read from the inside out. A lot of the grace of Codd’s ideas is recovered when we move to an operator notation (such as seen in dplyr or rquery).

A longer treatment (with more references) of this example can be found here: https://github.com/WinVector/PDSwR2/blob/master/BestOffers/BestOffers.md.

Relational data manipulation thinks in terms of operators, which we briefly touched on here, and data organization, which is the topic of our next section.

A.3.2. How to think relationally about data

The trick to thinking relationally about data is this: for every table, classify the columns into a few important themes, and work with the natural relations between these themes. One view of the major column themes is provided in table A.1.

Table A.1. Major SQL column themes

Column theme

Description

Common uses and treatments

Natural key columns In many tables, one or more columns taken together form a natural key that uniquely identifies the row. Some data (such as running logs) doesn’t have natural keys (many rows may correspond to a given timestamp). Natural keys are used to sort data, control joins, and specify aggregations.
Surrogate key columns Surrogate key columns are key columns (collections of columns that uniquely identify rows) that don’t have a natural relation to the problem. Examples of surrogate keys include row numbers and hashes. In some cases (like analyzing time series), the row number can be a natural key, but usually it’s a surrogate key. Surrogate key columns can be used to simplify joins; they tend not to be useful for sorting and aggregation. Surrogate key columns must not be used as modeling features, as they don’t represent useful measurements.
Provenance columns Provenance columns are columns that contain facts about the row, such as when it was loaded. The ORIGIN-SERTTIME, ORIGFILENAME, and ORIGFILEROWNUMBER columns added in section 2.3.1 are examples of provenance columns. Provenance columns shouldn’t be used in analyses, except for confirming you’re working on the right dataset, selecting a dataset (if different datasets are commingled in the same table), and comparing datasets.
Payload columns Payload columns contain actual data. Payload columns may be data such as prices and counts. Payload columns are used for aggregation, grouping, and conditions. They can also sometimes be used to specify joins.
Experimental design columns Experimental design columns include sample grouping like ORIGRANDGROUP from section 2.3.1, or data weights like the PWGTP* and WGTP* columns we mentioned in section 7.1.1. Experiment design columns can be used to control an analysis (select subsets of data, used as weights in modeling operations), but they should never be used as features in an analysis.
Derived columns Derived columns are columns that are functions of other columns or other groups of columns. An example would be the day of week (Monday through Sunday), which is a function of the date. Derived columns can be functions of keys (which means they’re unchanging in many GROUP BY queries, even though SQL will insist on specifying an aggregator such as MAX()) or functions of payload columns. Derived columns are useful in analysis. A full normal form database doesn’t have such columns. In normal forms, the idea is to not store anything that can be derived, which eliminates certain types of inconsistency (such as a row with the date February 1, 2014, and the day of week Wednesday, when the correct day of week is Saturday). But during analyses, it’s always a good idea to store intermediate calculations in tables and columns: it simplifies code and makes debugging much easier.

The point is that analysis is much easier if you have a good taxonomy of column themes for every supplied data source. You then design SQL command sequences to transform your data into a new table where the columns are just right for analysis. In the end, you should have tables where every row is an event you’re interested in, and every needed fact is already available in a column (which has been called a model matrix for a very long time, or a denormalized table in relational database terms).

Further database reading

Our go-to database reference is Joe Celko, SQL for Smarties, Fourth Edition (Morgan Kauffman, 2011).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset