Chapter 11. Documentation and deployment

This chapter covers

  • Producing effective milestone documentation
  • Managing project history using source control
  • Deploying results and making demonstrations

In this chapter, we’ll survey techniques for documenting and deploying your work. We will work specific scenarios, and point to resources for further study if you want to master the techniques being discussed. The theme is this: now that you can build machine learning models, you should explore tools and procedures to become proficient at saving, sharing, and repeating successes. Our mental model (figure 11.1) for this chapter emphasizes that this chapter is all about sharing what you model. Let’s use table 11.1 to get some more-specific goals in this direction.

Figure 11.1. Mental model

Table 11.1. Chapter goals

Goal

Description

Produce effective milestone documentation A readable summary of project goals, data provenance, steps taken, and technical results (numbers and graphs). Milestone documentation is usually read by collaborators and peers, so it can be concise and can often include actual code. We’ll demonstrate a great tool for producing excellent milestone documentation: the R knitr and rmarkdown packages, which we will refer to generically as R markdown. R markdown is a product of the “reproducible research” movement (see Christopher Gandrud’s Reproducible Research with R and RStudio, Second Edition, Chapman and Hall, 2015) and is an excellent way to produce a reliable snapshot that not only shows the state of a project, but allows others to confirm the project works.
Manage a complete project history It makes little sense to have exquisite milestone or checkpoint documentation of how your project worked last February if you can’t get a copy of February’s code and data. This is why you need good version control discipline to protect code, and good data discipline to preserve data.
Deploy demonstrations True production deployments are best done by experienced engineers. These engineers know the tools and environment they will be deploying to. A good way to jump-start production deployment is to have a reference application. This allows engineers to experiment with your work, test corner cases, and build acceptance tests.

This chapter explains how to share your work—even sharing it with your future self. We’ll discuss how to use R markdown to create substantial project milestone documentation and automate reproduction of graphs and other results. You’ll learn about using effective comments in code, and using Git for version management and for collaboration. We’ll also discuss deploying models as HTTP services and applications.

For some of the examples, we will use RStudio, which is an integrated development environment (IDE) that is a product of RStudio, Inc. (and not part of R/CRAN itself). Everything we show can be done without RStudio, but RStudio supplies a basic editor and some single-button-press alternatives to some scripting tasks.

11.1. Predicting buzz

Example

For our example scenario, we want to use metrics collected about the first few days of article views to predict the long-term popularity of an article. This can be important for selling advertising and predicting and managing revenue. To be specific: we will use measurements taken during the first eight days of an article’s publication to predict if the article will remain popular in the long term.

Our tasks for this chapter are to save and share our Buzz model, document the model, test the model, and deploy the model into production.

To simulate our example scenario of predicting long term article popularity or buzz we will use the Buzz dataset from http://ama.liglab.fr/datasets/buzz/. We’ll work with the data found in the file TomsHardware-Relative-Sigma-500.data.txt.[1] The original supplied documentation (TomsHardware-Relative-Sigma-500.names.txt and BuzzDataSetDoc.pdf) tells us the Buzz data is structured as shown in table 11.2.

1

All files mentioned in this chapter are available from https://github.com/WinVector/PDSwR2/tree/master/Buzz.

Table 11.2. Buzz data description

Attribute

Description

Rows Each row represents many different measurements of the popularity of a technical personal computer discussion topic.
Topics Topics include technical issues about personal computers such as brand names, memory, overclocking, and so on.
Measurement types For each topic, measurement types are quantities such as the number of discussions started, number of posts, number of authors, number of readers, and so on. Each measurement is taken at eight different times.
Times The eight relative times are named 0 through 7 and are likely days (the original variable documentation is not completely clear and the matching paper has not yet been released). For each measurement type, all eight relative times are stored in different columns in the same data row.
Buzz The quantity to be predicted is called buzz and is defined as being true or 1 if the ongoing rate of additional discussion activity is at least 500 events per day averaged over a number of days after the observed days. Likely buzz is a future average of the seven variables labeled NAC (the original documentation is unclear on this).

In our initial Buzz documentation, we list what we know (and, importantly, admit what we’re not sure about). We don’t intend any disrespect in calling out issues in the supplied Buzz documentation. That documentation is about as good as you see at the beginning of a project. In an actual project, you’d clarify and improve unclear points through discussions and work cycles. This is one reason why having access to active project sponsors and partners is critical in real-world projects.

In this chapter, we’ll use the Buzz model and dataset as is and concentrate on demonstrating the tools and techniques used in producing documentation, deployments, and presentations. In actual projects, we advise you to start by producing notes like those in table 11.2. You’d also incorporate meeting notes to document your actual project goals. As this is only a demonstration, we’ll emphasize technical documentation: data provenance and an initial trivial analysis to demonstrate we have control of the data. Our example initial Buzz analysis is found here: https://github.com/WinVector/PDSwR2/blob/master/Buzz/buzzm.md. We suggest you skim it before we work through the tools and steps used to produce the documents in our next section.

11.2. Using R markdown to produce milestone documentation

The first audience you’ll have to prepare documentation for is yourself and your peers. You may need to return to previous work months later, and it may be in an urgent situation like an important bug fix, presentation, or feature improvement. For self/peer documentation, you want to concentrate on facts: what the stated goals were, where the data came from, and what techniques were tried. You assume that as long as you use standard terminology or references, the reader can figure out anything else they need to know. You want to emphasize any surprises or exceptional issues, as they’re exactly what’s expensive to relearn. You can’t expect to share this sort of documentation with clients, but you can later use it as a basis for building wider documentation and presentations.

The first sort of documentation we recommend is project milestone or checkpoint documentation. At major steps of the project, you should take some time out to repeat your work in a clean environment (proving you know what’s in intermediate files and you can in fact recreate them). An important, and often neglected, milestone is the start of a project. In this section, we’ll use the knitr and rmarkdown R packages to document starting work with the Buzz data.

Documentation scenario: Share the ROC curve for the Buzz model

Our first task is to build a document that contains the ROC curve for the example model. We want to be able to rebuild this document automatically if we change model or evaluation data, so we will use R markdown to produce the document.

11.2.1. What is R markdown?

R markdown is a variation of the Markdown document specification[2] that allows the inclusion of R code and results inside documents. The concept of processing a combination of code and text should be credited to the R Sweave package[3] and from Knuth’s formative ideas of literate programming.[4] In practice, you maintain a master file that contains both user-readable documentation and chunks of program source code. The document types supported by R markdown include Markdown, HTML, LaTeX, and Word. LaTeX format is a good choice for detailed, typeset, technical documents. Markdown format is a good choice for online documentation and wikis.

2

Markdown itself is a popular document-formatting system based on the idea of imitating how people hand-annotate emails: https://en.wikipedia.org/wiki/Markdown.

3

4

The engine that performs the document creation task is called knitr. knitr’s main operation is called a knit: knitr extracts and executes all of the R code and then builds a new result document that assembles the contents of the original document plus pretty-printed code and results. Figure 11.2 shows how knitr treats documents as pieces (called chunks) and transforms chunks into sharable results.

Figure 11.2. R markdown process schematic

The process is best demonstrated by a few examples.

A simple R markdown example

Markdown (http://daringfireball.net/projects/markdown/) is a simple web-ready format that’s used in many wikis. The following listing shows a simple Markdown document with R markdown annotation blocks denoted with ```.

Listing 11.1. R-annotated Markdown
---                                                                            1
title: "Buzz scoring example"
output: github_document
---
```{r, include = FALSE}                                                        2
# process document with knitr or rmarkdown.
# knitr::knit("Buzz_score_example.Rmd") # creates Buzz_score_example.md
# rmarkdown::render("Buzz_score_example.Rmd",
#                   rmarkdown::html_document()) # creates Buzz_score_example.html
```                                                                            3

Example scoring (making predictions with) the Buzz data set.                   4

First attach the `randomForest` package and load the model and test data.
```{r}                                                                         5
suppressPackageStartupMessages(library("randomForest"))
lst <- readRDS("thRS500.RDS")
varslist <- lst$varslist
fmodel <- lst$fmodel
buzztest <- lst$buzztest
rm(list = "lst")
```

Now show the quality of our model on held-out test data.                      6
```{r}                                                                        7
buzztest$prediction <-
      predict(fmodel, newdata = buzztest, type = "prob")[, 2, drop = TRUE]

WVPlots::ROCPlot(buzztest, "prediction",
                 "buzz", 1,
                 "ROC curve estimating quality of model predictions on held-
     out data")
```

  • 1 YAML (yet another markup language ) header specifying some metadata: title and default output format
  • 2 An R markdown “start code chunk” annotation. The “include = FALSE” directive says the block is not shown in the rendering.
  • 3 End of the R markdown block; all content between the start and end marks is treated as R code and executed.
  • 4 Free Markdown text
  • 5 Another R code block. In this case, we are loading an already produced random Forest model and test data.
  • 6 More free test
  • 7 Another R code chunk

The contents of listing 11.1 are available in the file https://github.com/WinVector/PDSwR2/blob/master/Buzz/Buzz_score_example.Rmd. In R we’d process it like this:

rmarkdown::render("Buzz_score_example.Rmd", rmarkdown::html_document())

This produces the new file Buzz_score_example.html, which is a finished report in HTML format. Adding this sort of ability to your workflow (either using Sweave or knitr/rmarkdown) is game changing.

The purpose of R markdown

The purpose of R markdown is to produce reproducible work. The same data and techniques should be rerunnable to get equivalent results, without requiring error-prone direct human intervention such as selecting spreadsheet ranges or copying and pasting. When you distribute your work in R markdown format (as we do in section 11.2.3), anyone can download your work and, without great effort, rerun it to confirm they get the same results you did. This is the ideal standard of scientific research, but is rarely met, as scientists usually are deficient in sharing all of their code, data, and actual procedures. knitr collects and automates all the steps, so it becomes obvious if something is missing or doesn’t actually work as claimed. knitr automation may seem like a mere convenience, but it makes the essential work listed in table 11.3 much easier (and therefore more likely to actually be done).

Table 11.3. Maintenance tasks made easier by R markdown

Task

Discussion

Keeping code in sync with documentation With only one copy of the code (already in the document), it’s not so easy to get out of sync.
Keeping results in sync with data Eliminating all by-hand steps (such as cutting and pasting results, picking filenames, and including figures) makes it much more likely you’ll correctly rerun and recheck your work.
Handing off correct work to others If the steps are sequenced so a machine can run them, then it’s much easier to rerun and confirm them. Also, having a container (the master document) to hold all your work makes managing dependencies much easier.

11.2.2. knitr technical details

To use knitr on a substantial project, you need to know more about how knitr code chunks work. In particular, you need to be clear how chunks are marked and what common chunk options you’ll need to manipulate. Figure 11.3 shows the steps to prepare an R markdown document.

Figure 11.3. The R markdown process

knitr block declaration format

In general, a knitr code block starts with the block declaration (``` in Markdown and << in LaTeX). The first string is the name of the block (must be unique across the entire project). After that, a number of comma-separated option=value chunk option assignments are allowed.

knitr chunk options

A sampling of useful option assignments is given in table 11.4.

Table 11.4. Some useful knitr options

Option name

Purpose

cache Controls whether results are cached. With cache = FALSE (the default), the code chunk is always executed. With cache = TRUE, the code chunk isn’t executed if valid cached results are available from previous runs. Cached chunks are essential when you are revising knitr documents, but you should always delete the cache directory (found as a subdirectory of where you’re using knitr) and do a clean rerun to make sure your calculations are using current versions of the data and settings you’ve specified in your document.
echo Controls whether source code is copied into the document. With echo = TRUE (the default), pretty-formatted code is added to the document. With echo = FALSE, code isn’t echoed (useful when you only want to display results).
eval Controls whether code is evaluated. With eval = TRUE (the default), code is executed. With eval = FALSE, it’s not (useful for displaying instructions).
message Set message = FALSE to direct R message() commands to the console running R instead of to the document. This is useful for issuing progress messages to the user, that you don’t want in the final document.
results Controls what’s to be done with R output. Usually you don’t set this option and output is intermingled (with ## comments) with the code. A useful option is results='hide', which suppresses output.
tidy Controls whether source code is reformatted before being printed. We used to set tidy = FALSE, as one version of knitr misformatted R comments when tidying.

Most of these options are demonstrated in our Buzz example, which we’ll work through in the next section.

11.2.3. Using knitr to document the Buzz data and produce the model

The model we were just evaluating itself was produced using an R markdown script: the file buzzm.Rmd found at https://github.com/WinVector/PDSwR2/tree/master/Buzz. Knitting this file produced the Markdown result buzzm.md and the saved model file thRS500.RDS that drives our examples. All steps we’ll mention in this chapter are completely demonstrated in the Buzz example directory. We’ll show excerpts from buzzm.Rmd.

Buzz data notes

For the Buzz data, the preparation notes can be found in the files buzzm.md and buzzm.html. We suggest viewing one of these files and table 11.2. The original description files from the Buzz project (TomsHardware-Relative-Sigma-500.names.txt and BuzzDataSetDoc.pdf) are also available at https://github.com/WinVector/PDSwR2/tree/master/Buzz.

Confirming data provenance

Because knitr is automating steps, you can afford to take a couple of extra steps to confirm the data you’re analyzing is in fact the data you thought you had. For example, we’ll start our Buzz data analysis by confirming that the SHA cryptographic hash of the data we’re starting from matches what we thought we had downloaded. This is done (assuming your system has the sha cryptographic hash installed) as shown in the following listing (note: always look to the first line of chunks for chunk options such as cache = TRUE).

Listing 11.2. Using the system() command to compute a file hash
```{r dataprep}
infile <- "TomsHardware-Relative-Sigma-500.data.txt"
paste('checked at', date())
system(paste('shasum', infile), intern = TRUE)             1
buzzdata <- read.table(infile, header = FALSE, sep = ",")
...

  • 1 Runs a system-installed cryptographic hash program (this program is outside of R’s install image)

This code sequence depends on a program named shasum being on your execution path. You have to have a cryptographic hash installed, and you can supply a direct path to the program if necessary. Common locations for a cryptographic hash include /usr/bin/shasum, /sbin/md5, and fciv.exe, depending on your actual system configuration.

This code produces the output shown in figure 11.4. In particular, we’ve documented that the data we loaded has the same cryptographic hash we recorded when we first downloaded the data. Having confidence you’re still working with the exact same data you started with can speed up debugging when things go wrong. Note that we’re using the cryptographic hash only to defend against accident (using the wrong version of a file or seeing a corrupted file) and not to defend against adversaries or external attacks. For documenting data that may be changing under external control, it is critical to use up-to-date cryptographic techniques.

Figure 11.4. knitr documentation of Buzz data load

Figure 11.5 is the same check, rerun in 2019, which gives us some confidence we are in fact dealing with the same data.

Figure 11.5. knitr documentation of Buzz data load 2019: buzzm.md

Recording the performance of the naive analysis

The initial milestone is a good place to try to record the results of a naive “just apply a standard model to whatever variables are present” analysis. For the Buzz data analysis, we’ll use a random forest modeling technique (not shown here, but in our knitr documentation) and apply the model to test data.

Save your data!

Always save a copy of your training data. Remote data (URLs, databases) has a habit of changing or disappearing. To reproduce your work, you must save your inputs.

Listing 11.3. Calculating model performance
``` {r}
rtest <- data.frame(truth = buzztest$buzz,
pred = predict(fmodel, newdata = buzztest, type = "prob")[, 2, drop = TRUE])
print(accuracyMeasures(rtest$pred, rtest$truth))
```

    ## [1] "precision= 0.832402234636871 ; recall= 0.84180790960452"
    ##      pred
    ## truth FALSE TRUE
    ##     0   584   30
    ##     1    28  149
    ##   model  accuracy        f1 dev.norm       AUC
    ## 1 model 0.9266751 0.8370787  0.42056 0.9702102
Using milestones to save time

Now that we’ve gone to all the trouble to implement, write up, and run the Buzz data preparation steps, we’ll end our knitr analysis by saving the R workspace. We can then start additional analyses (such as introducing better variables for the time-varying data) from the saved workspace. In the following listing, we’ll show how to save a file, and how to again produce a cryptographic hash of the file (so we can confirm work that starts from a file with the same name is in fact starting from the same data).

Listing 11.4. Saving data
Save variable names, model, and test data.
``` {r}
fname <- 'thRS500.RDS'
items <- c("varslist", "fmodel", "buzztest")
saveRDS(object = list(varslist = varslist,
                      fmodel = fmodel,
                      buzztest = buzztest),
        file = fname)
message(paste('saved', fname))  # message to running R console
print(paste('saved', fname))    # print to document
```

    ## [1] "saved thRS500.RDS"
``` {r}
paste('finished at', date())
```

    ## [1] "finished at Thu Apr 18 09:33:05 2019"
``` {r}
system(paste('shasum', fname), intern = TRUE)  # write down file hash
```

    ## [1] "f2b3b80bc6c5a72079b39308a5758a282bcdd5bf  thRS500.RDS"
knitr takeaway

In our knitr example, we worked through the steps we’ve done for every dataset in this book: load data, manage columns/variables, perform an initial analysis, present results, and save a workspace. The key point is that because we took the extra effort to do this work in knitr, we have the following:

  • Nicely formatted documentation (buzzm.md)
  • Shared executable code (buzzm.Rmd)

This makes debugging (which usually involves repeating and investigating earlier work), sharing, and documentation much easier and more reliable.

Project organization, further reading

To learn more about R markdown we recommend Yihui Xie, Dynamic Documents with R and knitr (CRC Press, 2013). Some good ideas on how to organize a data project in reproducible fashion can be found in Reproducible Research with R and RStudio, Second Edition.

11.3. Using comments and version control for running documentation

Another essential record of your work is what we call running documentation. Running documentation is less formal than milestone/checkpoint documentation and is easily maintained in the form of code comments and version control records. Undocumented, untracked code runs up a great deal of technical debt (see http://mng.bz/IaTd) that can cause problems down the road.

Example

Suppose you want to work on formatting Buzz modeling results. You need to save this work to return to it later, document what steps you have taken, and share your work with others.

In this section, we’ll work through producing effective code comments and using Git for version control record keeping.

11.3.1. Writing effective comments

R’s comment style is simple: everything following a # (that isn’t itself quoted) until the end of a line is a comment and ignored by the R interpreter. The following listing is an example of a well-commented block of R code.

Listing 11.5. Example code comments
#' Return the pseudo logarithm, base 10.
#'
#' Return the pseudo logarithm (base 10) of x, which is close to
#' sign(x)*log10(abs(x)) for x such that abs(x) is large
#' and doesn't "blow up" near zero.  Useful
#' for transforming wide-range variables that may be negative
#' (like profit/loss).
#'
#' See: url{http://www.win-vector.com/blog/2012/03/modeling-trick-the-
     signed-pseudo-logarithm/}
#'
#' NB: This transform has the undesirable property of making most
#' signed distributions appear bi-modal around the origin, no matter
#' what the underlying distribution really looks like.
#' The argument x is assumed be numeric and can be a vector.
#'
#' @param x numeric vector
#' @return pseudo logarithm, base 10 of x
#'
#' @examples
#'
#' pseudoLog10(c(-5, 0, 5))
#' # should be: [1] -0.7153834  0.0000000  0.7153834
#'
#' @export
#'
pseudoLog10 <- function(x) {
  asinh(x / 2) / log(10)
}

When such comments (with the #' marks and @ marks ) is included in an R package, the documentation management engine can read the structured information and use it to produce additional documentation and even online help. For example, when we saved the preceding code in an R package at https://github.com/WinVector/PDSwR2/blob/master/PseudoLog10/R/pseudoLog10.R, we could use the roxygen2 R package to generate the online help shown in figure 11.6.

Figure 11.6. roxygen@-generated online help

Good comments include what the function does, what types arguments are expected to be used, limits of domain, why you should care about the function, and where it’s from. Of critical importance are any NB (nota bene or note well ) or TODO notes. It’s vastly more important to document any unexpected features or limitations in your code than to try to explain the obvious. Because R variables don’t have types (only objects they’re pointing to have types), you may want to document what types of arguments you’re expecting. It’s critical to state if a function works correctly on lists, data frame rows, vectors, and so on.

For more on packages and documentation, we recommend Hadley Wickham, R Packages: Organize, Test, Document, and Share Your Code (O’Reilly, 2015).

11.3.2. Using version control to record history

Version control can both maintain critical snapshots of your work in earlier states and produce running documentation of what was done by whom and when in your project. Figure 11.7 shows a cartoon “version control saves the day” scenario that is in fact common.

Figure 11.7. Version control saving the day

In this section, we’ll explain the basics of using Git (http://git-scm.com/) as a version control system. To really get familiar with Git, we recommend a good book such as Jon Loeliger and Matthew McCullough’s Version Control with Git, Second Edition, (O’Reilly, 2012). Or, better yet, work with people who know Git. In this chapter, we assume you know how to run an interactive shell on your computer (on Linux and OS X you tend to use bash as your shell; on Windows you can install Cygwin—http://www.cygwin.com).

Working in bright light

Sharing your Git repository means you’re sharing a lot of information about your work habits and also sharing your mistakes. You’re much more exposed than when you just share final work or status reports. Make this a virtue: know you’re working in bright light. One of the most critical features in a good data scientist (perhaps even before analytic skill) is scientific honesty.

To get most of the benefit from Git, you need to become familiar with a few commands, which we will demonstrate in terms of specific tasks next.

Choosing a project directory structure

Before starting with source control, it’s important to settle on and document a good project directory structure. Reproducible Research with R and RStudio, Second Edition, has good advice and instructions on how to do this. A pattern that’s worked well for us is to start a new project with the directory structure described in table 11.5.

Table 11.5. A possible project directory structure

Directory

Description

Data Where we save original downloaded data. This directory must usually be excluded from version control (using the .gitignore feature) due to file sizes, so you must ensure it’s backed up. We tend to save each data refresh in a separate subdirectory named by date.
Scripts Where we store all code related to analysis of the data.
Derived Where we store intermediate results that are derived from data and scripts. This directory must be excluded from source control. You also should have a master script that can rebuild the contents of this directory in a single command (and test the script from time to time).
Results Similar to derived, but this directory holds smaller, later results (often based on derived) and hand-written content. These include important saved models, graphs, and reports. This directory is under version control, so collaborators can see what was said when. Any report shared with partners should come from this directory.
Starting a Git project using the command line

When you’ve decided on your directory structure and want to start a version-controlled project, do the following:

  1. Start the project in a new directory. Place any work either in this directory or in subdirectories.
  2. Move your interactive shell into this directory and type git init. It’s okay if you’ve already started working and there are already files present.
  3. Exclude any subdirectories you don’t want under source control with .gitignore control files.

You can check if you’ve already performed the init step by typing git status. If the init hasn’t been done, you’ll get a message similar to fatal: Not a git repository (or any of the parent directories): .git. If the init has been done, you’ll get a status message telling you something like on branch master and listing facts about many files.

The init step sets up in your directory a single hidden file tree called .git and prepares you to keep extra copies of every file in your directory (including subdirectories). Keeping all of these extra copies is called versioning and what is meant by version control. You can now start working on your project: save everything related to your work in this directory or some subdirectory of this directory.

Again, you only need to init a project once. Don’t worry about accidentally running git init. a second time; that’s harmless.

Using add/commit pairs to checkpoint work
Get nervous about uncommitted state

Here’s a good rule of thumb for Git: you should be as nervous about having uncommitted changes as you should be about not having clicked Save. You don’t need to push/pull often, but you do need to make local commits often (even if you later squash them with a Git technique called rebasing).

As often as practical, enter the following two commands into an interactive shell in your project directory:

git add -A       1
git commit       2

  • 1 Stages results to commit (specifies what files should be committed)
  • 2 Actually performs the commit

Checking in a file is split into two stages: add and commit. This has some advantages (such as allowing you to inspect before committing), but for now just consider the two commands as always going together. The commit command should bring up an editor where you enter a comment as to what you’re up to. Until you’re a Git expert, allow yourself easy comments like “update,” “going to lunch,” “just added a paragraph,” or “corrected spelling.” Run the add/commit pair of commands after every minor accomplishment on your project. Run these commands every time you leave your project (to go to lunch, to go home, or to work on another project). Don’t fret if you forget to do this; just run the commands next time you remember.

A “wimpy commit” is better than no commit

We’ve been a little loose in our instructions to commit often and not worry too much about having a long commit message. Two things to keep in mind are that usually you want commits to be meaningful with the code working (so you tend not to commit in the middle of an edit with syntax errors), and good commit notes are to be preferred (just don’t forgo a commit because you don’t feel like writing a good commit note).

Using git log and git status to view progress

Any time you want to know about your work progress, type either git status to see if there are any edits you can put through the add/commit cycle, or git log to see the history of your work (from the viewpoint of the add/commit cycles).

The following listing shows the git status from our copy of this book’s examples repository (https://github.com/WinVector/PDSwR2).

Listing 11.6. Checking your project status
$ git status
On branch master
Your branch is up to date with 'origin/master'.

nothing to commit, working tree clean

And the next listing shows a git log from the same project.

Listing 11.7. Checking your project history
$ git log
commit d22572281d40522bc6ab524bbdee497964ff4af0 (HEAD -
     > master, origin/master)
Author: John Mount <[email protected]>
Date:   Tue Apr 16 16:24:23 2019 -0700

    technical edits ch7

The indented lines are the text we entered at the git commit step; the dates are tracked automatically.

Using Git through RStudio

The RStudio IDE supplies a graphical user interface to Git that you should try. The add/commit cycle can be performed as follows in RStudio:

  • Start a new project. From the RStudio command menu, select Project > Create Project, and choose New Project. Then select the name of the project and what directory to create the new project directory in; leave the type as (Default), and make sure Create a Git Repository for this Project is checked. When the new project pane looks something like figure 11.8, click Create Project, and you have a new project.
    Figure 11.8. RStudio new project pane

  • Do some work in your project. Create new files by selecting File > New > R Script. Type some R code (like 1/5) into the editor pane and then click the save icon to save the file. When saving the file, be sure to choose your project directory or a subdirectory of your project.
  • Commit your changes to version control. Figure 11.9 shows how to do this. Select the Git control pane in the top right of RStudio. This pane shows all changed files as line items. Check the Staged check box for any files you want to stage for this commit. Then click Commit, and you’re done.
    Figure 11.9. RStudio Git controls

You may not yet deeply understand or like Git, but you’re able to safely check in all of your changes every time you remember to stage and commit. This means all of your work history is there; you can’t clobber your committed work just by deleting your working file. Consider all of your working directory as “scratch work”—only checked-in work is safe from loss.

Your Git history can be seen by pulling down on the Other Commands gear (shown in the Git pane in figure 11.9) and selecting History (don’t confuse this with the nearby History pane, which is command history, not Git history). In an emergency, you can find Git help and find your earlier files. If you’ve been checking in, then your older versions are there; it’s just a matter of getting some help in accessing them. Also, if you’re working with others, you can use the push/pull menu items to publish and receive updates. Here’s all we want to say about version control at this point: commit often, and if you’re committing often, all problems can be solved with some further research. Also, be aware that since your primary version control is on your own machine, you need to make sure you have an independent backup of your machine. If your machine fails and your work hasn’t been backed up or shared, then you lose both your work and your version repository.

11.3.3. Using version control to explore your project

Up until now, our model of version control has been this: Git keeps a complete copy of all of our files each time we successfully enter the pair of add/commit lines. We’ll now use these commits. If you add/commit often enough, Git is ready to help you with any of the following tasks:

  • Tracking your work over time
  • Recovering a deleted file
  • Comparing two past versions of a file
  • Finding when you added a specific bit of text
  • Recovering a whole file or a bit of text from the past (undo an edit)
  • Sharing files with collaborators
  • Publicly sharing your project (à la GitHub at https://github.com/, Gitlab https://gitlab.com/, or Bitbucket at https://bitbucket.org)
  • Maintaining different versions (branches) of your work

And that’s why you want to add and commit often.

Getting help on Git

For any Git command, you can type git help [command] to get usage information. For example, to learn about git log, type git help log.

Finding out who wrote what and when

In section 11.3.1, we implied that a good version control system can produce a lot of documentation on its own. One powerful example is the command git blame. Look what happens if we download the Git repository https://github.com/WinVector/PDSwR2 (with the command git clone [email protected]:WinVector/PDSwR2.git) and run the command git blame Buzz/buzzapp/server.R (to see who “wrote” each line in the file).

Listing 11.8. Finding out who committed what
git blame Buzz/buzzapp/server.R
4efb2b78 (John Mount 2019-04-24 16:22:43 -0700  1) #
4efb2b78 (John Mount 2019-04-24 16:22:43 -0700  2)
     # This is the server logic of a Shiny web application. You can run the
4efb2b78 (John Mount 2019-04-24 16:22:43 -0700  3)
     # application by clicking 'Run App' above.
4efb2b78 (John Mount 2019-04-24 16:22:43 -0700  4) #

The git blame information takes each line of the file and prints the following:

  • The prefix of the line’s Git commit hash. This is used to identify which commit the line we’re viewing came from.
  • Who committed the line.
  • When they committed the line.
  • The line number.
  • And, finally, the contents of the line.
git blame doesn’t tell the whole story

It is important to understand that many of the updates that git blame reports may be mechanical (somebody using a tool to reformat files), or somebody acting on somebody else’s behalf. You must look at the commits to see what happened. In this particular example, the commit message was “add Nina’s Shiny example,” so this was work done by Nina Zumel, who had delegated checking it in to John Mount.

A famous example of abusing similar lines of code metrics was the attempt to discredit Katie Bouman’s leadership in creating the first image of a black hole. One of the (false) points raised was that collaborator Andrew Chael had contributed more lines of code to the public repository. Fortunately, Chael himself responded, defending Bouman’s role and pointing out the line count attributed to him was machine-generated model files he had checked into the repository as part of his contribution, not authored lines of code.

Using git diff to compare files from different commits

The git diff command allows you to compare any two committed versions of your project, or even to compare your current uncommitted work to any earlier version. In Git, commits are named using large hash keys, but you’re allowed to use prefixes of the hashes as names of commits.[5] For example, the following listing demonstrates finding the differences in two versions of https://github.com/WinVector/PDSwR2 in a diff or patch format.

5

You can also create meaningful names for commits with the git tag command.

Listing 11.9. Finding line-based differences between two committed versions
diff --git a/CDC/NatalBirthData.rData b/CDC/NatalBirthData.rData
...
+++ b/CDC/prepBirthWeightData.R
@@ -0,0 +1,83 @@
+data <- read.table("natal2010Sample.tsv.gz",
+                   sep="	", header = TRUE, stringsAsFactors = FALSE)
+
+# make a boolean from Y/N data
+makevarYN = function(col) {
+  ifelse(col %in% c("", "U"), NA, col=="Y")
+}
...
Try to not confuse Git commits and Git branches

A Git commit represents the complete state of a directory tree at a given time. A Git branch represents a sequence of commits and changes as you move through time. Commits are immutable; branches record progress.

Using git log to find the last time a file was around
Example

At some point there was a file named Buzz/buzz.pdf in our repository. Somebody asks us a question about this file. How do we use Git to find when this file was last in the repository, and what its contents had been?

After working on a project for a while, we often wonder, when did we delete a certain file and what was in it at the time? Git makes answering this question easy. We’ll demonstrate this in the repository https://github.com/WinVector/PDSwR2. We remember the Buzz directory having a file named buzz.pdf, but there is no such file now and we want to know what happened to it. To find out, we’ll run the following:

git log --name-status -- Buzz/buzz.pdf
commit 96503d8ca35a61ed9765edff9800fc9302554a3b
Author: John Mount <[email protected]>
Date:   Wed Apr 17 16:41:48 2019 -0700

    fix links and re-build Buzz example

D       Buzz/buzz.pdf

We see the file was deleted by John Mount. We can view the contents of this older file with the command git checkout 96503d8^1 -- Buzz/buzz.pdf. The 96503d8 is the prefix of the commit number (which was enough to specify the commit that deleted the file), and the ^1 means “the state of the file one commit before the named commit” (the last version before the file was deleted).

11.3.4. Using version control to share work

Example

We want to work with multiple people and share results. One way to use Git to accomplish this is by individually setting up our own repository and sharing with a central repository.

In addition to producing work, you must often share it with peers. The common (and bad) way to do this is emailing zip files. Most of the bad sharing practices take excessive effort, are error prone, and rapidly cause confusion. We advise using version control to share work with peers. To do that effectively with Git, you need to start using additional commands such as git pull, git rebase, and git push. Things seem more confusing at this point (though you still don’t need to worry about branching in its full generality), but are in fact far less confusing and less error-prone than ad hoc solutions. We almost always advise sharing work in star workflow, where each worker has their own repository, and a single common “naked” repository (a repository with only Git data structures and no ready-to-use files) is used to coordinate (thought of as a server or gold standard, often named origin). Figure 11.10 shows one arrangement of repositories that allows multiple authors to collaborate.

Figure 11.10. Multiple repositories working together

The usual shared workflow is like this:

  • Continuously— Work, work, work.
  • Frequently— Commit results to the local repository using a git add/git commit pair.
  • Every once in a while— Pull a copy of the remote repository into our view with some variation of git pull and then use git push to push work upstream.

The main rule of Git is this: don’t try anything clever (push/pull, and so on) unless you’re in a “clean” state (everything committed, confirmed with git status).

Setting up remote repository relations

For two or more Git repositories to share work, the repositories need to know about each other through a relation called remote. A Git repository is able to share its work to a remote repository by the push command and pick up work from a remote repository by the pull command. The next listing shows the declared remotes for the authors’ local copy of the https://github.com/WinVector/PDSwR2 repository.

Listing 11.10. git remote
$ git remote --verbose
origin  [email protected]:WinVector/PDSwR2.git (fetch)
origin  [email protected]:WinVector/PDSwR2.git (push)

The remote relation is set when you create a copy of a repository using the git clone command or can be set using the git remote add command. In listing 11.10, the remote repository is called origin—this is the traditional name for a remote repository that you’re using as your master or gold standard. (Git tends not to use the name master for repositories because master is the name of the branch you’re usually working on.)

Using push and pull to synchronize work with remote repositories

Once your local repository has declared some other repository as remote, you can push and pull between the repositories. When pushing or pulling, always make sure you’re clean (have no uncommitted changes), and you usually want to pull before you push (as that’s the quickest way to spot and fix any potential conflicts). For a description of what version control conflicts are and how to deal with them, see http://mng.bz/5pTv.

Usually, for simple tasks we don’t use branches (a technical version control term), and we use the rebase option on pull so that it appears that every piece of work is recorded into a simple linear order, even though collaborators are actually working in parallel. This is what we call an essential difficulty of working with others: time and order become separate ideas and become hard to track (and this is not a needless complexity added by using Git—there are such needless complexities, but this is not one of them).

The new Git commands you need to learn are these:

  • git push (usually used in the git push -u origin master variation)
  • git pull (usually used in the git fetch; git merge -m pull master origin/ master or git pull --rebase origin master variations)

Typically, two authors may be working on different files in the same project at the same time. As you can see in figure 11.11, the second author to push their results to the shared repository must decide how to specify the parallel work that was performed. Either they can say the work was truly in parallel (represented by two branches being formed and then a merge record joining the work), or they can rebase their own work to claim their work was done “after” the other’s work (preserving a linear edit history and avoiding the need for any merge records). Note: before and after are tracked in terms of arrows, not time.

Figure 11.11. git pull: rebase versus merge

Merging is what’s really happening, but rebase is much simpler to read. The general rule is that you should only rebase work you haven’t yet shared (in our example, Worker B should feel free to rebase their edits to appear to be after Worker A’s edits, as Worker B hasn’t yet successfully pushed their work anywhere). You should avoid rebasing records people have seen, as you’re essentially hiding the edit steps they may be basing their work on (forcing them to merge or rebase in the future to catch up with your changed record keeping).

Keep notes

Git commands are confusing; you’ll want to keep notes. One idea is to write a 3 × 5 card for each command you’re regularly using. Ideally, you can be at the top of your Git game with about seven cards.

For most projects, we try to use a rebase-only strategy. For example, this book itself is maintained in a Git repository. We have only two authors who are in close proximity (so able to easily coordinate), and we’re only trying to create one final copy of the book (we’re not trying to maintain many branches for other uses). If we always rebase, the edit history will appear totally ordered (for each pair of edits, one is always recorded as having come before the other), and this makes talking about versions of the book much easier (again, before is determined by arrows in the edit history, not by time stamp).

Don’t confuse version control with backup

Git keeps multiple copies and records of all of your work. But until you push to a remote destination, all of these copies are on your machine in the .git directory. So don’t confuse basic version control with remote backups; they’re complementary.

A bit on the Git philosophy

Git is interesting in that it automatically detects and manages so much of what you’d have to specify with other version control systems (for example, Git finds which files have changed instead of you having to specify them, and Git also decides which files are related). Because of the large degree of automation, beginners usually severely underestimate how much Git tracks for them. This makes Git fairly quick except when Git insists you help decide how a possible global inconsistency should be recorded in history (either as a rebase or a branch followed by a merge record). The point is this: Git suspects possible inconsistency based on global state (even when the user may not think there is such) and then forces the committer to decide how to annotate the issue at the time of commit (a great service to any possible readers in the future). Git automates so much of the record keeping that it’s always a shock when you have a conflict and have to express opinions on nuances you didn’t know were being tracked. Git is also an “anything is possible, but nothing is obvious or convenient” system. This is hard on the user at first, but in the end is much better than an “everything is smooth, but little is possible” version control system (which can leave you stranded).

11.4. Deploying models

Good data science shares a rule with good writing: show, don’t tell. And a successful data science project should include at least a demonstration deployment of any techniques and models developed. Good documentation and presentation are vital, but at some point, people have to see things working and be able to try their own tests. We strongly encourage partnering with a development group to produce the actual production-hardened version of your model, but a good demonstration helps recruit these collaborators.

Example

Suppose you are asked to make your model predictions available to other software so it can be reflected in reports and used to make decisions. This means you must somehow “deploy your model.” This can vary from scoring all data in a known database, exporting the model for somebody else to deploy, or setting up your own web application or HTTP service.

The statistician or analyst’s job often ends when the model is created or a report is finished. For the data scientist, this is just the acceptance phase. The real goal is getting the model into production: scoring data that wasn’t available when the model was built and driving decisions made by other software. This means that helping with deployment is part of the job. In this section, we will outline useful methods for achieving different styles of R model deployment.

We outline some deployment methods in table 11.6.

Table 11.6. Methods to deploy models

Method

Description

Batch Data is brought into R, scored, and then written back out. This is essentially an extension of what you’re already doing with test data.
Cross-language linkage R supplies answers to queries from another language (C, C++, Python, Java, and so on). R is designed with efficient cross-language calling in mind (in particular the Rcpp package), but this is a specialized topic we won’t cover here.
Services R can be set up as an HTTP service to take new data as an HTTP query and respond with results.
Export Often, model evaluation is simple compared to model construction. In this case, the data scientist can export the model and a specification for the code to evaluate the model, and the production engineers can implement (with tests) model evaluation in the language of their choice (SQL, Java, C++, and so on).
PMML PMML, or Predictive Model Markup Language, is a shared XML format that many modeling packages can export to and import from. If the model you produce is covered by R’s package pmml, you can export it without writing any additional code. Then any software stack that has an importer for the model in question can use your model.
Models in production

There are some basic defenses one should set up when placing a model in production. We mention these as we rarely see these valuable precautions taken:

  • All models and all predictions from models should be annotated with the model version name and a link to the model documentation. This simple precaution has saved one of the authors when they were able to show a misclassification was not from the model they had just deployed, but from a human tagger.
  • Machine learning model results should never be directly used as decisions. Instead, they should be an input to configurable business logic that makes decisions. This allows both patching the model to make it more reasonable (such as bounding probability predictions into a reasonable range such as 0.01 to 0.99) and turning it off (changing the business logic to not use the model prediction in certain cases).

You always want the last stage in any automated system to be directly controllable. So even a trivial business logic layer that starts by directly obeying a given model’s determination is high value, as it gives a place where you can correct special cases.

We’ve already demonstrated batch operation of models each time we applied a model to a test set. We won’t work through an R cross-language linkage example as it’s very specialized and requires knowledge of the system you’re trying to link to. We’ll demonstrate service and export strategies.

11.4.1. Deploying demonstrations using Shiny

Example

Suppose we want to build an interactive dashboard or demo for our boss. Our boss wants to try different classification thresholds against our Buzz score to see what precision and recall are available at each threshold. We could do this as a graph, but we are asked do this as an interactive service (possibly part of a larger drill-down/exploration service).

We will solve this scenario by using Shiny, a tool for building interactive web applications in R. Here we will use Shiny to let our boss pick the threshold that converts our Buzz score into a “will Buzz”/“won’t Buzz” decision. The entire code for this demonstration is in the Buzz/buzzapp directory of https://github.com/WinVector/PDSwR2.

The easiest way to run the Shiny application is to open the file server.R from that directory in RStudio. Then, as shown in figure 11.12, there will be a button on the upper right of the RStudio editor pane called Run App. Clicking this button will run the application.

Figure 11.12. Launching the Shiny server from RStudio

The running application will look like figure 11.13. The user can move the threshold control slider and get a new confusion matrix and model metrics (such as precision and recall) for each slider position.

Figure 11.13. Interacting with the Shiny application

Shiny’s program principles are based on an idea called reactive programming where the user specifies what values may change due to user interventions. The Shiny software then handles rerunning and updating the display as the user uses the application. Shiny is a very large topic, but you can get started by copying an example application and editing it to fit your own needs.

Further Shiny reading

We don’t currently have a Shiny book recommendation. A good place to start on Shiny documentation, examples, and tutorials is https://shiny.rstudio.com.

11.4.2. Deploying models as HTTP services

Example

Our model looked good in testing, and our boss likes working with our interactive web application. So we now want to fully “put our model in production.” In this case, the model is considered “in production” if other servers can send data to it and get scored results. That is, our model is to be partly deployed in production as part of a services oriented architecture (SOA).

Our model can be used by other software either by linking to it or having the model exposed as a service. In this case, we will deploy our Buzz model as an HTTP service. Once we have done this, other services at our company can send data to our model for scoring. For example, a revenue management dashboard can send a set of articles it is managing to our model for “buzz scoring,” meaning the buzz score can be incorporated into this dashboard. This is more flexible than having our Buzz model score all known articles in a database, as the dashboard can ask about any article for which it has the details.

One easy way to demonstrate an R model in operation is to expose it as an HTTP service. In the following listing, we show how to do this for our Buzz model (predicting discussion topic popularity). Listing 11.11 shows the first few lines of the file PDSwR2/Buzz/plumber.R. This .R file can be used with the plumber R package to expose our model as an HTTP service, either for production use or testing.

Listing 11.11. Buzz model as an R-based HTTP service
library("randomForest")           1

lst <- readRDS("thRS500.RDS")
varslist <- lst$varslist
fmodel <- lst$fmodel
buzztest <- lst$buzztest
rm(list = "lst")

#* Score a data frame.
#* @param d data frame to score
#* @post /score_data
function(d) {
  predict(fmodel, newdata = d, type = "prob")
}

  • 1 Attaches the randomForest package, so we can run our randomForest model

We would then start the server with the following code:

library("plumber")
r <- plumb("plumber.R")
r$run(port=8000)

The next listing is the contents of the file PDSwR2/Buzz/RCurl_client_example.Rmd, and shows how to call the HTTP service from R. However, this is just to demonstrate the capability—the whole point of setting up an HTTP service is that something other than R wants to use the service.

Listing 11.12. Calling the Buzz HTTP service
library("RCurl")
library("jsonlite")

post_query <- function(method, args) {                              1
   hdr <- c("Content-Type" = "application/x-www-form-urlencoded")
   resp <- postForm(
    paste0("http://localhost:8000/", method),
    .opts=list(httpheader = hdr,
               postfields = toJSON(args)))
  fromJSON(resp)
}

data <- read.csv("buzz_sample.csv",
                 stringsAsFactors = FALSE,
                 strip.white = TRUE)

scores <- post_query("score_data",
                     list(d = data))
knitr::kable(head(scores))
tab <- table(pred = scores[, 2]>0.5, truth = data$buzz)
knitr::kable(tab)

  • 1 Wraps the services as a function

This produces the result PDSwR2/Buzz/RCurl_client_example.md, shown in figure 11.14 (also saved in our example GitHub repository).

Figure 11.14. Top of HTML form that asks server for Buzz classification on submit

For more on plumber, we suggest starting with the plumber package documentation: https://CRAN.R-project.org/package=plumber.

11.4.3. Deploying models by export

It often makes sense to export a copy of the finished model from R, instead of attempting to reproduce all the details of model construction in another system or to use R itself in production. When exporting a model, you’re depending on development partners to handle the hard parts of hardening a model for production (versioning, dealing with exceptional conditions, and so on). Software engineers tend to be good at project management and risk control, so sharing projects with them is a good opportunity to learn.

The steps required depend a lot on the model and data treatment. For many models, you only need to save a few coefficients. For random forests, you need to export the trees. In all cases, you need to write code in your target system (be it SQL, Java, C, C++, Python, Ruby, or other) to evaluate the model.

One of the issues of exporting models is that you must repeat any data treatment. So part of exporting a model is producing a specification of the data treatment (so it can be reimplemented outside of R).

Exporting random forests to SQL with tidypredict
Exercise: Run our random forest model in SQL

Our goal is to export our random forest model as SQL code that can be then run in a database, without any further use of R.

The R package tidypredict[6] provides methods to export models such as our random forest Buzz model to SQL, which could then be run in a database. We will just show a bit of what this looks like. The random forest model consists of 500 trees that vote on the answer. The top of the first tree is shown in figure 11.15 (random forest trees tend not to be that legible). Remember that trees classify by making sequential decisions from the top-most node down.

6

Figure 11.15. The top of the first tree (of 500) from the random forest model

Now let’s look at the model that tidypredict converted to SQL. The conversion was performed in the R markdown file PDSwR2/Buzz/model_export.Rmd, which produces the rendered result PDSwR2/Buzz/model_export.md. We won’t show the code here, but instead show the first few lines of the what the first random forest tree is translated into:

CASE
 WHEN (`num.displays_06` >= 1517.5 AND
       `avg.auths.per.disc_00` < 2.25 AND
       `num.displays_06` < 2075.0) THEN ('0')
 WHEN (`num.displays_03` >= 1114.5 AND
       `atomic.containers_01` < 9.5 AND
       `avg.auths.per.disc_00` >= 2.25 AND
       `num.displays_06` < 2075.0) THEN ('0')
 WHEN ...

The preceding code is enumerating each path from the root of the tree down. Remember that decision trees are just huge nested if/else blocks, and SQL writes if/else as CASE/WHEN. Each SQL WHEN clause is a path in the original decision tree. This is made clearer in figure 11.16.

Figure 11.16. Annotating CASE/WHEN paths

In the SQL export, each tree is written as a series of WHEN cases over all of its paths, allowing the tree calculation to be performed in SQL. As a user, we would evaluate a tree by tracing down from the root node and moving down nodes, and left/right depending on the node conditions. The SQL code instead evaluates all paths from the roots to the leaves and keeps the result from the unique path for which all the conditions are met. It is an odd way to evaluate a tree, but it converts everything into an all-at-once formula that can be exported to SQL.

The overall idea is this: we have exported the random forest model into a format something else can read, SQL. Somebody else can own the finished model from that point on.

An important export system to consider using is Predictive Model Markup Language (PMML) which is an XML standard for sharing models across different systems.[7]

7

See, for example, the PMML package https://CRAN.R-project.org/package=pmml.

11.4.4. What to take away

You should now be comfortable demonstrating R models to others. Deployment and demonstration techniques include

  • Setting up a model as an HTTP service that can be experimented with by others
  • Setting up micro applications using Shiny
  • Exporting models so a model application can be reimplemented in a production environment

Summary

In this chapter, we worked on managing and sharing your work. In addition, we showed some techniques to set up demonstration HTTP services and export models for use by other software (so you don’t add R as a dependency in production). At this point, you have been building machine learning models for some time, and you now have some techniques for working proficiently with models over time and with collaborators.

Here are some key takeaways:

  • Use knitr to produce significant reproducible milestone/checkpoint documentation.
  • Write effective comments.
  • Use version control to save your work history.
  • Use version control to collaborate with others.
  • Make your models available to your partners for experimentation, testing, and production deployment.

In our next chapter, we will explore how to formally present and explain your work.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset