A.1. Installing the tools

The primary tool for working our examples will be R, and possibly RStudio. But other tools (databases, version control, compilers, and so on) are also highly recommended. You may also need access to online documentation or other help to get all of these tools to work in your environment. The distribution sites we list are a good place to start.

A.1.1. Installing Tools

The R environment is a set of tools and software that can be installed on Unix, Linux, Apple macOS, and Windows.

R

We recommend installing the latest version of R from the Comprehensive R Archive Network (CRAN) at https://cran.r-project.org, or a mirror. CRAN is the authoritative central repository for R and R packages. CRAN is supported by The R Foundation and the R Development Core Team. R itself is an official part of the Free Software Foundation’s GNU project distributed under a GPL 2 license. R is used at many large institutions, including the United States Food and Drug Administration.[1]

1

For this book, we recommend using at least R version 3.5.0 or newer.

To work with R, you need a text editor specialized for working with non-formatted (or not-rich) text. Such editors include Atom, Emacs, Notepad++, Pico, Programmer’s Notepad, RStudio, Sublime Text, text wrangler, vim, and many more. These are in contrast to rich text editors (which are not appropriate for programming tasks) such as Microsoft Word or Apple Text Edit.

RStudio

We suggest that when working with R, you consider using RStudio. RStudio is a popular cross-platform integrated development environment supplied by the company RStudio, Inc. (https://www.rstudio.com). RStudio supplies a built-in text editor and convenient user interfaces for common tasks such as installing software, rendering R markdown documents, and working with source control. RStudio is not an official part of R or CRAN, and should not be confused with R or CRAN.

An important feature of RStudio is the file browser and the set-directory/go-to-directory controls that are hidden in the gear icon of the file-browsing pane, which we point out in figure A.1.

Figure A.1. RStudio file-browsing controls

RStudio is not a requirement to use R or to work through the examples in this book.

Git

Git is a source control or version management system that is very useful for preserving and sharing work. To install Git, please follow the appropriate instructions from https://git-scm.com.

Data science always involves a lot of tools and collaboration, so the willingness to try new tools is a flexibility one needs to develop.

The book-support materials

All of the book-support materials are freely available from GitHub: https://github.com/WinVector/PDSwR2, as shown in figure A.2. The reader should download them in their entirety either using git clone with the URL https://github.com/WinVector/PDSwR2.git or by downloading a complete zip file by using the “Clone or Download” control at the top right of the GitHub page.

Figure A.2. Downloading the book materials from GitHub

Another way to download the book material is to use RStudio and Git. Select File > New Project > Create Project from Version Control > Git. That will bring up a dialog box as shown in figure A.3. You can fill in the Git URL and download the book materials as a project.

Figure A.3. Cloning the book repository

We will refer to this directory as PDSwR2 throughout the book, and all files and paths we mention are either in this directory or a subdirectory. Please be sure to look in this directory for any README or errata files.

Some features of the support directory include these:

  • All example data used in the book.
  • All example code used in the book. The examples from the book are available in the subdirectory CodeExamples, and also as the zip file CodeExamples.zip. In addition to this, the entire set of examples, rerun and rerendered, are shared in RenderedExamples. (All paths should be relative to where you have unpacked the book directory PDSwR2.)
R packages

A great advantage of R is the CRAN central package repository. R has standardized package installation through the install.packages() command. An installed package is typically not fully available for use in a project until the package is also attached for use by the library() command.[2] A good practice is this: any sort of R script or work should attach all the packages it intends to use as a first step. Also, in most cases scripts should not call install.packages(), as this changes the R installation, which should not be done without user supervision.

2

In R installing a package is a separate step from attaching it for use. install.packages() makes package contents potentially available; after that, library() readies them for use. A handy mnemonic is this: install.packages() sets up new appliances in your kitchen, and library() turns them on. You don’t have to install things very often, however you often have to turn things back on.

Installing the required packages

To install the set of packages required to work all the examples in this book, first download the book repository as described previously. Then look in the first directory or top directory of this repository: PDSwR2. In this directory, you will find the file packages.R. You can open this file with a text editor, and it should look like the following (though it may be more up to date than what is shown here).

# Please have an up to date version of R (3.5.*, or newer)
# Answer "no" to:
# Do you want to install from sources the packages which need compilation?
update.packages(ask = FALSE, checkBuilt = TRUE)

pkgs <- c(
    "arules", "bitops", "caTools", "cdata", "data.table", "DBI",
    "dbplyr", "DiagrammeR", "dplyr", "e1071", "fpc", "ggplot2",
    "glmnet", "glmnetUtils", "gridExtra", "hexbin", "kernlab",
    "igraph", "knitr", "lime", "lubridate", "magrittr", "MASS",
    "mgcv", "pander", "plotly", "pwr", "randomForest", "readr",
    "readxls", "rmarkdown", "rpart", "rpart.plot", "RPostgres",
    "rqdatatable", "rquery", "RSQLite", "scales", "sigr", "sqldf",
    "tidypredict", "text2vec", "tidyr", "vtreat", "wrapr", "WVPlots",
    "xgboost", "xts", "webshot", "zeallot", "zoo")

install.packages(
    pkgs,
    dependencies = c("Depends", "Imports", "LinkingTo"))

To install everything, run every line of code in this file from R.[3]

3

The preceding code can be found as the file packages.R at https://github.com/WinVector/PDSwR2. We could call it PDSwR2/packages.R, which could mean the file from the original GitHub URL or from a local copy of the GitHub repository.

Unfortunately, there are many reasons the install can fail: incorrect copy/paste, no internet connection, improperly configured R or RStudio, insufficient permissions to administer the R install, out-of-date versions of R or RStudio, missing system requirements, or no or incorrect C/C++/Fortran compiler. If you run into these problems, it is best to find a forum or expert to help you work through these steps. Once everything is successfully installed, R is a self-contained environment where things just work.

Not all packages are needed for all examples, so if you have trouble with the overall install, just try to work the examples in the book. Here’s a caveat: if you see a library(pkgname) command fail, please try install.packages('pkgname') to install the missing package. The preceding package list is just trying to get everything out of the way in one step.

Other tools

R’s capabilities can be enhanced by using tools such as Perl,[4] gcc/clang, gfortran, git, Rcpp, Tex, pandoc, ImageMagick, and Bash shell. Each of these is managed outside of R, and how to maintain them depends on your computer, operating system, and system permissions. Unix/Linux users have the easiest time installing these tools, and R is primarily developed in a Unix environment.[5] RStudio will install some of the extra tools. macOS users may need Apple’s Xcode tools and Homebrew (https://brew.sh) to have all the required tools. Windows users who wish to write packages may want to research RTools (https://cran.r-project.org/bin/windows/Rtools/).

4

5

For example, we share notes on rapidly configuring R and RStudio Server on an Amazon EC2 instance here: www.win-vector.com/blog/2018/01/setting-up-rstudio-server-quickly-on-amazon-ec2/.

Windows users may need RTools to compile packages; however, this should not be strictly necessary as most current packages are available from CRAN in a precompiled form (at least for macOS and 64-bit Windows). macOS users may need to install the Xcode compiler (available from Apple) to compile packages. All of these are steps you probably want to skip until you need the ability to compile.

A.1.2. The R package system

R is a broad and powerful language and analysis workbench in and of itself. But one of its real strengths is the depth of the package system and packages supplied through CRAN. To install a package from CRAN, just type install.packages('nameofpackage'). To use an installed package, type library(nameofpackage).[6] Any time you type library('nameofpackage') or require('nameofpackage'), you’re assuming you’re using a built-in package or you’re able to run install.packages('nameofpackage') if needed. We’ll return to the package system again and again in this book. To see what packages are present in your session, type sessionInfo().

6

Actually, library('nameofpackage') also works with quotes. The unquoted form works in R because R has the ability to delay argument evaluation (so an undefined nameofpackage doesn’t cause an error) as well as the ability to snoop the names of argument variables (most programming languages rely only on references or values of arguments). Given that a data scientist has to work with many tools and languages throughout the day, we prefer to not rely on features unique to one language unless we really need the feature. But the “official R style” is without the quotes.

Changing your CRAN mirror

You can change your CRAN mirror at any time with the chooseCRANmirror() command. This is handy if the mirror you’re working with is slow.

A.1.3. Installing Git

We advise installing Git version control before we show you how to use R and RStudio. This is because without Git, or a tool like it, you’ll lose important work. Not just lose your work—you’ll lose important client work. A lot of data science work (especially the analysis tasks) involves trying variations and learning things. Sometimes you learn something surprising and need to redo earlier experiments. Version control keeps earlier versions of all of your work, so it’s exactly the right tool to recover code and settings used in earlier experiments. Git is available in precompiled packages from http://git-scm.com.

A.1.4. Installing RStudio

RStudio supplies a text editor (for editing R scripts) and an integrated development environment for R. Before picking up RStudio from http://rstudio.com, you should install both R and Git as we described earlier.

The RStudio product you initially want is called RStudio Desktop and is available precompiled for Windows, Linux, and macOS.

When you’re first starting with RStudio, we strongly recommend turning off both the “Restore .RData into workspace at startup” and “Save workspace to .RData on exit” features. Having these settings on (the default) makes it hard to reliably “work clean” (a point we will discuss in section A.3. To turn off these features, open the RStudio options pane (the Global option is found by such as menus RStudio > Preferences, Tools > Global Options, Tools > Options, or similar, depending on what operating system you are using), and then alter the two settings as indicated in figure A.4.

Figure A.4. RStudio options

A.1.5. R resources

A lot of the power of R comes from its large family of packages, available from the CRAN repository. In this section, we’ll point out some packages and documentation.

Installing R views

R has an incredibly deep set of available libraries. Usually, R already has the package you want; it’s just a matter of finding it. A powerful way to find R packages is using views: http://cran.r-project.org/web/views/.

You can also install all the packages (with help documentation) from a view with a single command (though be warned: this can take an hour to finish). For example, here we’re installing a huge set of time series libraries all at once:

install.packages('ctv', repos = 'https://cran.r-project.org')
library('ctv')
# install.views('TimeSeries') # can take a LONG time

Once you’ve done this, you’re ready to try examples and code.

Online R resources

A lot of R help is available online. Some of our favorite resources include these:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset