Caching
Normally, each render of a document starts from a completely clean slate. This is great for reproducibility, because it ensures that you’ve captured every important computation in code. However, it can be painful if you have some computations that take a long time. The solution is cache: true
.
You can enable the knitr cache at the document level for caching the results of all computations in a document using standard YAML options:
---
title
:
"My
Document"
execute
:
cache
:
true
---
You can also enable caching at the chunk level for caching the results of computation in a specific chunk:
```{r}
#| cache: true
# code for lengthy computation...
```
When set, this will save the output of the chunk to a specially named file on disk. On subsequent runs, knitr will check to see if the code has changed, and if it hasn’t, it will reuse the cached results.
The caching system must be used with care, because by default it is based on the code only, not its dependencies. For example, here the processed_data
chunk depends on the raw-data
chunk:
```{r}
#| label: raw-data
#| cache: true
rawdata <- readr::read_csv("a_very_large_file.csv")
```
```{r}
#| label: processed_data
#| cache: true
processed_data <- rawdata |>
filter(!is.na(import_var)) |>
mutate(new_variable = complicated_transformation(x, y, z))
```
Caching the processed_data
chunk means that it will get rerun if the dplyr pipeline is changed, but it won’t get rerun if the read_csv()
call changes. You can avoid that problem with the dependson
chunk option:
```{r}
#| label: processed-data
#| cache: true
#| dependson: "raw-data"
processed_data <- rawdata |>
filter(!is.na(import_var)) |>
mutate(new_variable = complicated_transformation(x, y, z))
```
dependson
should contain a character vector of every chunk that the cached chunk depends on. Knitr will update the results for the cached chunk whenever it detects that one of its dependencies has changed.
Note that the chunks won’t update if a_very_large_file.csv
changes, because knitr caching tracks changes only within the .qmd
file. If you want to also track changes to that file, you can use the cache.extra
option. This is an arbitrary R expression that will invalidate the cache whenever it changes. A good function to use is file.mtime()
: it returns when it was last modified. Then you can write:
```{r}
#| label: raw-data
#| cache: true
#| cache.extra: !expr file.mtime("a_very_large_file.csv")
rawdata <- readr::read_csv("a_very_large_file.csv")
```
We’ve followed the advice of David Robinson to name these chunks: each chunk is named after the primary object that it creates. This makes it easier to understand the dependson
specification.
As your caching strategies get progressively more complicated, it’s a good idea to regularly clear out all your caches with knitr::clean_cache()
.
Exercises
- Set up a network of chunks where
d
depends on c
and b
, and both b
and c
depend on a
. Have each chunk print lubridate::now()
, set cache: true
, and then verify your understanding of caching.