-
Notifications
You must be signed in to change notification settings - Fork 27
/
targets.qmd
187 lines (138 loc) · 10.5 KB
/
targets.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
---
execute:
freeze: auto
---
```{r, message = FALSE, warning = FALSE, echo = FALSE}
knitr::opts_knit$set(root.dir = fs::dir_create(tempfile()))
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", eval = TRUE)
```
```{r, message = FALSE, warning = FALSE, echo = FALSE}
library(targets)
library(tarchetypes)
```
# Targets {#targets}
A target is a high-level steps of the computational pipeline, and it a piece of work that you define with your custom functions. A target runs some R code and saves the returned R object to storage, usually a single file inside `_targets/objects/`.
## Target names
A target is an abstraction. The `targets` package automatically manages data storage and retrieval under the hood, which means you do not need to reference a target's data file directly (e.g. `_targets/objects/your_target_name`). Instead, your R code should refer to a target name as if it were a variable in an R session. In other words, from the point of view of the user, a target is an R object in memory. That means a target name must be a valid visible symbol name for an R variable. The name must not begin with a dot, and it must be a string that lets you assign a value, e.g. `your_target_name <- TRUE`. For stylistic considerations, please refer to the [tidyverse style guide syntax chapter](https://style.tidyverse.org/syntax.html).
## What a target should do
Like a good function, a good target generally does one of three things:
1. Create a dataset.
2. Analyze a dataset with a model.
3. Summarize an analysis or dataset.
If a function gets too long, you can split it into nested sub-functions that make your larger function easier to read and maintain.
## How much a target should do
The `targets` package automatically skips targets that are already up to date, so it is best to define targets that maximize time savings. Good targets usually
1. Are large enough to subtract a decent amount of runtime when skipped.
1. Are small enough that some targets can be skipped even if others need to run.
1. Invoke no side effects such as modifications to the global environment. (But targets with `tar_target(format = "file")` can save files.)
1. Return a single value that is
i. Easy to understand and introspect.
i. Meaningful to the project.
i. Easy to save as a file, e.g. with `readRDS()`. Please avoid [non-exportable objects](https://cran.r-project.org/web/packages/future/vignettes/future-4-non-exportable-objects.html) as target return values or global variables.
Regarding the last point above, it is possible to customize the storage format of the target. For details, enter `?tar_target` in the console and scroll down to the description of the `format` argument.
## Working with tools outside R
Each target runs R code, so to invoke a tool outside R, consider `system2()` or [`processx`](https://processx.r-lib.org) to call the appropriate system commands. This technique allows you to run shell scripts, Python scripts, etc. from within R. External scripts should ideally be tracked as input files using `tar_target(format = "file")` as described in [section on external input files](#data). There are also specialized R packages to retrieve data from remote sources and invoke web APIs, including [`rnoaa`](https://github.com/ropensci/rnoaa), [`ots`](https://github.com/ropensci/ots), and [`aws.s3`](https://github.com/cloudyr/aws.s3), and you may wish to use [custom cues](https://docs.ropensci.org/targets/reference/tar_cue.html) to automatically invalidate a target when the upstream remote data changes.
## Side effects
Like a good pure function, a good target should return a single value and not produce side effects. (The exception is [output file targets](#data) which create files and return their paths.) Avoid modifying the global environment with calls to `data()` or `source()`. If you need to source scripts to define global objects, please do so at the top of your target script file (default: `_targets.R`) just like `source("R/functions.R")` from the [walkthrough vignette](#walkthrough).
## Dependencies
Consider the following pipeline.
```{r, eval = FALSE}
# _targets.R file
library(targets)
library(tarchetypes)
global_object <- 3
inner_function <- function(argument) {
local_object <- 1
argument + global_object + local_object + 2
}
outer_function <- function(object) {
object + inner_function(object) + 1
}
list(
tar_target(
name = second_target,
command = outer_function(first_target) + 2
),
tar_target(
name = first_target,
command = 2
)
)
```
```{r, echo = FALSE, eval = TRUE, message = FALSE, output = FALSE}
tar_script({
global_object <- 3
inner_function <- function(argument) {
local_object <- 1
argument + global_object + local_object + 2
}
outer_function <- function(object) {
object + inner_function(object) + 1
}
list(
tar_target(
name = second_target, command = outer_function(first_target) + 2),
tar_target(name = first_target, command = 2)
)
})
```
In order to run properly, `second_target` needs up-to-date versions of `first_target` and `outer_function()`. In other words, `first_target` and `outer_function()` are *dependencies* of `second_target`. Likewise, `inner_function()` is a dependency of `outer_function()`, and `global_object` is a dependency of `inner_function()`. The `targets` package searches commands and functions for dependencies, noting global symbols like `global_object` and ignoring local symbols like `argument` and `local_object`. The `tar_deps()` function emulates behavior for you.^[`tar_deps()` uses the `findGlobals()` function from the [`codetools`](https://CRAN.R-project.org/package=codetools) package, with some minor adjustments. See <https://adv-r.hadley.nz/expressions.html?q=ast#ast-funs> for more information on static code analysis.]
```{r}
tar_deps(outer_function(first_target) + 2)
```
```{r}
tar_deps(
function(argument) {
local_object <- 1
argument + global_object + local_object + 2
}
)
```
After it discards dangling symbols like `{` and `<-`, `targets` translates the dependency information into a *dependency [graph](https://en.wikipedia.org/wiki/Graph_(discrete_mathematics))* that you can visualize with `tar_visnetwork()`. It is good practice to make sure this graph has the correct nodes connected with the correct edges.
```{r, eval = FALSE}
# R console
tar_visnetwork()
```
```{r, echo = FALSE, eval = TRUE}
tar_visnetwork(callr_arguments = list(show = FALSE))
```
The dependency graph is a [directed acyclic graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph) (DAG) representation of the pipeline, where each node is a target or global object and each directed edge indicates where a downstream node depends on an upstream node. The DAG is not always a [tree](https://en.wikipedia.org/wiki/Tree_(graph_theory)), but it never contains a [cycle](https://en.wikipedia.org/wiki/Cycle_(graph_theory)) because no target is allowed to directly or indirectly depend on itself. The dependency graph should show a natural progression of work from left to right.^[If you have hundreds of targets, then [`tar_visnetwork()`](https://docs.ropensci.org/targets/reference/tar_visnetwork.html) may be slow. If that happens, consider temporarily commenting out some targets in `_targets.R` just for visualization purposes.] `targets` uses [static code analysis](https://adv-r.hadley.nz/expressions.html#ast-funs) to build the graph, so the order of `tar_target()` calls in the target list does not matter. However, `targets` does not support self-referential [loops](https://en.wikipedia.org/wiki/Loop_(graph_theory)) or other [cycles](https://en.wikipedia.org/wiki/Cycle_(graph_theory)).
When you run the pipeline with `tar_make()`^[or `tar_make_clustermq()` or `tar_make_future()`], `targets` runs the correct targets in the correct order with the correct resources according to the graph. For example, by the time `second_target` starts running, `targets` makes sure:
1. Dependency target `first_target` has already finished running.
2. Dependencies `first_target` and `outer_function()` are up to date.
3. Dependencies `first_target` and `outer_function()` are loaded into memory for `second_target` to use.
```{r, eval = TRUE}
# R console
tar_make()
```
At this point, any of the following changes will cause the next `tar_make()` to rerun `second_target`.
* Change the value of `global_object`.
* Change the body or arguments of `inner_function()`.
* Change the body or arguments of `outer_function()`.
* Change the command or value of `first_target`.
* Change the command of `second_target`.
## Return value
The return value of a target should be an R object that can be saved to disk and hashed.
### Saving
The object should be compatible with the storage format you choose using the `format` argument of `tar_target()` or `tar_option_set()`. For example, if the format is `"rds"` (default), then the target should return an R object that can be saved with `saveRDS()` and safely loaded properly into another session. Please avoid returning [non-exportable objects](https://cran.r-project.org/web/packages/future/vignettes/future-4-non-exportable-objects.html) such as connection objects, `Rcpp` pointers, `xgboost` matrices, and `greta` models^[Special exceptions are granted to Keras and Torch models, which can be safely returned from targets if you specify `format = "keras"` or `format = "torch"`.].
### Hashing
Once a target is saved to disk, `targets` computes a [`digest`](https://eddelbuettel.github.io/digest/) hash to track changes to the data file(s). These hashes are used to decide whether each target is up to date or needs to rerun. In order for the hash to be useful, the data you return from a target must be an accurate reflection of the underlying content of the data. So please try to return the actual data instead of an object that wraps or points to the data. Otherwise, the package will make incorrect decisions regarding which targets can skip and which need to rerun.
### Workaround
As a workaround, you can write custom functions to create temporary instances of these non-exportable/non-hashable objects and clean them up after the task is done. The following sketch creates a target that returns a database table while managing a transient connection object.
```{r, eval = FALSE}
# _targets.R
library(targets)
library(tarchetypes)
get_from_database <- function(table, ...) {
con <- DBI::dbConnect(...)
on.exit(close(con))
dbReadTable(con, table)
}
list(
tar_target(
table_from_database,
get_from_database("my_table", ...), # ... has use-case-specific arguments.
format = "feather" # Requires that the return value is a data frame.
)
)
```