-
Notifications
You must be signed in to change notification settings - Fork 27
/
data.qmd
158 lines (120 loc) · 7.19 KB
/
data.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
---
execute:
freeze: auto
---
# Local data {#data}
```{r, message = FALSE, warning = FALSE, echo = FALSE}
knitr::opts_knit$set(root.dir = fs::dir_create(tempfile()))
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", eval = TRUE)
```
```{r, message = FALSE, warning = FALSE, echo = FALSE, eval = TRUE}
library(targets)
library(tarchetypes)
```
:::{.callout-tip}
## Performance
See the [performance chapter](#performance) for options, settings, and other choices to make storage and memory more efficient for large data workflows.
:::
During a pipeline, `targets` manages R objects in memory and writes them to files on disk. It also stores target-level metadata in a compact central text file.
## Memory
`targets` uses random access memory (RAM) while the pipeline is running. Each target loads its upstream dependencies into memory and returns an R object in memory. After a target runs or loads, `tar_make()` either keeps the object in memory or discards it, depending on the settings in `tar_target()` and `tar_option_set()`. Set `memory = "transient"` to release the target whenever possible, and set `garbage_collection = TRUE` to run garbage collection with `gc()` just before the target runs. Alternatively, set `memory = "persistent"` to keep the object in memory and reduce costly interactions with the file system. The trade-off between memory and file I/O depends on your computing platform. See the [performance chapter](#performance) for more details.
## Local data store
In addition to memory, the pipeline writes data to files on disk. `tar_make()` creates a special data folder called `_targets/` at the root of your project.
```{r, eval = FALSE}
fs::dir_tree("_targets")
_targets
├── meta
│ ├── crew
│ ├── meta
│ ├── process
│ └── progress
├── objects
│ ├── target1
│ ├── target2
│ ├── dynamic_branch_c7bcb4bd
│ ├── dynamic_branch_285fb6a9
│ └── dynamic_branch_874ca381
├── scratch # tar_make() deletes this folder after it finishes.
└── user # for gittargets user data
```
The two most important parts are:
1. `_targets/meta/meta`: a text file with key target-level metadata.
2. `_targets/objects/`: a folder with the output data of each target.
Consider this pipeline:
```{r, eval = FALSE, echo = TRUE}
library(targets)
library(tarchetypes)
list(
tar_target(
name = target1,
command = 11 + 46,
format = "rds",
repository = "local"
)
)
```
`tar_make()` does the following:
1. Run the command of `target1` and observe a return value of `57`.
1. Save the value `57` to `_targets/objects/target1` using [saveRDS()](https://stat.ethz.ch/R-manual/R-devel/library/base/help/readRDS.html).
1. Append a line to `_targets/meta/meta` containing the hash, time stamp, file size, warnings, errors, and execution time of `target1`.
1. Append a line to `_targets/meta/progress` to indicate that `target1` finished.
Remarks:
* To read the value of `target1` back into R, `tar_read(target1)` is much better than `readRDS("_targets/objects/target1")`.
* The `format` argument of `tar_target()` controls how `tar_make()` saves the return value. The default is `"rds"`, and there are more efficient formats such as `"qs"` and `"feather"`. Some of these formats require external packages. See <https://docs.ropensci.org/targets/reference/tar_target.html#storage-formats> for details.
* For efficiency, `tar_make()` does not write to `_targets/meta/meta` or `_targets/meta/progress` every single time a target completes. Instead, it waits and gathers a backlog of text lines in memory, then writes whole batches of lines at a time. This behavior risks losing metadata in the event of a crash, but it reduces costly interactions with the file system. The `seconds_meta` argument controls how often `tar_make()` writes metadata. `seconds_reporter` does the same for messages printed to the R console.
## External files
Some pipelines work with custom external files outside `_targets/`. The user is still responsible for reading and writing these files. However, the pipeline can track them, detect changes, and decide whether to rerun or skip the targets that the files depend on. Simply create a file target.
In a file target,
1. `tar_target()` has `format = "file"`.
2. The command returns a character vector of file paths.
Consider this pipeline:
```{r, eval = FALSE, echo = TRUE}
# _targets.R
library(targets)
library(tarchetypes)
create_output <- function(file) {
data <- read.csv(file)
output <- head(data)
write.csv(output, "output.csv")
"output.csv"
}
list(
tar_target(name = input, command = "data.csv", format = "file"),
tar_target(name = output, command = create_output(input), format = "file")
)
```
In the dependency graph, `output` depends on `input` because the command of `output` mentions the symbol `input`.
```{r, eval = TRUE, echo = FALSE}
tar_script({
library(targets)
library(tarchetypes)
create_output <- function(file) {
data <- read.csv(file)
output <- head(data)
write.csv(output, "output.csv")
"output.csv"
}
list(
tar_target(name = input, command = "data.csv", format = "file"),
tar_target(name = output, command = create_output(input), format = "file")
)
})
```
```{r, eval = FALSE, echo = TRUE}
tar_visnetwork()
```
```{r, eval = TRUE, echo = FALSE}
tar_visnetwork()
```
Before the pipeline first runs, `data.csv` exists, but `output.csv` does not. During `tar_make()`, the `input` target tracks `data.csv`, and the `output` target creates and tracks `output.csv`. If `data.csv` changes before the next `tar_make()`, then both `input` and `output` rerun. If something outside the pipeline changes `output.csv`, then `output` reruns.
Remarks:
* A file target can have both input and output files.
* A file target can include directory paths as well as individual file paths.
* `format = "file_fast"` is just like `format = "file"`, except that it uses file modification time stamps to check if a file is up to date. If the time stamp of the file agrees with the time stamp in the metadata, the file is considered up to date. Otherwise, targets recomputes the hash of the file to make a final determination.
## Clean up local files
There are [multiple functions](https://docs.ropensci.org/targets/reference/index.html#section-clean) to remove or clean up local target storage. Some of them also delete cloud data if your pipeline uses an AWS or GCP bucket (see the [next chapter](#cloud-storage)).
* [`tar_destroy()`](https://docs.ropensci.org/targets/reference/tar_destroy.html) removes `_targets/` data store and any cloud data from the pipeline.
* [`tar_prune()`](https://docs.ropensci.org/targets/reference/tar_prune.html) deletes data and metadata irrelevant to the current pipeline in `_targets.R`.
* [`tar_delete()`](https://docs.ropensci.org/targets/reference/tar_delete.html) deletes specific data files from `_targets/objects/` and the cloud. It does not modify metadata.
* [`tar_invalidate()`](https://docs.ropensci.org/targets/reference/tar_invalidate.html) removes metadata from specific targets but keeps their data files in `_targets/objects/`.
* [`tar_meta_delete()`](https://docs.ropensci.org/targets/reference/tar_meta_delete.html) removes `_targets/meta/` files and their copies on the cloud.