-
Notifications
You must be signed in to change notification settings - Fork 27
/
cloud-storage.qmd
142 lines (103 loc) · 6.88 KB
/
cloud-storage.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
execute:
freeze: auto
---
# Cloud storage {#cloud-storage}
:::{.callout-warning}
## Cost
[Amazon S3](https://aws.amazon.com/s3/) or [Google Cloud Storage](https://cloud.google.com/storage/) are paid services. Amazon and Google not only charge for data, but also for operations that query or modify that data. Read <https://aws.amazon.com/s3/pricing/> and <https://cloud.google.com/storage/pricing> for details.
:::
:::{.callout-tip}
## Package version
This chapter requires `targets` version 1.3.0 or higher. Please visit the [installation instructions](https://docs.ropensci.org/targets/#installation).
:::
`targets` can store [data and metadata](#data) on the cloud, either with [Amazon Web Service (AWS) Simple Storage Service (S3)](https://aws.amazon.com/s3/) or [Google Cloud Platform (GCP) Google Cloud Storage (GCS)](https://cloud.google.com/storage/).
## Benefits
### Store less data locally
1. Use `tar_option_set()` and `tar_target()` to opt into cloud storage and configure options.
1. `tar_make()` uploads target data to a cloud bucket *instead of* the local [`_targets/objects/`](#data) folder. Likewise for [file targets](https://books.ropensci.org/targets/data.html#external-files).^[For cloud targets, `format = "file_fast"` has no purpose, and it automatically switches to `format = "file"`.]
1. Every `seconds_meta` seconds, `tar_make()` uploads metadata and still keeps local copies in [`_targets/meta/`](#data) folder.^[Metadata snapshots are synchronous, so a long target with `deployment = "main"` may block the main R process and delay uploads.]
### Inspect the results on a different computer
1. `tar_meta_download()` downloads the latest metadata from the bucket to the local [`_targets/meta/`](#data) folder.^[Functions `tar_meta_upload()`, `tar_meta_sync()`, and `tar_meta_delete()` also manage cloud metadata.]
1. Helpers like `tar_read()` read local metadata and access target data in the bucket.
### Track history
1. [Turn on versioning](https://docs.aws.amazon.com/AmazonS3/latest/userguide/manage-versioning-examples.html) in your [bucket](https://aws.amazon.com/s3/).
1. `tar_make()` records the versions of the target data in `_targets/meta/meta`.
1. Commit `_targets/meta/meta` to the same [version-controlled repository](https://github.com) as your R code.
1. Roll back to a [prior commit](https://git-scm.com/docs/git-commit) to roll back the [local metadata](#data) and give `targets` access to prior versions of the target data.
## Setup
### AWS setup
Skip these steps if you already have an [AWS](https://aws.amazon.com) account and [bucket](https://aws.amazon.com/s3/).
1. Sign up for a free tier account at <https://aws.amazon.com/free>.
1. Read the [Simple Storage Service (S3) instructions](https://docs.aws.amazon.com/AmazonS3/latest/gsg/GetStartedWithS3.html) and practice in the [web console](https://console.aws.amazon.com/s3/).
1. Install the `paws.storage` R package: `install.packages("paws.storage")`.
1. Follow the [`paws` documentation](https://www.paws-r-sdk.com/#credentials) to set your [AWS security credentials](https://docs.aws.amazon.com/IAM/latest/UserGuide/security-creds.html).
1. Create an [S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingBucket.html), either in the [web console](https://console.aws.amazon.com/s3/) or with `paws.storage::s3()$create_bucket()`.
### GCP setup
Skip these steps if you already have an [GCP](https://cloud.google.com) account and [bucket](https://cloud.google.com/storage/).
1. Activate a Google Cloud Platform account at <https://cloud.google.com>.
1. Install the `googleCloudStorageR` R package: `install.packages("googleCloudStorageR")`.
1. Follow the [`googleCloudStorageR` setup instructions](https://code.markedmondson.me/googleCloudStorageR/articles/googleCloudStorageR.html) to authenticate into Google Cloud and enable required APIs.
1. Create a [Google Cloud Storage (GCS)](https://cloud.google.com/storage/) bucket either in the [web console](https://cloud.google.com/storage/) or `googleCloudStorageR::gcs_create_bucket()`.
### Pipeline setup
Use `tar_option_set()` to opt into cloud storage and declare options. For AWS:^[`cue = tar_cue(file = FALSE)` is no longer recommended for [cloud storage](https://books.ropensci.org/targets/data.html#cloud-storage). This unwise shortcut is no longer necessary, as of <https://github.com/ropensci/targets/pull/1181> (`targets` version >= 1.3.2.9003).]
1. `repository = "aws"`
1. `resources = tar_resources(aws = tar_resources_aws(bucket = "YOUR_BUCKET", prefix = "YOUR/PREFIX"))`
Details:
* The process is analogous for GCP.
* The `prefix` is just like `tar_config_get("store")`, but for the cloud. It controls where the data objects live in the bucket, and it should not conflict with other projects.
* Arguments `repository`, `resources`, and `cue` of `tar_target()` override their counterparts in `tar_option_set()`.
* In `tar_option_set()`, `repository` controls the target data, and `repository_meta` controls the metadata. However, `repository_meta` just defaults to `repository`. To continuously upload the metadata, it usually suffices to set e.g. `repository = "aws"` in `tar_option_set()`.
## Example
Consider a pipeline with two simple targets.
```{r, eval = FALSE, echo = TRUE}
# Example _targets.R file:
library(targets)
library(tarchetypes)
tar_option_set(
repository = "aws",
resources = tar_resources(
aws = tar_resources_aws(
bucket = "my-test-bucket-25edb4956460647d",
prefix = "my_project_name"
)
)
)
write_file <- function(data) {
saveRDS(data, "file.rds")
"file.rds"
}
list(
tar_target(data, rnorm(5), format = "qs"),
tar_target(file, write_file(data), format = "file")
)
```
As usual, `tar_make()` runs the correct targets in the correct order. Both data files now live in bucket `my-test-bucket-25edb4956460647d` at S3 key paths which begin with [prefix](https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-prefixes.html) `my_project_name`. Neither `_targets/objects/data` nor `file.rds` exist locally because `repository` is `"aws"`.
```{r, eval = FALSE, echo = TRUE}
tar_make()
#> ▶ start target data
#> ● built target data [0 seconds]
#> ▶ start target file
#> ● built target file [0.002 seconds]
#> ▶ end pipeline [1.713 seconds]
```
At this point, if you switch to a different computer, download your metadata with `tar_meta_download()`. Then, your results will be up to date.
```{r, eval = FALSE, echo = TRUE}
tar_make()
#> ✔ skip target data
#> ✔ skip target file
#> ✔ skip pipeline [1.653 seconds]
```
`tar_read()` read local metadata and cloud target data.
```{r, eval = FALSE, echo = TRUE}
tar_read(data)
#> [1] -0.74654607 -0.59593497 -1.57229983 0.40915323 0.02579023
```
For a file target, `tar_read()` downloads the file to its original location and returns the path.
```{r, eval = FALSE, echo = TRUE}
path <- tar_read(file)
path
#> [1] "file.rds"
readRDS(path)
#> [1] -0.74654607 -0.59593497 -1.57229983 0.40915323 0.02579023
```