Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial try on tar_download #9

Closed
wants to merge 1 commit into from

Conversation

noamross
Copy link
Contributor

Prework

Summary

This is an initial try at tar_download(), which builds on top of tar_change(). tar_download() creates a file target which will be downloaded if the remote URL's modified time or eTag is updated.

This is a draft as I'm still working through a bug understanding some NSE issues, but I thought I would show progress so far for comment.

Related GitHub issues and pull requests

  • Ref: #

Checklist

  • This pull request is not a draft.

@noamross
Copy link
Contributor Author

Should I not be using tar_change() but rather building on top of tar_change_pair(), which I think is the equivalent of tar_target_raw()? @wlandau

Copy link
Member

@wlandau wlandau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, @noamross! I really appreciate your help with this archetype, especially because you are one of the folks who is probably going to use it the most. I agree with most of what you sketched out. I think the issues in my comments and some new tests will get us most of the way there.

@@ -32,7 +32,8 @@ Suggests:
digest (>= 0.6.25),
knitr (>= 1.28),
rmarkdown (>= 2.1),
testthat (>= 2.3.2)
testthat (>= 2.3.2),
curl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you add a minimum version for curl?

stop_on_no_internet = FALSE,
...
) {
if(!requireNamespace("curl", quietly = TRUE)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you use tarchetypes::assert_package("curl", "tar_download() requires the the package 'curl' to be installed")?

stop("tar_download() requires the the package 'curl' to be installed")
}
handle <- handle %||% curl::new_handle()
tar_change(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you pointed out, we should use tar_change_pair().

destdir = ".",
handle = NULL,
stop_on_no_internet = FALSE,
...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's consider making these arguments formal. For other archetypes, I found this useful so I could change the defaults in a way that users notice.

#' @export
tar_download_file <- function(url, destfile, destdir, stop_on_no_internet = FALSE, handle = curl::new_handle())
{
if (!curl::has_internet()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other archetypes might require an internet connection, so it may be nice to put this in a new assert_internet() in R/utils_assert.R.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your assertion patterns don't currently allow for "warning only" options, which I want here. (The use-case being that in the absence of a connection the workflow should be able to continue with current versions of files.) Do you want to incorporate that into your assertion setup or should I just do something conditional here?

Copy link
Member

@wlandau wlandau Aug 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. How about a validate_internet() utility?

validate_internet <- function(assert_internet = FALSE) {
  trn(assert_internet, try_cancel_internet(), assert_internet())
}

try_cancel_internet <- function(msg = NULL) {
  if (!curl::has_internet()) {
    warn_validate("no internet")
    tar_cancel()
  }
}

warn_validate <- function(...) {
  warning(warning_validate(...))
}

warning_validate <- function(...) {
  structure(
    list(message = paste0(..., collapse = ""), call = NULL),
    class = c(
      "condition_validate",
      "condition_tarchetypes",
      "warning",
      "condition"
    )
  )
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to test for internet to determine whether to cancel the target. Should this return a logical value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly for me atm this seems like a way to turn 5 lines of readable code into 20 harder-to-understand lines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just updated the code above to include a call to tar_cancel().

There are more functions, but they help us avoid nested if/else logic, and I think each function might be reusable for other archetypes. And warn_validate() and warning_validate() support custom conditions for warnings, which make exception handling and testing nicer.

stop("No internet. Cannot check url: ", url)
else
warning("No internet. Cannot check url: ", url)
tar_cancel()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we namespace calls to targets functions?

@@ -0,0 +1,104 @@
#' @title Download a file from a remote source, checking for changes
#' first
#' @description Create a target that downnloads a file if it has changed since
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just flagging this for proofreading when it comes time to polish things up.

@wlandau
Copy link
Member

wlandau commented Aug 11, 2020

Should I not be using tar_change() but rather building on top of tar_change_pair(), which I think is the equivalent of tar_target_raw()? @wlandau

Exactly, the NSE issues you faced are the exact reason for tar_target_raw(). tar_change_pair() should probably be called tar_change_raw().

tar_download <- function(
name,
url,
destfile = basename(url),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need both destdir and destfile arguments? What about a single path argument that defaults to basename(url)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I like this split, is that, for me, it's a common pattern to download a lot of files from different sources to a data/ or downloads/ (or data/downloads), so this makes an easy pattern to iterate over and conceptually I separate where I put something from it's name. I don't feel strongly about it but a lot of functions split these concepts (e.g., rmarkdown::render).

destfile = basename(url),
destdir = ".",
handle = NULL,
stop_on_no_internet = FALSE,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe assert_internet instead of stop_on_no_internet?

@wlandau
Copy link
Member

wlandau commented Aug 14, 2020

Also, please feel free to add yourself as a contributor in the DESCRIPTION.

@wlandau wlandau mentioned this pull request Sep 8, 2020
3 tasks
@wlandau
Copy link
Member

wlandau commented Sep 15, 2020

On reflection, I think we might be better off thinking about input URLs as formats: ropensci/targets#154.

@wlandau
Copy link
Member

wlandau commented Sep 15, 2020

Though if we do move to a format instead of an archetype, I'm not sure what to do about the curl handles. Not sure it's worth adding an option in tar_option_get() for something so specific.

@wlandau
Copy link
Member

wlandau commented Sep 15, 2020

Yup, tar_download() still seems worth it because tar_target(format = "url") will not be able to accept custom curl handles.

@wlandau
Copy link
Member

wlandau commented Sep 18, 2020

@noamross, after some reflection and refactoring, I decided to allow custom curl handles for the built-in format = "url" through the resources argument.

tar_target(
  url,
  "https://httpbin.org/etag/test",
  format = "url",
  resources = list(handle = curl::new_handle(...))
)

So I think the functionality in the current PR is now covered directly in targets.

I built URLs directly into targets to allow easier dynamic branching across URLs. So what do you think about a version of tar_download() that is shorthand for this?

list(
  tar_files(urls, rep("https://httpbin.org/etag/test", 2), format = "url"),
  tar_target(downloads, download.file(urls), pattern = map(urls))
)

Combine with some HPC, this could make parallel downloads a lot easier.

@wlandau wlandau closed this Oct 13, 2020
@wlandau
Copy link
Member

wlandau commented Oct 17, 2020

Sorry, I didn't mean to close your PR @noamross. I don't know why that happened, and I don't know why I can't reopen it.

@wlandau
Copy link
Member

wlandau commented Oct 17, 2020

I definitely don't remember closing it, especially not within the last few days, and I 100% meant to keep it open because of the alternative direction available to tar_download(). Looks like there is no commit hash tied to the "closed" message, so it wasn't closed due to a typo in a commit message. This is so odd.

@petrbouchal
Copy link

Just to say I would appreciate having this in {tarchetypes} - I almost started implementing my own before discovering this. By now this is one of the few archetypes/helpers I miss and would use frequently, ideally in conjunction with the new tar_timestamp().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants