Added `download_file()`; fixed missing word in `write_disk()` docs #599

hrbrmstr · 2019-06-19T13:15:33Z

GET + write_disk() does a fine job dealing with downloads but is somewhat verbose to type.

Base download.file() works fine but isn't cached-file-aware like write_disk() is.

GET + write_disk() is not multi-URL/file-aware like base download.file() is.

Base download.file() doesn't truly have all of the nice browser-like HTTP operations like GET does.

Abusing bandwidth (i.e. needlessly re-downloading) even in these modern times is not cool.

Thus begat download_file().

The function:

supports single URL/path and multiple URLs/paths for multi-file downloading (tho does not perform these ops in parallel like download.file() can in some circumstances)
is cached-file aware but also validates that a given URL is a valid URL (from a URL structure perspective, not "reach out and touch someone" perspective) when returning a response.
returns a data frame (classed as a tibble) with a summary of the operations that includes the requested URL, file path, status code and whether a cached file was used. A 399 status code is used to identify whether a URL resource was already in a cached state since 304 Not Modified does not make sense (since the function does not test for existence) and 399 is not likely to be used in our lifetime as a "standard" 300-level response

Example(s):

library(httr)
library(tibble)

tmp1 <- tempfile()
tmp2 <- tempfile()
tmp3 <- tempfile()

(download_file("https://google.com", tmp1)) # downloads fine
## # A tibble: 1 x 4
##   url          path                                  status_code cache_used
##   <chr>        <chr>                                       <int> <lgl>     
## 1 https://goo… /var/folders/1w/2d82v7ts3gs98tc6v772…         200 FALSE

(download_file("https://google.com", tmp1)) # doesn't re-download since it's cached
## # A tibble: 1 x 4
##   url          path                                  status_code cache_used
##   <chr>        <chr>                                       <int> <lgl>     
## 1 https://goo… /var/folders/1w/2d82v7ts3gs98tc6v772…         399 TRUE

(download_file("https://google.com", tmp1, overwrite = TRUE)) # re-downloads (overwrites file)
## # A tibble: 1 x 4
##   url          path                                  status_code cache_used
##   <chr>        <chr>                                       <int> <lgl>     
## 1 https://goo… /var/folders/1w/2d82v7ts3gs98tc6v772…         200 FALSE

(download_file("https://google.com", tmp2)) # re-downloads (new file)
## # A tibble: 1 x 4
##   url          path                                  status_code cache_used
##   <chr>        <chr>                                       <int> <lgl>     
## 1 https://goo… /var/folders/1w/2d82v7ts3gs98tc6v772…         200 FALSE

(download_file("badurl", tmp3)) # handles major errors gracefully
## # A tibble: 1 x 4
##   url    path  status_code cache_used
##   <chr>  <chr>       <int> <lgl>     
## 1 badurl <NA>           NA FALSE

# multi-file example with no-caching
(download_file(
  c(rep("https://google.com", 2), "badurl"),
  c(tmp1, tmp2, tmp3),
  overwrite = TRUE
))
## # A tibble: 3 x 4
##   url          path                                  status_code cache_used
##   <chr>        <chr>                                       <int> <lgl>     
## 1 https://goo… /var/folders/1w/2d82v7ts3gs98tc6v772…         200 FALSE     
## 2 https://goo… /var/folders/1w/2d82v7ts3gs98tc6v772…         200 FALSE     
## 3 badurl       <NA>                                           NA FALSE

# multi-file example with caching
(download_file(
  c(rep("https://google.com", 2), "badurl"),
  c(tmp1, tmp2, tmp3),
  overwrite = FALSE
))
## # A tibble: 3 x 4
##   url          path                                  status_code cache_used
##   <chr>        <chr>                                       <int> <lgl>     
## 1 https://goo… /var/folders/1w/2d82v7ts3gs98tc6v772…         399 TRUE      
## 2 https://goo… /var/folders/1w/2d82v7ts3gs98tc6v772…         399 TRUE      
## 3 badurl       <NA>                                           NA FALSE

There are tests for single & multi-file downloads.

I also added a missing word to the path parameter in write_disk() docs.

devtools::check() passed.

jennybc

I like the idea of this convenience function. It's not my call but I wonder if this sort of layer is something we would book against httr2 (https://github.com/r-lib/httr2/issues) or perhaps there is no reason to not just get on with it.

jennybc · 2019-06-19T20:11:18Z

R/download-file.R

+download_file <- function(url, path, overwrite = FALSE, ...) {
+
+  url <- as.character(url)
+  path <- as.character(path)


This makes me think of something I do in usethis::use_course() and recently decided to not present as a PR into curl (jeroen/curl#187): trying to determine local file name from the Content-Disposition header (and falling back to base name of the URL).

@hrbrmstr Do you have any thoughts on that? Maybe the curl issue is the best place to put them btw.

That's a super gd question. I have a similar idiom setup for some light crawls we do but hadn't thought to make it generic somewhere.

I wonder if adding a "helper" to/for write_disk() might be a way to go (not sure curl is the right place for it).

Something like (seriously off the top of my head) write_disk <- function(path, infer = FALSE, overwrite = FALSE)

where if infer were TRUE it'd have the behaviour you posited:

look for Content-Disposition and use that if found (potentially sanitizing it since it's theoretically untrusted content)

use last part of URL if ^^ does not exist, BUT

use a generated UUID-ish name if ^^ is a /

AND store it in the directory pointed to in path.

What you describe it basically exactly what I do in usethis (Content-Disposition > last bit of URL > random name + sanitization):

https://github.com/r-lib/usethis/blob/master/R/course.R

👍 It'd be pretty straightforward for me to add this to write_disk() (on holiday in 2 days so wldn't be until after that unless it seriously downpours in Acadia :-) if folks deem that a worthy add.

hadley · 2023-10-31T20:31:17Z

httr has been superseded in favour of httr2, so is no longer under active development. Thanks for using httr, thanks for contributing, and my apologies that your contribution never made it into the package.

hrbrmstr added 3 commits June 19, 2019 08:57

added download_file(); fixed missing word in write_disk() docs

b1d6f8e

fixed example in download_file()

55ca7da

tweaked docs a bit in download_file()

d278108

jennybc reviewed Jun 19, 2019

View reviewed changes

hadley closed this Oct 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added `download_file()`; fixed missing word in `write_disk()` docs #599

Added `download_file()`; fixed missing word in `write_disk()` docs #599

hrbrmstr commented Jun 19, 2019

jennybc left a comment

jennybc Jun 19, 2019

hrbrmstr Jun 19, 2019

jennybc Jun 19, 2019

hrbrmstr Jun 19, 2019

hadley commented Oct 31, 2023

Added download_file(); fixed missing word in write_disk() docs #599

Added download_file(); fixed missing word in write_disk() docs #599

Conversation

hrbrmstr commented Jun 19, 2019

jennybc left a comment

Choose a reason for hiding this comment

jennybc Jun 19, 2019

Choose a reason for hiding this comment

hrbrmstr Jun 19, 2019

Choose a reason for hiding this comment

jennybc Jun 19, 2019

Choose a reason for hiding this comment

hrbrmstr Jun 19, 2019

Choose a reason for hiding this comment

hadley commented Oct 31, 2023

Added `download_file()`; fixed missing word in `write_disk()` docs #599

Added `download_file()`; fixed missing word in `write_disk()` docs #599