Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse "Content-Disposition" headers #187

Closed
jennybc opened this issue Jun 4, 2019 · 4 comments
Closed

Parse "Content-Disposition" headers #187

jennybc opened this issue Jun 4, 2019 · 4 comments

Comments

@jennybc
Copy link
Contributor

jennybc commented Jun 4, 2019

We have a pair of functions in usethis (use_course() and, in the dev version, use_zip()) that download a ZIP file and unpack it. They have two user-friendly features:

  1. The default name of the ZIP file is determined by what you're downloading, just like in the browser! This primarily relies on the "Content-Disposition" header.
  2. The unzipping tries hard to be smart about unpacking.

For an hour or two this weekend, I aspired to make a PR to curl to parse the "Content-Disposition" header, paving the way towards a curl_download() variant that could determine destfile for itself. But I gave up, feeling convinced that to do this with the rigor necessary in the curl package (vs a usethis function for interactive use) is too much work.

Leaving a few notes here for discussion, since I bothered to research this. Maybe one day the landscape will change and the cost-benefit analysis looks better. Ideally, an existing parser could just be used/embedded.


Concrete examples of how the header looks in cases I see all the time:

DropBox folder download: "attachment; filename=\"foo.zip\"; filename*=UTF-8''foo.zip\"
GitHub repo download: "attachment; filename=foo-master.zip"

Problem is, lots of weird edge cases are possible, in terms of quoting and specifying the encoding.

curl itself even punts on the filename* field:

https://github.com/curl/curl/blob/3538026f6f145b2811f4d515992565d6cbe969b0/src/tool_cb_hdr.c#L109-L110

and dealing with this properly is an official TODO:

https://curl.haxx.se/docs/todo.html#UTF_8_filenames_in_Content_Dispo


Main resource on how to use this header:

RFC 6266 Use of the Content-Disposition Header Field in the Hypertext Transfer Protocol (HTTP)
https://tools.ietf.org/html/rfc6266

An extensive test suite exploring lots of edge cases and how various browser handle them (or, rather, fail to):
http://test.greenbytes.de/tech/tc2231/

Some Python efforts, that convey a general sense of neglect/abandonment:


@jeroen it's fine if you want to just close this, w/ or w/o making some comments. I just wanted to put these links somewhere.

@jennybc
Copy link
Contributor Author

jennybc commented Jun 4, 2019

Poking around in curl source (curl curl, not this package 🙂) does make me wonder ... could this package somehow access the plain old filename field of Content-Disposition, since curl does? Or is the ability to get at this limited to the command-line tool curl and is not exposed in libcurl? I suspect that is unfortunately the case.

@jay
Copy link

jay commented Jun 4, 2019

is the ability to get at this limited to the command-line tool curl and is not exposed in libcurl?

Yes, you would have to parse it out of the header as the curl tool does.

@dmi3kno
Copy link

dmi3kno commented Jul 12, 2019

I am parsing content-disposition In guess_basename() helper here https://github.com/dmi3kno/polite/blob/master/inst/templates/polite_template.R

@jeroen
Copy link
Owner

jeroen commented Jul 26, 2019

I might change my mind in the future, but for now I won't implement this in the curl package for the following reasons:

  • Because libcurl does not expose a content-disposition parser anyway, there is no need to have this in the curl package other than convenience. It might be a better fit for httr.
  • As noted above, there are a lot of edge cases. Because curl is an infrastructural package, I only want to expose things that work reliably and consistently.

If I were to write such a function I would recommend doing it in R, basically as in tidy_download. I.e. first download to a temporary file and then after the download has succeeded, check the response headers from the handle and possibly rename the file.

I do not recommend writing directly to the filename suggested by the content-disposition in C, because it is difficult to implement and makes malformed content-disposition headers possibly result in a crash or security vulnerability.

@jeroen jeroen closed this as completed Jul 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants