Parse "Content-Disposition" headers #187

jennybc · 2019-06-04T16:21:37Z

We have a pair of functions in usethis (use_course() and, in the dev version, use_zip()) that download a ZIP file and unpack it. They have two user-friendly features:

The default name of the ZIP file is determined by what you're downloading, just like in the browser! This primarily relies on the "Content-Disposition" header.
The unzipping tries hard to be smart about unpacking.

For an hour or two this weekend, I aspired to make a PR to curl to parse the "Content-Disposition" header, paving the way towards a curl_download() variant that could determine destfile for itself. But I gave up, feeling convinced that to do this with the rigor necessary in the curl package (vs a usethis function for interactive use) is too much work.

Leaving a few notes here for discussion, since I bothered to research this. Maybe one day the landscape will change and the cost-benefit analysis looks better. Ideally, an existing parser could just be used/embedded.

Concrete examples of how the header looks in cases I see all the time:

DropBox folder download: "attachment; filename=\"foo.zip\"; filename*=UTF-8''foo.zip\"
GitHub repo download: "attachment; filename=foo-master.zip"

Problem is, lots of weird edge cases are possible, in terms of quoting and specifying the encoding.

curl itself even punts on the filename* field:

https://github.com/curl/curl/blob/3538026f6f145b2811f4d515992565d6cbe969b0/src/tool_cb_hdr.c#L109-L110

and dealing with this properly is an official TODO:

https://curl.haxx.se/docs/todo.html#UTF_8_filenames_in_Content_Dispo

Main resource on how to use this header:

RFC 6266 Use of the Content-Disposition Header Field in the Hypertext Transfer Protocol (HTTP)
https://tools.ietf.org/html/rfc6266

An extensive test suite exploring lots of edge cases and how various browser handle them (or, rather, fail to):
http://test.greenbytes.de/tech/tc2231/

Some Python efforts, that convey a general sense of neglect/abandonment:

https://pypi.org/project/rfc6266-parser/ (a fork of next one which appears to be abandoned)
https://pypi.org/project/rfc6266/
Issue on requests https://github.com/kennethreitz/requests/issues/5031

@jeroen it's fine if you want to just close this, w/ or w/o making some comments. I just wanted to put these links somewhere.

The text was updated successfully, but these errors were encountered:

jennybc · 2019-06-04T16:40:56Z

Poking around in curl source (curl curl, not this package 🙂) does make me wonder ... could this package somehow access the plain old filename field of Content-Disposition, since curl does? Or is the ability to get at this limited to the command-line tool curl and is not exposed in libcurl? I suspect that is unfortunately the case.

jay · 2019-06-04T18:30:08Z

is the ability to get at this limited to the command-line tool curl and is not exposed in libcurl?

Yes, you would have to parse it out of the header as the curl tool does.

dmi3kno · 2019-07-12T08:18:40Z

I am parsing content-disposition In guess_basename() helper here https://github.com/dmi3kno/polite/blob/master/inst/templates/polite_template.R

jeroen · 2019-07-26T12:23:18Z

I might change my mind in the future, but for now I won't implement this in the curl package for the following reasons:

Because libcurl does not expose a content-disposition parser anyway, there is no need to have this in the curl package other than convenience. It might be a better fit for httr.
As noted above, there are a lot of edge cases. Because curl is an infrastructural package, I only want to expose things that work reliably and consistently.

If I were to write such a function I would recommend doing it in R, basically as in tidy_download. I.e. first download to a temporary file and then after the download has succeeded, check the response headers from the handle and possibly rename the file.

I do not recommend writing directly to the filename suggested by the content-disposition in C, because it is difficult to implement and makes malformed content-disposition headers possibly result in a crash or security vulnerability.

jennybc mentioned this issue Jun 19, 2019

Added download_file(); fixed missing word in write_disk() docs r-lib/httr#599

Closed

jeroen closed this as completed Jul 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse "Content-Disposition" headers #187

Parse "Content-Disposition" headers #187

jennybc commented Jun 4, 2019 •

edited

Loading

jennybc commented Jun 4, 2019

jay commented Jun 4, 2019

dmi3kno commented Jul 12, 2019

jeroen commented Jul 26, 2019

Parse "Content-Disposition" headers #187

Parse "Content-Disposition" headers #187

Comments

jennybc commented Jun 4, 2019 • edited Loading

jennybc commented Jun 4, 2019

jay commented Jun 4, 2019

dmi3kno commented Jul 12, 2019

jeroen commented Jul 26, 2019

jennybc commented Jun 4, 2019 •

edited

Loading