Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URLencoding breakes S3 key #77

Closed
realknorke opened this issue Jun 17, 2021 · 7 comments · Fixed by #78
Closed

URLencoding breakes S3 key #77

realknorke opened this issue Jun 17, 2021 · 7 comments · Fixed by #78

Comments

@realknorke
Copy link

When uploading a file to S3 like this:

>>> from s3path import S3Path
>>> s=S3Path('/my-bucket/tmp/foo=bar/objname')
>>> s.write_text("foobar")
/path/to/my/venv/lib64/python3.8/site-packages/smart_open/s3.py:220: UserWarning: ignoring the following deprecated transport parameters: ['multipart_upload_kwargs', 'object_kwargs', 'resource_kwargs', 'session']. See <https://github.com/RaRe-Technologies/smart_open/blob/develop/MIGRATING_FROM_OLDER_VERSIONS.rst> for details
  warnings.warn(message, UserWarning)
6

The location is transformed and the = is urlencoded by a as_uri() call. In S3 I now have a tmp/foo%3Dbar/ prefix. But I expected an tmp/foo=bar/ prefix.

The as_uri() call in s3path.py leads to unexpected behaviour and breakes the behaviour of s3path <=v0.3

@realknorke
Copy link
Author

Maybe related to #59

@maresb
Copy link
Contributor

maresb commented Jun 18, 2021

I think there is a good discussion to be had regarding the precise meaning of RFC2396, and what smart_open and boto3 mean specifically by "URI", and what the .as_uri() method should return.

It might be that there are upstream "bugs" in smart_open and/or boto3 regarding the escaping and lack of escaping of URIs. However, I don't yet understand the exact conditions outlined in the RFC for when it's okay to not escape characters in a URI.

In my first reading, it appears that the encoding of a URI is context-dependent, and characters only need to be escaped when not escaping them would lead to ambiguity. Thus it may be acceptable behavior to have a URI which is not url-encoded.

Some passages from the RFC which support this view:

  • If the data for a URI component would conflict with the reserved purpose, then the conflicting data must be escaped before forming the URI.

  • Characters in the "reserved" set are not reserved in all contexts. The set of characters actually reserved within any given URI component is defined by that component. In general, a character is reserved if the semantics of the URI changes if the character is replaced with its escaped US-ASCII encoding.

In smart_open, they provide the following S3 URI template:

[secret:key@][host[:port]@]bucket/object

I'm not sure to what extent s3path handles the optional components of this URI.

In the meantime, I have a patch incoming, which should "fix" the bug so that it gets smart_open working again...

@maresb
Copy link
Contributor

maresb commented Jun 21, 2021

Thanks @liormizr for the merge!!!

@liormizr
Copy link
Owner

Deployed
Version: 0.3.02

@JohnHBrock
Copy link

It looks like #78 broke S3Path.exists() whenever the URL contains special characters. The problem is that S3Path.key is used for the existence check, which still does URI encoding.

@liormizr
Copy link
Owner

@JohnHBrock can you please send me an example?

@JohnHBrock
Copy link

I think this is OK actually, sorry for the false alarm. We had a workaround in place for the URL encoding issue, and that workaround is subtly broken by #78, but that's fine because #78 lets us remove the workaround. 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants