Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always charset decode, regardless of content type. #90

Closed
wants to merge 1 commit into from

Conversation

dracos
Copy link

@dracos dracos commented Nov 14, 2017

For example, application/json; charset=utf-8 is one obvious contender that should be decoded. If no charset is given or detected, assume none rather than ISO-8859-1.

Fixes #72. Fixes #36.

For example, `application/json; charset=utf-8` is one obvious contender
that should be decoded. If no charset is given or detected, assume none
rather than ISO-8859-1.
@vanHoesel
Copy link
Member

vanHoesel commented Mar 11, 2018

Rational

@dracos , I can see your point of view, but I'd strongly disagree with this pull-request.

application/json is a binary format, that just happens to look human readable; other binary formats like image/jpeg are much harder to understand by humans.

So, the only part that $mess->decoded_content can do is decode the 'transmission encoding', from gzip to unzipped etc. Would this had been a text/* content, than it could be 'decoded' according to the charset that comes as an (not so) optional argument.

RTFRFC

See RFC-6657 – MIME Charset Default Update §3 New Rules for Default "charset" Parameter Values for "text/*" Media Types.

Basically, every text/* MIME-Type SHOULD have it's specific default and senders of messages SHOULD include the charset parameter (except for payloads that can have their own internal specifications like text/html and text/xml - those SHOULD NOT have a charset parameter to avoid conflicts)

Now, regarding to application/json - which obviously is NOT a text/* type, it is said in RFC 2046 – Media Types §4.1.2 Charset Parameter

Other media types than subtypes of "text" might choose to employ the charset parameter as defined here, but ...

Which means that the binary application/json can 'borrow' that parameter for it's own purpose. JSON itself specifies in RFC 7158 – JSON §8.1 Character Encoding that:

JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default encoding is UTF-8, and JSON texts that are encoded in UTF-8 are interoperable in the sense that they will be read successfully by the maximum number of implementations; there are many implementations that cannot successfully read texts in other encodings (such as UTF-16 and UTF-32).

Recommendation

@vanHoesel, @dracos and @everyone-else

# get the 'Content Encoding' decoded representational data 
my $content_data = $mess->decoded_content();
# this is the 'binary data` for the payload: application/json
# from here you are on your own, just as if it would be `video/mpeg`

# not entirely on your own, here is how to get your data structure
my $content_type = $mess->headers('Content-Type);
my $charset = ( $content_type =~ m/charset=([-a-z]+)/i )->[0] || 'utf-8';
my $perl_struct = JSON->new->decode (decode $charset, $content_data);

Conclusion

This merge-request (and possibly other related to charset/encoding issues) SHOULD be declined

@dracos
Copy link
Author

dracos commented Mar 11, 2018

You could perhaps approach this with a slightly less confrontational/rude attitude, or at least that's how it reads here :(

The JSON RFC that you quote explains that it must be encoded in Unicode, as you say. So ignoring any other issues about charset or not (and note I do agree with your rationale) would you be happy if decoded_content did decode the content as Unicode if the content type was application/json because according to the RFC that must be how it is encoded? This would seem consistent with the purpose and activity of the package.

Alternatively, would it be okay if decoded_content was changed to decode any content type - but only if the optional charset override was provided (so you're explicitly telling it that you want it to try and decode something)?

I have already dealt with this in my own code, but I was only trying to help improve this code, even if in the end it is only some more explanation in the documentation. Thanks for your time.

@vanHoesel
Copy link
Member

From HTTP::Mesage $mess->decoded_content

$mess->decoded_content( %options )

Returns the content with any Content-Encoding undone and for textual content the raw content encoded to Perl's Unicode strings. If the Content-Encoding or charset of the message is unknown this method will fail by returning undef.

Question is though ... what is textual content ? is that MIME type type/* or anything that can be read by (compared to computers, super intelligent) humans.

Decoding application/xml was probably not a good idea.

Where as decoding text/xml (which is an alias) would be a good idea, but (as of RFC 6657 §3.a )any charset definition with text/xml would be discouraged, as it can conflict with the one inside the XML itself.

@vanHoesel
Copy link
Member

Sorry @dracos , in no way did I want to sound rude

@vanHoesel
Copy link
Member

vanHoesel commented Mar 11, 2018

arguably - trying to be consistent with the package - we could add another 'intelligent' thing inside decoded_content and make another exception rule for it. But your proposal was quite a big shortcut, passing by all the rules we have for HTTP / MIME / Charsets

@dracos
Copy link
Author

dracos commented Mar 11, 2018

Yes, I realise that now, and thanks for the detailed background information. As I said, I was just trying to be helpful, and hope that something can still come out of it :)

The docs could certainly expand on "textual content" – and yes, I think my idea would be to what I mentioned above, as you also suggest, having a special case of it knowing about the common application/json format (and I'd be happy to try and code that up).

"Decoding application/xml was probably not a good idea." - it (and ...+xml) do have charset in their media type definitions, at least, in the same way as text/xml (apart from fallbacks).

@dracos
Copy link
Author

dracos commented Mar 11, 2018

I'll close this and come back with something specifically for application/json for consideration.

@vanHoesel
Copy link
Member

As you may have understood, me personally am a bit resistant in having modules and frameworks doing the 'smart' thing, rather then making things explicit for users.

If we would update the documentation and «expand on "textual content"», then ...

I will not oppose a merge-request as already proposed in #36 (decoded_content isn't decoding charset if content-type is application/json).

But it would also give some leeway to actually decode anything that does have a charset ... as mentioned by @SineSwiper in #74 (->decoded_content should decode application/json, etc [rt.cpan.org #82963) and remove the entire

if ($self->content_is_text || (my $is_xml = $self->content_is_xml)) {

Still, being smart will be tricky, as specifications do change and thus will either start breaking when updating to new specs ... or not update at all, being the default for text/* a good example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants