Always charset decode, regardless of content type. #90

dracos · 2017-11-14T20:37:11Z

For example, application/json; charset=utf-8 is one obvious contender that should be decoded. If no charset is given or detected, assume none rather than ISO-8859-1.

Fixes #72. Fixes #36.

For example, `application/json; charset=utf-8` is one obvious contender that should be decoded. If no charset is given or detected, assume none rather than ISO-8859-1.

vanHoesel · 2018-03-11T14:45:46Z

Rational

@dracos , I can see your point of view, but I'd strongly disagree with this pull-request.

application/json is a binary format, that just happens to look human readable; other binary formats like image/jpeg are much harder to understand by humans.

So, the only part that $mess->decoded_content can do is decode the 'transmission encoding', from gzip to unzipped etc. Would this had been a text/* content, than it could be 'decoded' according to the charset that comes as an (not so) optional argument.

RTFRFC

See RFC-6657 – MIME Charset Default Update §3 New Rules for Default "charset" Parameter Values for "text/*" Media Types.

Basically, every text/* MIME-Type SHOULD have it's specific default and senders of messages SHOULD include the charset parameter (except for payloads that can have their own internal specifications like text/html and text/xml - those SHOULD NOT have a charset parameter to avoid conflicts)

Now, regarding to application/json - which obviously is NOT a text/* type, it is said in RFC 2046 – Media Types §4.1.2 Charset Parameter

Other media types than subtypes of "text" might choose to employ the charset parameter as defined here, but ...

Which means that the binary application/json can 'borrow' that parameter for it's own purpose. JSON itself specifies in RFC 7158 – JSON §8.1 Character Encoding that:

JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default encoding is UTF-8, and JSON texts that are encoded in UTF-8 are interoperable in the sense that they will be read successfully by the maximum number of implementations; there are many implementations that cannot successfully read texts in other encodings (such as UTF-16 and UTF-32).

Recommendation

@vanHoesel, @dracos and @everyone-else

# get the 'Content Encoding' decoded representational data 
my $content_data = $mess->decoded_content();
# this is the 'binary data` for the payload: application/json
# from here you are on your own, just as if it would be `video/mpeg`

# not entirely on your own, here is how to get your data structure
my $content_type = $mess->headers('Content-Type);
my $charset = ( $content_type =~ m/charset=([-a-z]+)/i )->[0] || 'utf-8';
my $perl_struct = JSON->new->decode (decode $charset, $content_data);

Conclusion

This merge-request (and possibly other related to charset/encoding issues) SHOULD be declined

dracos · 2018-03-11T16:26:08Z

You could perhaps approach this with a slightly less confrontational/rude attitude, or at least that's how it reads here :(

The JSON RFC that you quote explains that it must be encoded in Unicode, as you say. So ignoring any other issues about charset or not (and note I do agree with your rationale) would you be happy if decoded_content did decode the content as Unicode if the content type was application/json because according to the RFC that must be how it is encoded? This would seem consistent with the purpose and activity of the package.

Alternatively, would it be okay if decoded_content was changed to decode any content type - but only if the optional charset override was provided (so you're explicitly telling it that you want it to try and decode something)?

I have already dealt with this in my own code, but I was only trying to help improve this code, even if in the end it is only some more explanation in the documentation. Thanks for your time.

vanHoesel · 2018-03-11T16:29:23Z

From HTTP::Mesage $mess->decoded_content

$mess->decoded_content( %options )

Returns the content with any Content-Encoding undone and for textual content the raw content encoded to Perl's Unicode strings. If the Content-Encoding or charset of the message is unknown this method will fail by returning undef.

Question is though ... what is textual content ? is that MIME type type/* or anything that can be read by (compared to computers, super intelligent) humans.

Decoding application/xml was probably not a good idea.

Where as decoding text/xml (which is an alias) would be a good idea, but (as of RFC 6657 §3.a )any charset definition with text/xml would be discouraged, as it can conflict with the one inside the XML itself.

vanHoesel · 2018-03-11T16:34:27Z

Sorry @dracos , in no way did I want to sound rude

vanHoesel · 2018-03-11T16:40:54Z

arguably - trying to be consistent with the package - we could add another 'intelligent' thing inside decoded_content and make another exception rule for it. But your proposal was quite a big shortcut, passing by all the rules we have for HTTP / MIME / Charsets

dracos · 2018-03-11T16:43:15Z

Yes, I realise that now, and thanks for the detailed background information. As I said, I was just trying to be helpful, and hope that something can still come out of it :)

The docs could certainly expand on "textual content" – and yes, I think my idea would be to what I mentioned above, as you also suggest, having a special case of it knowing about the common application/json format (and I'd be happy to try and code that up).

"Decoding application/xml was probably not a good idea." - it (and ...+xml) do have charset in their media type definitions, at least, in the same way as text/xml (apart from fallbacks).

dracos · 2018-03-11T16:44:47Z

I'll close this and come back with something specifically for application/json for consideration.

vanHoesel · 2018-03-11T19:00:04Z

As you may have understood, me personally am a bit resistant in having modules and frameworks doing the 'smart' thing, rather then making things explicit for users.

If we would update the documentation and «expand on "textual content"», then ...

I will not oppose a merge-request as already proposed in #36 (decoded_content isn't decoding charset if content-type is application/json).

But it would also give some leeway to actually decode anything that does have a charset ... as mentioned by @SineSwiper in #74 (->decoded_content should decode application/json, etc [rt.cpan.org #82963) and remove the entire

if ($self->content_is_text || (my $is_xml = $self->content_is_xml)) {

Still, being smart will be tricky, as specifications do change and thus will either start breaking when updating to new specs ... or not update at all, being the default for text/* a good example.

Always charset decode, regardless of content type.

530ca55

For example, `application/json; charset=utf-8` is one obvious contender that should be decoded. If no charset is given or detected, assume none rather than ISO-8859-1.

dracos mentioned this pull request Nov 14, 2017

Possible character encoding problem in 2.2 mysociety/fixmystreet#1905

Closed

This was referenced Mar 11, 2018

->decoded_content should decode application/json, etc [rt.cpan.org #82963] #72

Closed

decoded_content isn't decoding charset if content-type is application/json #36

Closed

dracos closed this Mar 11, 2018

dracos mentioned this pull request Mar 11, 2018

Allow decoding of application/json and application/javascript. #99

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always charset decode, regardless of content type. #90

Always charset decode, regardless of content type. #90

dracos commented Nov 14, 2017

vanHoesel commented Mar 11, 2018 •

edited

Loading

dracos commented Mar 11, 2018

vanHoesel commented Mar 11, 2018

vanHoesel commented Mar 11, 2018

vanHoesel commented Mar 11, 2018 •

edited

Loading

dracos commented Mar 11, 2018

dracos commented Mar 11, 2018

vanHoesel commented Mar 11, 2018

Always charset decode, regardless of content type. #90

Always charset decode, regardless of content type. #90

Conversation

dracos commented Nov 14, 2017

vanHoesel commented Mar 11, 2018 • edited Loading

Rational

RTFRFC

Recommendation

Conclusion

dracos commented Mar 11, 2018

vanHoesel commented Mar 11, 2018

vanHoesel commented Mar 11, 2018

vanHoesel commented Mar 11, 2018 • edited Loading

dracos commented Mar 11, 2018

dracos commented Mar 11, 2018

vanHoesel commented Mar 11, 2018

vanHoesel commented Mar 11, 2018 •

edited

Loading

vanHoesel commented Mar 11, 2018 •

edited

Loading