Allow decoding of application/json and application/javascript. #99

dracos · 2018-03-11T17:05:09Z

This will, for application/json, detect if the file is UTF-8, UTF-16, or UTF-32, and try and return the content decoded. It will allow use of the charset/ default_charset options. Also allow text/json (if UTF-8).

It will also decode application/javascript which is permitted a charset parameter in RFC4329, so can be treated in the same way XML is.

New way of doing #90 which was too broad. Fixes #36. Fixes the main part of #72.

vanHoesel

Please also expand on "textual content" in the POD for =item $mess->decoded_content

vanHoesel · 2018-03-11T19:36:15Z

lib/HTTP/Headers.pm

+    my $ct = shift->content_type;
+    # text/json is not standard but still used by various servers.
+    # No issue including it as well.
+    return $ct eq 'application/json' || $ct eq 'text/json';


would you please broaden this criteria, so it includes application/*+json too

vanHoesel · 2018-03-11T19:52:17Z

lib/HTTP/Message.pm

+	    my $charset = lc(
+	        $opt{charset} ||
+		$opt{default_charset} ||
+		$self->content_charset ||


that is interesting that you choose to not include $self->content_type_charset, which specifically uses the charset from the Content-Type header (like application/json; charset=utf-16). I'd thought that that would be the preferred way.

Also, it is discouraged to send the BOM, according to RFC 7158 – JSON §8.1 String and Character Issues / Character Encoding

So, I'd suggest to certainly add content_type_charset before content_charset

Sorry, my reading of what you wrote in #90 was that you'd want to ignore the charset header; if we're happy to use it (I certainly am!), then we could have it take the same code flow as text/* which is easier. Updated.

if you want decoded_content do something smart, and extend "textual context" beyond text/* type (like we apparently also seem to do for application/xml ... then use what has been specified in the Content-Type

What I meant, was that this behaviour is outside the specs, only text/* have charset ... and thus should be ignored and should be passed in to the processor (that would be the module parsing the content-body into Perl structure). It is 'binary data' and 'we' shouldn't try to be smart but leave it to the processor. Just ask yourself, why munge on binary data for json or xml and not on audio-samples ...

but okay ... let's be 'smart' - but that is not my preference

This will detect if the file is UTF-8, UTF-16, or UTF-32, and try and return the content decoded. It will allow use of the charset/ default_charset options. Also allow text/json (if UTF-8).

This media type has a charset parameter as per RFC4329, so can be treated in the same way that XML is.

vanHoesel · 2018-03-12T00:31:48Z

Hey @dracos ... I just realised ... we can not release this as is

Since application/json has not been automagically been decoded before and people had to do it manually themselves, working around this caveat ... guess what will happen with anyone ever implemented an API-Client using JSON data exchange

...

yep, hell breaks loose and everyone gets then double decoded rubbish!

Sorry man, for all that hard work you put in this, guess the only way to get it fixed, is by using some additional module, as done with HTTP::Message::JSON that comes bundled with LWP::JSON::Tiny

... correct me if I'm wrong

dracos · 2018-03-12T09:08:44Z

Hi, yes, I see what you mean. One possible solution is a new option to decoded_content could be added that specifies that you know and want this behaviour, though I'm not sure what you could call it.

vanHoesel · 2018-03-12T09:20:27Z

on my way to $work, I was thinking about this but instead of adding more params, what about having a method that is called what it is doing ...

unpacked_and_charset_decoded_content

and be specific in documentation of what we mean with 'charset_decoded', for which content-types it applies and in which order the method pick it's charset

vanHoesel · 2018-03-12T09:27:06Z

oh .. 'unpacked' .. should that then actually return a 203 Non-Authoritative Information, since we have transformed the data ?

oalders · 2018-03-14T13:07:25Z

As far as adding a new method or allowing a param to be passed to decoded_content, I'd be interested to hear from @karenetheridge, @skaji and @genio.

topaz · 2021-10-25T20:13:23Z

I'm very interested in a fix for this, or at very least a more explicit warning in the documentation. Some backward-compatible way to always automatically decode charsets declared as a suffix of the Content-Type header - rather than just "textual" content types - would be appreciated.

Currently, the documentation uses "...and for textual content...", presumably as a way to tersely gloss over the current behavior of only decoding charsets attached to text/*, application/xml, and *+xml content types. Because I misunderstood the intent of the documentation and had insufficient testing for character set decoding edge cases I assumed were being handled, this behavior has caused an issue in a production system I operate. An explicit description of this behavior in the documentation might prevent others from the same fate.

vanHoesel · 2021-10-25T21:23:20Z

Hey @topaz ,

3½ yrs later . . .

would you please be so kind to come up with a proposal for the updated POD, such that it might have prevented your production outage?

topaz · 2021-10-25T21:39:09Z

I can make this into a pull request, but since I see this PR already has code on it and I'm not sure what happens if two different branches are associated with a PR, I'll just put it here for now. If you'd like a PR of this instead, just let me know.

Replace the first sentence from this POD entry with:

Returns the content with any C<Content-Encoding> undone and, for textual content
(C<Content-Type> values starting with C<text/>, exactly matching
C<application/xml>, or ending with C<+xml>), the raw content's character set
decoded into Perl's Unicode string format. Note that this
L<does not currently|https://github.com/libwww-perl/HTTP-Message/pull/99>
attempt to decode declared character sets for any other content types like
C<application/json> or C<application/javascript>.

(Again, just to be clear, my long-term preference would be to have this capability added as a feature somehow.)

oalders · 2021-10-29T16:05:11Z

@topaz a pull request would be helpful for this. We could merge that for the time being so that the documentation of the current behaviour is clearer.

topaz · 2021-10-30T22:58:31Z

@oalders #166

kcaran · 2024-04-12T19:24:23Z

Found this issue after trying to figure out why my queries weren't being decoded correctly. I'm always a little disappointed when things can't get fixed because someone might get the wrong decoding. They could have (should have?) used $r->content instead of $r->decoded_content, since it really isn't decoded!

oalders · 2024-04-12T19:27:28Z

@kcaran do you have any suggestions for improving this?

kcaran · 2024-04-15T15:52:54Z

Hi Olaf. Maybe I'm misinterpreting its purpose, but I would expect decoded_content() to decode the response based on the charset and content() to return the raw data. But I wouldn't want decoded_content() to return an error if the response didn't have a charset - the fallback would be to return the raw data.

My suggestion would be to accept the pull request. :-)

haarg · 2024-04-15T19:50:36Z

This PR needs tests before it can be accepted.

dracos changed the title ~~Allow decoding of application/json.~~ Allow decoding of application/json and application/javascript. Mar 11, 2018

vanHoesel reviewed Mar 11, 2018

View reviewed changes

dracos force-pushed the application-json branch 3 times, most recently from 328b154 to 8f82a98 Compare March 11, 2018 20:47

dracos added 2 commits March 11, 2018 20:49

Allow decoding of application/json.

421246d

This will detect if the file is UTF-8, UTF-16, or UTF-32, and try and return the content decoded. It will allow use of the charset/ default_charset options. Also allow text/json (if UTF-8).

Allow decoding of application/javascript.

3a6f948

This media type has a charset parameter as per RFC4329, so can be treated in the same way that XML is.

dracos force-pushed the application-json branch from 8f82a98 to 3a6f948 Compare March 11, 2018 20:49

This was referenced Mar 12, 2018

decoded_content isn't decoding charset if content-type is application/json #36

Closed

->decoded_content should decode application/json, etc [rt.cpan.org #82963] #72

Closed

This was referenced Oct 30, 2021

Clarify POD for decoded_content charset decoding behavior topaz/HTTP-Message#1

Closed

Clarify documentation for decoded_content #166

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow decoding of application/json and application/javascript. #99

Allow decoding of application/json and application/javascript. #99

dracos commented Mar 11, 2018 •

edited

Loading

vanHoesel left a comment •

edited

Loading

vanHoesel Mar 11, 2018

dracos Mar 11, 2018

vanHoesel Mar 11, 2018

dracos Mar 11, 2018

vanHoesel Mar 11, 2018

vanHoesel commented Mar 12, 2018

dracos commented Mar 12, 2018

vanHoesel commented Mar 12, 2018 •

edited

Loading

vanHoesel commented Mar 12, 2018

oalders commented Mar 14, 2018

topaz commented Oct 25, 2021

vanHoesel commented Oct 25, 2021

topaz commented Oct 25, 2021

oalders commented Oct 29, 2021

topaz commented Oct 30, 2021

kcaran commented Apr 12, 2024

oalders commented Apr 12, 2024

kcaran commented Apr 15, 2024

haarg commented Apr 15, 2024

Allow decoding of application/json and application/javascript. #99

Are you sure you want to change the base?

Allow decoding of application/json and application/javascript. #99

Conversation

dracos commented Mar 11, 2018 • edited Loading

vanHoesel left a comment • edited Loading

Choose a reason for hiding this comment

vanHoesel Mar 11, 2018

Choose a reason for hiding this comment

dracos Mar 11, 2018

Choose a reason for hiding this comment

vanHoesel Mar 11, 2018

Choose a reason for hiding this comment

dracos Mar 11, 2018

Choose a reason for hiding this comment

vanHoesel Mar 11, 2018

Choose a reason for hiding this comment

vanHoesel commented Mar 12, 2018

dracos commented Mar 12, 2018

vanHoesel commented Mar 12, 2018 • edited Loading

vanHoesel commented Mar 12, 2018

oalders commented Mar 14, 2018

topaz commented Oct 25, 2021

vanHoesel commented Oct 25, 2021

topaz commented Oct 25, 2021

oalders commented Oct 29, 2021

topaz commented Oct 30, 2021

kcaran commented Apr 12, 2024

oalders commented Apr 12, 2024

kcaran commented Apr 15, 2024

haarg commented Apr 15, 2024

dracos commented Mar 11, 2018 •

edited

Loading

vanHoesel left a comment •

edited

Loading

vanHoesel commented Mar 12, 2018 •

edited

Loading