Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow decoding of application/json and application/javascript. #99

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

dracos
Copy link

@dracos dracos commented Mar 11, 2018

This will, for application/json, detect if the file is UTF-8, UTF-16, or UTF-32, and try and return the content decoded. It will allow use of the charset/ default_charset options. Also allow text/json (if UTF-8).

It will also decode application/javascript which is permitted a charset parameter in RFC4329, so can be treated in the same way XML is.

New way of doing #90 which was too broad. Fixes #36. Fixes the main part of #72.

@dracos dracos changed the title Allow decoding of application/json. Allow decoding of application/json and application/javascript. Mar 11, 2018
Copy link
Member

@vanHoesel vanHoesel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also expand on "textual content" in the POD for =item $mess->decoded_content

my $ct = shift->content_type;
# text/json is not standard but still used by various servers.
# No issue including it as well.
return $ct eq 'application/json' || $ct eq 'text/json';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you please broaden this criteria, so it includes application/*+json too

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

my $charset = lc(
$opt{charset} ||
$opt{default_charset} ||
$self->content_charset ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is interesting that you choose to not include $self->content_type_charset, which specifically uses the charset from the Content-Type header (like application/json; charset=utf-16). I'd thought that that would be the preferred way.

Also, it is discouraged to send the BOM, according to RFC 7158 – JSON §8.1 String and Character Issues / Character Encoding

So, I'd suggest to certainly add content_type_charset before content_charset

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my reading of what you wrote in #90 was that you'd want to ignore the charset header; if we're happy to use it (I certainly am!), then we could have it take the same code flow as text/* which is easier. Updated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you want decoded_content do something smart, and extend "textual context" beyond text/* type (like we apparently also seem to do for application/xml ... then use what has been specified in the Content-Type

What I meant, was that this behaviour is outside the specs, only text/* have charset ... and thus should be ignored and should be passed in to the processor (that would be the module parsing the content-body into Perl structure). It is 'binary data' and 'we' shouldn't try to be smart but leave it to the processor. Just ask yourself, why munge on binary data for json or xml and not on audio-samples ...

but okay ... let's be 'smart' - but that is not my preference

@dracos dracos force-pushed the application-json branch 3 times, most recently from 328b154 to 8f82a98 Compare March 11, 2018 20:47
This will detect if the file is UTF-8, UTF-16, or UTF-32,
and try and return the content decoded. It will allow use
of the charset/ default_charset options.

Also allow text/json (if UTF-8).
This media type has a charset parameter as per RFC4329, so can
be treated in the same way that XML is.
@vanHoesel
Copy link
Member

Hey @dracos ... I just realised ... we can not release this as is

Since application/json has not been automagically been decoded before and people had to do it manually themselves, working around this caveat ... guess what will happen with anyone ever implemented an API-Client using JSON data exchange

...

yep, hell breaks loose and everyone gets then double decoded rubbish!

Sorry man, for all that hard work you put in this, guess the only way to get it fixed, is by using some additional module, as done with HTTP::Message::JSON that comes bundled with LWP::JSON::Tiny

... correct me if I'm wrong

@dracos
Copy link
Author

dracos commented Mar 12, 2018

Hi, yes, I see what you mean. One possible solution is a new option to decoded_content could be added that specifies that you know and want this behaviour, though I'm not sure what you could call it.

@vanHoesel
Copy link
Member

vanHoesel commented Mar 12, 2018

on my way to $work, I was thinking about this but instead of adding more params, what about having a method that is called what it is doing ...

unpacked_and_charset_decoded_content

and be specific in documentation of what we mean with 'charset_decoded', for which content-types it applies and in which order the method pick it's charset

@vanHoesel
Copy link
Member

oh .. 'unpacked' .. should that then actually return a 203 Non-Authoritative Information, since we have transformed the data ?

@oalders
Copy link
Member

oalders commented Mar 14, 2018

As far as adding a new method or allowing a param to be passed to decoded_content, I'd be interested to hear from @karenetheridge, @skaji and @genio.

@topaz
Copy link
Contributor

topaz commented Oct 25, 2021

I'm very interested in a fix for this, or at very least a more explicit warning in the documentation. Some backward-compatible way to always automatically decode charsets declared as a suffix of the Content-Type header - rather than just "textual" content types - would be appreciated.

Currently, the documentation uses "...and for textual content...", presumably as a way to tersely gloss over the current behavior of only decoding charsets attached to text/*, application/xml, and *+xml content types. Because I misunderstood the intent of the documentation and had insufficient testing for character set decoding edge cases I assumed were being handled, this behavior has caused an issue in a production system I operate. An explicit description of this behavior in the documentation might prevent others from the same fate.

@vanHoesel
Copy link
Member

Hey @topaz ,

3½ yrs later . . .

would you please be so kind to come up with a proposal for the updated POD, such that it might have prevented your production outage?

@topaz
Copy link
Contributor

topaz commented Oct 25, 2021

I can make this into a pull request, but since I see this PR already has code on it and I'm not sure what happens if two different branches are associated with a PR, I'll just put it here for now. If you'd like a PR of this instead, just let me know.

Replace the first sentence from this POD entry with:

Returns the content with any C<Content-Encoding> undone and, for textual content
(C<Content-Type> values starting with C<text/>, exactly matching
C<application/xml>, or ending with C<+xml>), the raw content's character set
decoded into Perl's Unicode string format. Note that this
L<does not currently|https://github.com/libwww-perl/HTTP-Message/pull/99>
attempt to decode declared character sets for any other content types like
C<application/json> or C<application/javascript>.

(Again, just to be clear, my long-term preference would be to have this capability added as a feature somehow.)

@oalders
Copy link
Member

oalders commented Oct 29, 2021

@topaz a pull request would be helpful for this. We could merge that for the time being so that the documentation of the current behaviour is clearer.

@topaz
Copy link
Contributor

topaz commented Oct 30, 2021

@oalders #166

@kcaran
Copy link

kcaran commented Apr 12, 2024

Found this issue after trying to figure out why my queries weren't being decoded correctly. I'm always a little disappointed when things can't get fixed because someone might get the wrong decoding. They could have (should have?) used $r->content instead of $r->decoded_content, since it really isn't decoded!

@oalders
Copy link
Member

oalders commented Apr 12, 2024

@kcaran do you have any suggestions for improving this?

@kcaran
Copy link

kcaran commented Apr 15, 2024

Hi Olaf. Maybe I'm misinterpreting its purpose, but I would expect decoded_content() to decode the response based on the charset and content() to return the raw data. But I wouldn't want decoded_content() to return an error if the response didn't have a charset - the fallback would be to return the raw data.

My suggestion would be to accept the pull request. :-)

@haarg
Copy link
Member

haarg commented Apr 15, 2024

This PR needs tests before it can be accepted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

decoded_content isn't decoding charset if content-type is application/json
6 participants