Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of malformed data truncates string #88

Closed
romanofski opened this issue Jan 5, 2017 · 5 comments
Closed

Handling of malformed data truncates string #88

romanofski opened this issue Jan 5, 2017 · 5 comments

Comments

@romanofski
Copy link

I wonder if the Encoding modules behaviour in terms of handling malformed data has regressed. I'm using the following single line with an invalid non-utf8 character:

perl -MEncode -E 'say Encode::decode(qq(UTF-8), "Mu\361oz", Encode::FB_HTMLCREF)'

and expect it being returned as:

Muñoz

I've tested this on Fedora 24 with Perl 5.22 and Encode 2.84 which returns the entire string including the replaced invalid characters.

When I try decode on Fedora 25 with Perl 5.24 and Encode 2.88 I get a truncated string:

Muñ

Using not the Fedora packages, it seems the problem was introduced in 2.87, since 2.86 is still returning a non-truncated result.

Disclaimer: My Perl experience is very limited. Perhaps I've missed something important and this is expected behaviour.

@romanofski
Copy link
Author

We've just noticed, that this seems to be only happening if two extra bytes follow the invalid character like in the example. If I add more, e.g.:

perl -MEncode -E 'say Encode::decode(qq(UTF-8), "Mu\361lololooz", Encode::FB_HTMLCREF)'

it correctly returns:

Muñlololooz

romanofski pushed a commit to romanofski/rpmgrill that referenced this issue Jan 5, 2017
This adjusts the input string based on a possible regression with
Encode (see dankogai/p5-encode#88).
romanofski pushed a commit to romanofski/rpmgrill that referenced this issue Jan 5, 2017
This adjusts the input string based on a possible regression with
Encode (see dankogai/p5-encode#88).
@pali
Copy link
Contributor

pali commented Jan 13, 2017

Read documentation: https://metacpan.org/pod/Encode#FB_PERLQQ-FB_HTMLCREF-FB_XMLCREF

When you decode, \xHH is inserted for a malformed character, where HH is the hex representation of the octet that could not be decoded to utf8. When you encode, \x{HHHH} will be inserted, where HHHH is the Unicode code point (in any number of hex digits) of the character that cannot be found in the character repertoire of the encoding.

The HTML/XML character reference modes are about the same. In place of \x{HHHH}, HTML uses &#NNN; where NNN is a decimal number, and XML uses &#xHHHH; where HHHH is the hexadecimal number.

@pali
Copy link
Contributor

pali commented Jan 13, 2017

Probably you are facing problem fixed in pull request #84. Can you try version from git master?

@romanofski
Copy link
Author

Tested it with git master (HEAD at a70d0f6) at the problem seems to be fixed. I went back to b426e97 and ran into the problem again. I suppose the issue can be closed. Many thanks!

@pali
Copy link
Contributor

pali commented Jan 26, 2017

If you are unable to reproduce your problem anymore with last git version, then it is really fixed and you can close this bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants