Handling of malformed data truncates string #88

romanofski · 2017-01-05T04:59:26Z

I wonder if the Encoding modules behaviour in terms of handling malformed data has regressed. I'm using the following single line with an invalid non-utf8 character:

perl -MEncode -E 'say Encode::decode(qq(UTF-8), "Mu\361oz", Encode::FB_HTMLCREF)'

and expect it being returned as:

Mu&amp;#241;oz

I've tested this on Fedora 24 with Perl 5.22 and Encode 2.84 which returns the entire string including the replaced invalid characters.

When I try decode on Fedora 25 with Perl 5.24 and Encode 2.88 I get a truncated string:

Mu&#241;

Using not the Fedora packages, it seems the problem was introduced in 2.87, since 2.86 is still returning a non-truncated result.

Disclaimer: My Perl experience is very limited. Perhaps I've missed something important and this is expected behaviour.

The text was updated successfully, but these errors were encountered:

romanofski · 2017-01-05T05:30:23Z

We've just noticed, that this seems to be only happening if two extra bytes follow the invalid character like in the example. If I add more, e.g.:

perl -MEncode -E 'say Encode::decode(qq(UTF-8), "Mu\361lololooz", Encode::FB_HTMLCREF)'

it correctly returns:

Mu&#241;lololooz

This adjusts the input string based on a possible regression with Encode (see dankogai/p5-encode#88).

pali · 2017-01-13T12:23:53Z

Read documentation: https://metacpan.org/pod/Encode#FB_PERLQQ-FB_HTMLCREF-FB_XMLCREF

When you decode, \xHH is inserted for a malformed character, where HH is the hex representation of the octet that could not be decoded to utf8. When you encode, \x{HHHH} will be inserted, where HHHH is the Unicode code point (in any number of hex digits) of the character that cannot be found in the character repertoire of the encoding.

The HTML/XML character reference modes are about the same. In place of \x{HHHH}, HTML uses &#NNN; where NNN is a decimal number, and XML uses &#xHHHH; where HHHH is the hexadecimal number.

pali · 2017-01-13T15:19:19Z

Probably you are facing problem fixed in pull request #84. Can you try version from git master?

romanofski · 2017-01-17T04:37:48Z

Tested it with git master (HEAD at a70d0f6) at the problem seems to be fixed. I went back to b426e97 and ran into the problem again. I suppose the issue can be closed. Many thanks!

pali · 2017-01-26T22:00:46Z

If you are unable to reproduce your problem anymore with last git version, then it is really fixed and you can close this bug.

romanofski mentioned this issue Jan 5, 2017

Tests work on f25 default-to-open/rpmgrill#14

Merged

romanofski pushed a commit to romanofski/rpmgrill that referenced this issue Jan 5, 2017

Adjust the input for sanitize_text testing

37f9a7c

This adjusts the input string based on a possible regression with Encode (see dankogai/p5-encode#88).

romanofski pushed a commit to romanofski/rpmgrill that referenced this issue Jan 5, 2017

Adjust the input for sanitize_text testing

c33a016

This adjusts the input string based on a possible regression with Encode (see dankogai/p5-encode#88).

romanofski closed this as completed Jan 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of malformed data truncates string #88

Handling of malformed data truncates string #88

romanofski commented Jan 5, 2017

romanofski commented Jan 5, 2017

pali commented Jan 13, 2017

pali commented Jan 13, 2017

romanofski commented Jan 17, 2017

pali commented Jan 26, 2017

Handling of malformed data truncates string #88

Handling of malformed data truncates string #88

Comments

romanofski commented Jan 5, 2017

romanofski commented Jan 5, 2017

pali commented Jan 13, 2017

pali commented Jan 13, 2017

romanofski commented Jan 17, 2017

pali commented Jan 26, 2017