Incorrect parsing of UTF-16 escape sequences #37

schlndh · 2021-07-01T11:46:53Z

I'm trying to parse a string which contains this character escaped as UTF-16, but the result is incorrect. I debugged it a little bit and it seems that the issue is due to submitting the UTF-16 units to unicodeToUtf8 one-by-one, rather than decoding the unicode codepoint and then submitting that to unicodeToUtf8.

Here is a code that reproduces the problem:

// This doesn't work.
echo \Peast\Peast::latest('"\uD83D\uDE00"')->parse()->getBody()[0]->getExpression()->getValue() . "\n";
// These work.
echo json_decode('"\uD83D\uDE00"') . "\n";
echo \Peast\Syntax\Utils::unicodeToUtf8(0x0001F600) . "\n";
echo \Peast\Peast::latest('"\u{1F600}"')->parse()->getBody()[0]->getExpression()->getValue() . "\n";
echo \Peast\Peast::latest('"😀"')->parse()->getBody()[0]->getExpression()->getValue() . "\n";

// This also works (i.e. it shows the smiley face).
console.log("\uD83D\uDE00");

I tested this with PHP 7.4 and the latest master (b33fa0d).

The text was updated successfully, but these errors were encountered:

mck89 · 2021-07-11T16:39:05Z

I think that the form in which a single character is represented by 2 unicode points is the Modified utf8 with surrogate pairs that is used when converting from utf16 to utf8.

I'm still trying to understand if that is the case and, if so, if there's some way to group those characters without refactoring the one-by-one logic (this problem should affect strings, templates and variabile names).

I'm very busy right now but i will try to work on it in some weeks.

mck89 · 2021-07-24T11:39:56Z

I've just released a new version with surrogate pairs support in strings and templates. No need to change variables name parsing since they are not allowed as variable names. Thank you for reporting!

mck89 closed this as completed in cd50aa9 Jul 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect parsing of UTF-16 escape sequences #37

Incorrect parsing of UTF-16 escape sequences #37

schlndh commented Jul 1, 2021

mck89 commented Jul 11, 2021

mck89 commented Jul 24, 2021

Incorrect parsing of UTF-16 escape sequences #37

Incorrect parsing of UTF-16 escape sequences #37

Comments

schlndh commented Jul 1, 2021

mck89 commented Jul 11, 2021

mck89 commented Jul 24, 2021