Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prohibit surrogate code units #290

Merged

Conversation

gibson042
Copy link
Collaborator

@gibson042 gibson042 commented Jul 18, 2022

Surrogate code units (U+D800 through U+DBFF) cannot be encoded into UTF-8.
Ref #268

This covers the most important part of #268, which is UTF-8 compatibility. I would also like to prohibit permanently reserved-for-Unicode-internal-use noncharacters (U+FDD0 through U+FDEF and U+nFFFE and U+nFFFF where n is 0x0 through 0x10, two of which are also invalid XML characters) and control characters (U+0000 through U+001F and U+007F through U+009F, the first 32 of which are not valid unescaped inside a JSON string or [with specific exceptions allowing tab, line feed, and carriage return] in XML), although those can be addressed in a followup (and any of them that should be allowed in string contents will need a corresponding escape sequence, similar to how \\ represents a single \).

Surrogate code units (U+D800 through U+DBFF) cannot be encoded into UTF-8.
Ref unicode-org#268
Copy link
Collaborator

@mihnita mihnita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!
M.

@eemeli eemeli merged commit 012d13a into unicode-org:develop Aug 17, 2022
echeran pushed a commit that referenced this pull request Sep 20, 2022
Surrogate code units (U+D800 through U+DBFF) cannot be encoded into UTF-8.

Ref #268
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants