-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode 9 updates #70
Conversation
- New rules GB10/(12/13) are used to combine emoji-zwj sequences/ (force grapheme breaks every two RI codepoints). Unfortunately this breaks statelessness of grapheme-boundary determination. Deal with this by ignoring the problem in utf8proc_grapheme_break, and by hacking in a special case in decompose - ZWJ moved to its own boundclass, update what is now GB9 accordingly. - Add comments to indicate which rule a given case implements - The Number of bound classes Now exceeds 4 bits, expand to 8 and reorganize fields
Please note that evaluation of GB10 (grapheme breaks between emoji zwj sequences) | ||
and GB 12/13 (regional indicator code points) require knowledge of previous characters | ||
and are thus not handled by this function. This may result in an incorrect break before | ||
and E_Modifier class codepoint and an incorrectly missing break between two |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an E_Modifier?
This almost certainly needs version number bumps |
tbc == UTF8PROC_BOUNDCLASS_EXTEND) | ||
*last_boundclass = UTF8PROC_BOUNDCLASS_E_BASE; | ||
else | ||
*last_boundclass = tbc; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to export this logic somehow so that we can use it in the Julia grapheme iterator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mhm, I believe just adding an extra out parameter to give you the override class should be fine.
Ok, I've updated the API to expose the state override. Since that's an ABI incompatible change, I've bumped the MAJOR version accordingly. Please review. |
@@ -19,9 +19,9 @@ UCFLAGS = $(CFLAGS) $(PICFLAG) $(C99FLAG) $(WCFLAGS) -DUTF8PROC_EXPORTS | |||
# not API compatibility: MAJOR should be incremented whenever *binary* | |||
# compatibility is broken, even if the API is backward-compatible | |||
# Be sure to also update these in MANIFEST and CMakeLists.txt! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you forget to update CMakeLists.txt too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that you also need to update utf8proc.h
for the API version change.
Maybe better to do this first, then update #68? |
* matching the rules in Unicode 8.0.0. | ||
* | ||
* @warning If the state parameter is used, `utf8proc_grapheme_break` must be called | ||
* IN ORDER on ALL potentital breaks in a string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
potential
Do we need to hold off on doing the version bump if we want to do #68 right after? |
I think we can do a version bump for this (definitely needed), and if we want to merge further big changes we don't need to change the version again until we do the next release. |
Ok. |
Good to merge? |
There is a warning about an unused variable: lbc_override |
Oops, thanks. |
I also realized I made this static. Will fix. |
{ | ||
int lbc_override = lbc; | ||
if (state && *state != UTF8PROC_BOUNDCLASS_START) | ||
lbc_override = *state; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Keno, I just noticed that lbc_override
seems to never be used. Did you mean to pass lbc_override
to grapheme_break_simple
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's been a while, but yes, that looks right, esp by looking at the first commit in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already seem to have found that bug though: https://github.com/JuliaLang/utf8proc/blob/master/utf8proc.c#L290
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the only substantiative changes to the code required are the new rules in TR29, but please do check the Unicode change log (http://www.unicode.org/versions/Unicode9.0.0/) for other changes that may affect this library.