Unicode 9 updates #70

Keno · 2016-06-24T03:25:37Z

I believe the only substantiative changes to the code required are the new rules in TR29, but please do check the Unicode change log (http://www.unicode.org/versions/Unicode9.0.0/) for other changes that may affect this library.

- New rules GB10/(12/13) are used to combine emoji-zwj sequences/ (force grapheme breaks every two RI codepoints). Unfortunately this breaks statelessness of grapheme-boundary determination. Deal with this by ignoring the problem in utf8proc_grapheme_break, and by hacking in a special case in decompose - ZWJ moved to its own boundclass, update what is now GB9 accordingly. - Add comments to indicate which rule a given case implements - The Number of bound classes Now exceeds 4 bits, expand to 8 and reorganize fields

tkelman · 2016-06-24T10:13:13Z

utf8proc.c

+   Please note that evaluation of GB10 (grapheme breaks between emoji zwj sequences)
+   and GB 12/13 (regional indicator code points) require knowledge of previous characters
+   and are thus not handled by this function. This may result in an incorrect break before
+   and E_Modifier class codepoint and an incorrectly missing break between two


an E_Modifier?

tkelman · 2016-06-24T10:28:26Z

This almost certainly needs version number bumps

stevengj · 2016-06-24T13:49:25Z

utf8proc.c

+             tbc == UTF8PROC_BOUNDCLASS_EXTEND)
+      *last_boundclass = UTF8PROC_BOUNDCLASS_E_BASE;
+    else
+      *last_boundclass = tbc;


Would be nice to export this logic somehow so that we can use it in the Julia grapheme iterator.

Mhm, I believe just adding an extra out parameter to give you the override class should be fine.

Keno · 2016-06-25T14:46:34Z

Ok, I've updated the API to expose the state override. Since that's an ABI incompatible change, I've bumped the MAJOR version accordingly. Please review.

Keno · 2016-06-25T14:49:17Z

We might want to do this at the same time as #68 (which may also require a rebase for this). cc @benibela

stevengj · 2016-06-27T19:30:57Z

Makefile

@@ -19,9 +19,9 @@ UCFLAGS = $(CFLAGS) $(PICFLAG) $(C99FLAG) $(WCFLAGS) -DUTF8PROC_EXPORTS
 # not API compatibility: MAJOR should be incremented whenever *binary*
 # compatibility is broken, even if the API is backward-compatible
 # Be sure to also update these in MANIFEST and CMakeLists.txt!


Did you forget to update CMakeLists.txt too?

Note that you also need to update utf8proc.h for the API version change.

stevengj · 2016-06-27T19:33:00Z

Maybe better to do this first, then update #68?

stevengj · 2016-06-27T19:34:28Z

utf8proc.h

+ *              matching the rules in Unicode 8.0.0.
+ *
+ * @warning If the state parameter is used, `utf8proc_grapheme_break` must be called
+ *          IN ORDER on ALL potentital breaks in a string.


Keno · 2016-06-27T19:53:35Z

Do we need to hold off on doing the version bump if we want to do #68 right after?

tkelman · 2016-06-27T21:06:24Z

I think we can do a version bump for this (definitely needed), and if we want to merge further big changes we don't need to change the version again until we do the next release.

Keno · 2016-06-27T21:33:49Z

Ok.

Keno · 2016-06-28T12:59:00Z

Good to merge?

benibela · 2016-07-02T12:51:50Z

There is a warning about an unused variable: lbc_override

Keno · 2016-07-02T13:02:19Z

Oops, thanks.

Keno · 2016-07-02T13:04:11Z

I also realized I made this static. Will fix.

… again in #70

stevengj · 2017-09-12T16:30:10Z

utf8proc.c

+{
+  int lbc_override = lbc;
+  if (state && *state != UTF8PROC_BOUNDCLASS_START)
+    lbc_override = *state;


@Keno, I just noticed that lbc_override seems to never be used. Did you mean to pass lbc_override to grapheme_break_simple?

It's been a while, but yes, that looks right, esp by looking at the first commit in this PR.

We already seem to have found that bug though: https://github.com/JuliaLang/utf8proc/blob/master/utf8proc.c#L290

Keno added 2 commits June 23, 2016 23:53

Import Unicode 9 data

68a577c

Keno force-pushed the kf/unicode9 branch from 10b2c56 to 68a577c Compare June 24, 2016 03:53

Keno changed the title ~~WIP: Unicode 9 updates~~ Unicode 9 updates Jun 24, 2016

tkelman reviewed Jun 24, 2016
View reviewed changes

stevengj mentioned this pull request Jun 24, 2016

Unicode 9 support #71

Closed

stevengj reviewed Jun 24, 2016
View reviewed changes

Keno force-pushed the kf/unicode9 branch from bc29543 to 120a75a Compare June 25, 2016 14:44

Keno force-pushed the kf/unicode9 branch from d70e8ac to 5697c96 Compare June 25, 2016 14:56

stevengj reviewed Jun 27, 2016
View reviewed changes

Update Grapheme break API to expose state override

987e72f

Keno force-pushed the kf/unicode9 branch from 5697c96 to c18ad80 Compare June 27, 2016 20:15

Bump MAJOR version

2a83fe7

Keno force-pushed the kf/unicode9 branch from c18ad80 to 2a83fe7 Compare June 27, 2016 21:35

stevengj merged commit 41c6b23 into master Jun 28, 2016

stevengj deleted the kf/unicode9 branch June 28, 2016 20:04

stevengj added a commit that referenced this pull request Jun 28, 2016

note Unicode 9 support (from #70) in README

9a0b87b

This was referenced Jul 11, 2016

Tag release with Unicode 9 support #72

Closed

Convert compiler warnings to errors for Travis builds #73

Merged

This was referenced Jul 12, 2016

restore old grapheme_break API #74

Closed

update to Unifont 9 (for Unicode 9 charwidths) #75

Merged

stevengj added a commit that referenced this pull request Jul 13, 2016

the ABI version was already bumped in #62, does not need to be bumped…

cb2a3e4

… again in #70

orbisvicis mentioned this pull request Sep 2, 2017

[Question] Grapheme Boundary State #110

Closed

stevengj reviewed Sep 12, 2017

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode 9 updates #70

Unicode 9 updates #70

Keno commented Jun 24, 2016 •

edited

Loading

tkelman Jun 24, 2016 •

edited

Loading

tkelman commented Jun 24, 2016

stevengj Jun 24, 2016

Keno Jun 24, 2016

Keno commented Jun 25, 2016

Keno commented Jun 25, 2016

stevengj Jun 27, 2016

Keno Jun 27, 2016

stevengj Jun 27, 2016

stevengj commented Jun 27, 2016

stevengj Jun 27, 2016

Keno commented Jun 27, 2016

tkelman commented Jun 27, 2016

Keno commented Jun 27, 2016

Keno commented Jun 28, 2016

benibela commented Jul 2, 2016

Keno commented Jul 2, 2016

Keno commented Jul 2, 2016

stevengj Sep 12, 2017

Keno Sep 12, 2017

Keno Sep 12, 2017

Keno Sep 12, 2017

Unicode 9 updates #70

Unicode 9 updates #70

Conversation

Keno commented Jun 24, 2016 • edited Loading

tkelman Jun 24, 2016 • edited Loading

Choose a reason for hiding this comment

tkelman commented Jun 24, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Keno commented Jun 25, 2016

Keno commented Jun 25, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevengj commented Jun 27, 2016

Choose a reason for hiding this comment

Keno commented Jun 27, 2016

tkelman commented Jun 27, 2016

Keno commented Jun 27, 2016

Keno commented Jun 28, 2016

benibela commented Jul 2, 2016

Keno commented Jul 2, 2016

Keno commented Jul 2, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Keno commented Jun 24, 2016 •

edited

Loading

tkelman Jun 24, 2016 •

edited

Loading