-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compact the _Unicode_property_data tables. #2757
Conversation
The Unicode data files used to generate the extended grapheme cluster tables contain contiguous ranges split over several lines, for example: 2060..2064 ; Control # Cf [5] WORD JOINER..INVISIBLE PLUS 2065 ; Control # Cn <reserved-2065> 2066..206F ; Control # Cf [10] LEFT-TO-RIGHT ISOLATE..NOMINAL DIGIT SHAPES Instead of creating separate table entries, create one entry for the combined range. Especially the Extended_Pictographic property has a lot of ranges that can be combined. Combining these ranges reduces the size of the tables and should improve performance. This change reduces the total number of entries in the tables _Grapheme_Break_property_data and _Extended_Pictographic_property_values by 445, saving 2670 bytes of data.
@barcharcraz I assume you want to have a look at this. |
tools/unicode_properties_parse/grapheme_break_property_data_gen.py
Outdated
Show resolved
Hide resolved
tools/unicode_properties_parse/grapheme_break_property_data_gen.py
Outdated
Show resolved
Hide resolved
Thanks for noticing this and writing an elegant improvement! 😻 I've pushed very small changes to the Python script for the function name and comment. Additionally, I have verified that the output of the script, followed by clang-format, exactly matches the change to the product header. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great change! It looks like they repeat stuff like this in order to keep things more "stable" over unicode updates, and to split "unrelated" emoji, even if they happen to be packed in one after the other.
We don't need to worry about either of those things here.
Oh, I think I'm wrong about why they do this. It's to line up the ranges with other properties! So for Gbpdata if a character has the same GBP as the preceding range but a different General category things are split. They must figure it's easier to rejoin the ranges than to split them apart, which is quite true. For emoji-data it's more common to see the breaks, because they are breaking the ranges if any of the other emoji properties are different. |
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
Thx 4 tiny tbls! 😹 📉 🎉 |
Congratulations on stepping up to contribute to the best open-source C++ Standard Library! |
Thanks for acknowledging my contributions to libc++ :-P |
🔥🔥🔥 |
Co-authored-by: Stephan T. Lavavej <[email protected]>
The Unicode data files used to generate the extended grapheme cluster
tables contain contiguous ranges split over several lines, for example:
2060..2064 ; Control # Cf [5] WORD JOINER..INVISIBLE PLUS
2065 ; Control # Cn
2066..206F ; Control # Cf [10] LEFT-TO-RIGHT ISOLATE..NOMINAL DIGIT SHAPES
Instead of creating separate table entries, create one entry for the
combined range. Especially the Extended_Pictographic property has a lot
of ranges that can be combined. Combining these ranges reduces the size
of the tables and should improve performance.
This change reduces the total number of entries in the tables
_Grapheme_Break_property_data and _Extended_Pictographic_property_values
by 445, saving 2670 bytes of data.