Compact the _Unicode_property_data tables. #2757

mordante · 2022-06-03T18:28:51Z

The Unicode data files used to generate the extended grapheme cluster
tables contain contiguous ranges split over several lines, for example:

2060..2064 ; Control # Cf [5] WORD JOINER..INVISIBLE PLUS
2065 ; Control # Cn
2066..206F ; Control # Cf [10] LEFT-TO-RIGHT ISOLATE..NOMINAL DIGIT SHAPES

Instead of creating separate table entries, create one entry for the
combined range. Especially the Extended_Pictographic property has a lot
of ranges that can be combined. Combining these ranges reduces the size
of the tables and should improve performance.

This change reduces the total number of entries in the tables
_Grapheme_Break_property_data and _Extended_Pictographic_property_values
by 445, saving 2670 bytes of data.

The Unicode data files used to generate the extended grapheme cluster tables contain contiguous ranges split over several lines, for example: 2060..2064 ; Control # Cf [5] WORD JOINER..INVISIBLE PLUS 2065 ; Control # Cn <reserved-2065> 2066..206F ; Control # Cf [10] LEFT-TO-RIGHT ISOLATE..NOMINAL DIGIT SHAPES Instead of creating separate table entries, create one entry for the combined range. Especially the Extended_Pictographic property has a lot of ranges that can be combined. Combining these ranges reduces the size of the tables and should improve performance. This change reduces the total number of entries in the tables _Grapheme_Break_property_data and _Extended_Pictographic_property_values by 445, saving 2670 bytes of data.

mordante · 2022-06-03T18:30:43Z

@barcharcraz I assume you want to have a look at this.

tools/unicode_properties_parse/grapheme_break_property_data_gen.py

StephanTLavavej · 2022-06-03T22:55:00Z

Thanks for noticing this and writing an elegant improvement! 😻 I've pushed very small changes to the Python script for the function name and comment.

Additionally, I have verified that the output of the script, followed by clang-format, exactly matches the change to the product header.

barcharcraz

This is a great change! It looks like they repeat stuff like this in order to keep things more "stable" over unicode updates, and to split "unrelated" emoji, even if they happen to be packed in one after the other.

We don't need to worry about either of those things here.

barcharcraz · 2022-06-03T23:15:18Z

Oh, I think I'm wrong about why they do this. It's to line up the ranges with other properties! So for Gbpdata if a character has the same GBP as the preceding range but a different General category things are split. They must figure it's easier to rejoin the ranges than to split them apart, which is quite true.

For emoji-data it's more common to see the breaks, because they are breaking the ranges if any of the other emoji properties are different.

StephanTLavavej · 2022-06-10T21:31:34Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej · 2022-06-12T09:45:47Z

Thx 4 tiny tbls! 😹 📉 🎉

CaseyCarter · 2022-06-12T17:42:56Z

Congratulations on stepping up to contribute to the best open-source C++ Standard Library!

mordante · 2022-06-12T20:26:39Z

Congratulations on stepping up to contribute to the best open-source C++ Standard Library!

Thanks for acknowledging my contributions to libc++ :-P

CaseyCarter · 2022-06-12T23:13:10Z

Congratulations on stepping up to contribute to the best open-source C++ Standard Library!

Thanks for acknowledging my contributions to libc++ :-P

🔥🔥🔥

Co-authored-by: Stephan T. Lavavej <[email protected]>

mordante requested a review from a team as a code owner June 3, 2022 18:28

StephanTLavavej added performance Must go faster format C++20/23 format labels Jun 3, 2022

StephanTLavavej reviewed Jun 3, 2022

View reviewed changes

tools/unicode_properties_parse/grapheme_break_property_data_gen.py Outdated Show resolved Hide resolved

tools/unicode_properties_parse/grapheme_break_property_data_gen.py Outdated Show resolved Hide resolved

Code review feedback.

796f7c9

StephanTLavavej approved these changes Jun 3, 2022

View reviewed changes

StephanTLavavej assigned barcharcraz Jun 3, 2022

barcharcraz approved these changes Jun 3, 2022

View reviewed changes

StephanTLavavej unassigned barcharcraz Jun 3, 2022

StephanTLavavej self-assigned this Jun 10, 2022

StephanTLavavej merged commit 82acfcf into microsoft:main Jun 12, 2022

fsb4000 pushed a commit to fsb4000/STL that referenced this pull request Aug 13, 2022

Compact the _Unicode_property_data tables. (microsoft#2757)

3f7a8d1

Co-authored-by: Stephan T. Lavavej <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compact the _Unicode_property_data tables. #2757

Compact the _Unicode_property_data tables. #2757

mordante commented Jun 3, 2022

mordante commented Jun 3, 2022

StephanTLavavej commented Jun 3, 2022

barcharcraz left a comment

barcharcraz commented Jun 3, 2022

StephanTLavavej commented Jun 10, 2022

StephanTLavavej commented Jun 12, 2022

CaseyCarter commented Jun 12, 2022 •

edited

Loading

mordante commented Jun 12, 2022

CaseyCarter commented Jun 12, 2022

Compact the _Unicode_property_data tables. #2757

Compact the _Unicode_property_data tables. #2757

Conversation

mordante commented Jun 3, 2022

mordante commented Jun 3, 2022

StephanTLavavej commented Jun 3, 2022

barcharcraz left a comment

Choose a reason for hiding this comment

barcharcraz commented Jun 3, 2022

StephanTLavavej commented Jun 10, 2022

StephanTLavavej commented Jun 12, 2022

CaseyCarter commented Jun 12, 2022 • edited Loading

mordante commented Jun 12, 2022

CaseyCarter commented Jun 12, 2022

CaseyCarter commented Jun 12, 2022 •

edited

Loading