Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compact the _Unicode_property_data tables. #2757

Merged
merged 2 commits into from
Jun 12, 2022

Conversation

mordante
Copy link
Contributor

@mordante mordante commented Jun 3, 2022

The Unicode data files used to generate the extended grapheme cluster
tables contain contiguous ranges split over several lines, for example:

2060..2064 ; Control # Cf [5] WORD JOINER..INVISIBLE PLUS
2065 ; Control # Cn
2066..206F ; Control # Cf [10] LEFT-TO-RIGHT ISOLATE..NOMINAL DIGIT SHAPES

Instead of creating separate table entries, create one entry for the
combined range. Especially the Extended_Pictographic property has a lot
of ranges that can be combined. Combining these ranges reduces the size
of the tables and should improve performance.

This change reduces the total number of entries in the tables
_Grapheme_Break_property_data and _Extended_Pictographic_property_values
by 445, saving 2670 bytes of data.

The Unicode data files used to generate the extended grapheme cluster
tables contain contiguous ranges split over several lines, for example:

  2060..2064    ; Control # Cf   [5] WORD JOINER..INVISIBLE PLUS
  2065          ; Control # Cn       <reserved-2065>
  2066..206F    ; Control # Cf  [10] LEFT-TO-RIGHT ISOLATE..NOMINAL DIGIT SHAPES

Instead of creating separate table entries, create one entry for the
combined range. Especially the Extended_Pictographic property has a lot
of ranges that can be combined. Combining these ranges reduces the size
of the tables and should improve performance.

This change reduces the total number of entries in the tables
_Grapheme_Break_property_data and _Extended_Pictographic_property_values
by 445, saving 2670 bytes of data.
@mordante mordante requested a review from a team as a code owner June 3, 2022 18:28
@mordante
Copy link
Contributor Author

mordante commented Jun 3, 2022

@barcharcraz I assume you want to have a look at this.

@StephanTLavavej StephanTLavavej added performance Must go faster format C++20/23 format labels Jun 3, 2022
@StephanTLavavej
Copy link
Member

Thanks for noticing this and writing an elegant improvement! 😻 I've pushed very small changes to the Python script for the function name and comment.

Additionally, I have verified that the output of the script, followed by clang-format, exactly matches the change to the product header.

Copy link
Member

@barcharcraz barcharcraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great change! It looks like they repeat stuff like this in order to keep things more "stable" over unicode updates, and to split "unrelated" emoji, even if they happen to be packed in one after the other.

We don't need to worry about either of those things here.

@barcharcraz
Copy link
Member

Oh, I think I'm wrong about why they do this. It's to line up the ranges with other properties! So for Gbpdata if a character has the same GBP as the preceding range but a different General category things are split. They must figure it's easier to rejoin the ranges than to split them apart, which is quite true.

For emoji-data it's more common to see the breaks, because they are breaking the ranges if any of the other emoji properties are different.

@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit 82acfcf into microsoft:main Jun 12, 2022
@StephanTLavavej
Copy link
Member

Thx 4 tiny tbls! 😹 📉 🎉

@CaseyCarter
Copy link
Member

CaseyCarter commented Jun 12, 2022

Congratulations on stepping up to contribute to the best open-source C++ Standard Library! :trollface:

@mordante
Copy link
Contributor Author

Congratulations on stepping up to contribute to the best open-source C++ Standard Library! :trollface:

Thanks for acknowledging my contributions to libc++ :-P

@CaseyCarter
Copy link
Member

Congratulations on stepping up to contribute to the best open-source C++ Standard Library! :trollface:

Thanks for acknowledging my contributions to libc++ :-P

🔥🔥🔥

fsb4000 pushed a commit to fsb4000/STL that referenced this pull request Aug 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
format C++20/23 format performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants