-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[stdlib] Add full unicode support for character casing functions #3496
base: nightly
Are you sure you want to change the base?
[stdlib] Add full unicode support for character casing functions #3496
Conversation
Signed-off-by: Maxim Zaks <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mzaks this is great! . Just a bit of a nitpick: I think _unicode.mojo
and _unicode_lookups.mojo
should be in the utils package
stdlib/src/collections/_unicode.mojo
Outdated
if (rune >> 7) == 0: # This is 1 byte ASCII char | ||
p[0] = rune.cast[DType.uint8]() | ||
return 1 | ||
|
||
@always_inline | ||
fn _utf8_len(val: UInt32) -> Int: | ||
alias sizes = SIMD[DType.uint32, 4]( | ||
0, 0b1111_111, 0b1111_1111_111, 0b1111_1111_1111_1111 | ||
) | ||
var values = SIMD[DType.uint32, 4](val) | ||
var mask = values > sizes | ||
return int(mask.cast[DType.uint8]().reduce_add()) | ||
|
||
var num_bytes = _utf8_len(rune) | ||
var shift = 6 * (num_bytes - 1) | ||
var mask = UInt32(0xFF) >> (num_bytes + 1) | ||
var num_bytes_marker = UInt32(0xFF) << (8 - num_bytes) | ||
p[0] = (((rune >> shift) & mask) | num_bytes_marker).cast[DType.uint8]() | ||
for i in range(1, num_bytes): | ||
shift -= 6 | ||
p[i] = (((rune >> shift) & 0b00111111) | 0b10000000).cast[DType.uint8]() | ||
return num_bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code logic is used in many other places. I'll open a PR to fix chr
implementation which I put in #3239 and ended up mixed with other problems and getting forgotten.
!sync |
Signed-off-by: Maxim Zaks <[email protected]>
Co-authored-by: martinvuyk <[email protected]> Signed-off-by: Maxim Zaks <[email protected]>
Co-authored-by: martinvuyk <[email protected]> Signed-off-by: Maxim Zaks <[email protected]>
Co-authored-by: martinvuyk <[email protected]> Signed-off-by: Maxim Zaks <[email protected]>
da0b9bf
to
fa8c0c9
Compare
…:mzaks/mojo into feature/support-unicode-character-casing # Conflicts: # stdlib/src/collections/_unicode.mojo
Very good point. Just moved them to utils. |
Signed-off-by: Maxim Zaks <[email protected]>
c28d399
to
258a088
Compare
!sync |
I tried a performance improvement based on static B+Tree described here: https://en.algorithmica.org/hpc/data-structures/s-tree/ It looks promising, but I think it will be better as a separate PR. |
what an awesome book, I have new reading material for a couple of months...
I think in this case it might or it might not be better. The lists are static and known at compile time yet the binary search is generic. I think you could just hardcode values and ranges here, though I'm not sure how that much branching would affect performance. But there are some quite big ranges is some places e.g.: alias has_uppercase_mapping = List[UInt32, hint_trivial_type=True](
...
0x007A, # LATIN SMALL LETTER Z z
0x00B5, # MICRO SIGN µ
0x00E0, # LATIN SMALL LETTER A WITH GRAVE à
...
0x0586, # ARMENIAN SMALL LETTER FEH ֆ
0x10D0, # GEORGIAN LETTER AN ა
...
0x2D2D, # GEORGIAN SMALL LETTER AEN ⴭ
0xA641, # CYRILLIC SMALL LETTER ZEMLYA ꙁ
...
) And if we are going to nitpick performance, I think ASCII letters should be prioritized and not use a generic algorithm for something that is realistically mostly going to be used for the first X amount of letters (we could use an ASCII optimized version and have another Unicode one, at the cost of checking the first UTF8 byte and the branch that follows) |
…arating (#47532) [External] [stdlib] Fix chr impl taking funcs to string_slice and separating Fix `chr` implementation taking funcs to `string_slice` and separating their respective functionalities. This code will be used elsewhere e.g., PR [#3496](#3496 (comment)) ORIGINAL_AUTHOR=martinvuyk <[email protected]> PUBLIC_PR_LINK=#3506 Co-authored-by: martinvuyk <[email protected]> Closes #3506 MODULAR_ORIG_COMMIT_REV_ID: dc8c96e58f5e272d01e3c26f6daf3ffe8f7c0b36
!sync |
The code I used to generate the lookup tables can be found here https://gist.github.com/mzaks/bbadaeebcf81a5200021af041568b26b