Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stdlib] Add full unicode support for character casing functions #3496

Open
wants to merge 7 commits into
base: nightly
Choose a base branch
from

Conversation

mzaks
Copy link
Contributor

@mzaks mzaks commented Sep 18, 2024

The code I used to generate the lookup tables can be found here https://gist.github.com/mzaks/bbadaeebcf81a5200021af041568b26b

Copy link
Contributor

@martinvuyk martinvuyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mzaks this is great! . Just a bit of a nitpick: I think _unicode.mojo and _unicode_lookups.mojo should be in the utils package

stdlib/src/collections/_unicode.mojo Outdated Show resolved Hide resolved
stdlib/src/collections/_unicode.mojo Outdated Show resolved Hide resolved
stdlib/src/collections/_unicode.mojo Outdated Show resolved Hide resolved
Comment on lines 139 to 160
if (rune >> 7) == 0: # This is 1 byte ASCII char
p[0] = rune.cast[DType.uint8]()
return 1

@always_inline
fn _utf8_len(val: UInt32) -> Int:
alias sizes = SIMD[DType.uint32, 4](
0, 0b1111_111, 0b1111_1111_111, 0b1111_1111_1111_1111
)
var values = SIMD[DType.uint32, 4](val)
var mask = values > sizes
return int(mask.cast[DType.uint8]().reduce_add())

var num_bytes = _utf8_len(rune)
var shift = 6 * (num_bytes - 1)
var mask = UInt32(0xFF) >> (num_bytes + 1)
var num_bytes_marker = UInt32(0xFF) << (8 - num_bytes)
p[0] = (((rune >> shift) & mask) | num_bytes_marker).cast[DType.uint8]()
for i in range(1, num_bytes):
shift -= 6
p[i] = (((rune >> shift) & 0b00111111) | 0b10000000).cast[DType.uint8]()
return num_bytes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code logic is used in many other places. I'll open a PR to fix chr implementation which I put in #3239 and ended up mixed with other problems and getting forgotten.

@JoeLoser
Copy link
Collaborator

!sync

@modularbot modularbot added the imported-internally Signals that a given pull request has been imported internally. label Sep 19, 2024
mzaks and others added 4 commits September 19, 2024 15:46
@mzaks mzaks force-pushed the feature/support-unicode-character-casing branch from da0b9bf to fa8c0c9 Compare September 19, 2024 13:46
…:mzaks/mojo into feature/support-unicode-character-casing

# Conflicts:
#	stdlib/src/collections/_unicode.mojo
@mzaks
Copy link
Contributor Author

mzaks commented Sep 19, 2024

Hi @mzaks this is great! . Just a bit of a nitpick: I think _unicode.mojo and _unicode_lookups.mojo should be in the utils package

Very good point. Just moved them to utils.

@mzaks mzaks force-pushed the feature/support-unicode-character-casing branch from c28d399 to 258a088 Compare September 19, 2024 14:05
@JoeLoser
Copy link
Collaborator

!sync

@mzaks
Copy link
Contributor Author

mzaks commented Sep 21, 2024

I tried a performance improvement based on static B+Tree described here: https://en.algorithmica.org/hpc/data-structures/s-tree/

It looks promising, but I think it will be better as a separate PR.
@JoeLoser what do you think?

@martinvuyk
Copy link
Contributor

https://en.algorithmica.org/hpc/data-structures/s-tree/

what an awesome book, I have new reading material for a couple of months...

I tried a performance improvement based on static B+Tree

I think in this case it might or it might not be better. The lists are static and known at compile time yet the binary search is generic. I think you could just hardcode values and ranges here, though I'm not sure how that much branching would affect performance. But there are some quite big ranges is some places e.g.:

alias has_uppercase_mapping = List[UInt32, hint_trivial_type=True](
    ...
    0x007A,  # LATIN SMALL LETTER Z z
    0x00B5,  # MICRO SIGN µ
    0x00E0,  # LATIN SMALL LETTER A WITH GRAVE à
    ...
    0x0586,  # ARMENIAN SMALL LETTER FEH ֆ
    0x10D0,  # GEORGIAN LETTER AN ა
    ...
    0x2D2D,  # GEORGIAN SMALL LETTER AEN ⴭ
    0xA641,  # CYRILLIC SMALL LETTER ZEMLYA ꙁ
    ...
)

And if we are going to nitpick performance, I think ASCII letters should be prioritized and not use a generic algorithm for something that is realistically mostly going to be used for the first X amount of letters (we could use an ASCII optimized version and have another Unicode one, at the cost of checking the first UTF8 byte and the branch that follows)

modularbot added a commit that referenced this pull request Sep 22, 2024
…arating (#47532)

[External] [stdlib] Fix chr impl taking funcs to string_slice and
separating

Fix `chr` implementation taking funcs to `string_slice` and separating their
respective functionalities.

This code will be used elsewhere e.g., PR
[#3496](#3496 (comment))

ORIGINAL_AUTHOR=martinvuyk
<[email protected]>
PUBLIC_PR_LINK=#3506

Co-authored-by: martinvuyk <[email protected]>
Closes #3506
MODULAR_ORIG_COMMIT_REV_ID: dc8c96e58f5e272d01e3c26f6daf3ffe8f7c0b36
@JoeLoser
Copy link
Collaborator

!sync

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
imported-internally Signals that a given pull request has been imported internally.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants