Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Unicode Annex 31 methods to char #2693

Closed
wants to merge 7 commits into from

Conversation

notriddle
Copy link
Contributor

@notriddle notriddle commented Apr 24, 2019

@notriddle notriddle changed the title Create 0000-char-uax-31.md Add Unicode Annex 31 methods to `char Apr 24, 2019
@notriddle notriddle changed the title Add Unicode Annex 31 methods to `char Add Unicode Annex 31 methods to char Apr 24, 2019
@clarfonthey
Copy link
Contributor

I think this could probably be done as a PR to Rust directly rather than an RFC.

@Centril Centril added A-primitive Primitive types related proposals & ideas A-string Proposals relating to strings. T-libs-api Relevant to the library API team, which will review and decide on the RFC. labels Apr 25, 2019
text/0000-char-uax-31.md Outdated Show resolved Hide resolved
text/0000-char-uax-31.md Outdated Show resolved Hide resolved
Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very iffy on this, I don't see a strong enough motivation to include it in the stdlib, while I see very clear reasons for avoiding things that change every unicode version to be in the stdlib. We moved unicode_segmentation out of tree as well for similar reasons, despite it being very useful in unicode-aware string handling.

text/0000-char-uax-31.md Outdated Show resolved Hide resolved
a standardized set of code point categories for defining computer language syntax.

This is being used in production Rust code already.
Rust's own compiler already has functions to check against Annex 31 code point categories in the lexer,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for the unstable, old, non_ascii_idents feature, which we have already RFCd to change. The change will still need some function like this, but it may need tailoring for bidi characters. This is true for other implementors too, the XID functions often need tailoring.

They're not super hard to tailor, though, so we could still expose these functions to match spec and let those tailoring just suffix the calls with || ch == ... && ch != ...

Copy link
Contributor Author

@notriddle notriddle May 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The XID part is currently unstable. But stable Rust does respect Pattern_White_Space, so there's already committed-to Annex 31-based syntax. https://internals.rust-lang.org/t/do-we-need-unicode-whitespace/9876

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a concrete case of tailoring, Rust actually uses XID_Start | '_' and not a plain XID_Start. I've actually got bitten by that when implementing lexer for IntelliJ Rust a while ago :)


# Drawbacks
[drawbacks]: #drawbacks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matching unicode versions is also a big issue here. This function will be inaccurate half the time as we don't always update our data files immediately (it's not always straightforward). Even if we do, the behavior of this function will change every year, and while we don't have a guarantee on stdlib behavior stability, this does mean that older compilers will lead to different results on code that compiles. This further makes me feel like this should be a versioned crate.

is_whitespace already opened the doors to this issue, but I don't want to make it worse. is_whitespace is a small relatively stable list whereas XID expands all the time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Added it.

# Motivation
[motivation]: #motivation

As a systems language, Rust is heavily used for parsing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me this reads as motivation for why Rust needs such functions somewhere, but inclusion in the stdlib is a much higher bar, especially when this gets us tangled up with unicode versioning.

Yes, we already have is_whitespace, but that's an old API that was grandfathered in, and it has fewer unicode stability issues than XID.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One argument for it being in std is discoverability, but if you're parsing some grammar that grammar will likely tell you about XID.

text/0000-char-uax-31.md Outdated Show resolved Hide resolved
@matklad
Copy link
Member

matklad commented Jul 24, 2019

cc rust-lang/rust#62848 which proposes to move in the opposite direction: remove these methods from libcore and use unicode_xid crate in the compiler.

@scottmcm
Copy link
Member

rust-lang/rust#62848 appears to have reached consensus for the opposite direction; does that amount to an agreement to close this one?

@notriddle
Copy link
Contributor Author

Yeah, I will definitely rescind this whole thing, if that's going to be the solution.

@matklad
Copy link
Member

matklad commented Nov 24, 2019

@notriddle is_xid_start and is_xid_continue were removed from libcore. rustc now uses unicode-xid crate as the source of truth for these definitions. And, if one is interested specifically in Rust definition of identifiers, there's rustc_lexer.

Based on the comment above I am going to optimistically close this issue and remove the nominated label, but please reopen if I misunderstood your intentions.

@matklad matklad closed this Nov 24, 2019
@notriddle notriddle deleted the patch-2 branch November 24, 2019 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-primitive Primitive types related proposals & ideas A-string Proposals relating to strings. T-libs-api Relevant to the library API team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants