Add Unicode Annex 31 methods to `char` #2693

notriddle · 2019-04-24T20:03:33Z

Rendered

clarfonthey · 2019-04-25T06:01:39Z

I think this could probably be done as a PR to Rust directly rather than an RFC.

text/0000-char-uax-31.md

Co-Authored-By: notriddle <[email protected]>

Manishearth

I'm very iffy on this, I don't see a strong enough motivation to include it in the stdlib, while I see very clear reasons for avoiding things that change every unicode version to be in the stdlib. We moved unicode_segmentation out of tree as well for similar reasons, despite it being very useful in unicode-aware string handling.

text/0000-char-uax-31.md

Manishearth · 2019-05-01T15:18:44Z

text/0000-char-uax-31.md

+a standardized set of code point categories for defining computer language syntax.
+
+This is being used in production Rust code already.
+Rust's own compiler already has functions to check against Annex 31 code point categories in the lexer,


This is for the unstable, old, non_ascii_idents feature, which we have already RFCd to change. The change will still need some function like this, but it may need tailoring for bidi characters. This is true for other implementors too, the XID functions often need tailoring.

They're not super hard to tailor, though, so we could still expose these functions to match spec and let those tailoring just suffix the calls with || ch == ... && ch != ...

The XID part is currently unstable. But stable Rust does respect Pattern_White_Space, so there's already committed-to Annex 31-based syntax. https://internals.rust-lang.org/t/do-we-need-unicode-whitespace/9876

As a concrete case of tailoring, Rust actually uses XID_Start | '_' and not a plain XID_Start. I've actually got bitten by that when implementing lexer for IntelliJ Rust a while ago :)

Manishearth · 2019-05-01T15:21:05Z

text/0000-char-uax-31.md

+
+# Drawbacks
+[drawbacks]: #drawbacks
+


Matching unicode versions is also a big issue here. This function will be inaccurate half the time as we don't always update our data files immediately (it's not always straightforward). Even if we do, the behavior of this function will change every year, and while we don't have a guarantee on stdlib behavior stability, this does mean that older compilers will lead to different results on code that compiles. This further makes me feel like this should be a versioned crate.

is_whitespace already opened the doors to this issue, but I don't want to make it worse. is_whitespace is a small relatively stable list whereas XID expands all the time.

Agreed. Added it.

Manishearth · 2019-05-01T15:26:27Z

text/0000-char-uax-31.md

+# Motivation
+[motivation]: #motivation
+
+As a systems language, Rust is heavily used for parsing.


To me this reads as motivation for why Rust needs such functions somewhere, but inclusion in the stdlib is a much higher bar, especially when this gets us tangled up with unicode versioning.

Yes, we already have is_whitespace, but that's an old API that was grandfathered in, and it has fewer unicode stability issues than XID.

One argument for it being in std is discoverability, but if you're parsing some grammar that grammar will likely tell you about XID.

text/0000-char-uax-31.md

Co-Authored-By: notriddle <[email protected]>

matklad · 2019-07-24T07:58:53Z

cc rust-lang/rust#62848 which proposes to move in the opposite direction: remove these methods from libcore and use unicode_xid crate in the compiler.

scottmcm · 2019-09-21T01:01:45Z

rust-lang/rust#62848 appears to have reached consensus for the opposite direction; does that amount to an agreement to close this one?

notriddle · 2019-09-21T02:13:57Z

Yeah, I will definitely rescind this whole thing, if that's going to be the solution.

matklad · 2019-11-24T15:33:49Z

@notriddle is_xid_start and is_xid_continue were removed from libcore. rustc now uses unicode-xid crate as the source of truth for these definitions. And, if one is interested specifically in Rust definition of identifiers, there's rustc_lexer.

Based on the comment above I am going to optimistically close this issue and remove the nominated label, but please reopen if I misunderstood your intentions.

Create 0000-char-uax-31.md

1856466

notriddle changed the title ~~Create 0000-char-uax-31.md~~ Add Unicode Annex 31 methods to `char Apr 24, 2019

notriddle changed the title ~~Add Unicode Annex 31 methods to `char~~ Add Unicode Annex 31 methods to char Apr 24, 2019

Acknowledge that is_xid_start and continue exist

42bfc63

Centril added A-primitive Primitive types related proposals & ideas A-string Proposals relating to strings. T-libs-api Relevant to the library API team, which will review and decide on the RFC. labels Apr 25, 2019

mibac138 reviewed Apr 25, 2019

View reviewed changes

text/0000-char-uax-31.md Outdated Show resolved Hide resolved

mibac138 reviewed Apr 25, 2019

View reviewed changes

text/0000-char-uax-31.md Outdated Show resolved Hide resolved

mibac138 and others added 2 commits April 25, 2019 09:14

Update text/0000-char-uax-31.md

320cf1c

Co-Authored-By: notriddle <[email protected]>

Update text/0000-char-uax-31.md

8654dfb

Co-Authored-By: notriddle <[email protected]>

Manishearth reviewed May 1, 2019

View reviewed changes

mzji reviewed May 2, 2019

View reviewed changes

text/0000-char-uax-31.md Outdated Show resolved Hide resolved

mzji and others added 3 commits May 2, 2019 18:34

Update text/0000-char-uax-31.md

6db8ba1

Co-Authored-By: notriddle <[email protected]>

Update 0000-char-uax-31.md

4143fef

Mention the back-compat drawback

b9d0c6c

scottmcm added the I-nominated label Sep 21, 2019

matklad closed this Nov 24, 2019

matklad removed the I-nominated label Nov 24, 2019

notriddle deleted the patch-2 branch November 24, 2019 19:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Unicode Annex 31 methods to `char` #2693

Add Unicode Annex 31 methods to `char` #2693

notriddle commented Apr 24, 2019 •

edited

Loading

clarfonthey commented Apr 25, 2019

Manishearth left a comment

Manishearth May 1, 2019

notriddle May 3, 2019 •

edited

Loading

matklad Jun 12, 2019

Manishearth May 1, 2019

notriddle May 3, 2019

Manishearth May 1, 2019

Manishearth May 1, 2019

matklad commented Jul 24, 2019

scottmcm commented Sep 21, 2019

notriddle commented Sep 21, 2019

matklad commented Nov 24, 2019

Add Unicode Annex 31 methods to char #2693

Add Unicode Annex 31 methods to char #2693

Conversation

notriddle commented Apr 24, 2019 • edited Loading

clarfonthey commented Apr 25, 2019

Manishearth left a comment

Choose a reason for hiding this comment

Manishearth May 1, 2019

Choose a reason for hiding this comment

notriddle May 3, 2019 • edited Loading

Choose a reason for hiding this comment

matklad Jun 12, 2019

Choose a reason for hiding this comment

Manishearth May 1, 2019

Choose a reason for hiding this comment

notriddle May 3, 2019

Choose a reason for hiding this comment

Manishearth May 1, 2019

Choose a reason for hiding this comment

Manishearth May 1, 2019

Choose a reason for hiding this comment

matklad commented Jul 24, 2019

scottmcm commented Sep 21, 2019

notriddle commented Sep 21, 2019

matklad commented Nov 24, 2019

Add Unicode Annex 31 methods to `char` #2693

Add Unicode Annex 31 methods to `char` #2693

notriddle commented Apr 24, 2019 •

edited

Loading

notriddle May 3, 2019 •

edited

Loading