Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not consider # an emoji in the lexer #109754

Closed
wants to merge 1 commit into from

Conversation

estebank
Copy link
Contributor

@estebank estebank commented Mar 30, 2023

@estebank estebank added beta-nominated Nominated for backporting to the compiler in the beta channel. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Mar 30, 2023
@rustbot
Copy link
Collaborator

rustbot commented Mar 30, 2023

r? @TaKO8Ki

(rustbot has picked a reviewer for you, use r? to override)

@rustbot rustbot added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Mar 30, 2023
@estebank
Copy link
Contributor Author

r? @compiler-errors

@compiler-errors
Copy link
Member

Can you link to open-i18n/rust-unic#280 ? I think that may be the root cause of the issue.

r=me afterwards
@bors rollup

@compiler-errors
Copy link
Member

compiler-errors commented Mar 30, 2023

Actually, rust-unic doesn't seem to have been updated since 2019(!). I'm actually feeling like we may want to unland the original fix and fix the issue with 'emoji in a better way 😰 ....

edit: That is to say, I'm not confident that rust-unic will fix this any time soon so maybe it's best to approach this fix from another direction 🤔

@compiler-errors
Copy link
Member

Well, this is beta nominated anyways, so perhaps let's use that as an opportunity to discuss this in the compiler meeting tomorrow morning. r=me if we decide to go forward with this fix during that time.

@estebank
Copy link
Contributor Author

We've been using rust-unic in the lexer for a while now to recover identifiers with emojis in general, the PR with the regression only added a path to recover lifetimes in particular. Other codepaths were already excluding ASCII codepoints from consideration. It is disappointing to see such a useful project fall into disrepair, but maybe we can (as a separate effort) try to contact the maintainers (I see @CAD97 was involved a few years back and might be amenable to doing code reviews for fixes?). Otherwise, we could move to another crate like emojis for the checks.

@CAD97
Copy link
Contributor

CAD97 commented Mar 30, 2023

I can do code review, but I don't think I have publish rights (would need to double check), so it'd have to be published as a fork.

Nowadays I'd recommend building on icu4x if possible since that has a real backing rather than relying on the bandwidth of just one person for management.

A big advantage of icu4x is that the data provider can be separated from the algorithm/API provider, so the former doesn't have to wait on the latter for updates, and you get to choose exactly what data you pack. Here's the relevant loader.

(I also fell out of favor with unic's project structure around the time of my most recent commit landing there, and again nowadays tend to prefer icu4x. I can try to email @behnam over the weekend to see if I can get publishing permissions to keep what unic does have updated.)

@petrochenkov
Copy link
Contributor

petrochenkov commented Mar 30, 2023

Why the emoji detection logic even exists in the lexer is beyond me.

Yeah, some person thought it would improve diagnostics, with which I disagree, and it would be tolerable if it were harmless.
But if it causes any issues that already means it doesn't pull its weight.

@thomcc thomcc added the A-Unicode Area: Unicode label Mar 30, 2023
@thomcc
Copy link
Member

thomcc commented Mar 30, 2023

Checking for the Emoji character property is not going to be the right approach, as a number of characters (as you've found) have that property despite not being what most users would consider emojis (filtering ASCII might be enough, although a number of other things that don't really look like emoji, such as , , or ©, will still be present).

I think usually you'd look for full emoji sequences (or perhaps RGI_Emoji) to do this accurately, but I'm not sure. @Manishearth might know, and there might be a reasonable way to do this in icu4x.

@apiraino
Copy link
Contributor

Beta backport declined as per compiler team on Zulip. Seems a safer choice to revert #108031

@rustbot label -beta-nominated

@rustbot rustbot removed the beta-nominated Nominated for backporting to the compiler in the beta channel. label Mar 30, 2023
@Manishearth
Copy link
Member

So an annoying thing is that what is and isn't an emoji is kinda font-dependent too.

There are a couple properties that may be useful here:

  • Emoji_Presentation (list): This should render as an emoji by default
  • Extended_Pictographic (list): This character is pictographic and not from a writing system. Contains a lot of symbols too.
  • Emoji (list): Unicode considers it acceptable to render this character as an emoji, even if it does not recommend it by default. This contains a lot of stuff applications like Discord choose to render as emoji (e.g. ™, though I think Discord no longer does that particular one)

I suspect what you want is the first one. It's not perfect, but it's probably better. Here's what it excludes from Emoji. Note that only those with XIDC=N are relevant for this diagnostic since it's triggered by lex errors.

Yes, ICU4X has APIs for all of these properties, e.g. icu::properties::sets::load_emoji_presentation().

@estebank
Copy link
Contributor Author

t-compiler decided to revert the original PR

@estebank estebank closed this Mar 30, 2023
@CAD97
Copy link
Contributor

CAD97 commented Mar 31, 2023

To the specific case of this diagnostic, the "best" solution would probably be to somehow gate the "likely bad identifier" codepath to only characters which would otherwise get the "unknown start of token" error. `©` is not allowed in identifiers is probably essentially as useful as identifiers cannot contain emoji: `©` , anyway.

@Manishearth
Copy link
Member

Yeah though I think switching to Emoji_Presentation will also be a win.

@Manishearth
Copy link
Member

Also, the latest UAX 31 draft does have an extension for emoji in identifiers which we could include if someone is up to writing an RFC (I'm happy to let them know what would be required and help them along).

@CAD97
Copy link
Contributor

CAD97 commented Mar 31, 2023

Ooh, that's cool! Even if Rust doesn't use it for allowed identifiers (without a technical limitation, I don't see a reason not to, given we already support all sorts of "bad" identifiers via the normal XID sets), using that specification for the recognizing of bad identifiers for nicer errors makes a lot of sense.

The only thing that worries me a little is that part of it is

The removal of the code point U+FE0E VARIATION SELECTOR-15 (the Text Presentation Selector) from the set Continue.

which unless that codepoint is in the RGI emoji set added to both Start and Continue, makes the change make some strings which are currently identifiers not. (I understand the purpose: to prevent the use of an allowed identifier emoji in text presentation. And there's no simple way to say "allowed, but only after a codepoint which isn't part of an emoji sequence.") Though the use of this codepoint in a valid identifier should be rare enough that this should be fine? I guess that's for the eventual RFC to hash out.

@Manishearth
Copy link
Member

there's no simple way to say "allowed, but only after a codepoint which isn't part of an emoji sequence."

It's the opposite: the spec allows it after certain emoji-capable codepoints. But yes, this complicates things. For backcompat it probably makes sense to not touch VS and maybe just exclude emoji that contain characters that are current Rust syntax tokens (it's a very small handful)

@CAD97
Copy link
Contributor

CAD97 commented Mar 31, 2023

Well, I've confused myself sufficiently for today. VS15 = text presentation, VS16 = emoji presentation. The emoji profile removes VS15 from the identifier Continue, uses VS15 for marking the set of characters which are both pattern syntax and emoji presentation as syntax instead of identifiers, removes VS16 from the operator Continue, and has an example of using VS16(?) to mark ☕ as not an identifier, but syntax instead.

I think that because Rust doesn't use Pattern_Syntax for defining the set of syntactically meaningful characters, instead using a small list of ASCII (though IIRC all are in Pattern_Syntax), none of the characters in [[:Pattern_Syntax:]&[:Emoji_Presentation:]]1 should present an ambiguity for Rust, and we should just be able to continue to allow VS15 in identifiers without problems.

Or perhaps you're saying to exclude those characters from identifiers, since following them with VS15 should by the standard emoji profile make them behave as syntax instead of identifier.

Or perhaps you're just referring to cases like *️⃣, which is * followed by VS16 and a combining enclosing keycap. I don't think allowing these would cause any issues for actual emoji-in-identifiers support (though obviously it would for use just for improved diagnostics). What might result in interesting behavior is whether peoples' editors type ☺️ as (single codepoint, disallowed) or ☺️ (with VS16, allowed). (Chance the distinction survives getting posted as a GitHub comment: low. My Android inserts the VS16.)

It'd be nice if the UAX could make the handling of VS15 a bit clearer without having to read UTS51. I thought I understood2 without knowing the intricacies of emoji sequences, just the basics, but now I'm not sure I do anymore. The maybe-typo of using U+FE0F VS16 in the operator example instead of U+FE0E VS15 like discussed in the following prose certainly doesn't help.

Rephrasing, if I understood the spec correctly, the handling of VS15 is just to handle the emoji which are considered syntax. An alternative handling would be to instead just forbid the use of those characters as syntax (this is already the case in Rust), not adding them-plus-VS15 to the set of characters with syntactic use. Thus we can leave VS15 in identifiers without any issue.

I'm obviously interested in being clued in if/when an RFC is posted to adopt the emoji profile for Rust identifiers. But until then this doesn't matter too much.

Footnotes

  1. From iterating over the set with the regex crate: ⌚⌛⏩⏪⏫⏬⏰⏳◽◾☔☕♈♉♊♋♌♍♎♏♐♑♒♓♿⚓⚡⚪⚫⚽⚾⛄⛅⛎⛔⛪⛲⛳⛵⛺⛽✅✊✋✨❌❎❓❔❕❗➕➖➗➰➿⬛⬜⭐⭕

  2. What I initially read as the section resulting in: VS15 (text presentation) is no longer allowed in identifiers. VS16 (emoji presentation) is no longer allowed in syntax/operators. VS15 can be used to mark characters that changed from syntax to identifiers as syntax again.

@Manishearth
Copy link
Member

Yeah, and also I certainly am not planning on proposing this

I may propose an RFC for ZWNJ in the identifier profile, and I may propose some lint improvements as PRs. I would help someone doing a mathematical symbols thing and would sketch out what needs to be done for someone wanting to do emoji, but I don't think we have anyone who wants to do that.

Dylan-DPC added a commit to Dylan-DPC/rust that referenced this pull request Apr 10, 2023
…avidtwco

Revert "Don't recover lifetimes/labels containing emojis as character literals"

Reverts PR rust-lang#108031 per rust-lang#109754 (comment)

Fixes (doesnt close until beta backported) rust-lang#109746

This reverts commit e3f9db5.
This reverts commit 98b82ae.
This reverts commit 380fa26.
JohnTitor added a commit to JohnTitor/rust that referenced this pull request Apr 11, 2023
…avidtwco

Revert "Don't recover lifetimes/labels containing emojis as character literals"

Reverts PR rust-lang#108031 per rust-lang#109754 (comment)

Fixes (doesnt close until beta backported) rust-lang#109746

This reverts commit e3f9db5.
This reverts commit 98b82ae.
This reverts commit 380fa26.
Dylan-DPC added a commit to Dylan-DPC/rust that referenced this pull request Apr 11, 2023
…avidtwco

Revert "Don't recover lifetimes/labels containing emojis as character literals"

Reverts PR rust-lang#108031 per rust-lang#109754 (comment)

Fixes (doesnt close until beta backported) rust-lang#109746

This reverts commit e3f9db5.
This reverts commit 98b82ae.
This reverts commit 380fa26.
compiler-errors added a commit to compiler-errors/rust that referenced this pull request Apr 11, 2023
…avidtwco

Revert "Don't recover lifetimes/labels containing emojis as character literals"

Reverts PR rust-lang#108031 per rust-lang#109754 (comment)

Fixes (doesnt close until beta backported) rust-lang#109746

This reverts commit e3f9db5.
This reverts commit 98b82ae.
This reverts commit 380fa26.
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this pull request Apr 12, 2023
…avidtwco

Revert "Don't recover lifetimes/labels containing emojis as character literals"

Reverts PR rust-lang#108031 per rust-lang#109754 (comment)

Fixes (doesnt close until beta backported) rust-lang#109746

This reverts commit e3f9db5.
This reverts commit 98b82ae.
This reverts commit 380fa26.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Unicode Area: Unicode S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

regression: no emojis in lifetime names (despite no emojis being present?)
9 participants