why doesn't \p{Emoji}+ match all emoji? #947
-
What version of regex are you using?
Describe the bug at a high level.I wrote a regex to capture single emoji like this It captures most of the emojis (in apple eco) but some aren't, some fails such as: Activities: 🕹️... I can list them all if it is helpful. What are the steps to reproduce the behavior?Explained. What is the actual behavior?Explained. What is the expected behavior?Ideally it should try to capture as many emojis as possible. Or maybe it is impossible? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
In the future, when filing issues, it would be helpful to provide code that I can run. That way, I can be sure I know what it is you're talking about. With that said, my guess here is that you're conflating It is beyond the scope of this crate specifically to provide an implementation of UTS#51. Now if you have a program that executes a regex using a Unicode property such as
For example, just checking your first "joystick" example, we can see that it is behaving as expected. We need to do two things for that. First is write a program to look at actual behavior, and second we need to check whether its output is consistent with Unicode's definition of the relevant properties. So here's the program: use regex::Regex;
fn main() {
let s = "🕹️";
for cp in s.chars() {
println!("{:X}", u32::from(cp));
}
let re = Regex::new(r"^\p{Emoji}+$").unwrap();
println!("all in emoji? {:?}", re.is_match(s));
let re = Regex::new(r"^[\p{Emoji}\p{Emoji_Component}]+$").unwrap();
println!("all in emoji or emoji_component? {:?}", re.is_match(s));
}
I had a suspicion that your emoji were actually composed of multiple codepoints, and that it was likely that one of them wasn't in
(I got somewhat lucky that the codepoints used here appear explicitly in the file. They may not, as the file is a sequence of ranges. So lookup might take a little more work than a simple grep query.) As we can see, |
Beta Was this translation helpful? Give feedback.
In the future, when filing issues, it would be helpful to provide code that I can run. That way, I can be sure I know what it is you're talking about.
With that said, my guess here is that you're conflating
\p{Emoji}
with the concept of "emoji."\p{Emoji}
is a Unicode property, and it is one component of the Unicode emoji technical standard. Actually detecting and extracting emoji requires implementing that UTS, which will use\p{Emoji}
for sure, but it is nowhere close to sufficient to implement. Appendix A of UTS#51 outlines the various properties related to emoji, andEmoji
is merely one of them. Presumably one would need to use all of them to implement proper emoji extraction. (I note…