Skip to content

Commit

Permalink
Introduce new REGEX_POSSIBLE which contains the regex described in …
Browse files Browse the repository at this point in the history
  • Loading branch information
janlelis committed Oct 17, 2024
1 parent c777cd4 commit 5e3e380
Show file tree
Hide file tree
Showing 9 changed files with 162 additions and 87 deletions.
68 changes: 0 additions & 68 deletions .github/workflows/codeql-analysis.yml

This file was deleted.

4 changes: 3 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,13 @@
### 3.7.0 (unereleased)

- Bump required Ruby slightly to 2.5
- Introduce new `REGEX_POSSIBLE` which contains the regex described in
https://www.unicode.org/reports/tr51/#EBNF_and_Regex
- Fix that some valid subdivisions were not decompressed (`REGEX_VALID`)
- Be stricter about selection of tag characters in (`REGEX_WELL_FORMED`)
- Only U+E0030..U+E0039, U+E0061..U+E007A allowed
- Max tag sequence length
- Use native /\p{RI}/ regex for regional indicators
- Use native `/\p{RI}/` regex for regional indicators

### 3.6.0

Expand Down
32 changes: 17 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,29 +48,31 @@ Matches (non-textual) Emoji of all kinds:

Regex | Description | Example Matches | Example Non-Matches
------------------------------|-------------|-----------------|--------------------
`Unicode::Emoji::REGEX` | **Use this if unsure!** Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kind of *recommended* Emoji sequences | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🤾🏽‍♀️` | `😴︎`, ``, `🏻`, `🇵🇵`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤠‍🤢`
`Unicode::Emoji::REGEX_VALID` | Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kind of *valid* Emoji sequences | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤠‍🤢` | `😴︎`, ``, `🏻`, `🇵🇵`
`Unicode::Emoji::REGEX_WELL_FORMED` | Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kind of *well-formed* Emoji sequences | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤠‍🤢`, `🇵🇵` | `😴︎`, ``, `🏻`
`Unicode::Emoji::REGEX` | **Use this if unsure!** Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kinds of *recommended* Emoji sequences | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🤾🏽‍♀️` | `😴︎`, ``, `🏻`, `🇵🇵`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤠‍🤢`
`Unicode::Emoji::REGEX_VALID` | Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kinds of *valid* Emoji sequences | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤠‍🤢` | `😴︎`, ``, `🏻`, `🇵🇵`
`Unicode::Emoji::REGEX_WELL_FORMED` | Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kinds of *well-formed* Emoji sequences | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤠‍🤢`, `🇵🇵` | `😴︎`, ``, `🏻`
`Unicode::Emoji::REGEX_POSSIBLE` | Matches all singleton Emoji, singleton components, all kinds of Emoji sequences, and even single digits | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤠‍🤢`, `🇵🇵`, `😴︎`, ``, `🏻`, `1` |

##### Picking the Right Emoji Regex

- Usually you just want `REGEX` (RGI set)
- If you want broader matching (e.g. more sub-regions), choose `REGEX_VALID`
- If you even want to match for invalid sequences, too, use `REGEX_WELL_FORMED`
- If you want a quick check for possible Emoji, which might contain false positives, use `REGEX_POSSIBLE` ([suggested in the Unicode Standard](https://www.unicode.org/reports/tr51/#EBNF_and_Regex))

Property | `REGEX` (RGI / Recommended) | `REGEX_VALID` (Valid) | `REGEX_WELL_FORMED` (Well-formed) | `REGEX_POSSIBLE`
---------|-----------------------------|-----------------------|-----------------------------------|-----------------
Region "🇵🇹" | Yes | Yes | Yes | Yes
Region "🇵🇵" | No | No | Yes | Yes
Tag Sequence "🏴󠁧󠁢󠁳󠁣󠁴󠁿" | Yes | Yes | Yes | Yes
Tag Sequence "🏴󠁧󠁢󠁡󠁧󠁢󠁿" | No | Yes | Yes | Yes
Tag Sequence "😴󠁧󠁢󠁡󠁡󠁡󠁿" | No | No | Yes | Yes
ZWJ Sequence "🤾🏽‍♀️" | Yes | Yes | Yes | Yes
ZWJ Sequence "🤠‍🤢" | No | Yes | Yes | Yes

Please see [the standard](https://www.unicode.org/reports/tr51/#Emoji_Sets) for details.

Property | `REGEX` (RGI / Recommended) | `REGEX_VALID` (Valid) | `REGEX_WELL_FORMED` (Well-formed)
---------|-----------------------------|-----------------------|----------------------------------
Region "🇵🇹" | Yes | Yes | Yes
Region "🇵🇵" | No | No | Yes
Tag Sequence "🏴󠁧󠁢󠁳󠁣󠁴󠁿" | Yes | Yes | Yes
Tag Sequence "🏴󠁧󠁢󠁡󠁧󠁢󠁿" | No | Yes | Yes
Tag Sequence "😴󠁧󠁢󠁡󠁡󠁡󠁿" | No | No | Yes
ZWJ Sequence "🤾🏽‍♀️" | Yes | Yes | Yes
ZWJ Sequence "🤠‍🤢" | No | Yes | Yes

More info about valid vs. recommended Emoji in this [blog article on Emojipedia](https://blog.emojipedia.org/unicode-behind-the-curtain/).
More info about valid vs. recommended Emoji also in this [blog article on Emojipedia](https://blog.emojipedia.org/unicode-behind-the-curtain/).

#### Singleton Regexes

Expand All @@ -83,7 +85,7 @@ Regex | Description | Example Matches | Example Non-Matc

#### Include Textual Emoji

By default, textual Emoji (emoji characters with text variation selector or those that have a default text presentation) will not be included in the default regexes. However, if you wish to match for them too, you can include them in your regex by appending the `_INCLUDE_TEXT` suffix:
By default, textual Emoji (emoji characters with text variation selector or those that have a default text presentation) will not be included in the default regexes (except in `REGEX_POSSIBLE`). However, if you wish to match for them too, you can include them in your regex by appending the `_INCLUDE_TEXT` suffix:

Regex | Description | Example Matches | Example Non-Matches
------------------------------|-------------|-----------------|--------------------
Expand Down
20 changes: 20 additions & 0 deletions data/generate_constants.rb
Original file line number Diff line number Diff line change
Expand Up @@ -177,6 +177,22 @@ def compile(emoji_character:, emoji_modifier:, emoji_modifier_base:, emoji_compo
emoji_well_formed_core_sequence,
)

emoji_possible_modification = \
join(
emoji_modifier,
pack([VS16, EMOJI_KEYCAP_SUFFIX]) + "?",
"[󠀠-󠁾]+󠁿" # raw tags
)

emoji_possible_zwj_element = \
join(
emoji_well_formed_flag_sequence,
emoji_character + emoji_possible_modification + "?"
)

emoji_possible = \
emoji_possible_zwj_element + "(?:" + pack(ZWJ) + emoji_possible_zwj_element + ")*"

regexes = {}

# Matches basic singleton emoji and all kind of sequences, but restrict zwj and tag sequences to known sequences (rgi)
Expand All @@ -188,6 +204,10 @@ def compile(emoji_character:, emoji_modifier:, emoji_modifier_base:, emoji_compo
# Matches basic singleton emoji and all kind of sequences
regexes[:REGEX_WELL_FORMED] = Regexp.compile(emoji_well_formed_sequence)

# Quick test which might lead to false positves
# See https://www.unicode.org/reports/tr51/#EBNF_and_Regex
regexes[:REGEX_POSSIBLE] = Regexp.compile(emoji_possible)

# Matches only basic single, non-textual emoji
# Ignores "components" like modifiers or simple digits
regexes[:REGEX_BASIC] = Regexp.compile(
Expand Down
2 changes: 1 addition & 1 deletion lib/unicode/emoji.rb
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ module Emoji
)

%w[
REGEX REGEX_VALID REGEX_WELL_FORMED REGEX_BASIC REGEX_TEXT REGEX_ANY REGEX_INCLUDE_TEXT
REGEX REGEX_VALID REGEX_WELL_FORMED REGEX_POSSIBLE REGEX_BASIC REGEX_TEXT REGEX_ANY REGEX_INCLUDE_TEXT
REGEX_VALID_INCLUDE_TEXT REGEX_WELL_FORMED_INCLUDE_TEXT REGEX_PICTO REGEX_PICTO_NO_EMOJI
].each do |const_name|
autoload const_name, File.join(generated_constants_dirpath, const_name.downcase)
Expand Down
3 changes: 3 additions & 0 deletions lib/unicode/emoji/constants.rb
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,9 @@ module Emoji
SPEC_TAGS = [*0xE0030..0xE0039, *0xE0061..0xE007A].freeze
EMOJI_KEYCAP_SUFFIX = 0x20E3
ZWJ = 0x200D
VS15 = 0xFE0E
VS16 = 0xFE0F
ENCLOSING_KEYCAP = 0x20E3
REGIONAL_INDICATORS = [*0x1F1E6..0x1F1FF].freeze
end
end
8 changes: 8 additions & 0 deletions lib/unicode/emoji/generated/regex_possible.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# This file was generated by a script, please do not edit it by hand.
# See `$ rake generate_constants` and data/generate_constants.rb for more info.

module Unicode
module Emoji
REGEX_POSSIBLE = /(?:\p{RI}{2}|[\#\*0-9©®‼⁉™ℹ↔-↙↩↪⌚⌛⌨⏏⏩-⏳⏸-⏺Ⓜ▪▫▶◀◻-◾☀-☄☎☑☔☕☘☝☠☢☣☦☪☮☯☸-☺♀♂♈-♓♟♠♣♥♦♨♻♾♿⚒-⚗⚙⚛⚜⚠⚡⚧⚪⚫⚰⚱⚽⚾⛄⛅⛈⛎⛏⛑⛓⛔⛩⛪⛰-⛵⛷-⛺⛽✂✅✈-✍✏✒✔✖✝✡✨✳✴❄❇❌❎❓-❕❗❣❤➕-➗➡➰➿⤴⤵⬅-⬇⬛⬜⭐⭕〰〽㊗㊙🀄🃏🅰🅱🅾🅿🆎🆑-🆚🇦-🇿🈁🈂🈚🈯🈲-🈺🉐🉑🌀-🌡🌤-🎓🎖🎗🎙-🎛🎞-🏰🏳-🏵🏷-📽📿-🔽🕉-🕎🕐-🕧🕯🕰🕳-🕺🖇🖊-🖍🖐🖕🖖🖤🖥🖨🖱🖲🖼🗂-🗄🗑-🗓🗜-🗞🗡🗣🗨🗯🗳🗺-🙏🚀-🛅🛋-🛒🛕-🛗🛜-🛥🛩🛫🛬🛰🛳-🛼🟠-🟫🟰🤌-🤺🤼-🥅🥇-🧿🩰-🩼🪀-🪉🪏-🫆🫎-🫜🫟-🫩🫰-🫸](?:[🏻-🏿]|️⃣?|[󠀠-󠁾]+󠁿)?)(?:‍(?:\p{RI}{2}|[\#\*0-9©®‼⁉™ℹ↔-↙↩↪⌚⌛⌨⏏⏩-⏳⏸-⏺Ⓜ▪▫▶◀◻-◾☀-☄☎☑☔☕☘☝☠☢☣☦☪☮☯☸-☺♀♂♈-♓♟♠♣♥♦♨♻♾♿⚒-⚗⚙⚛⚜⚠⚡⚧⚪⚫⚰⚱⚽⚾⛄⛅⛈⛎⛏⛑⛓⛔⛩⛪⛰-⛵⛷-⛺⛽✂✅✈-✍✏✒✔✖✝✡✨✳✴❄❇❌❎❓-❕❗❣❤➕-➗➡➰➿⤴⤵⬅-⬇⬛⬜⭐⭕〰〽㊗㊙🀄🃏🅰🅱🅾🅿🆎🆑-🆚🇦-🇿🈁🈂🈚🈯🈲-🈺🉐🉑🌀-🌡🌤-🎓🎖🎗🎙-🎛🎞-🏰🏳-🏵🏷-📽📿-🔽🕉-🕎🕐-🕧🕯🕰🕳-🕺🖇🖊-🖍🖐🖕🖖🖤🖥🖨🖱🖲🖼🗂-🗄🗑-🗓🗜-🗞🗡🗣🗨🗯🗳🗺-🙏🚀-🛅🛋-🛒🛕-🛗🛜-🛥🛩🛫🛬🛰🛳-🛼🟠-🟫🟰🤌-🤺🤼-🥅🥇-🧿🩰-🩼🪀-🪉🪏-🫆🫎-🫜🫟-🫩🫰-🫸](?:[🏻-🏿]|️⃣?|[󠀠-󠁾]+󠁿)?))*/
end
end
8 changes: 8 additions & 0 deletions lib/unicode/emoji/generated_native/regex_possible.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# This file was generated by a script, please do not edit it by hand.
# See `$ rake generate_constants` and data/generate_constants.rb for more info.

module Unicode
module Emoji
REGEX_POSSIBLE = /(?:\p{RI}{2}|\p{Emoji}(?:\p{EMod}|️⃣?|[󠀠-󠁾]+󠁿)?)(?:‍(?:\p{RI}{2}|\p{Emoji}(?:\p{EMod}|️⃣?|[󠀠-󠁾]+󠁿)?))*/
end
end
104 changes: 102 additions & 2 deletions spec/unicode_emoji_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -311,7 +311,7 @@
assert_equal "🏴󠁧󠁢󠁡󠁧󠁢󠁿", $&
end

it "does match invalid tag sequences" do
it "matches invalid base tag sequences" do
"😴󠁧󠁢󠁡󠁡󠁡󠁿 GB AAA" =~ Unicode::Emoji::REGEX_WELL_FORMED
assert_equal "😴󠁧󠁢󠁡󠁡󠁡󠁿", $&
end
Expand All @@ -321,7 +321,7 @@
assert_equal "🏴", $&
end

it "does not match too long tag sequences (only black flag is matched)" do
it "does not match invalid tag sequences (only black flag is matched)" do
"🏴󠀤󠁿 $" =~ Unicode::Emoji::REGEX_WELL_FORMED
assert_equal "🏴", $&
end
Expand All @@ -337,6 +337,106 @@
end
end

describe "REGEX_POSSIBLE" do
it "matches most singleton emoji codepoints" do
"😴 sleeping face" =~ Unicode::Emoji::REGEX_POSSIBLE
assert_equal "😴", $&
end

it "matches singleton emoji in combination with emoji variation selector" do
"😴\u{FE0F} sleeping face" =~ Unicode::Emoji::REGEX_POSSIBLE
assert_equal "😴\u{FE0F}", $&
end

it "matches singleton emoji (without VS) when in combination with text variation selector" do
"😴\u{FE0E} sleeping face" =~ Unicode::Emoji::REGEX_POSSIBLE
assert_equal "😴", $&
end

it "matches textual singleton emoji" do
"▶ play button" =~ Unicode::Emoji::REGEX_POSSIBLE
assert_equal "▶", $&
end

it "matches textual singleton emoji in combination with emoji variation selector" do
"▶\u{FE0F} play button" =~ Unicode::Emoji::REGEX_POSSIBLE
assert_equal "▶\u{FE0F}", $&
end

it "matches singleton 'component' emoji codepoints" do
"🏻 light skin tone" =~ Unicode::Emoji::REGEX_POSSIBLE
assert_equal "🏻", $&
end

it "matches modified emoji if modifier base emoji is used" do
"🛌🏽 person in bed: medium skin tone" =~ Unicode::Emoji::REGEX_POSSIBLE
assert_equal "🛌🏽", $&
end

it "matches modified emoji even if no modifier base emoji is used" do
"🌵🏽 cactus" =~ Unicode::Emoji::REGEX_POSSIBLE
assert_equal "🌵🏽", $&
end

it "matches valid region flags" do
"🇵🇹 Portugal" =~ Unicode::Emoji::REGEX_POSSIBLE
assert_equal "🇵🇹", $&
end

it "does match invalid region flags" do
"🇵🇵 PP Land" =~ Unicode::Emoji::REGEX_POSSIBLE
assert_equal "🇵🇵", $&
end

it "matches emoji keycap sequences" do
"2️⃣ keycap: 2" =~ Unicode::Emoji::REGEX_POSSIBLE
assert_equal "2️⃣", $&
end

it "matches only digit of non-emoji keycap sequences" do
"8⃣ text keycap: 8" =~ Unicode::Emoji::REGEX_POSSIBLE
assert_equal "8", $&

"#⃣ text keycap: #" =~ Unicode::Emoji::REGEX_POSSIBLE
assert_equal "#", $&
end

it "matches recommended tag sequences" do
"🏴󠁧󠁢󠁳󠁣󠁴󠁿 Scotland" =~ Unicode::Emoji::REGEX_POSSIBLE
assert_equal "🏴󠁧󠁢󠁳󠁣󠁴󠁿", $&
end

it "matches valid tag sequences, even though they are not recommended" do
"🏴󠁧󠁢󠁡󠁧󠁢󠁿 GB AGB" =~ Unicode::Emoji::REGEX_POSSIBLE
assert_equal "🏴󠁧󠁢󠁡󠁧󠁢󠁿", $&
end

it "matches invalid base tag sequences" do
"😴󠁧󠁢󠁡󠁡󠁡󠁿 GB AAA" =~ Unicode::Emoji::REGEX_POSSIBLE
assert_equal "😴󠁧󠁢󠁡󠁡󠁡󠁿", $&
end

it "matches too long tag sequences" do
"🏴󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁿 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA" =~ Unicode::Emoji::REGEX_POSSIBLE
assert_equal "🏴󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁿", $&
end

it "machtes invalid tag sequences (only black flag is matched)" do
"🏴󠀤󠁿 $" =~ Unicode::Emoji::REGEX_POSSIBLE
assert_equal "🏴󠀤󠁿", $&
end

it "matches recommended zwj sequences" do
"🤾🏽‍♀️ woman playing handball: medium skin tone" =~ Unicode::Emoji::REGEX_POSSIBLE
assert_equal "🤾🏽‍♀️", $&
end

it "matches valid zwj sequences, even though they are not recommended" do
"🤠‍🤢 vomiting cowboy" =~ Unicode::Emoji::REGEX_POSSIBLE
assert_equal "🤠‍🤢", $&
end
end

describe "REGEX_BASIC" do
it "matches most singleton emoji codepoints" do
"😴 sleeping face" =~ Unicode::Emoji::REGEX_BASIC
Expand Down

0 comments on commit 5e3e380

Please sign in to comment.