Introduce new REGEX_POSSIBLE which contains the regex described in …

https://www.unicode.org/reports/tr51/#EBNF_and_Regex
janlelis · Oct 17, 2024 · 5e3e380 · 5e3e380
1 parent c777cd4
commit 5e3e380
Show file tree

Hide file tree

Showing 9 changed files with 162 additions and 87 deletions.
diff --git a/.github/workflows/codeql-analysis.yml b/.github/workflows/codeql-analysis.yml
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,11 +3,13 @@
 ### 3.7.0 (unereleased)
 
 - Bump required Ruby slightly to 2.5
+- Introduce new `REGEX_POSSIBLE` which contains the regex described in
+  https://www.unicode.org/reports/tr51/#EBNF_and_Regex
 - Fix that some valid subdivisions were not decompressed (`REGEX_VALID`)
 - Be stricter about selection of tag characters in (`REGEX_WELL_FORMED`)
   - Only U+E0030..U+E0039, U+E0061..U+E007A allowed
   - Max tag sequence length
-- Use native /\p{RI}/ regex for regional indicators
+- Use native `/\p{RI}/` regex for regional indicators
 
 ### 3.6.0
 

diff --git a/README.md b/README.md
@@ -48,29 +48,31 @@ Matches (non-textual) Emoji of all kinds:
 
 Regex                         | Description | Example Matches | Example Non-Matches
 ------------------------------|-------------|-----------------|--------------------
-`Unicode::Emoji::REGEX`       | **Use this if unsure!** Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kind of *recommended* Emoji sequences | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🤾🏽‍♀️` | `😴︎`, `▶`, `🏻`, `🇵🇵`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤠‍🤢`
-`Unicode::Emoji::REGEX_VALID` | Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kind of *valid* Emoji sequences | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤠‍🤢` | `😴︎`, `▶`, `🏻`, `🇵🇵`
-`Unicode::Emoji::REGEX_WELL_FORMED` | Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kind of *well-formed* Emoji sequences | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤠‍🤢`,  `🇵🇵` | `😴︎`, `▶`, `🏻`
+`Unicode::Emoji::REGEX`       | **Use this if unsure!** Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kinds of *recommended* Emoji sequences | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🤾🏽‍♀️` | `😴︎`, `▶`, `🏻`, `🇵🇵`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤠‍🤢`
+`Unicode::Emoji::REGEX_VALID` | Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kinds of *valid* Emoji sequences | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤠‍🤢` | `😴︎`, `▶`, `🏻`, `🇵🇵`
+`Unicode::Emoji::REGEX_WELL_FORMED` | Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kinds of *well-formed* Emoji sequences | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤠‍🤢`,  `🇵🇵` | `😴︎`, `▶`, `🏻`
+`Unicode::Emoji::REGEX_POSSIBLE` | Matches all singleton Emoji, singleton components, all kinds of Emoji sequences, and even single digits | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤠‍🤢`,  `🇵🇵`, `😴︎`, `▶`, `🏻`, `1` |
 
 ##### Picking the Right Emoji Regex
 
 - Usually you just want `REGEX` (RGI set)
 - If you want broader matching (e.g. more sub-regions), choose `REGEX_VALID`
 - If you even want to match for invalid sequences, too, use `REGEX_WELL_FORMED`
+- If you want a quick check for possible Emoji, which might contain false positives, use `REGEX_POSSIBLE` ([suggested in the Unicode Standard](https://www.unicode.org/reports/tr51/#EBNF_and_Regex))
+
+Property | `REGEX` (RGI / Recommended) | `REGEX_VALID` (Valid) | `REGEX_WELL_FORMED` (Well-formed) | `REGEX_POSSIBLE`
+---------|-----------------------------|-----------------------|-----------------------------------|-----------------
+Region "🇵🇹"                    | Yes | Yes | Yes | Yes
+Region "🇵🇵"                   | No  | No  | Yes | Yes
+Tag Sequence "🏴󠁧󠁢󠁳󠁣󠁴󠁿"              | Yes | Yes | Yes | Yes
+Tag Sequence "🏴󠁧󠁢󠁡󠁧󠁢󠁿"              | No  | Yes | Yes | Yes
+Tag Sequence "😴󠁧󠁢󠁡󠁡󠁡󠁿"              | No  | No  | Yes | Yes
+ZWJ Sequence "🤾🏽‍♀️"           | Yes | Yes | Yes | Yes
+ZWJ Sequence "🤠‍🤢"            | No  | Yes | Yes | Yes
 
 Please see [the standard](https://www.unicode.org/reports/tr51/#Emoji_Sets) for details.
 
-Property | `REGEX` (RGI / Recommended) | `REGEX_VALID` (Valid) | `REGEX_WELL_FORMED` (Well-formed)
----------|-----------------------------|-----------------------|----------------------------------
-Region "🇵🇹"                    | Yes | Yes | Yes
-Region "🇵🇵"                   | No  | No  | Yes
-Tag Sequence "🏴󠁧󠁢󠁳󠁣󠁴󠁿"              | Yes | Yes | Yes
-Tag Sequence "🏴󠁧󠁢󠁡󠁧󠁢󠁿"              | No  | Yes | Yes
-Tag Sequence "😴󠁧󠁢󠁡󠁡󠁡󠁿"              | No  | No  | Yes
-ZWJ Sequence "🤾🏽‍♀️"           | Yes | Yes | Yes
-ZWJ Sequence "🤠‍🤢"            | No  | Yes | Yes
-
-More info about valid vs. recommended Emoji in this [blog article on Emojipedia](https://blog.emojipedia.org/unicode-behind-the-curtain/).
+More info about valid vs. recommended Emoji also in this [blog article on Emojipedia](https://blog.emojipedia.org/unicode-behind-the-curtain/).
 
 #### Singleton Regexes
 
@@ -83,7 +85,7 @@ Regex                         | Description | Example Matches | Example Non-Matc
 
 #### Include Textual Emoji
 
-By default, textual Emoji (emoji characters with text variation selector or those that have a default text presentation) will not be included in the default regexes. However, if you wish to match for them too, you can include them in your regex by appending the `_INCLUDE_TEXT` suffix:
+By default, textual Emoji (emoji characters with text variation selector or those that have a default text presentation) will not be included in the default regexes (except in `REGEX_POSSIBLE`). However, if you wish to match for them too, you can include them in your regex by appending the `_INCLUDE_TEXT` suffix:
 
 Regex                         | Description | Example Matches | Example Non-Matches
 ------------------------------|-------------|-----------------|--------------------

diff --git a/data/generate_constants.rb b/data/generate_constants.rb
@@ -177,6 +177,22 @@ def compile(emoji_character:, emoji_modifier:, emoji_modifier_base:, emoji_compo
       emoji_well_formed_core_sequence,
     )
 
+  emoji_possible_modification = \
+    join(
+      emoji_modifier,
+      pack([VS16, EMOJI_KEYCAP_SUFFIX]) + "?",
+      "[󠀠-󠁾]+󠁿" # raw tags
+    )
+
+  emoji_possible_zwj_element = \
+    join(
+      emoji_well_formed_flag_sequence,
+      emoji_character + emoji_possible_modification + "?"
+    )
+
+  emoji_possible = \
+    emoji_possible_zwj_element + "(?:" + pack(ZWJ) + emoji_possible_zwj_element + ")*"
+
   regexes = {}
 
   # Matches basic singleton emoji and all kind of sequences, but restrict zwj and tag sequences to known sequences (rgi)
@@ -188,6 +204,10 @@ def compile(emoji_character:, emoji_modifier:, emoji_modifier_base:, emoji_compo
   # Matches basic singleton emoji and all kind of sequences
   regexes[:REGEX_WELL_FORMED] = Regexp.compile(emoji_well_formed_sequence)
 
+  # Quick test which might lead to false positves
+  # See https://www.unicode.org/reports/tr51/#EBNF_and_Regex
+  regexes[:REGEX_POSSIBLE] = Regexp.compile(emoji_possible)
+
   # Matches only basic single, non-textual emoji
   # Ignores "components" like modifiers or simple digits
   regexes[:REGEX_BASIC] = Regexp.compile(

diff --git a/lib/unicode/emoji.rb b/lib/unicode/emoji.rb
@@ -22,7 +22,7 @@ module Emoji
     )
 
     %w[
-      REGEX REGEX_VALID REGEX_WELL_FORMED REGEX_BASIC REGEX_TEXT REGEX_ANY REGEX_INCLUDE_TEXT
+      REGEX REGEX_VALID REGEX_WELL_FORMED REGEX_POSSIBLE REGEX_BASIC REGEX_TEXT REGEX_ANY REGEX_INCLUDE_TEXT
       REGEX_VALID_INCLUDE_TEXT REGEX_WELL_FORMED_INCLUDE_TEXT REGEX_PICTO REGEX_PICTO_NO_EMOJI
     ].each do |const_name|
       autoload const_name, File.join(generated_constants_dirpath, const_name.downcase)

diff --git a/lib/unicode/emoji/constants.rb b/lib/unicode/emoji/constants.rb
@@ -24,6 +24,9 @@ module Emoji
     SPEC_TAGS                     = [*0xE0030..0xE0039, *0xE0061..0xE007A].freeze
     EMOJI_KEYCAP_SUFFIX           = 0x20E3
     ZWJ                           = 0x200D
+    VS15                          = 0xFE0E
+    VS16                          = 0xFE0F
+    ENCLOSING_KEYCAP              = 0x20E3
     REGIONAL_INDICATORS           = [*0x1F1E6..0x1F1FF].freeze
   end
 end
diff --git a/lib/unicode/emoji/generated/regex_possible.rb b/lib/unicode/emoji/generated/regex_possible.rb
@@ -0,0 +1,8 @@
+# This file was generated by a script, please do not edit it by hand.
+# See `$ rake generate_constants` and data/generate_constants.rb for more info.
+
+module Unicode
+  module Emoji
+    REGEX_POSSIBLE = /(?:\p{RI}{2}|[\#\*0-9©®‼⁉™ℹ↔-↙↩↪⌚⌛⌨⏏⏩-⏳⏸-⏺Ⓜ▪▫▶◀◻-◾☀-☄☎☑☔☕☘☝☠☢☣☦☪☮☯☸-☺♀♂♈-♓♟♠♣♥♦♨♻♾♿⚒-⚗⚙⚛⚜⚠⚡⚧⚪⚫⚰⚱⚽⚾⛄⛅⛈⛎⛏⛑⛓⛔⛩⛪⛰-⛵⛷-⛺⛽✂✅✈-✍✏✒✔✖✝✡✨✳✴❄❇❌❎❓-❕❗❣❤➕-➗➡➰➿⤴⤵⬅-⬇⬛⬜⭐⭕〰〽㊗㊙🀄🃏🅰🅱🅾🅿🆎🆑-🆚🇦-🇿🈁🈂🈚🈯🈲-🈺🉐🉑🌀-🌡🌤-🎓🎖🎗🎙-🎛🎞-🏰🏳-🏵🏷-📽📿-🔽🕉-🕎🕐-🕧🕯🕰🕳-🕺🖇🖊-🖍🖐🖕🖖🖤🖥🖨🖱🖲🖼🗂-🗄🗑-🗓🗜-🗞🗡🗣🗨🗯🗳🗺-🙏🚀-🛅🛋-🛒🛕-🛗🛜-🛥🛩🛫🛬🛰🛳-🛼🟠-🟫🟰🤌-🤺🤼-🥅🥇-🧿🩰-🩼🪀-🪉🪏-🫆🫎-🫜🫟-🫩🫰-🫸](?:[🏻-🏿]|️⃣?|[󠀠-󠁾]+󠁿)?)(?:‍(?:\p{RI}{2}|[\#\*0-9©®‼⁉™ℹ↔-↙↩↪⌚⌛⌨⏏⏩-⏳⏸-⏺Ⓜ▪▫▶◀◻-◾☀-☄☎☑☔☕☘☝☠☢☣☦☪☮☯☸-☺♀♂♈-♓♟♠♣♥♦♨♻♾♿⚒-⚗⚙⚛⚜⚠⚡⚧⚪⚫⚰⚱⚽⚾⛄⛅⛈⛎⛏⛑⛓⛔⛩⛪⛰-⛵⛷-⛺⛽✂✅✈-✍✏✒✔✖✝✡✨✳✴❄❇❌❎❓-❕❗❣❤➕-➗➡➰➿⤴⤵⬅-⬇⬛⬜⭐⭕〰〽㊗㊙🀄🃏🅰🅱🅾🅿🆎🆑-🆚🇦-🇿🈁🈂🈚🈯🈲-🈺🉐🉑🌀-🌡🌤-🎓🎖🎗🎙-🎛🎞-🏰🏳-🏵🏷-📽📿-🔽🕉-🕎🕐-🕧🕯🕰🕳-🕺🖇🖊-🖍🖐🖕🖖🖤🖥🖨🖱🖲🖼🗂-🗄🗑-🗓🗜-🗞🗡🗣🗨🗯🗳🗺-🙏🚀-🛅🛋-🛒🛕-🛗🛜-🛥🛩🛫🛬🛰🛳-🛼🟠-🟫🟰🤌-🤺🤼-🥅🥇-🧿🩰-🩼🪀-🪉🪏-🫆🫎-🫜🫟-🫩🫰-🫸](?:[🏻-🏿]|️⃣?|[󠀠-󠁾]+󠁿)?))*/
+  end
+end
diff --git a/lib/unicode/emoji/generated_native/regex_possible.rb b/lib/unicode/emoji/generated_native/regex_possible.rb
@@ -0,0 +1,8 @@
+# This file was generated by a script, please do not edit it by hand.
+# See `$ rake generate_constants` and data/generate_constants.rb for more info.
+
+module Unicode
+  module Emoji
+    REGEX_POSSIBLE = /(?:\p{RI}{2}|\p{Emoji}(?:\p{EMod}|️⃣?|[󠀠-󠁾]+󠁿)?)(?:‍(?:\p{RI}{2}|\p{Emoji}(?:\p{EMod}|️⃣?|[󠀠-󠁾]+󠁿)?))*/
+  end
+end
diff --git a/spec/unicode_emoji_spec.rb b/spec/unicode_emoji_spec.rb
@@ -311,7 +311,7 @@
       assert_equal "🏴󠁧󠁢󠁡󠁧󠁢󠁿", $&
     end
 
-    it "does match invalid tag sequences" do
+    it "matches invalid base tag sequences" do
       "😴󠁧󠁢󠁡󠁡󠁡󠁿 GB AAA" =~ Unicode::Emoji::REGEX_WELL_FORMED
       assert_equal "😴󠁧󠁢󠁡󠁡󠁡󠁿", $&
     end
@@ -321,7 +321,7 @@
       assert_equal "🏴", $&
     end
 
-    it "does not match too long tag sequences (only black flag is matched)" do
+    it "does not match invalid tag sequences (only black flag is matched)" do
       "🏴󠀤󠁿 $" =~ Unicode::Emoji::REGEX_WELL_FORMED
       assert_equal "🏴", $&
     end
@@ -337,6 +337,106 @@
     end
   end
 
+  describe "REGEX_POSSIBLE" do
+    it "matches most singleton emoji codepoints" do
+      "😴 sleeping face" =~ Unicode::Emoji::REGEX_POSSIBLE
+      assert_equal "😴", $&
+    end
+
+    it "matches singleton emoji in combination with emoji variation selector" do
+      "😴\u{FE0F} sleeping face" =~ Unicode::Emoji::REGEX_POSSIBLE
+      assert_equal "😴\u{FE0F}", $&
+    end
+
+    it "matches singleton emoji (without VS) when in combination with text variation selector" do
+      "😴\u{FE0E} sleeping face" =~ Unicode::Emoji::REGEX_POSSIBLE
+      assert_equal "😴", $&
+    end
+
+    it "matches textual singleton emoji" do
+      "▶ play button" =~ Unicode::Emoji::REGEX_POSSIBLE
+      assert_equal "▶", $&
+    end
+
+    it "matches textual singleton emoji in combination with emoji variation selector" do
+      "▶\u{FE0F} play button" =~ Unicode::Emoji::REGEX_POSSIBLE
+      assert_equal "▶\u{FE0F}", $&
+    end
+
+    it "matches singleton 'component' emoji codepoints" do
+      "🏻 light skin tone" =~ Unicode::Emoji::REGEX_POSSIBLE
+      assert_equal "🏻", $&
+    end
+
+    it "matches modified emoji if modifier base emoji is used" do
+      "🛌🏽 person in bed: medium skin tone" =~ Unicode::Emoji::REGEX_POSSIBLE
+      assert_equal "🛌🏽", $&
+    end
+
+    it "matches modified emoji even if no modifier base emoji is used" do
+      "🌵🏽 cactus" =~ Unicode::Emoji::REGEX_POSSIBLE
+      assert_equal "🌵🏽", $&
+    end
+
+    it "matches valid region flags" do
+      "🇵🇹 Portugal" =~ Unicode::Emoji::REGEX_POSSIBLE
+      assert_equal "🇵🇹", $&
+    end
+
+    it "does match invalid region flags" do
+      "🇵🇵 PP Land" =~ Unicode::Emoji::REGEX_POSSIBLE
+      assert_equal "🇵🇵", $&
+    end
+
+    it "matches emoji keycap sequences" do
+      "2️⃣ keycap: 2" =~ Unicode::Emoji::REGEX_POSSIBLE
+      assert_equal "2️⃣", $&
+    end
+
+    it "matches only digit of non-emoji keycap sequences" do
+      "8⃣ text keycap: 8" =~ Unicode::Emoji::REGEX_POSSIBLE
+      assert_equal "8", $&
+
+      "#⃣ text keycap: #" =~ Unicode::Emoji::REGEX_POSSIBLE
+      assert_equal "#", $&
+    end
+
+    it "matches recommended tag sequences" do
+      "🏴󠁧󠁢󠁳󠁣󠁴󠁿 Scotland" =~ Unicode::Emoji::REGEX_POSSIBLE
+      assert_equal "🏴󠁧󠁢󠁳󠁣󠁴󠁿", $&
+    end
+
+    it "matches valid tag sequences, even though they are not recommended" do
+      "🏴󠁧󠁢󠁡󠁧󠁢󠁿 GB AGB" =~ Unicode::Emoji::REGEX_POSSIBLE
+      assert_equal "🏴󠁧󠁢󠁡󠁧󠁢󠁿", $&
+    end
+
+    it "matches invalid base tag sequences" do
+      "😴󠁧󠁢󠁡󠁡󠁡󠁿 GB AAA" =~ Unicode::Emoji::REGEX_POSSIBLE
+      assert_equal "😴󠁧󠁢󠁡󠁡󠁡󠁿", $&
+    end
+
+    it "matches too long tag sequences" do
+      "🏴󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁿 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA" =~ Unicode::Emoji::REGEX_POSSIBLE
+      assert_equal "🏴󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁁󠁿", $&
+    end
+
+    it "machtes invalid tag sequences (only black flag is matched)" do
+      "🏴󠀤󠁿 $" =~ Unicode::Emoji::REGEX_POSSIBLE
+      assert_equal "🏴󠀤󠁿", $&
+    end
+
+    it "matches recommended zwj sequences" do
+      "🤾🏽‍♀️ woman playing handball: medium skin tone" =~ Unicode::Emoji::REGEX_POSSIBLE
+      assert_equal "🤾🏽‍♀️", $&
+    end
+
+    it "matches valid zwj sequences, even though they are not recommended" do
+      "🤠‍🤢 vomiting cowboy" =~ Unicode::Emoji::REGEX_POSSIBLE
+      assert_equal "🤠‍🤢", $&
+    end
+  end
+
   describe "REGEX_BASIC" do
     it "matches most singleton emoji codepoints" do
       "😴 sleeping face" =~ Unicode::Emoji::REGEX_BASIC