Provides regular expressions to find Emoji in strings, incorporating the latest Unicode / Emoji standards.
Additional features:
- A categorized list of Emoji (RGI: Recommended for General Interchange)
- Retrieve Emoji properties info about specific codepoints (Emoji_Modifier, Emoji_Presentation, etc.)
Emoji version: 16.0 (September 2024)
CLDR version (used for sub-region flags): 46 (October 2024)
gem "unicode-emoji"
The gem includes multiple Emoji regexes, which are compiled out of various Emoji Unicode data sources.
require "unicode/emoji"
string = "String which contains all types of Emoji sequences:
- Singleton Emoji: ๐ด
- Textual singleton Emoji with Emoji variation: โถ๏ธ
- Emoji with skin tone modifier: ๐๐ฝ
- Region flag: ๐ต๐น
- Sub-Region flag: ๐ด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ
- Keycap sequence: 2๏ธโฃ
- Sequence using ZWJ (zero width joiner): ๐คพ๐ฝโโ๏ธ
"
string.scan(Unicode::Emoji::REGEX) # => ["๐ด", "โถ๏ธ", "๐๐ฝ", "๐ต๐น", "๐ด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ", "2๏ธโฃ", "๐คพ๐ฝโโ๏ธ"]
Depending on your exact usecase, you can choose between multiple levels of Emoji detection:
Regex | Description | Example Matches | Example Non-Matches |
---|---|---|---|
Unicode::Emoji::REGEX |
Use this one if unsure! Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kinds of recommended Emoji sequences (RGI/FQE) | ๐ด , โถ๏ธ , ๐๐ฝ , ๐ต๐น , 2๏ธโฃ , ๐ด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ , ๐คพ๐ฝโโ๏ธ |
๐คพ๐ฝโโ , ๐โโ๏ธ , ๐ด๏ธ , โถ , ๐ป , ๐ต๐ต , ๐ด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ , ๐ค โ๐คข , 1 , 1โฃ |
Unicode::Emoji::REGEX_VALID |
Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kinds of valid Emoji sequences | ๐ด , โถ๏ธ , ๐๐ฝ , ๐ต๐น , 2๏ธโฃ , ๐ด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ , ๐ด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ , ๐คพ๐ฝโโ๏ธ , ๐คพ๐ฝโโ ,๐โโ๏ธ , ๐ค โ๐คข |
๐ด๏ธ , โถ , ๐ป , ๐ต๐ต , 1 , 1โฃ |
Unicode::Emoji::REGEX_WELL_FORMED |
Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kinds of well-formed Emoji sequences | ๐ด , โถ๏ธ , ๐๐ฝ , ๐ต๐น , 2๏ธโฃ , ๐ด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ , ๐ด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ , ๐คพ๐ฝโโ๏ธ , ๐คพ๐ฝโโ ,๐โโ๏ธ , ๐ค โ๐คข , ๐ต๐ต |
๐ด๏ธ , โถ , ๐ป , 1 , 1โฃ |
Unicode::Emoji::REGEX_POSSIBLE |
Matches all singleton Emoji, singleton components, all kinds of Emoji sequences, and even single digits (except for: unqualified keycap sequences) | ๐ด , โถ๏ธ , ๐๐ฝ , ๐ต๐น , 2๏ธโฃ , ๐ด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ , ๐ด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ , ๐คพ๐ฝโโ๏ธ , ๐คพ๐ฝโโ , ๐โโ๏ธ , ๐ค โ๐คข , ๐ต๐ต , ๐ด๏ธ , โถ , ๐ป , 1 |
1โฃ |
By default, textual Emoji (emoji characters with text variation selector or those that have a default text presentation) will not be included in the default regexes (except in REGEX_POSSIBLE
). However, if you wish to match for them too, you can include them in your regex by appending the _INCLUDE_TEXT
suffix:
Regex | Description | Example Matches | Example Non-Matches |
---|---|---|---|
Unicode::Emoji::REGEX_INCLUDE_TEXT |
REGEX + REGEX_TEXT |
๐ด , โถ๏ธ , ๐๐ฝ , ๐ต๐น , 2๏ธโฃ , ๐ด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ , ๐คพ๐ฝโโ๏ธ , ๐ด๏ธ , โถ , 1โฃ |
๐คพ๐ฝโโ , ๐โโ๏ธ , ๐ป , ๐ต๐ต , ๐ด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ , ๐ค โ๐คข , 1 |
Unicode::Emoji::REGEX_VALID_INCLUDE_TEXT |
REGEX_VALID + REGEX_TEXT |
๐ด , โถ๏ธ , ๐๐ฝ , ๐ต๐น , 2๏ธโฃ , ๐ด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ , ๐ด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ , ๐คพ๐ฝโโ๏ธ , ๐คพ๐ฝโโ , ๐โโ๏ธ , ๐ค โ๐คข , ๐ด๏ธ , โถ , 1โฃ |
๐ป , ๐ต๐ต , 1 |
Unicode::Emoji::REGEX_WELL_FORMED_INCLUDE_TEXT |
REGEX_WELL_FORMED + REGEX_TEXT |
๐ด , โถ๏ธ , ๐๐ฝ , ๐ต๐น , 2๏ธโฃ , ๐ด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ , ๐ด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ , ๐คพ๐ฝโโ๏ธ , ๐คพ๐ฝโโ , ๐โโ๏ธ , ๐ค โ๐คข , ๐ต๐ต , ๐ด๏ธ , โถ , 1โฃ |
๐ป , 1 |
Regex | Description | Example Matches | Example Non-Matches |
---|---|---|---|
Unicode::Emoji::REGEX_INCLUDE_MQE |
Like REGEX , but additionally includes Emoji with missing Emoji Presentation Variation Selectors, where the first partial Emoji has all required Variation Selectors |
๐ด , โถ๏ธ , ๐๐ฝ , ๐ต๐น , 2๏ธโฃ , ๐ด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ , ๐คพ๐ฝโโ๏ธ , ๐คพ๐ฝโโ |
๐โโ๏ธ , ๐ด๏ธ , โถ , ๐ป , ๐ต๐ต , ๐ด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ , ๐ค โ๐คข , 1 , 1โฃ |
Unicode::Emoji::REGEX_INCLUDE_MQE_UQE |
Like REGEX , but additionally includes Emoji with missing Emoji Presentation Variation Selectors |
๐ด , โถ๏ธ , ๐๐ฝ , ๐ต๐น , 2๏ธโฃ , ๐ด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ , ๐คพ๐ฝโโ๏ธ , ๐คพ๐ฝโโ , ๐โโ๏ธ |
๐ด๏ธ , โถ , ๐ป , ๐ต๐ต , ๐ด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ , ๐ค โ๐คข , 1 , 1โฃ |
List of MQE and UQE Emoji sequences
Matches only simple one-codepoint (+ optional variation selector) Emoji:
Regex | Description | Example Matches | Example Non-Matches |
---|---|---|---|
Unicode::Emoji::REGEX_BASIC |
Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji), but no sequences at all | ๐ด , โถ๏ธ |
๐ด๏ธ , โถ , ๐ป , ๐๐ฝ , ๐ต๐น , ๐ต๐ต ,2๏ธโฃ , ๐ด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ , ๐ด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ , ๐คพ๐ฝโโ๏ธ , ๐คพ๐ฝโโ , ๐โโ๏ธ , ๐ค โ๐คข , 1 |
Unicode::Emoji::REGEX_TEXT |
Matches only textual singleton Emoji (except for singleton components, like digits) | ๐ด๏ธ , โถ |
๐ด , โถ๏ธ , ๐ป , ๐๐ฝ , ๐ต๐น , ๐ต๐ต ,2๏ธโฃ , ๐ด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ , ๐ด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ , ๐คพ๐ฝโโ๏ธ , ๐คพ๐ฝโโ , ๐โโ๏ธ , ๐ค โ๐คข , 1 |
Here is a list of all Emoji that can be matched using the two regexes: character.construction/emoji-vs-text
While REGEX_BASIC
is part of the above regexes, REGEX_TEXT
is only included in the *_INCLUDE_TEXT
or *_UQE
variants.
- Fully-qualified RGI Emoji ZWJ sequence
- Minimally-qualified RGI Emoji ZWJ sequence (lacks Emoji Presentation Selectors, but not in the first Emoji character)
- Unqualified RGI Emoji ZWJ sequence (lacks Emoji Presentation Selector, including in the first Emoji character). Unqualified Emoji include all basic Emoji in Text Presentation (see column 11/12).
- Non-RGI Emoji ZWJ sequence
- Valid Region made from a pair of Regional Indicators
- Any Region made from a pair of Regional Indicators
- RGI Flag Emoji Tag Sequences (England, Scotland, Wales)
- Valid Flag Emoji Tag Sequences (any known subdivision)
- Any Emoji Tag Sequences (any tag sequence with any base)
- Basic Default Emoji Presentation Characters or Text characters with Emoji Presentation Selector
- Basic Default Text Presentation Characters or Basic Emoji with Text Presentation Selector
- Non-Emoji (unqualified) keycap
Regex | 1 RGI/FQE | 2 RGI/MQE | 3 RGI/UQE | 4 Non-RGI | 5 Valid Reยญgion | 6 Any Reยญgion | 7 RGI Tag | 8 Valid Tag | 9 Any Tag | 10 Basic Emoji | 11 Basic Text | 12 Text Keyยญcap |
---|---|---|---|---|---|---|---|---|---|---|---|---|
REGEX | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ |
REGEX INCLUDE TEXT | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ |
REGEX INCLUDE MQE | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ |
REGEX INCLUDE MQE UQE | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ |
REGEX VALID | โ | โ | (โ )ยน | โ | โ | โ | โ | โ | โ | โ | โ | โ |
REGEX VALID INCLUDE TEXT | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ |
REGEX WELL FORMED | โ | โ | (โ )ยน | โ | โ | โ | โ | โ | โ | โ | โ | โ |
REGEX WELL FORMED INCLUDE TEXT | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ |
REGEX POSSIBLE | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ |
REGEX BASIC | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ |
REGEX TEXT | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ |
ยน Matches all unqualified Emoji, except for textual singleton Emoji (see columns 11, 12)
See spec files for detailed examples about which regex matches which kind of Emoji.
- Usually you just want
REGEX
(recommended Emoji set, RGI) - Use
REGEX_INCLUDE_MQE
orREGEX_INCLUDE_MQE_UQE
if you want to catch Emoji sequences with missing Variation Selectors. - If you want broader matching (any ZWJ sequences, more sub-region flags), choose
REGEX_VALID
- If you need to match any region flag and any tag sequence, choose
REGEX_WELL_FORMED
- Use the
_INCLUDE_TEXT
suffix with any of the above base regexes, if you want to also match basic textual Emoji - And finally, there is also the option to use
REGEX_POSSIBLE
, which is a simplified test for possible Emoji, comparable toREGEX_WELL_FORMED*
. It might contain false positives, however, the regex is less complex and suggested in the Unicode standard itself as a first check.
Desc | Emoji | Escaped | REGEX (RGI/FQE) |
REGEX_INCLUDE_MQE (RGI/MQE) |
REGEX_VALID |
REGEX_WELL_FORMED / REGEX_POSSIBLE |
---|---|---|---|---|---|---|
RGI ZWJ Sequence | ๐คพ๐ฝโโ๏ธ | \u{1F93E 1F3FD 200D 2640 FE0F} |
โ | โ | โ | โ |
RGI ZWJ Sequence MQE | ๐คพ๐ฝโโ | \u{1F93E 1F3FD 200D 2640} |
โ | โ | โ | โ |
Valid ZWJ Sequence, Non-RGI | ๐ค โ๐คข | \u{1F920 200D 1F922} |
โ | โ | โ | โ |
Known Region | ๐ต๐น | \u{1F1F5 1F1F9} |
โ | โ | โ | โ |
Unknown Region | ๐ต๐ต | \u{1F1F5 1F1F5} |
โ | โ | โ | โ |
RGI Tag Sequence | ๐ด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ | \u{1F3F4 E0067 E0062 E0073 E0063 E0074 E007F} |
โ | โ | โ | โ |
Valid Tag Sequence | ๐ด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ | \u{1F3F4 E0067 E0062 E0061 E0067 E0062 E007F} |
โ | โ | โ | โ |
Well-formed Tag Sequence | ๐ด๓ ง๓ ข๓ ก๓ ก๓ ก๓ ฟ | \u{1F634 E0067 E0062 E0061 E0061 E0061 E007F} |
โ | โ | โ | โ |
Please see the standard for more details, examples, explanations.
More info about valid vs. recommended Emoji can also be found in this blog article on Emojipedia.
Unicode::Emoji::REGEX_PICTO
matches single codepoints with the Extended_Pictographic property. For example, it will match โ
BLACK SAFETY SCISSORS.
Unicode::Emoji::REGEX_PICTO_NO_EMOJI
matches single codepoints with the Extended_Pictographic property, but excludes Emoji characters.
See character.construction/picto for a list of all non-Emoji pictographic characters.
Unicode::Emoji::REGEX_ANY
, same as \p{Emoji}
. Deprecated: Will be removed or renamed in the future.
Use Unicode::Emoji::LIST
or the list method to get a ordered and categorized list of Emoji:
Unicode::Emoji.list.keys
# => ["Smileys & Emotion", "People & Body", "Component", "Animals & Nature", "Food & Drink", "Travel & Places", "Activities", "Objects", "Symbols", "Flags"]
Unicode::Emoji.list("Food & Drink").keys
# => ["food-fruit", "food-vegetable", "food-prepared", "food-asian", "food-marine", "food-sweet", "drink", "dishware"]
Unicode::Emoji.list("Food & Drink", "food-asian")
=> ["๐ฑ", "๐", "๐", "๐", "๐", "๐", "๐", "๐ ", "๐ข", "๐ฃ", "๐ค", "๐ฅ", "๐ฅฎ", "๐ก", "๐ฅ", "๐ฅ ", "๐ฅก"]
Please note that categories might change with future versions of the Emoji standard, although this has not happened often.
A list of all Emoji (generated from this gem) can be found at character.construction/emoji.
Allows you to access the codepoint data for a single character form Unicode's emoji-data.txt file:
require "unicode/emoji"
Unicode::Emoji.properties "โ" # => ["Emoji", "Emoji_Modifier_Base"]
- Unicodeยฎ Technical Standard #51
- Emoji categories
- Ruby gem which displays Emoji sequence names (as website)
- Part of unicode-x
- Copyright (C) 2017-2024 Jan Lelis https://janlelis.com. Released under the MIT license.
- Unicode data: https://www.unicode.org/copyright.html#Exhibit1