-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support emoji for MTurk import / export #1773
Changes from 4 commits
02d46d3
4c7a7d5
b19d582
4601b35
6d8db06
09a408d
6f25042
a51fbda
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
import json | ||
import re | ||
import sys | ||
|
||
|
||
# Source: https://github.com/charman/mturk-emoji | ||
def replace_emoji_characters(s): | ||
"""Replace 4-byte characters with HTML spans with bytes as JSON array | ||
|
||
This function takes a Unicode string containing 4-byte Unicode | ||
characters, e.g. 😀, and replaces each 4-byte character with an | ||
HTML span with the 4 bytes encoded as a JSON array, e.g.: | ||
|
||
<span class='emoji-bytes' data-emoji-bytes='[240, 159, 152, 128]'></span> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Use double quotes? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is out of date, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Updated. |
||
|
||
Args: | ||
s (Unicode string): | ||
Returns: | ||
Unicode string with all 4-byte Unicode characters in the source | ||
string replaced with HTML spans | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Give an example? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added example. |
||
""" | ||
|
||
def _emoji_match_to_span(emoji_match): | ||
""" | ||
Args: | ||
emoji_match (MatchObject): | ||
|
||
Returns: | ||
Unicode string | ||
""" | ||
return emoji_match.group().encode("ascii", "xmlcharrefreplace").decode() | ||
|
||
# The procedure for stripping Emoji characters is based on this | ||
# StackOverflow post: | ||
# http://stackoverflow.com/questions/12636489/python-convert-4-byte-char-to-avoid-mysql-error-incorrect-string-value | ||
if sys.maxunicode == 1114111: | ||
# Python was built with '--enable-unicode=ucs4' | ||
highpoints = re.compile("[\U00010000-\U0010ffff]") | ||
elif sys.maxunicode == 65535: | ||
# Python was built with '--enable-unicode=ucs2' | ||
highpoints = re.compile("[\uD800-\uDBFF][\uDC00-\uDFFF]") | ||
else: | ||
raise UnicodeError("Unable to determine if Python was built using UCS-2 or UCS-4") | ||
|
||
return highpoints.sub(_emoji_match_to_span, s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add type hints
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.