Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero width spaces (U+200b) inside the token #1010

Open
vvi56 opened this issue Feb 20, 2024 · 8 comments
Open

Zero width spaces (U+200b) inside the token #1010

vvi56 opened this issue Feb 20, 2024 · 8 comments

Comments

@vvi56
Copy link

vvi56 commented Feb 20, 2024

According to the standard should space-like characters such as zero width space (U+200b) be included in the tokens or skipped like the normal space character ?

Here are some examples:

  1. be_hse-ud-dev.conllu
    image

  2. tr_penn-ud-train.conllu
    image

  3. pt_pud-ud-test.conllu
    image

@nschneid
Copy link
Contributor

In English our policy is to retain the exact original string for each sentence. UniversalDependencies/UD_English-EWT#83 explains how we mark the token with SpecialEncoding=Yes and use CorrectForm to specify the normal spelling.

@vvi56
Copy link
Author

vvi56 commented Feb 20, 2024

@nschneid yes, I see this case (U+00AD) in English-EWT.

Is the zero width space (U+200B) the instance of the space character (in which case it should be skipped) or the character that must be retained as part of the token, such as the soft hyphen (U+00AD) ?

Should we distinguish between normal spaces (U+0020) and all other spaces (Space Separator Unicode category: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=[:General_Category=Space_Separator:] ) ?

@nschneid
Copy link
Contributor

Does it separate two words? If so I would say it should be encoded in UD with the SpacesAfter feature rather than as part of the word itself.

@arademaker
Copy link
Contributor

I just fixed the Portuguese-PUD dataset. Thank you

@dan-zeman
Copy link
Member

dan-zeman commented Feb 21, 2024

According to the standard should space-like characters such as zero width space (U+200b) be included in the tokens or skipped like the normal space character ?

The guidelines say that "spaces" cannot occur in columns other than FORM, LEMMA and MISC, and in FORM and LEMMA they can occur only in expressions specifically defined for the language. However, it is not specified what types of spaces are meant.

The validator uses the \s special character (from the regex module for Python) to identify a "space". It should match any character in the Unicode category Zs ("Separator, space"). It also matches tabs and newlines, which are in Unicode category C, but those are banned anyway and will result in a validation error elsewhere. U+200B ZERO WIDTH SPACE is in category Cf ("Other, format") and is not matched by \s; therefore a UD token with this character is not invalid.

Nevertheless, the character is still intended to separate words rather than being their part. So if it is preserved in treebanks, it should probably be a separate token. And if it is a separate token, I cannot see it tagged and attached as anything else than punctuation – although I cannot say I like such a solution.

It would be also possible to say that the validator should consider both \s and \x{200B} as "spaces". But this is an ad-hoc solution and I don't know whether there are other characters (and how many) that would deserve the same treatment. We cannot exclude the whole C category because there are characters that should be allowed inside words.

@vvi56
Copy link
Author

vvi56 commented Feb 21, 2024

Here is the list of all instances of \x{200B} in FORM and LEMMA columns in UD 2.13:

  1. be_hse-ud-{dev,train}.conllu : ~20 times, always in the same context
# text = <a_href="https://telegra.ph/file/ff4e408b49529ff1b14dc.jpg"><200b><200b></a><strong>Гэта сенсацыя!
1   <a_href="https://telegra.ph/file/ff4e408b49529ff1b14dc.jpg">    <a_href="https://telegra.ph/file/ff4e408b49529ff1b14dc.jpg">          X   X   _   2   dep 2:dep   SpaceAfter=No
2   <200b><200b>    <200b><200b>    X   X   _   6   parataxis   6:parataxis SpaceAfter=No
3   </a>    </a>    X   X   _   2   dep 2:dep   SpaceAfter=No
  1. ru_gsd-ud-train.conllu : twice in one sentence
# sent_id = train-s1679
# text = Тем не менее, выборы на следующий год принесли победу Немецкой демократической партии (DDP) и Кёлер вернулся на свой <200b> <200b> пост в качестве министра финансов.
...
21  свой    свой    DET PRP$    Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing   24  det _   _
22  <200b>  <200b>  X   FW  _   24  amod    _   _
23  <200b>  <200b>  X   FW  _   24  amod    _   _
24  пост    пост    NOUN    NN  Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing   19  obl _   _
...
  1. tr_penn-ud-train.conllu : 4 times in different sentences
5   <200b><200b>meslektaşlarına <200b><200b>meslektaşlarına NOUN    _   Case=Nom|Number=Sing|Person=3   6   obl _   _
10  <200b><200b>ayarlamak   <200b><200b>ayarlamak   NOUN    _   Case=Nom|Number=Sing|Person=3   11  xcomp   _   _
12  <200b><200b>yaptı   <200b><200b>yaptı   NOUN    _   Case=Nom|Number=Sing|Person=3   4   conj    _   _
6   <200b><200b>otomobil    <200b><200b>otomobil    NOUN    _   Case=Nom|Number=Sing|Person=3   7   nmod    _   _
  1. zh_{gsd,gsdsimp}-ud-train.conllu : once
3   毒<200b><200b>物    毒<200b><200b>物    NOUN    NN  _   21  nsubj   _   SpaceAfter=No|Translit=dú<200b><200b>wù|LTranslit=dú<200b><200b>wù
  1. pt_pud-ud-test.conllu : once, already fixed

@vvi56
Copy link
Author

vvi56 commented Feb 21, 2024

There is only one character from the Space_Separator category in the FORM and LEMMA column other than a normal space: \x{00A0} (NO-BREAK SPACE):

br_keb-ud-test.conllu :

9   100<00A0>000 100 000 NUM num Number=Plur 6   nsubj   _   _
6   16<00A0>345  16 345  NUM num Number=Plur 1   appos   _   _

All other spaces in FORM and LEMMA columns are normal spaces (\x{0020}) used to encode multiwords, numbers and formulas. ~13K occurrences of the normal space character in many corpora.

@vvi56
Copy link
Author

vvi56 commented Feb 21, 2024

It would be also possible to say that the validator should consider both \s and \x{200B} as "spaces". But this is an ad-hoc solution and I don't know whether there are other characters (and how many) that would deserve the same treatment. We cannot exclude the whole C category because there are characters that should be allowed inside words.

A refined rule for columns FORM and LEMMA could probably be: "The value cannot begin or end with a space-like character".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants