Zero width spaces (U+200b) inside the token #1010

vvi56 · 2024-02-20T22:11:09Z

According to the standard should space-like characters such as zero width space (U+200b) be included in the tokens or skipped like the normal space character ?

Here are some examples:

be_hse-ud-dev.conllu
tr_penn-ud-train.conllu
pt_pud-ud-test.conllu

nschneid · 2024-02-20T23:00:31Z

In English our policy is to retain the exact original string for each sentence. UniversalDependencies/UD_English-EWT#83 explains how we mark the token with SpecialEncoding=Yes and use CorrectForm to specify the normal spelling.

vvi56 · 2024-02-20T23:34:06Z

@nschneid yes, I see this case (U+00AD) in English-EWT.

Is the zero width space (U+200B) the instance of the space character (in which case it should be skipped) or the character that must be retained as part of the token, such as the soft hyphen (U+00AD) ?

Should we distinguish between normal spaces (U+0020) and all other spaces (Space Separator Unicode category: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=[:General_Category=Space_Separator:] ) ?

nschneid · 2024-02-21T00:21:44Z

Does it separate two words? If so I would say it should be encoded in UD with the SpacesAfter feature rather than as part of the word itself.

arademaker · 2024-02-21T13:24:43Z

I just fixed the Portuguese-PUD dataset. Thank you

dan-zeman · 2024-02-21T18:16:26Z

According to the standard should space-like characters such as zero width space (U+200b) be included in the tokens or skipped like the normal space character ?

The guidelines say that "spaces" cannot occur in columns other than FORM, LEMMA and MISC, and in FORM and LEMMA they can occur only in expressions specifically defined for the language. However, it is not specified what types of spaces are meant.

The validator uses the \s special character (from the regex module for Python) to identify a "space". It should match any character in the Unicode category Zs ("Separator, space"). It also matches tabs and newlines, which are in Unicode category C, but those are banned anyway and will result in a validation error elsewhere. U+200B ZERO WIDTH SPACE is in category Cf ("Other, format") and is not matched by \s; therefore a UD token with this character is not invalid.

Nevertheless, the character is still intended to separate words rather than being their part. So if it is preserved in treebanks, it should probably be a separate token. And if it is a separate token, I cannot see it tagged and attached as anything else than punctuation – although I cannot say I like such a solution.

It would be also possible to say that the validator should consider both \s and \x{200B} as "spaces". But this is an ad-hoc solution and I don't know whether there are other characters (and how many) that would deserve the same treatment. We cannot exclude the whole C category because there are characters that should be allowed inside words.

vvi56 · 2024-02-21T21:45:20Z

Here is the list of all instances of \x{200B} in FORM and LEMMA columns in UD 2.13:

be_hse-ud-{dev,train}.conllu : ~20 times, always in the same context

# text = <a_href="https://telegra.ph/file/ff4e408b49529ff1b14dc.jpg"><200b><200b></a><strong>Гэта сенсацыя!
1   <a_href="https://telegra.ph/file/ff4e408b49529ff1b14dc.jpg">    <a_href="https://telegra.ph/file/ff4e408b49529ff1b14dc.jpg">          X   X   _   2   dep 2:dep   SpaceAfter=No
2   <200b><200b>    <200b><200b>    X   X   _   6   parataxis   6:parataxis SpaceAfter=No
3   </a>    </a>    X   X   _   2   dep 2:dep   SpaceAfter=No

ru_gsd-ud-train.conllu : twice in one sentence

# sent_id = train-s1679
# text = Тем не менее, выборы на следующий год принесли победу Немецкой демократической партии (DDP) и Кёлер вернулся на свой <200b> <200b> пост в качестве министра финансов.
...
21  свой    свой    DET PRP$    Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing   24  det _   _
22  <200b>  <200b>  X   FW  _   24  amod    _   _
23  <200b>  <200b>  X   FW  _   24  amod    _   _
24  пост    пост    NOUN    NN  Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing   19  obl _   _
...

tr_penn-ud-train.conllu : 4 times in different sentences

5   <200b><200b>meslektaşlarına <200b><200b>meslektaşlarına NOUN    _   Case=Nom|Number=Sing|Person=3   6   obl _   _
10  <200b><200b>ayarlamak   <200b><200b>ayarlamak   NOUN    _   Case=Nom|Number=Sing|Person=3   11  xcomp   _   _
12  <200b><200b>yaptı   <200b><200b>yaptı   NOUN    _   Case=Nom|Number=Sing|Person=3   4   conj    _   _
6   <200b><200b>otomobil    <200b><200b>otomobil    NOUN    _   Case=Nom|Number=Sing|Person=3   7   nmod    _   _

zh_{gsd,gsdsimp}-ud-train.conllu : once

3   毒<200b><200b>物    毒<200b><200b>物    NOUN    NN  _   21  nsubj   _   SpaceAfter=No|Translit=dú<200b><200b>wù|LTranslit=dú<200b><200b>wù

pt_pud-ud-test.conllu : once, already fixed

vvi56 · 2024-02-21T22:09:44Z

There is only one character from the Space_Separator category in the FORM and LEMMA column other than a normal space: \x{00A0} (NO-BREAK SPACE):

br_keb-ud-test.conllu :

9   100<00A0>000 100 000 NUM num Number=Plur 6   nsubj   _   _
6   16<00A0>345  16 345  NUM num Number=Plur 1   appos   _   _

All other spaces in FORM and LEMMA columns are normal spaces (\x{0020}) used to encode multiwords, numbers and formulas. ~13K occurrences of the normal space character in many corpora.

vvi56 · 2024-02-21T22:19:16Z

It would be also possible to say that the validator should consider both \s and \x{200B} as "spaces". But this is an ad-hoc solution and I don't know whether there are other characters (and how many) that would deserve the same treatment. We cannot exclude the whole C category because there are characters that should be allowed inside words.

A refined rule for columns FORM and LEMMA could probably be: "The value cannot begin or end with a space-like character".

arademaker mentioned this issue Feb 21, 2024

zero width space inside a token UniversalDependencies/UD_Portuguese-PUD#54

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero width spaces (U+200b) inside the token #1010

Zero width spaces (U+200b) inside the token #1010

vvi56 commented Feb 20, 2024

nschneid commented Feb 20, 2024

vvi56 commented Feb 20, 2024 •

edited

Loading

nschneid commented Feb 21, 2024

arademaker commented Feb 21, 2024

dan-zeman commented Feb 21, 2024 •

edited

Loading

vvi56 commented Feb 21, 2024 •

edited

Loading

vvi56 commented Feb 21, 2024 •

edited

Loading

vvi56 commented Feb 21, 2024 •

edited

Loading

Zero width spaces (U+200b) inside the token #1010

Zero width spaces (U+200b) inside the token #1010

Comments

vvi56 commented Feb 20, 2024

nschneid commented Feb 20, 2024

vvi56 commented Feb 20, 2024 • edited Loading

nschneid commented Feb 21, 2024

arademaker commented Feb 21, 2024

dan-zeman commented Feb 21, 2024 • edited Loading

vvi56 commented Feb 21, 2024 • edited Loading

vvi56 commented Feb 21, 2024 • edited Loading

vvi56 commented Feb 21, 2024 • edited Loading

vvi56 commented Feb 20, 2024 •

edited

Loading

dan-zeman commented Feb 21, 2024 •

edited

Loading

vvi56 commented Feb 21, 2024 •

edited

Loading

vvi56 commented Feb 21, 2024 •

edited

Loading

vvi56 commented Feb 21, 2024 •

edited

Loading