Typo definition #858

nschneid · 2022-04-06T13:54:00Z

A potential rewrite of the Typo definition for discussion.

The new definition broadens beyond errors to include typographically unexpected spellings. For example:

encoding issues that are not technically errors (encoding error UD_English-EWT#83)
word-internal spacing being used on purpose for visual effect, like U P D A T E D (this is goeswith so Typo=Yes is required) (Goeswith edits UD_English-EWT#314)

This may shift the boundary with Style=Expr when it comes to odd but intentional spellings like CA$H.

Thoughts?

dan-zeman · 2022-04-06T15:13:50Z

Sounds good to me.

amir-zeldes · 2022-04-06T17:16:57Z

I have a comment and a question:

unexpected capitalization choices do not fall under Typo=Yes

What about things like "i" for pronoun "I"? I think this is as unexpected in 'normal' text as other spelling abnormalities.

And a question: how does this relate to spoken data, where there may be dysfluencies that should receive a CorrectForm. Some examples:

we call it a sholids (for "a solid", as opposed to a liquid or gas)
they known about it (probably for "know", in context)

These are not 'typos' (since they're spoken), but basically have the same structure: unexpected form for which we can guess the CorrectForm. Do we need a parallel Dysfluency=Yes?

dan-zeman · 2022-04-06T19:22:57Z

Do we need a parallel Dysfluency=Yes?

No, I think Typo=Yes is enough and its definition should be written in such a way that it generalizes to spoken data.

nschneid · 2022-04-06T19:30:00Z

Hmm...I think if it were to include speech errors as well as typographical errors it should be called "error", right? "Typo" really suggests a typographical artifact.

dan-zeman · 2022-04-06T21:05:11Z

Hmm...I think if it were to include speech errors as well as typographical errors it should be called "error", right? "Typo" really suggests a typographical artifact.

Maybe. But it is already a part of the universal guidelines and I would not change it or add an extra feature just because the string sounds less adequate than it perhaps could be. "Pick a suboptimal standard and stick to it."

amir-zeldes · 2022-04-06T21:09:32Z

I agree with both of you :| it sounds ugly, but I admit it's silly to have two annotations that both basically imply an error with a CorrectForm. BTW for things like 'i' we now have corrected forms in GUM upstream, so we plan to propagate CorrectForms to the UD release as well, and indeed they are also specified for spoken data (in GUM the annotation name is sic, which is taken from the TEI inventory).

nschneid · 2022-04-06T21:26:00Z

Before we make a decision about the "Typo" name, let's discuss the capitalization policy.

For EWT we normalize capitalization somewhat in lemmas, but I don't think it makes sense to treat all-lowercase sentences (or for that matter all-caps sentences) as typos with CorrectForms—flexibility with capitalization comes with the territory in user-generated content.

Maybe the policy should be that it is up to each treebank to decide what to do about capitalization. Or, maybe the way to think about it is that lemmas (ideally) capture canonical capitalization, and therefore any capitalization deviation between the form and lemma can be detected without a special feature.

(Several weeks ago I floated the idea of developing a canonicalization standard, specifying all info needed to derive a well-edited sentence/text if possible with only superficial edits, but it seemed like nobody was interested in that.)

amir-zeldes · 2022-04-06T21:36:22Z

Yeah, a standard would be great, but I suspect it's very complicated... FWIW, GUM behavior is to not mark all caps as in any way an error, but mark it as rendered in all caps (<hi rend="caps">, so similar to how we mark up italics), whereas lower-casing something that should conventionally be upper case is marked with sic:

<sic ana="I">i</sic> <hi rend="caps">WANTED</hi> to go to <sic ana="SyntaxFest">syntaxfest</sic> in <sic ana="Bulgaria">bulgaria</sic>

nschneid · 2022-04-07T00:47:47Z

Gotcha. Rendering=Caps could be a good MISC feature. In any case it should almost always be sufficient to compare the casing of the lemma to the casing of the form to see if it is nonstandard. I guess the weird exception would be if only an inflectional part of the word uses nonstandard casing, e.g. "wantED".

Capitalization (and punctuation) at sentence boundaries would be another thing that could be canonicalized, but when it comes to EWT I can list many higher-priority items like standardizing our approach to pronoun lemmas!

…orrs/dysfluencies

_u-feat/Typo.md

amir-zeldes · 2022-04-07T20:30:31Z

One more question about superfluous material. Suppose I have this text:

When a young girl," the family moved south to Vichy, spending vacations at the paternal ancestral village of Mazirat

It's pretty clear the quotation marks after girl, are just a 'fat-finger' typo, and this is marked up in GUM with the target hypothesis "" (empty string). Should this have Typo=Yes? If not, what else? Should it have CorrectForm? If so, which is correct:

CorrectForm= (nothing, it shouldn't be there, so zero length string value)
CorrectForm=_ (because 'empty' is "_"; but then 'auto-construction' of normative text is hampered)

What do you think?

nschneid · 2022-04-07T21:56:41Z

I would say a spurious token should be considered reparandum, no need to tag it with Typo or CorrectForm.

sylvainkahane · 2022-04-08T07:51:33Z

A reparandum would nee a repair. Reparandum-repair pairs in spoken data are something quite different.

nschneid · 2022-04-08T12:51:15Z

This is the current policy articulated here:

If the text contains by error a word that should not be there, it can be treated similarly to speech disfluences, that is, attached to the following constituent via the reparandum relation. A relatively common instance in written language is that a word is typed twice in a row.

Maybe the name "reparandum" is being interpreted liberally but I don't see a better deprel to use for accidental extra words.

amir-zeldes · 2022-04-08T13:32:07Z

I agree with @sylvainkahane , it seems odd to call this a reparandum. I think that label should always point LTR, but you could have superfluous trailing punctuation. It seems least offensive to attach it as punct since it's not a case of the writer starting to utter punctuation, then changing their mind and repairing it with an alternative (what is the 'repair' for these quotation marks?)

nschneid · 2022-04-08T13:58:56Z

Sometimes a word is accidentally repeated, in which case it seems reasonable to treat the first as the reparandum and the second as its head.

If we are talking about a totally accidental token not syntactically/semantically connected to the sentence at all (which could be punctuation but need not be—suppose someone typing on their phone accidentally types "x" or "1"), maybe it should attach as reparandum to the root word of the sentence. Or maybe it should be dep. My personal interpretation of dep is "I have no idea what this word/phrase is doing here." :)

amir-zeldes · 2022-04-08T15:42:40Z

Discussing the deprel is also interesting, but assuming for some reason we don't want to use reparandum, shouldn't we use Typo and an empty CorrectForm to allow for the normalized version of the sentence to be reproduced using a straightforward procedure?

nschneid · 2022-04-10T20:12:30Z

My gut feeling is that if a word is superfluous that should be indicated on the deprel (because it also pertains to the syntactic structure), not with an empty CorrectForm. But I'm open to being persuaded otherwise if there's a compelling reason that the deprel is insufficient.

nschneid · 2022-04-10T20:21:22Z

OK if we publish the revised Typo feature page? The extra word policy is given on a different page (https://universaldependencies.org/u/overview/typos.html) so feel free to open a separate issue for that.

amir-zeldes · 2022-04-13T17:18:12Z

LGTM!

Stormur · 2022-05-04T16:40:51Z

Or, maybe the way to think about it is that lemmas (ideally) capture canonical capitalization, and therefore any capitalization deviation between the form and lemma can be detected without a special feature.

I do not agree with this approach. Lemmas should have nothing to do with capitalisation, it is a conventional fact that pertains tothe form only. So I would rather generally enforce all-lowercase lemmas universally. "Canonical capitalisation" can be detected by comparing all occurring forms of a given lemma.

amir-zeldes · 2022-05-04T18:40:33Z

Would you say the lemma of France is "france"? Or the lemma of "NASA" is "nasa"? That could produce very strange cases where an acronym, say "OR" ('operating room', usually spelled in caps), looks like a common word - "or". Or how about in German, where capitalization distinguishes the NOUN lemma "Ansehen" (reputation, esteem) from the VERB lemma "ansehen" (look at something)?

Also note that in many inflectional languages, the dictionary form of a word is considered to be its nominative singular. If that form is conventionally capitalized, we would be ignoring the lemmatization conventions of that language by using a lowercase form which may never be used. It's easy to lower case the lemma field if you need to do that for some application, but reliably recovering that information if it is not included in the corpus is not possible.

Stormur · 2022-05-05T08:44:32Z

Would you say the lemma of France is "france"?

Yes (does it really look so bad?)

Or the lemma of "NASA" is "nasa"? That could produce very strange cases where an acronym, say "OR" ('operating room', usually spelled in caps), looks like a common word - "or".

Acronyms might be a different issue, since they start with a different nature and we might discuss if they really have "lemmas" (and if they're not treated as multiword tokens). I don't know if they are an appropriate example.

Or how about in German, where capitalization distinguishes the NOUN lemma "Ansehen" (reputation, esteem) from the VERB lemma "ansehen" (look at something)?

Well, the distinction here is given by the POS. One cannot really think lemmas and POSs as isolated. Anyway, there are innumerable other ambiguities that exist even without regard to capitalisation if one just looks at the lemma strings and ignores the other annotatio nlayers. Besides, in this particular case, one might argue that we are indeed looking at the very same word: one is capitalised in nominal contexts, but in both cases it is the same verb noun form (infinitive) of the same word, so capitalisation in the lemma is just continuing a rather arbitrary German orthographic convention (i.e. "all nouns have to be capitalised", which actually often means "all words in a nominal context"). But again, the POS can distinguish if one wants to treat them as different words.

Also note that in many inflectional languages, the dictionary form of a word is considered to be its nominative singular. If that form is conventionally capitalized, we would be ignoring the lemmatization conventions of that language by using a lowercase form which may never be used. It's easy to lower case the lemma field if you need to do that for some application, but reliably recovering that information if it is not included in the corpus is not possible.

I don't understand exactly the issue at stake here.
In general, I would think as the lemma in the sense of LEMMA field as different from the lemma in the sense of dictionary entry. The more uniform they are, the better, also for data treatment. I could also reverse the argument by noting that even if you annotate a lemma "as it should be", with all its capitalisations right, in some texts you might never find it in that "correct" form.

dan-zeman · 2022-05-05T09:00:32Z

Would you say the lemma of France is "france"?

Yes (does it really look so bad?)

Yes :-)

Stormur · 2022-05-05T09:07:23Z

Would you say the lemma of France is "france"?

Yes (does it really look so bad?)

Yes :-)

It really is just habit!

amir-zeldes · 2022-05-05T13:19:48Z

if you annotate a lemma "as it should be", with all its capitalisations right, in some texts you might never find it in that "correct" form.

Absolutely - that's why we have Typo and CorrectForm if we want to indicate that a spelling is non-standard. I would consider "fbi" to be a non-standard way of spelling FBI, and it could also be misspelled FPI, or FBY, but if I understand that the intention was to say "FBI" then the lemma should be FBI and we should have Typo=Yes.

It really is just habit!

Yes, but it is a habit of the vast majority of the English speaking language community, so I don't really see a reason to deviate from it in lemmatization.

LEMMA field as different from the lemma in the sense of dictionary entry. The more uniform they are, the better, also for data treatment.

I agree that dictionary entries are not always the same as UD lemmas, but if we can easily maintain parity between the two for capitalization, why not do it?

in this particular case, one might argue that we are indeed looking at the very same word

No, I don't think so - I am not a German native speaker, but I very much doubt that most speakers see the noun Ansehen (esteem) as the same word as ansehen (look at). They are of course etymologically related, but this case is not the result of spontaneous conversion with transparent meaning. Evidence from separate entries in Duden, the standard German dictionary:

https://www.duden.de/rechtschreibung/Ansehen
https://www.duden.de/rechtschreibung/ansehen

Typo definition

75a73aa

nschneid added 3 commits April 7, 2022 09:58

elaborate on stylistic expressiveness, placeholder for grammatical er…

4a3393d

…orrs/dysfluencies

capitalization: leave it up to treebanks

4857bb4

broadly cover word form errors

e282137

dan-zeman reviewed Apr 7, 2022

View reviewed changes

_u-feat/Typo.md Outdated Show resolved Hide resolved

nschneid added 2 commits April 10, 2022 16:15

limit discussion of Style feature

55096e4

extra words: limit statement about Type=Yes to reparandum tokens

717ec29

dan-zeman merged commit d4c66a9 into pages-source Apr 13, 2022

dan-zeman deleted the nschneid-typo-def branch April 13, 2022 18:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Typo definition #858

Typo definition #858

nschneid commented Apr 6, 2022

dan-zeman commented Apr 6, 2022

amir-zeldes commented Apr 6, 2022

dan-zeman commented Apr 6, 2022 •

edited

Loading

nschneid commented Apr 6, 2022

dan-zeman commented Apr 6, 2022

amir-zeldes commented Apr 6, 2022

nschneid commented Apr 6, 2022 •

edited

Loading

amir-zeldes commented Apr 6, 2022 •

edited

Loading

nschneid commented Apr 7, 2022

amir-zeldes commented Apr 7, 2022 •

edited

Loading

nschneid commented Apr 7, 2022

sylvainkahane commented Apr 8, 2022

nschneid commented Apr 8, 2022

amir-zeldes commented Apr 8, 2022

nschneid commented Apr 8, 2022

amir-zeldes commented Apr 8, 2022

nschneid commented Apr 10, 2022

nschneid commented Apr 10, 2022

amir-zeldes commented Apr 13, 2022

Stormur commented May 4, 2022

amir-zeldes commented May 4, 2022

Stormur commented May 5, 2022 •

edited

Loading

dan-zeman commented May 5, 2022

Stormur commented May 5, 2022

amir-zeldes commented May 5, 2022

Typo definition #858

Typo definition #858

Conversation

nschneid commented Apr 6, 2022

dan-zeman commented Apr 6, 2022

amir-zeldes commented Apr 6, 2022

dan-zeman commented Apr 6, 2022 • edited Loading

nschneid commented Apr 6, 2022

dan-zeman commented Apr 6, 2022

amir-zeldes commented Apr 6, 2022

nschneid commented Apr 6, 2022 • edited Loading

amir-zeldes commented Apr 6, 2022 • edited Loading

nschneid commented Apr 7, 2022

amir-zeldes commented Apr 7, 2022 • edited Loading

nschneid commented Apr 7, 2022

sylvainkahane commented Apr 8, 2022

nschneid commented Apr 8, 2022

amir-zeldes commented Apr 8, 2022

nschneid commented Apr 8, 2022

amir-zeldes commented Apr 8, 2022

nschneid commented Apr 10, 2022

nschneid commented Apr 10, 2022

amir-zeldes commented Apr 13, 2022

Stormur commented May 4, 2022

amir-zeldes commented May 4, 2022

Stormur commented May 5, 2022 • edited Loading

dan-zeman commented May 5, 2022

Stormur commented May 5, 2022

amir-zeldes commented May 5, 2022

dan-zeman commented Apr 6, 2022 •

edited

Loading

nschneid commented Apr 6, 2022 •

edited

Loading

amir-zeldes commented Apr 6, 2022 •

edited

Loading

amir-zeldes commented Apr 7, 2022 •

edited

Loading

Stormur commented May 5, 2022 •

edited

Loading