Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Typo definition #858

Merged
merged 6 commits into from
Apr 13, 2022
Merged

Typo definition #858

merged 6 commits into from
Apr 13, 2022

Conversation

nschneid
Copy link
Contributor

@nschneid nschneid commented Apr 6, 2022

A potential rewrite of the Typo definition for discussion.

The new definition broadens beyond errors to include typographically unexpected spellings. For example:

This may shift the boundary with Style=Expr when it comes to odd but intentional spellings like CA$H.

Thoughts?

@dan-zeman
Copy link
Member

Sounds good to me.

@amir-zeldes
Copy link
Contributor

I have a comment and a question:

unexpected capitalization choices do not fall under Typo=Yes

What about things like "i" for pronoun "I"? I think this is as unexpected in 'normal' text as other spelling abnormalities.

And a question: how does this relate to spoken data, where there may be dysfluencies that should receive a CorrectForm. Some examples:

  • we call it a sholids (for "a solid", as opposed to a liquid or gas)
  • they known about it (probably for "know", in context)

These are not 'typos' (since they're spoken), but basically have the same structure: unexpected form for which we can guess the CorrectForm. Do we need a parallel Dysfluency=Yes?

@dan-zeman
Copy link
Member

dan-zeman commented Apr 6, 2022

Do we need a parallel Dysfluency=Yes?

No, I think Typo=Yes is enough and its definition should be written in such a way that it generalizes to spoken data.

@nschneid
Copy link
Contributor Author

nschneid commented Apr 6, 2022

Hmm...I think if it were to include speech errors as well as typographical errors it should be called "error", right? "Typo" really suggests a typographical artifact.

@dan-zeman
Copy link
Member

Hmm...I think if it were to include speech errors as well as typographical errors it should be called "error", right? "Typo" really suggests a typographical artifact.

Maybe. But it is already a part of the universal guidelines and I would not change it or add an extra feature just because the string sounds less adequate than it perhaps could be. "Pick a suboptimal standard and stick to it."

@amir-zeldes
Copy link
Contributor

I agree with both of you :| it sounds ugly, but I admit it's silly to have two annotations that both basically imply an error with a CorrectForm. BTW for things like 'i' we now have corrected forms in GUM upstream, so we plan to propagate CorrectForms to the UD release as well, and indeed they are also specified for spoken data (in GUM the annotation name is sic, which is taken from the TEI inventory).

@nschneid
Copy link
Contributor Author

nschneid commented Apr 6, 2022

Before we make a decision about the "Typo" name, let's discuss the capitalization policy.

For EWT we normalize capitalization somewhat in lemmas, but I don't think it makes sense to treat all-lowercase sentences (or for that matter all-caps sentences) as typos with CorrectForms—flexibility with capitalization comes with the territory in user-generated content.

Maybe the policy should be that it is up to each treebank to decide what to do about capitalization. Or, maybe the way to think about it is that lemmas (ideally) capture canonical capitalization, and therefore any capitalization deviation between the form and lemma can be detected without a special feature.

(Several weeks ago I floated the idea of developing a canonicalization standard, specifying all info needed to derive a well-edited sentence/text if possible with only superficial edits, but it seemed like nobody was interested in that.)

@amir-zeldes
Copy link
Contributor

amir-zeldes commented Apr 6, 2022

Yeah, a standard would be great, but I suspect it's very complicated... FWIW, GUM behavior is to not mark all caps as in any way an error, but mark it as rendered in all caps (<hi rend="caps">, so similar to how we mark up italics), whereas lower-casing something that should conventionally be upper case is marked with sic:

<sic ana="I">i</sic> <hi rend="caps">WANTED</hi> to go to <sic ana="SyntaxFest">syntaxfest</sic> in <sic ana="Bulgaria">bulgaria</sic>

@nschneid
Copy link
Contributor Author

nschneid commented Apr 7, 2022

Gotcha. Rendering=Caps could be a good MISC feature. In any case it should almost always be sufficient to compare the casing of the lemma to the casing of the form to see if it is nonstandard. I guess the weird exception would be if only an inflectional part of the word uses nonstandard casing, e.g. "wantED".

Capitalization (and punctuation) at sentence boundaries would be another thing that could be canonicalized, but when it comes to EWT I can list many higher-priority items like standardizing our approach to pronoun lemmas!

_u-feat/Typo.md Outdated Show resolved Hide resolved
@amir-zeldes
Copy link
Contributor

amir-zeldes commented Apr 7, 2022

One more question about superfluous material. Suppose I have this text:

  • When a young girl," the family moved south to Vichy, spending vacations at the paternal ancestral village of Mazirat

It's pretty clear the quotation marks after girl, are just a 'fat-finger' typo, and this is marked up in GUM with the target hypothesis "" (empty string). Should this have Typo=Yes? If not, what else? Should it have CorrectForm? If so, which is correct:

  • CorrectForm= (nothing, it shouldn't be there, so zero length string value)
  • CorrectForm=_ (because 'empty' is "_"; but then 'auto-construction' of normative text is hampered)

What do you think?

@nschneid
Copy link
Contributor Author

nschneid commented Apr 7, 2022

I would say a spurious token should be considered reparandum, no need to tag it with Typo or CorrectForm.

@sylvainkahane
Copy link
Contributor

A reparandum would nee a repair. Reparandum-repair pairs in spoken data are something quite different.

@nschneid
Copy link
Contributor Author

nschneid commented Apr 8, 2022

This is the current policy articulated here:

If the text contains by error a word that should not be there, it can be treated similarly to speech disfluences, that is, attached to the following constituent via the reparandum relation. A relatively common instance in written language is that a word is typed twice in a row.

Maybe the name "reparandum" is being interpreted liberally but I don't see a better deprel to use for accidental extra words.

@amir-zeldes
Copy link
Contributor

I agree with @sylvainkahane , it seems odd to call this a reparandum. I think that label should always point LTR, but you could have superfluous trailing punctuation. It seems least offensive to attach it as punct since it's not a case of the writer starting to utter punctuation, then changing their mind and repairing it with an alternative (what is the 'repair' for these quotation marks?)

@nschneid
Copy link
Contributor Author

nschneid commented Apr 8, 2022

Sometimes a word is accidentally repeated, in which case it seems reasonable to treat the first as the reparandum and the second as its head.

If we are talking about a totally accidental token not syntactically/semantically connected to the sentence at all (which could be punctuation but need not be—suppose someone typing on their phone accidentally types "x" or "1"), maybe it should attach as reparandum to the root word of the sentence. Or maybe it should be dep. My personal interpretation of dep is "I have no idea what this word/phrase is doing here." :)

@amir-zeldes
Copy link
Contributor

Discussing the deprel is also interesting, but assuming for some reason we don't want to use reparandum, shouldn't we use Typo and an empty CorrectForm to allow for the normalized version of the sentence to be reproduced using a straightforward procedure?

@nschneid
Copy link
Contributor Author

My gut feeling is that if a word is superfluous that should be indicated on the deprel (because it also pertains to the syntactic structure), not with an empty CorrectForm. But I'm open to being persuaded otherwise if there's a compelling reason that the deprel is insufficient.

@nschneid
Copy link
Contributor Author

OK if we publish the revised Typo feature page? The extra word policy is given on a different page (https://universaldependencies.org/u/overview/typos.html) so feel free to open a separate issue for that.

@amir-zeldes
Copy link
Contributor

LGTM!

@dan-zeman dan-zeman merged commit d4c66a9 into pages-source Apr 13, 2022
@dan-zeman dan-zeman deleted the nschneid-typo-def branch April 13, 2022 18:16
@Stormur
Copy link
Contributor

Stormur commented May 4, 2022

Or, maybe the way to think about it is that lemmas (ideally) capture canonical capitalization, and therefore any capitalization deviation between the form and lemma can be detected without a special feature.

I do not agree with this approach. Lemmas should have nothing to do with capitalisation, it is a conventional fact that pertains tothe form only. So I would rather generally enforce all-lowercase lemmas universally. "Canonical capitalisation" can be detected by comparing all occurring forms of a given lemma.

@amir-zeldes
Copy link
Contributor

Would you say the lemma of France is "france"? Or the lemma of "NASA" is "nasa"? That could produce very strange cases where an acronym, say "OR" ('operating room', usually spelled in caps), looks like a common word - "or". Or how about in German, where capitalization distinguishes the NOUN lemma "Ansehen" (reputation, esteem) from the VERB lemma "ansehen" (look at something)?

Also note that in many inflectional languages, the dictionary form of a word is considered to be its nominative singular. If that form is conventionally capitalized, we would be ignoring the lemmatization conventions of that language by using a lowercase form which may never be used. It's easy to lower case the lemma field if you need to do that for some application, but reliably recovering that information if it is not included in the corpus is not possible.

@Stormur
Copy link
Contributor

Stormur commented May 5, 2022

Would you say the lemma of France is "france"?

Yes (does it really look so bad?)

Or the lemma of "NASA" is "nasa"? That could produce very strange cases where an acronym, say "OR" ('operating room', usually spelled in caps), looks like a common word - "or".

Acronyms might be a different issue, since they start with a different nature and we might discuss if they really have "lemmas" (and if they're not treated as multiword tokens). I don't know if they are an appropriate example.

Or how about in German, where capitalization distinguishes the NOUN lemma "Ansehen" (reputation, esteem) from the VERB lemma "ansehen" (look at something)?

Well, the distinction here is given by the POS. One cannot really think lemmas and POSs as isolated. Anyway, there are innumerable other ambiguities that exist even without regard to capitalisation if one just looks at the lemma strings and ignores the other annotatio nlayers. Besides, in this particular case, one might argue that we are indeed looking at the very same word: one is capitalised in nominal contexts, but in both cases it is the same verb noun form (infinitive) of the same word, so capitalisation in the lemma is just continuing a rather arbitrary German orthographic convention (i.e. "all nouns have to be capitalised", which actually often means "all words in a nominal context"). But again, the POS can distinguish if one wants to treat them as different words.

Also note that in many inflectional languages, the dictionary form of a word is considered to be its nominative singular. If that form is conventionally capitalized, we would be ignoring the lemmatization conventions of that language by using a lowercase form which may never be used. It's easy to lower case the lemma field if you need to do that for some application, but reliably recovering that information if it is not included in the corpus is not possible.

I don't understand exactly the issue at stake here.
In general, I would think as the lemma in the sense of LEMMA field as different from the lemma in the sense of dictionary entry. The more uniform they are, the better, also for data treatment. I could also reverse the argument by noting that even if you annotate a lemma "as it should be", with all its capitalisations right, in some texts you might never find it in that "correct" form.

@dan-zeman
Copy link
Member

Would you say the lemma of France is "france"?

Yes (does it really look so bad?)

Yes :-)

@Stormur
Copy link
Contributor

Stormur commented May 5, 2022

Would you say the lemma of France is "france"?

Yes (does it really look so bad?)

Yes :-)

It really is just habit!

@amir-zeldes
Copy link
Contributor

if you annotate a lemma "as it should be", with all its capitalisations right, in some texts you might never find it in that "correct" form.

Absolutely - that's why we have Typo and CorrectForm if we want to indicate that a spelling is non-standard. I would consider "fbi" to be a non-standard way of spelling FBI, and it could also be misspelled FPI, or FBY, but if I understand that the intention was to say "FBI" then the lemma should be FBI and we should have Typo=Yes.

It really is just habit!

Yes, but it is a habit of the vast majority of the English speaking language community, so I don't really see a reason to deviate from it in lemmatization.

LEMMA field as different from the lemma in the sense of dictionary entry. The more uniform they are, the better, also for data treatment.

I agree that dictionary entries are not always the same as UD lemmas, but if we can easily maintain parity between the two for capitalization, why not do it?

in this particular case, one might argue that we are indeed looking at the very same word

No, I don't think so - I am not a German native speaker, but I very much doubt that most speakers see the noun Ansehen (esteem) as the same word as ansehen (look at). They are of course etymologically related, but this case is not the result of spontaneous conversion with transparent meaning. Evidence from separate entries in Duden, the standard German dictionary:

https://www.duden.de/rechtschreibung/Ansehen
https://www.duden.de/rechtschreibung/ansehen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants