-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Typo definition #858
Typo definition #858
Conversation
Sounds good to me. |
I have a comment and a question:
What about things like "i" for pronoun "I"? I think this is as unexpected in 'normal' text as other spelling abnormalities. And a question: how does this relate to spoken data, where there may be dysfluencies that should receive a
These are not 'typos' (since they're spoken), but basically have the same structure: unexpected form for which we can guess the CorrectForm. Do we need a parallel |
No, I think |
Hmm...I think if it were to include speech errors as well as typographical errors it should be called "error", right? "Typo" really suggests a typographical artifact. |
Maybe. But it is already a part of the universal guidelines and I would not change it or add an extra feature just because the string sounds less adequate than it perhaps could be. "Pick a suboptimal standard and stick to it." |
I agree with both of you :| it sounds ugly, but I admit it's silly to have two annotations that both basically imply an error with a CorrectForm. BTW for things like 'i' we now have corrected forms in GUM upstream, so we plan to propagate CorrectForms to the UD release as well, and indeed they are also specified for spoken data (in GUM the annotation name is |
Before we make a decision about the "Typo" name, let's discuss the capitalization policy. For EWT we normalize capitalization somewhat in lemmas, but I don't think it makes sense to treat all-lowercase sentences (or for that matter all-caps sentences) as typos with CorrectForms—flexibility with capitalization comes with the territory in user-generated content. Maybe the policy should be that it is up to each treebank to decide what to do about capitalization. Or, maybe the way to think about it is that lemmas (ideally) capture canonical capitalization, and therefore any capitalization deviation between the form and lemma can be detected without a special feature. (Several weeks ago I floated the idea of developing a canonicalization standard, specifying all info needed to derive a well-edited sentence/text if possible with only superficial edits, but it seemed like nobody was interested in that.) |
Yeah, a standard would be great, but I suspect it's very complicated... FWIW, GUM behavior is to not mark all caps as in any way an error, but mark it as rendered in all caps (
|
Gotcha. Capitalization (and punctuation) at sentence boundaries would be another thing that could be canonicalized, but when it comes to EWT I can list many higher-priority items like standardizing our approach to pronoun lemmas! |
One more question about superfluous material. Suppose I have this text:
It's pretty clear the quotation marks after
What do you think? |
I would say a spurious token should be considered |
A reparandum would nee a repair. Reparandum-repair pairs in spoken data are something quite different. |
This is the current policy articulated here:
Maybe the name "reparandum" is being interpreted liberally but I don't see a better deprel to use for accidental extra words. |
I agree with @sylvainkahane , it seems odd to call this a reparandum. I think that label should always point LTR, but you could have superfluous trailing punctuation. It seems least offensive to attach it as |
Sometimes a word is accidentally repeated, in which case it seems reasonable to treat the first as the reparandum and the second as its head. If we are talking about a totally accidental token not syntactically/semantically connected to the sentence at all (which could be punctuation but need not be—suppose someone typing on their phone accidentally types "x" or "1"), maybe it should attach as |
Discussing the deprel is also interesting, but assuming for some reason we don't want to use reparandum, shouldn't we use Typo and an empty CorrectForm to allow for the normalized version of the sentence to be reproduced using a straightforward procedure? |
My gut feeling is that if a word is superfluous that should be indicated on the deprel (because it also pertains to the syntactic structure), not with an empty CorrectForm. But I'm open to being persuaded otherwise if there's a compelling reason that the deprel is insufficient. |
OK if we publish the revised Typo feature page? The extra word policy is given on a different page (https://universaldependencies.org/u/overview/typos.html) so feel free to open a separate issue for that. |
LGTM! |
I do not agree with this approach. Lemmas should have nothing to do with capitalisation, it is a conventional fact that pertains tothe form only. So I would rather generally enforce all-lowercase lemmas universally. "Canonical capitalisation" can be detected by comparing all occurring forms of a given lemma. |
Would you say the lemma of France is "france"? Or the lemma of "NASA" is "nasa"? That could produce very strange cases where an acronym, say "OR" ('operating room', usually spelled in caps), looks like a common word - "or". Or how about in German, where capitalization distinguishes the NOUN lemma "Ansehen" (reputation, esteem) from the VERB lemma "ansehen" (look at something)? Also note that in many inflectional languages, the dictionary form of a word is considered to be its nominative singular. If that form is conventionally capitalized, we would be ignoring the lemmatization conventions of that language by using a lowercase form which may never be used. It's easy to lower case the lemma field if you need to do that for some application, but reliably recovering that information if it is not included in the corpus is not possible. |
Yes (does it really look so bad?)
Acronyms might be a different issue, since they start with a different nature and we might discuss if they really have "lemmas" (and if they're not treated as multiword tokens). I don't know if they are an appropriate example.
Well, the distinction here is given by the POS. One cannot really think lemmas and POSs as isolated. Anyway, there are innumerable other ambiguities that exist even without regard to capitalisation if one just looks at the lemma strings and ignores the other annotatio nlayers. Besides, in this particular case, one might argue that we are indeed looking at the very same word: one is capitalised in nominal contexts, but in both cases it is the same verb noun form (infinitive) of the same word, so capitalisation in the lemma is just continuing a rather arbitrary German orthographic convention (i.e. "all nouns have to be capitalised", which actually often means "all words in a nominal context"). But again, the POS can distinguish if one wants to treat them as different words.
I don't understand exactly the issue at stake here. |
Yes :-) |
It really is just habit! |
Absolutely - that's why we have
Yes, but it is a habit of the vast majority of the English speaking language community, so I don't really see a reason to deviate from it in lemmatization.
I agree that dictionary entries are not always the same as UD lemmas, but if we can easily maintain parity between the two for capitalization, why not do it?
No, I don't think so - I am not a German native speaker, but I very much doubt that most speakers see the noun Ansehen (esteem) as the same word as ansehen (look at). They are of course etymologically related, but this case is not the result of spontaneous conversion with transparent meaning. Evidence from separate entries in Duden, the standard German dictionary: https://www.duden.de/rechtschreibung/Ansehen |
A potential rewrite of the Typo definition for discussion.
The new definition broadens beyond errors to include typographically unexpected spellings. For example:
goeswith
soTypo=Yes
is required) (Goeswith edits UD_English-EWT#314)This may shift the boundary with Style=Expr when it comes to odd but intentional spellings like CA$H.
Thoughts?