-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Status of morphosyntactic features and values #985
Comments
This is a very interesting idea that would start moving UD from a theory of what the features and values of a language are, toward why they occur/how they are signaled (the form of a lexeme, agreement, context, etc.). What gives me some pause is the interaction between the categories of |
I agree it's an interesting view on the inventory of each language, though I would point out it depends on a variety of annotation choices. For example, the Gender of "she" is only lexical if we lemmatize "she" separately. We could have decided that all pronouns have a single lemma, and "she" is the inflected feminine form of that lemma - in English that might seem bizarre, but in other languages the boundary between inflection and derivation (which I take to fall under 'lexical' in this typology) can be murky. In English, for example, we could argue about whether the morphological comparative is lexical or inflectional, given that not all English adjectives can add -er. In any case, I think this is more a topic for an analytic paper than an annotation concern, since the classification into |
Thanks for your feedback.
@amir-zeldes You are right that there are cases where it is unclear whether a feature is lexical or inflectional. But there are always cases in corpus annotation where we must make a choice and the choice is not straightforward. It is clear that the case of pronouns is one of the most problematic. In some sense we already decide whether we consider the features to be lexical or inflectional when we choose the lemma form. For instance, how to interpret the fact that her as |
These are all PRON in UD, not DET. You're right that the lemma and case of her depend on whether it is possessive or not. Here is the full paradigm: https://universaldependencies.org/en/pos/PRON.html A current difference between GUM and EWT is that GUM applies |
I continue with the example of pronouns in English. My purpose is not to discuss the choices made in these treebanks, but just to take this as an example about the status of features. (By the way, we had exactly the same problems for the annotation of pronouns in French and I don't think that our annotation is optimal.) We have:
By definition, the lemma is the conventional name chosen for a lexeme. If I analyze this annotation, it means that we and us are the two forms of the lexeme we and |
I would like to emphasize that best problem concerning the status of features is the use of some features as denominations (such as |
We had a long discussion when creating this table. I agree it may not be perfect from a theoretical perspective. One practical factor was preexisting lemmatization standards—as I recall, the established practice was to lemmatize possessives separately from non-possessives. Another factor was that there wasn't an obvious feature available to distinguish independent vs. dependent possessives (some sources call them both "genitives", but we decided to call them both |
Lemmatization is yet another, somewhat separate topic, which also has to do with lexicographic standards in a language, etymology, and more. In most Indo-European languages, the de-adjectival possessives (Lat. meus "my"), are distinguished from the true pronominal genitives (Lat. mei "of me", genitive < lemma ego). But @nschneid is right in saying that this is perhaps more of a standardization question, and indeed, many, much larger corpora than the UD ones are lemmatized, and breaking with their tradition would be a high price in terms of interoperability of linked open data resources. Personally I'm happy to keep things stable and would sooner change "my" to not be Case=Gen than change the lemma (and historically, it is in fact not Case=Gen, though "its/his/her" is) |
Features must not be used denominatively because this is not what they are meant to. There can be other spaces for that, for example With regard to the distinction between lexcial and inflectional feature, I have also mused about marking this distinction, but in the end I I think that probably this "tag polysemy" can be maintained: the important thing is that the feature correspond to some morphological property, e.g. in English If a feature can be determined only purely contextually, then I advocate for not annotating it: it simply is not there morphologically. Then the case of English -ing forms appears to me as one of random coincidence: the phonological material is the same, but we can actually distinguish the nominal (annotating is nice) from the adjectival (the annotating person) forms. This issue admittedly becomes a little tricky in that it approaches contextual annotation. Another case is Latin cum: the |
This a follow-up of previous discussions we had to decide if a feature must be instanciated or not (the last one concerning the
Voice
feature in English, see #290).First, some features are features associated to inflectional morphemes, while others are lexical features. Examples of lexical features are
Gender
,Number
andPerson
on pronouns in English, while inflectional feature areNumber
andPerson
on the verb agreeing with its subject:Another example is the Gender agreement of the adjective and articles with the noun in French.
Gender
is lexical feature of NOUNs (Gender[lex]
), whileGender
is an inflectional feature on ADJs or DETs (Gender[infl]
):Note that Definiteness is a lexical feature, while Number is an inflectional feature on NOUNs, ADJs and articles (it is a lexical feature on most other DETs). PronType is an inherently lexical feature.
But there is a third use of morphosyntactic features: the denominative use. For instance, English has two participles which are the so-called present and past participles. The English treebank use the features
Tense=Pres
andTense=Past
to distinguish the two participles. It is quite problematic because these participles have more aspectual values than temporal:Second, in some case, an inflectional feature is not instantiated on a given lexeme. For instance, French has many ADJs, which do not show variation in Gender, such as rouge ‘red’, facile ‘easy’, etc. Nevertheless, the value can generally be deduced from the context. For the French treebanks, we thus have instantiated the
Gender
feature each time its value could be deduced from the context. This could be indicated on the value:English treebanks contain a lot of contextual values (due to the very poor inflectional morphology of English). For instance, every -ing verbal form can be
VerbForm=Part
orVerbForm=Ger
. This can only be deduced from the context: not any English verb has a different form for the present participle and the gerund. “Only-contextual features" could be distinguished:Note that
VerbForm=Part
is just[ctxt]
because the value is marked for past participles of some verbs (those distinguishing past participles and preterit). For past participles of transitive verbs, we have an opposition between imperfect forms (she has driven the car) and passive forms (the car was driven):The bare form of the verb is also ambiguous and can only be disambiguated contextually:
Note that the value
VerbForm=Inf
is only contextual[only-ctxt]
, but not the valueVerbForm=Fin
, since finiteness is marked for the 3SG present form, as well as for the past form of some verbs.It means that we can distinguish features for which some values can be marked (
VerbForm
,Tense
, etc.) and features for which all values are contextual (Voice
).Of course it would be too costly to add the status of features and values to each occurrence, but it would be useful for people exploiting a treebank to know the status of features and values. We could ask in the guidelines associated with the validator whether a feature is inflectional
[infl]
, lexical[lex]
or denominative[denom]
. Maybe also if the feature has values which are only contextual[only-ctxt]
.Such information would be very useful for linguists exploiting the treebanks. If we currently study noun-adjective Gender agreement in French, it would be difficult with only the treebank to know when this agreement is really effective. Same thing with the verb-subject Person agreement in English. And if we study Tense in English (without any knowledge of the language), we would have strange results due to the Tense feature on participles.
The text was updated successfully, but these errors were encountered: