-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New validator rule: leaf-det-clf #1059
Comments
The errors in Hebrew are due to things like # x- so the RTL text doesn't make this unreadable
32 x-ה x-ה DET art PronType=Art 33 det _ Gloss=the|Ref=GEN_19.8
33 x-אֲנָשִׁ֤ים x-אישׁ NOUN subs Gender=Masc|Number=Plur 38 obl _ Gloss=man|Ref=GEN_19.8
34-35 x-הָאֵל֙ x-_ _ _ _ _ _ _ _
34 x-הָ x-ה DET art PronType=Art 35 det _ Gloss=the|Ref=GEN_19.8
35 x-אֵל֙ x-אל PRON prde Number=Plur|PronType=Dem 33 det _ Gloss=these|Ref=GEN_19.8 where demonstrative pronouns have their own determiners. (I'm open to other means of annotating this.) |
@mr-martian this is also the analysis used in the modern Hebrew TBs, so I would be inclined to accept and keep it (it's also parallel to how adjectival modification works in Hebrew) |
If I were doing Hebrew from scratch, the one alternative I'd consider is treating ה as an inflectional prefix rather than a syntactic word. |
I would vote against that TBH, it's not how other languages with repeating articles do it either (e.g. Greek) and it complicates lemmatization, type counts, and a bunch of other things. |
I have one remaining error:
The offending tree has someone emphasising 'every' by saying a h-uile h-uile. Is there maybe a better way I should be doing this or could it be an exception? |
Repetition for emphasis: would The validator currently allows |
This invalidated both HDT and GSD for German, mostly because of vor allem (mainly) and unter anderem (among others). For both, the first word is an How should we handle this better? |
No, for the German case it's not "among other teachers", notice "other" is dative but "teacher" is not - it's "among others, teachers". I think the mistake is the deprel det - this is not a determiner but an oblique modifier, just like English "among others". |
What about Here two examples in one sentence from Roman tragedies in UD Latin-CIRCSE: |
For spoken data, we need three relations to be added to the validator:
|
In Latvian, we have several expressions considered as compound pronouns in Latvian traditional grammar which consist of one particle and one pronoun. For example, kaut kāds where kaut is a particle and kāds is a pronoun (this expression roughly means 'some kind of'). Currently, we annotate the particle as The particles in these expressions usually are kaut, diez, diezin, nez, nezin, and they all have very fuzzy, hard to pin down semantics so we feel uncomfortable annotating them as adverbs. We would like to annotate these expressions as Would you please consider allowing |
@dan-zeman What about relaxing the error to a warning while we figure out the contours of the rule? |
I think that this new rule is fine, even if, while correcting, I and colleagues have encountered a couple of cases which really do not look reducible to a trivial correction as all the others.
To summarise the above discussion, my two proposals are to deactivate this validation rule if:
|
We have something similar to the case in 1. in Coptic where a word is repeated for distributive meaning:
Etc. 1-2 also work fine in modern Hebrew BTW, and 3. would work in the plural. What we did in UD Coptic was interpret them as nominal modifiers without a preposition (i.e. "one one" is the same as "one by one" with the word "by" suppressed). We then used the |
This new rule invalidates an analysis in my Low Saxon dataset that I just presented last spring in my LREC-COLING paper and discussed with other UD people at the conference, even with @dan-zeman himself, if I remember correctly. It is explained in Section 5.1 here: https://aclanthology.org/2024.lrec-main.1388.pdf The gloss and translation of the sentence can be found in Section 4.3. Attaching the possessor in dative case to the possessee instead of the determiner does not represent the way this construction works because 1) the dative possessor cannot be attached to the possessee without the determiner and 2) the possessee can be dropped while the determiner cannot. E.g., in the example in my paper, "In der Gemoene iarem." (literally "in the parish hers") is a valid answer to a specification question in whose service the person stands. (A note to German speakers: Masculine and neuter nouns show that this is indeed a dative, not a genitive.) |
@ftyers @jonorthwash Is there a way to get around Pronoun det with appos in (). This is something that might show up in a text «his (John's) text is strange.» I would have: det(text, his) appos(his, John's)
Also, in Latvian we struggle with constructions similar to "such a high price that nobody could afford it" from the original post as well. |
Yes, @nschneid, I think the problem encountered in UD_Erzya-JR should be made explicit, here.
`some like him (Stepan Ivanich) had gotten older...' obl(syrelgadstʹ, ladso) This could also be dealt with as a postposition, where the noun ‹lad› `way' in the Inessive case would contribute to the same ‹obl› dependency obl(syrelgadstʹ, sonze) Departing from a ‹det› dependency, however, we could approach English(, but this is not what EWT does). His friends come from all over. In linguistics, such a sentence might be quoted with an inserted identifier for contextual clarity, e.g., His (Fred's) friends come from all over. Authors themselves [their very selves], might do the same thing with commas: Since the validator does not allow words with a ‹det› dependency to take children, one might opt to follow a Swedish lead and change all instances of genitive-case personal pronoun ‹det› to ‹nmod:poss/nmod:det›, but wouldn't that go against the established norm? Here is an example of Swedish
In Swedish, the first and second person pronouns are associated with distinct determiners that are called pronouns in UD vår, min, er, din. These words inflect according to their possessa, and therefore they might be seen as analogically the same phenomena as the Czech possessive determiners. `possessive determiners (which modify a nominal) (note that some languages use PRON for similar words): [cs] můj, tvůj, jeho, její, náš, váš, jejich' https://universaldependencies.org/ru/dep/nmod.html https://universaldependencies.org/en/dep/nmod.html So it looks like there might be a Swedish–English consensus for nmod:poss use with possessive pronouns, and genitive personal pronouns. There is disparity within the Russian corpora along side a consistent Czech. |
211 treebanks are invalidated by this new rule, and we need guidance on what to do before the freeze!!! Please provide brief and clear instructions, as aligning the treebanks with this rule requires a lot of work. |
I notice that the leaf-det-clf rule introduced in UniversalDependencies/tools@1e4debd and then revised in UniversalDependencies/tools@759c5ae has invalidated quite a lot (a majority?) of treebanks.
Is further revision necessary? For example, EWT is still experiencing some errors that look like they should be valid:
det
+nmod
e.g. "at least some reports" (det(reports, some)
,nmod(some, least)
). "at least" is admittedly ADV-like, so another option is to make itExtPos=ADV
andadvmod
.det
licensing anadvcl
, as in these results. The guidelines on sufficiency and excess for "so" and similar say theadvcl
should attach to the adjective or adverb, not the noun in a case like sufficient flour. In such a high price that nobody could afford it, I suppose "such" should have anadvcl
dependent?The text was updated successfully, but these errors were encountered: