Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New validator rule: leaf-det-clf #1059

Open
nschneid opened this issue Oct 8, 2024 · 19 comments
Open

New validator rule: leaf-det-clf #1059

nschneid opened this issue Oct 8, 2024 · 19 comments

Comments

@nschneid
Copy link
Contributor

nschneid commented Oct 8, 2024

I notice that the leaf-det-clf rule introduced in UniversalDependencies/tools@1e4debd and then revised in UniversalDependencies/tools@759c5ae has invalidated quite a lot (a majority?) of treebanks.

Is further revision necessary? For example, EWT is still experiencing some errors that look like they should be valid:

  • det + nmod e.g. "at least some reports" (det(reports, some), nmod(some, least)). "at least" is admittedly ADV-like, so another option is to make it ExtPos=ADV and advmod.
  • "such"/det licensing an advcl, as in these results. The guidelines on sufficiency and excess for "so" and similar say the advcl should attach to the adjective or adverb, not the noun in a case like sufficient flour. In such a high price that nobody could afford it, I suppose "such" should have an advcl dependent?
@mr-martian
Copy link
Contributor

The errors in Hebrew are due to things like

# x- so the RTL text doesn't make this unreadable
32	x-ה	x-ה	DET	art	PronType=Art	33	det	_	Gloss=the|Ref=GEN_19.8
33	x-אֲנָשִׁ֤ים	x-אישׁ	NOUN	subs	Gender=Masc|Number=Plur	38	obl	_	Gloss=man|Ref=GEN_19.8
34-35	x-הָאֵל֙	x-_	_	_	_	_	_	_	_
34	x-הָ	x-ה	DET	art	PronType=Art	35	det	_	Gloss=the|Ref=GEN_19.8
35	x-אֵל֙	x-אל	PRON	prde	Number=Plur|PronType=Dem	33	det	_	Gloss=these|Ref=GEN_19.8

where demonstrative pronouns have their own determiners. (I'm open to other means of annotating this.)

@amir-zeldes
Copy link
Contributor

@mr-martian this is also the analysis used in the modern Hebrew TBs, so I would be inclined to accept and keep it (it's also parallel to how adjectival modification works in Hebrew)

@mr-martian
Copy link
Contributor

If I were doing Hebrew from scratch, the one alternative I'd consider is treating ה as an inflectional prefix rather than a syntactic word.

@amir-zeldes
Copy link
Contributor

I would vote against that TBH, it's not how other languages with repeating articles do it either (e.g. Greek) and it complicates lemmatization, type counts, and a bunch of other things.

@colinbatchelor
Copy link
Contributor

I have one remaining error:
[(in gd_arcosg-ud-train.conllu) Line 55940 Sent p01_033h Node 79]: [L3 Syntax leaf-det-clf] 'det' not expected to have children (79:a:det --> 81:h-uile:compound)

The offending tree has someone emphasising 'every' by saying a h-uile h-uile. Is there maybe a better way I should be doing this or could it be an exception?

@nschneid
Copy link
Contributor Author

Repetition for emphasis: would flat be a good option instead of compound? Cf. https://universaldependencies.org/u/dep/flat.html#iconic-sequences (though I can't speak to how languages are dealing with reduplication in general).

The validator currently allows fixed, but not flat, it seems.

@LeonieWeissweiler
Copy link
Contributor

LeonieWeissweiler commented Oct 10, 2024

This invalidated both HDT and GSD for German, mostly because of vor allem (mainly) and unter anderem (among others). For both, the first word is an ADP' and the second is a DET' that depends on it with the `case' relation.

How should we handle this better?

@nschneid
Copy link
Contributor Author

unter anderem is sometimes treated as a fixed expression. Here is a case triggering the error:

image

I assume this means "among other teachers"—is there a reason not to analyze it as "among [other teachers]", with unter attaching to Lehrer?

@amir-zeldes
Copy link
Contributor

No, for the German case it's not "among other teachers", notice "other" is dative but "teacher" is not - it's "among others, teachers". I think the mistake is the deprel det - this is not a determiner but an oblique modifier, just like English "among others".

@FedeIure
Copy link

Repetition for emphasis: would flat be a good option instead of compound? Cf. https://universaldependencies.org/u/dep/flat.html#iconic-sequences (though I can't speak to how languages are dealing with reduplication in general).

The validator currently allows fixed, but not flat, it seems.

What about flat:redup to mark repetition for emphasis?

Here two examples in one sentence from Roman tragedies in UD Latin-CIRCSE:

flat_redup_Latin_CIRCSE

@sylvainkahane
Copy link
Contributor

For spoken data, we need three relations to be added to the validator:

  • discourse, which is very common between two determiners in false starts: "a, uh, a gap", "my, uh, our friend"
  • parataxis for cases such as "a, I don't how to call that, a kiosk, …": here we have a reparandum link between the two "a"s and we would like to attach the parenthesis to the first "a". More exactly we use parataxis:parenth in our spoken French treebanks.
  • dep for false starts such as "the last, the last day": here "the last" forms a phrase the head of which is missing and we decided to have dep(the, last). I am not against another solution, as long as "the last" is still a phrase.

@lrituma
Copy link

lrituma commented Oct 15, 2024

In Latvian, we have several expressions considered as compound pronouns in Latvian traditional grammar which consist of one particle and one pronoun. For example, kaut kāds where kaut is a particle and kāds is a pronoun (this expression roughly means 'some kind of'). Currently, we annotate the particle as discourse which is dependent of pronoun, and pronoun occasionally becomes det if the expression describes a noun. This leads to validation error.

The particles in these expressions usually are kaut, diez, diezin, nez, nezin, and they all have very fuzzy, hard to pin down semantics so we feel uncomfortable annotating them as adverbs.

We would like to annotate these expressions as compound (instead of fixed) because the pronoun is the second element in the phrase and we feel that it is the head of the phrase because the pronoun inflects together with a noun and bears the most of semantic meaning of the expression.

Would you please consider allowing compound in this construction or is there any other option appropriate here?

@nschneid
Copy link
Contributor Author

@dan-zeman What about relaxing the error to a warning while we figure out the contours of the rule?

@Stormur
Copy link
Contributor

Stormur commented Oct 17, 2024

I think that this new rule is fine, even if, while correcting, I and colleagues have encountered a couple of cases which really do not look reducible to a trivial correction as all the others.

  1. The already mentioned reduplication, which is treated through flat:redup in Latin treebanks. One example is quot quot from quot: while the latter means 'as many as', the reduplication has a distributive sense as in 'for each possible one...' (this expression is sometimes even univerbated). I think to annotate them separately, each depending on the head, is not the right way to deal with them: here we do not have two or more different terms, but really the same one "clonating" itself. On the other hand, flat is really the closest relation we have to fixed, which would cause no problem, but is not a correct choice (well, in my opinion it is never the correct choice)
    • Problem: horizontal relation
  2. The phrase nostra qui remansissemus caede 'the murder of us who are left (behind)', but more literally 'our who are left murder', since nostra is the inflected possessive determiner for the 1st person plural. What happens here is that the possessive adds a nominal person, as it were, and this person is another referent beyond the noun caede 'murder' in this phrase; as such, the relative can target it (or at least, Cicero pleases himself in doing so). We could not really justify an analysis where we shift the relative under the head noun, since the murder is not one of its arguments.
    • Problem: the relative clause dependent of the determiner cannot be traced back to the referent of its head

To summarise the above discussion, my two proposals are to deactivate this validation rule if:

  1. the child of det is a flat relation
  2. the head element has the feature Person, at least for acl:relcl

@amir-zeldes
Copy link
Contributor

We have something similar to the case in 1. in Coptic where a word is repeated for distributive meaning:

  1. one one = "one by one"
  2. two two = "two by two, in pairs"
  3. color color = "color for color, every color"

Etc. 1-2 also work fine in modern Hebrew BTW, and 3. would work in the plural. What we did in UD Coptic was interpret them as nominal modifiers without a preposition (i.e. "one one" is the same as "one by one" with the word "by" suppressed). We then used the nmod:unmarked relation, which is a subtype of nmod used without a case marker.

@jasiewert
Copy link
Contributor

This new rule invalidates an analysis in my Low Saxon dataset that I just presented last spring in my LREC-COLING paper and discussed with other UD people at the conference, even with @dan-zeman himself, if I remember correctly. It is explained in Section 5.1 here: https://aclanthology.org/2024.lrec-main.1388.pdf The gloss and translation of the sentence can be found in Section 4.3.

Attaching the possessor in dative case to the possessee instead of the determiner does not represent the way this construction works because 1) the dative possessor cannot be attached to the possessee without the determiner and 2) the possessee can be dropped while the determiner cannot. E.g., in the example in my paper, "In der Gemoene iarem." (literally "in the parish hers") is a valid answer to a specification question in whose service the person stands. (A note to German speakers: Masculine and neuter nouns show that this is indeed a dative, not a genitive.)
The alternative to change the determiners' tags to PRON in Low Saxon would go against UD's own definition of determiners. I would therefore join @nschneid in asking you to relax the error to a warning or ask for language-specific exceptions to the rule.

nschneid referenced this issue in UniversalDependencies/UD_Erzya-JR Oct 21, 2024
@ftyers @jonorthwash Is there a way to get around Pronoun det with appos in (). This is something that might show up in a text «his (John's) text is strange.» I would have: det(text, his) appos(his, John's)
@lauma
Copy link
Contributor

lauma commented Oct 21, 2024

Also, in Latvian we struggle with constructions similar to "such a high price that nobody could afford it" from the original post as well.

@rueter
Copy link
Contributor

rueter commented Oct 21, 2024

Yes, @nschneid, I think the problem encountered in UD_Erzya-JR should be made explicit, here.
In Erzya (myv), Moksha (mdf) and Skolt Saami (sms), genitive forms of personal pronouns are regularly connected to their possessa with a ‹det› dependency.

sent_id = EKS:2011:39:15:ČesnokovF
Конат-конат сонзэ (Степан Иваныч) ладсо сырелгадсть...
Konat-konat    sonze    (Stepan Ivanych)  ladso    syrelgadstʹ...
such-such.Pl  his/her  (St. I.)                    in.way   become.older.3Pl

`some like him (Stepan Ivanich) had gotten older...'

obl(syrelgadstʹ, ladso)
det(ladso, sonze)
appos(sonze, Stepan)

This could also be dealt with as a postposition, where the noun ‹lad› `way' in the Inessive case would contribute to the same ‹obl› dependency

obl(syrelgadstʹ, sonze)
case(sonze, ladso)
appos(sonze, Stepan)

Departing from a ‹det› dependency, however, we could approach English(, but this is not what EWT does).

His friends come from all over.
det(friends, his)

In linguistics, such a sentence might be quoted with an inserted identifier for contextual clarity, e.g.,

His (Fred's) friends come from all over.
det(friends, his)
appos(His, Fred's)

Authors themselves [their very selves], might do the same thing with commas:
His, Fred's, friends come from all over.
det(friends, his)
appos(His, Fred's)

Since the validator does not allow words with a ‹det› dependency to take children, one might opt to follow a Swedish lead and change all instances of genitive-case personal pronoun ‹det› to ‹nmod:poss/nmod:det›, but wouldn't that go against the established norm?

Here is an example of Swedish hennes ‹her› given with ‹nmod:poss› dependency
The genitive form of a third person singular personal pronoun 'her'

# sent_id = sv-ud-dev-78
# text = Börjar hennes jobb att delas av den moderne mannen?
1	Börjar	börja	VERB	VB|PRS|AKT	Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Act	0	root	0:root	_
2	hennes	hon	PRON	PS|UTR/NEU|SIN/PLU|DEF	Definite=Def|Poss=Yes|PronType=Prs	3	nmod:poss	3:nmod:poss	_
3	jobb	jobb	NOUN	NN|NEU|SIN|IND|NOM	Case=Nom|Definite=Ind|Gender=Neut|Number=Sing	1	nsubj	1:nsubj|5:nsubj	_
4	att	att	PART	IE	_	5	mark	5:mark	_
5	delas	dela	VERB	VB|INF|SFO	VerbForm=Inf|Voice=Pass	1	xcomp	1:xcomp	_
6	av	av	ADP	PP	_	9	case	9:case	_
7	den	den	DET	DT|UTR|SIN|DEF	Definite=Def|Gender=Com|Number=Sing|PronType=Art	9	det	9:det	_
8	moderne	modern	ADJ	JJ|POS|MAS|SIN|DEF|NOM	Case=Nom|Definite=Def|Degree=Pos|Gender=Com|Number=Sing	9	amod	9:amod	_
9	mannen	man	NOUN	NN|UTR|SIN|DEF|NOM	Case=Nom|Definite=Def|Gender=Com|Number=Sing	5	obl:agent	5:obl:agent	SpaceAfter=No
10	?	?	PUNCT	MAD	_	1	punct	1:punct	_

In Swedish, the first and second person pronouns are associated with distinct determiners that are called pronouns in UD vår, min, er, din. These words inflect according to their possessa, and therefore they might be seen as analogically the same phenomena as the Czech possessive determiners.

`possessive determiners (which modify a nominal) (note that some languages use PRON for similar words): [cs] můj, tvůj, jeho, její, náš, váš, jejich'
See also
https://universaldependencies.org/cs/dep/nmod.html
The Czech is consistent.

https://universaldependencies.org/ru/dep/nmod.html
I note that Russian also ‹его карта› amod(карта, его)
translated as English ‹his card› amod(card, his)
Syntag appears to contradict this in ‹его мнению› his opinion' det(мнению, его) but also в его (и не только его, но и нашем) случае' ‹в его случае› `in his case' nmod(случае, его)

https://universaldependencies.org/en/dep/nmod.html
I note that the English provides ‹my office› nmod:poss(office, my)
which is the same coding as in EWT.

So it looks like there might be a Swedish–English consensus for nmod:poss use with possessive pronouns, and genitive personal pronouns.

There is disparity within the Russian corpora along side a consistent Czech.

@johnnymoretti
Copy link

211 treebanks are invalidated by this new rule, and we need guidance on what to do before the freeze!!! Please provide brief and clear instructions, as aligning the treebanks with this rule requires a lot of work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests