Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

English nominal subtypes: merge :npmod and :tmod as :unmarked #1028

Open
nschneid opened this issue Apr 27, 2024 · 20 comments
Open

English nominal subtypes: merge :npmod and :tmod as :unmarked #1028

nschneid opened this issue Apr 27, 2024 · 20 comments

Comments

@nschneid
Copy link
Contributor

nschneid commented Apr 27, 2024

Because prepositions are so important in English, we have a well-established practice of distinguishing ordinary prepositional nmod and obl from other kinds via subtyping (nmod:poss, etc.).

In particular, nmod:tmod/obl:tmod have been used for non-prepositional temporal adjunct nominals like

  • It will happen Friday. (obl:tmod) The party Friday was widely attended. (nmod:tmod)

in contrast to

  • It will happen on Friday. (obl) The party on Friday was widely attended. (nmod)

tmod is part of the legacy of Stanford Dependencies. In light of current UD theory, it is an anomaly where the subtype reflects a semantic but not syntactic distinction (#893). Moreover, it is potentially confusing that only some temporal obliques (the prepositionless ones) receive the subtype.

Meanwhile, nmod:npmod/obl:npmod are used for OTHER non-prepositional adjunct nominals (in special constructions like "5 dollars a share" and "Shares eased a fraction). The term "npmod" (derived from the npadvmod relation in Stanford Dependencies) has been a source of confusion and invokes a concept of NP that is not part of UD theory.

A discussion amongst the core group concluded that a subtype named :unmarked would be a less confusing way to implement the adpositional vs. non-adpositional distinction, for languages that choose to do so.

@amir-zeldes and I plan to implement this for our English corpora, by simply renaming both :tmod and :npmod to :unmarked. Perhaps English-Atis (@aslikuzgun), English-ESLSpok (@kristopherkyle), English-{LinES, Pronouns, PUD} (@AngledLuffa), English-ParTUT (@msang) would like to do so as well for consistency.

@nschneid
Copy link
Contributor Author

As this is a trivial change to implement, but one that multiple treebanks may want to make in concert, is it better to update EWT/GUM before May 1 or wait until the next release?

@AngledLuffa
Copy link

I'm not the right @ for LinES, but I can do it in the CoreNLP converter, PUD, and Pronouns

@LarsAhrenberg I can do it if you want me to do it to LinES

@AngledLuffa
Copy link

Is this just literally a string replace over everything?

The only : relations marked in Pronouns are aux:pass and det:predet. Another job well done

@AngledLuffa
Copy link

PUD has plenty. Please confirm if there's any intelligence required to do this, or just ESC-shift-5

@nschneid
Copy link
Contributor Author

Simple replacement. Since EWT lacks any entity annotation whatsoever, for the :tmod ones I think I'll add TemporalNPAdjunct=Yes in MISC to retain the semantic information for posterity. Eventually we should annotate all temporal entities.

@amir-zeldes
Copy link
Contributor

is it better to update EWT/GUM before May 1 or wait until the next release?

Not sure, time is a bit tight. And it's not just English, where I can update the GUM, Reddit and GENTLE repos - I know of at least UD Coptic and Hebrew IAHLTwiki which I maintain and use these labels, so I could change those, but I haven't coordinated with the annotators about this. Do you know if there are other datasets using these subtypes? I wouldn't want to create differences between datasets on short notice just for a renaming.

@dan-zeman
Copy link
Member

  • nmod:npmod: Armenian, English, Hebrew, Western Armenian
  • obl:npmod: Ancient Hebrew, Coptic, English, Hebrew

  • acl:tmod: Vietnamese
  • advmod:tmod: Apurina, Erzya, Italian, Komi Permyak, Komi Zyrian, Latin, Moksha, Romanian, Skolt Sami
  • nmod:tmod: Chinese, English, Hebrew, Indonesian, Irish, Javanese, Moksha, Romanian, Sinhala, Telugu, Turkish, Uyghur
  • obl:tmod: Apurina, Arabic, Cantonese, Chinese, Classical Chinese, Danish, English, Erzya, Frisian Dutch, German, Hebrew, Hindi, Indonesian, Irish, Italian, Javanese, Komi Permyak, Komi Zyrian, Korean, Latin, Manx, Moksha, Old East Slavic, Old Irish, Portuguese, Romanian, Russian, Scottish Gaelic, Sinhala, Skolt Sami, Spanish, Tamil, Telugu, Thai, Turkish, Uyghur, Vietnamese, Warlpiri, Xibe

@nschneid
Copy link
Contributor Author

OK let's not rush it then. Let's implement it in the 2.15 release.

@dan-zeman dan-zeman modified the milestones: v2.14, v2.15 Apr 29, 2024
@mr-martian
Copy link
Contributor

For Ancient Hebrew the usage of obl:npmod isn't "preposition-less non-temporal obl" but rather the construction argued about in #832, so I'd need a new label for those if there is to be an effort to eliminate :npmod in general.

@amir-zeldes
Copy link
Contributor

@mr-martian I think obl:unmarked is about as informative/appropriate as obl:npmod, so you may as well switch too (not saying it's an ideal label, but the previous one also makes no sense in the context of dependencies)

@nschneid
Copy link
Contributor Author

I started to draft a new issue about this, forgetting that this one existed. :D One bit of information not included above is the alternatives that were discussed, which I'll put for posterity:

  • "unmarked" is admittedly somewhat vague, but it was preferred over the other options considered: "bare" (might suggest lack of determiner), "caseless" (concern about confusion with inflectional case), "adv" or "advl" to signal the adverbial function (similar to advmod/advcl, plus UD regards adverbial PPs as nominals so the lack of the preposition doesn't distinguish adverbial nmods or obls from non-adverbial ones).

nschneid added a commit that referenced this issue Jun 22, 2024
nschneid added a commit that referenced this issue Jun 22, 2024
nschneid added a commit that referenced this issue Jun 22, 2024
nschneid referenced this issue in UniversalDependencies/UD_English-EWT Jun 22, 2024
@nschneid
Copy link
Contributor Author

nschneid commented Jun 22, 2024

Implemented for EWT, and created some initial docs:

Still need to update more docs pages and mark old subtypes as deprecated.

What are implementation plans for other treebanks?

@LarsAhrenberg
Copy link
Contributor

So far UD_English-LinES has used neither :npmod nor :tmod, but it seems quite straightforward to implement :unmarked so I put it up for version 2.15.

AngledLuffa added a commit to UniversalDependencies/UD_English-PUD that referenced this issue Jun 24, 2024
AngledLuffa added a commit to UniversalDependencies/UD_English-PUD that referenced this issue Jun 24, 2024
@AngledLuffa
Copy link

I made a PR for PUD. I don't think it's relevant for Pronouns

AngledLuffa added a commit to UniversalDependencies/UD_English-PUD that referenced this issue Jun 24, 2024
@LarsAhrenberg
Copy link
Contributor

Reviewing the outputs of my script adding :unmarked to obl and nmod tokens I've come across a number of cases where I think the subrelation is reasonable but which are not covered in the initial docs ( oblique, nmod ). I would be grateful to hear the views of other people.

Multipart references to locations
at number four, Privet Drive
nmod:unmarked(four, Privet)

by way of Northfield , Minnesota
nmod:unmarked(Northfield, Minnesota)

Apposition like but without identity of reference:
blamed for letting the quality of life (a deplorable phrase) deteriorate
nmod:unmarked(quality, phrase)

Subject: The cost of enlargement
nmod:unmarked(Subject, cost)

Your amendments uphold two important principles: the right of rightholders to fair remuneration and the ...
nmod:unmarked(principles, right)

Personal pronoun + noun
I suppose you fellows remember...
nmod:unmarked(you, fellows)

Go back to Stromboli, you dumb bastard
nmod:unmarked(you, bastard)

Multi-word proper noun made adjective
a tall Puerto Rican man.
nmod:unmarked(man, Puerto), flat(Puerto, Rican)

Pre-head modifier like 'a couple'
leather red with a suppleness to it that is part gift, part effort
nmod:unmarked(gift, part), nmod:unmarked(effort, part)

Fronted or extraposed subject predicative
A kibbutznik seaman, he has just returned from a voyage.
obl:unmarked(returned, seaman)

These grew spontaneously one out of the other,
obl:unmarked(grew, one)

Sound imitations
Pop, would go one of the eight-inch guns;
obl:unmarked(go, Pop) or maybe it should be obj(go, Pop)

@nschneid
Copy link
Contributor Author

nschneid commented Jul 5, 2024

Sound imitations
Pop, would go one of the eight-inch guns;
obl:unmarked(go, Pop) or maybe it should be obj(go, Pop)

"Pop" can't be omitted so it looks like obj to me (with an inverted word order; cf. 'Never!' said John).

Pre-head modifier like 'a couple'
leather red with a suppleness to it that is part gift, part effort
nmod:unmarked(gift, part), nmod:unmarked(effort, part)

Interesting...haven't thought about this one:

  • "part X": Is this like "3 parts sugar" in a recipe? Yeah nmod:unmarked probably makes sense by analogy to "a couple".
  • (You haven't commented on this but) my gut feeling is that "[part X], [part Y]" is an asyndetic coordination structure. You can paraphrase with "and". So conj(gift, effort).

Multi-word proper noun made adjective
a tall Puerto Rican man.
nmod:unmarked(man, Puerto), flat(Puerto, Rican)

Because you can say "the man is Puerto Rican", I would lean toward treating the whole expression as an ADJ (ExtPos=ADJ). Thus: flat(Puerto/PROPN,ExtPos=ADJ Rican/ADJ) and amod(man, Puerto)

The rest have been discussed but not decided yet. See this paper for a synopsis and some proposals. If you want to contribute to the discussion: #455, UniversalDependencies/UD_English-EWT/issues/436, #751, #762, #933, #1024

amir-zeldes added a commit to amir-zeldes/gum that referenced this issue Jul 26, 2024
  * Replaces nmod/obl:npmod/tmod
  * Uses of tmod can be emulated using lemma list in label_trees.py (e.g. for generating PTB NP-TMP)
  * See UniversalDependencies/docs#1028
amir-zeldes added a commit to gucorpling/gentle that referenced this issue Jul 26, 2024
  * Replaces nmod/obl:npmod/tmod
  * Uses of tmod can be emulated using lemma list in label_trees.py (e.g. for generating PTB NP-TMP)
  * See UniversalDependencies/docs#1028
amir-zeldes added a commit to UniversalDependencies/UD_Hebrew-IAHLTwiki that referenced this issue Jul 26, 2024
  * Replaces nmod/obl:npmod/tmod
  * Used TemporalNPAdjunct=Yes in misc to preserve tmod info
  * See UniversalDependencies/docs#1028
amir-zeldes added a commit to UniversalDependencies/UD_Hebrew-IAHLTknesset that referenced this issue Jul 26, 2024
  * Replaces nmod/obl:npmod/tmod
  * Used TemporalNPAdjunct=Yes in misc to preserve tmod info
  * See UniversalDependencies/docs#1028
amir-zeldes added a commit to IAHLT/UD_Hebrew that referenced this issue Jul 26, 2024
  * Replaces nmod/obl:npmod/tmod
  * Used TemporalNPAdjunct=Yes in misc to preserve tmod info
  * See UniversalDependencies/docs#1028
amir-zeldes added a commit to UniversalDependencies/UD_Coptic-Scriptorium that referenced this issue Jul 26, 2024
  * Replaces :npmod subtype
  * See UniversalDependencies/docs#1028
@amir-zeldes
Copy link
Contributor

OK, this change should now be done and documented for:

  • UD_English-GUM (upstream)
  • UD_English-GUMReddit (upstream)
  • UD_English-GENTLE (upstream)
  • UD_Coptic-Scriptorium (in UD dev already)
  • UD_Hebrew-IAHLTwiki (UD dev)
  • UD_Hebrew-IAHLTknesset (UD dev)
  • IAHLT/UD_Hebrew-HTB (outside UD)

@nschneid
Copy link
Contributor Author

Excellent!

Any updates regarding English-Atis (@aslikuzgun), English-ESLSpok (@kristopherkyle), English-ParTUT (@msang)? All of these use at least a subset of the {nmod:npmod, obl:npmod, nmod:tmod, obl:tmod} relations.

@nschneid
Copy link
Contributor Author

I believe the English docs are now up to date, with mentions of :npmod and :tmod replaced with :unmarked.

I have not heard any objections to incorporating :unmarked into the remaining English corpora. @dan-zeman what is the policy regarding simple rule-based edits to other treebanks in the interest of within-language consistency?

@dan-zeman
Copy link
Member

I have not heard any objections to incorporating :unmarked into the remaining English corpora. @dan-zeman what is the policy regarding simple rule-based edits to other treebanks in the interest of within-language consistency?

It depends. If I know that a treebank is actively maintained (or was in the not-so-distant past), like EWT, I would hesitate to touch it without the current maintainer's consent. If I know that the data provider / last maintainer has been silent for a long time, I would just go and fix it. Ideally the validator should flag it as a new error and the treebanks should get their four years grace period. But we currently have this mechanism only for the main guidelines, not for the language-specific relation subtypes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants