Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Fort Smith", "Lake Superior", "Mount Fuji": flat or compound? #501

Closed
nschneid opened this issue Oct 27, 2017 · 9 comments
Closed

"Fort Smith", "Lake Superior", "Mount Fuji": flat or compound? #501

nschneid opened this issue Oct 27, 2017 · 9 comments

Comments

@nschneid
Copy link
Contributor

For named entities with a descriptive term at the beginning like “Fort Smith”, it’s not clear to me how to decide between flat and compound. The UD_English corpus has just 3 tokens of “Fort” and it’s always attached as compound. Same for “Lake Superior” and “Mount Fuji”.

The use of compound here is a bit odd because typically the head of the (endocentric) compound is the supercategory: “Mississippi River” is a kind of river, but “Fort Smith” is a kind of fort, so it would be nice to have “Fort” as the head. Perhaps this is an argument in favor of flat. Then again, "_ River" and "Fort _" are standard templates for place names, so perhaps it is odd to assign them different syntactic relations.

Whatever the policy, I think this deserves mention in the docs.

(Related to #487)
(Prompted by a question from @bguil)

@manning
Copy link
Contributor

manning commented Oct 29, 2017

I agree that we would be helped greatly by having conventions for things like this described in the docs.

I suspect that you should discount what is currently in UD_English, since the reality is that it was treebanked in a mixture of UD beta and UD v1, when flat did not exist, and despite attempts to update things, I think the choice here clearly reflects the UD v1 reality not careful thought.

My personal perspective for v2 is that flat makes more sense here. Certainly for some of these you get similar variants in both directions, and so flat seems better: Black Mountain and Mount Black; Lake Fern and Fern Lake.

My impression (as you see evidence of in #487) is that there is less than total agreement among UD people as to how to treat constructs of this and similar sorts. Some people (people from Prague, maybe speakers of languages with a lot of morphology) tend to prefer trying to give a diachronic syntactic analysis to such examples, whereas others (like me) tend to think these things are largely frozen, and flat makes more sense in many case. For example, this is reflected in the somewhat schizophrenic discussion under flat for cases like Ludwig van Beethoven and Río de la Plata

@sylvainkahane
Copy link
Contributor

@nschneid you said "“Fort Smith” is a kind of fort, so it would be nice to have “Fort” as the head. Perhaps this is an argument in favor of flat.", which sounds strange for me, because flat is for headless constructions.

If you want to have "Fort" at the head you should use compound or appos. The question is internal to English. English has a productive compound construction N2 N1 (with N1 as the head). The question is: has English a productive construction N1 N2? I think so.

@amir-zeldes
Copy link
Contributor

I'm with @nschneid in wanting 'fort' and 'river' as the head, and with @sylvainkahane in thinking this speaks against flat. My worry is this, and I'm sure I'm repeating other here: We're trying to make syntactic annotation guidelines based on whether or not we think it's a named entity. But in my view that's not the job of treebanks, but of NER annotations.

To me flat means 'we can't figure out what's the head' or at least 'there's no good reason to prefer one or the other'. For the 'Jane Smith' type name cases, I think this holds because there's no real reason to prefer a first or last name as a head, and even if we pick one, it's not really the same syntactic relation as a compound (note: you can refer to her as 'Jane' or 'Smith'; I don't think you can do that with the fort, but indeed, you can refer to it as 'the fort').

I guess my TLDR is, if you can figure out what the head is, I'd prefer to mark it.

@nschneid
Copy link
Contributor Author

What I think is really going on here is that speakers have fairly specific constructions for certain kinds of NEs. They know that it is "Fort _" but "_ River". Yet we can identify heads based on the semantics. So how about:

compound:name(Fort, Smith)
compound:name(River, Mississippi)

This could give the best of both worlds: the headedness coming from compound and the fact that these are idiosyncratic to certain kinds of NEs coming from :name. This would maintain the generalization that "normal" compounds in English are right-headed (which I think would be beneficial for parsers).

@dan-zeman
Copy link
Member

@manning Agreed with what you say about "people in Prague" :) although I don't know how large a proportion I represent; anyway, I would love to see a doc page devoted in depth to this, and taking a perspective of many languages. It's true that morphology in some languages speaks against a flat analysis where English seems "flatter". But in general I tend to agree with @amir-zeldes 's comment that being a part of a NE does not invalidate the syntax.

@nschneid
Copy link
Contributor Author

nschneid commented Nov 8, 2017

It occurs to me that it is usually the Mississippi River, with a determiner, unlike Fort Smith (type on left) or New York City (type on the right). Is this relevant to deciding between compound and flat? If so, would this criterion extend to other languages? I know there are idiosyncrasies in the use of determiners with names—e.g. in English, most countries do not except The Netherlands.

@arademaker
Copy link
Contributor

For example, this is reflected in the somewhat schizophrenic discussion under flat for cases like Ludwig van Beethoven and Río de la Plata

@manning, how do you prefer to analyse these cases of Ludwig van Beethoven and Río de la Plata?

I guess my TLDR is, if you can figure out what the head is, I'd prefer to mark it.

@amir-zeldes I believe the problem is that different annotators would probably analyze in different way, making it difficult to obtain a consistent treebank.

@sylvainkahane
Copy link
Contributor

@arademaker in which language? I think that Rio de la Plata can have different analyses in Spanish, in languages related to Spanish (like French where de and la exist) and in an unrelated language where this name is used. For English, I think persons in charge must decide whether it is transparent or not for an average speaker.

Same remarks about your second remark to @amir-zeldes. The problem we discussed is specific to English and must be solved in the specific guide for English. If it is, it won't be a problem for annotators.

@amir-zeldes
Copy link
Contributor

I agree with @sylvainkahane , and would like to add that the problem @arademaker mentioned is true and important to deal with, but it is not solved by using flat, since annotators can disagree about whether or not flat applies. Either way, clear guidelines are needed, and these depend to some extent on the applications we have in mind.

Maybe my view is skewed since I'm interested in entities: to me it seems important to know that "the River X" is a kind of river. But I also think that syntactically it is not really flat... To me this is different from "John Smith", where I have no compelling argument to prefer one part or the other as the head.

@dan-zeman dan-zeman added this to the v2.2 milestone Apr 24, 2018
@dan-zeman dan-zeman modified the milestones: v2.2, v2.4 Nov 13, 2018
@dan-zeman dan-zeman modified the milestones: v2.4, v2.5 Oct 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants