-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Fort Smith", "Lake Superior", "Mount Fuji": flat or compound? #501
Comments
I agree that we would be helped greatly by having conventions for things like this described in the docs. I suspect that you should discount what is currently in UD_English, since the reality is that it was treebanked in a mixture of UD beta and UD v1, when flat did not exist, and despite attempts to update things, I think the choice here clearly reflects the UD v1 reality not careful thought. My personal perspective for v2 is that My impression (as you see evidence of in #487) is that there is less than total agreement among UD people as to how to treat constructs of this and similar sorts. Some people (people from Prague, maybe speakers of languages with a lot of morphology) tend to prefer trying to give a diachronic syntactic analysis to such examples, whereas others (like me) tend to think these things are largely frozen, and |
@nschneid you said "“Fort Smith” is a kind of fort, so it would be nice to have “Fort” as the head. Perhaps this is an argument in favor of If you want to have "Fort" at the head you should use |
I'm with @nschneid in wanting 'fort' and 'river' as the head, and with @sylvainkahane in thinking this speaks against To me I guess my TLDR is, if you can figure out what the head is, I'd prefer to mark it. |
What I think is really going on here is that speakers have fairly specific constructions for certain kinds of NEs. They know that it is "Fort _" but "_ River". Yet we can identify heads based on the semantics. So how about: compound:name(Fort, Smith) This could give the best of both worlds: the headedness coming from |
@manning Agreed with what you say about "people in Prague" :) although I don't know how large a proportion I represent; anyway, I would love to see a doc page devoted in depth to this, and taking a perspective of many languages. It's true that morphology in some languages speaks against a |
It occurs to me that it is usually the Mississippi River, with a determiner, unlike Fort Smith (type on left) or New York City (type on the right). Is this relevant to deciding between |
@manning, how do you prefer to analyse these cases of
@amir-zeldes I believe the problem is that different annotators would probably analyze in different way, making it difficult to obtain a consistent treebank. |
@arademaker in which language? I think that Rio de la Plata can have different analyses in Spanish, in languages related to Spanish (like French where de and la exist) and in an unrelated language where this name is used. For English, I think persons in charge must decide whether it is transparent or not for an average speaker. Same remarks about your second remark to @amir-zeldes. The problem we discussed is specific to English and must be solved in the specific guide for English. If it is, it won't be a problem for annotators. |
I agree with @sylvainkahane , and would like to add that the problem @arademaker mentioned is true and important to deal with, but it is not solved by using Maybe my view is skewed since I'm interested in entities: to me it seems important to know that "the River X" is a kind of river. But I also think that syntactically it is not really flat... To me this is different from "John Smith", where I have no compelling argument to prefer one part or the other as the head. |
For named entities with a descriptive term at the beginning like “Fort Smith”, it’s not clear to me how to decide between
flat
andcompound
. The UD_English corpus has just 3 tokens of “Fort” and it’s always attached ascompound
. Same for “Lake Superior” and “Mount Fuji”.The use of
compound
here is a bit odd because typically the head of the (endocentric) compound is the supercategory: “Mississippi River” is a kind of river, but “Fort Smith” is a kind of fort, so it would be nice to have “Fort” as the head. Perhaps this is an argument in favor offlat
. Then again, "_ River" and "Fort _" are standard templates for place names, so perhaps it is odd to assign them different syntactic relations.Whatever the policy, I think this deserves mention in the docs.
(Related to #487)
(Prompted by a question from @bguil)
The text was updated successfully, but these errors were encountered: