-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inconsistent analysis of etc #820
Comments
That's a very complicate word that does not fit the distribution of any other word. They are called extenders by Overstreet 2005. In the spoken French we analyzed them as CCONJ, even if they are not equivalent to coordinating conjunctions. Overstreet M. (2005). And stuff und so: Investigating Pragmatics Expressions in English and German. Journal of Pragmatics 37, 1845–1864. |
For English, as I understand it, the idea is that "etc" is a foreign word hence upos of A related thing that has been discussed but not resolved is the structure of "et al."—the options being
|
"etc" is a loan word in English, not a foreign word. X is not a good option. |
You mean as
Exactly, I think one of the reasons for this analysis, at least coming from GUM which has entity and coreference annotation, is that it behaves like a plural coordinate phrase and can corefer with one. So we can have:
In the English corpora, the xpos tag FW is usually automatically converted to X, it's only 'foreign' because the PTB guidelines treated it this way. I agree it's not ideal, but I'm not sure if it's worth making the correspondence with xpos more piecemeal by changing this specific word's upos tag (though it doesn't matter too much to me personally) |
Why not following the German HDT and split et/cc cetera/noun? That is, etc is a MWT. The second case of @nschneid right? |
But what upostag to use? That is why I prefer split "et cetera" |
Just "et cetera"? Are other abbreviations split as well? In the English tokenization we only split off clitics. |
Two complements about the CCONJ analysis of "etc". |
@dan-zeman is right, this issue is part of the #181, should we close it here and continue there? I can't see etc tagged as ADV in Portuguese, but I may be wrong. We have 14 cases in Bosque. In #181, @manning was against splitting |
Changing the tokenization for etc. would be a pretty radical break with LDC and other corpus behavior in English, so I would be strongly against it, and as @nschneid points out it is a slippery slope opening a huge number of questions regarding what to split or not to split (we also don't split acronyms, and I don't see that 'etc' is fundamentally different) Latin "cetera" is a plural adjective meaning "remaining", so if it's not a foreign word, then I suppose it could be tagged with upos ADJ, but it's not that X offends me that much - the guidelines state that it is used with tokens that "for some reason cannot be assigned a real part-of-speech category", and I think it's OK that that guideline is fairly vague. As @sylvainkahane pointed out, it is basically a sui generis, so no other tag fits well. In any case, "etc" seems more complex than the simple integrated loan word example of "sombrero": https://universaldependencies.org/u/pos/X.html Happy to move this to #181 if preferred. |
I think I agree with almost everything @sylvainkahane writes, except that I don't come down on the side of CCONJ. One word or twoYes, "etc." has a history whereby it comes from two Latin words. But it just doesn't seem a good synchronic analysis to say that it should be two words. Would we next split up "another" because it comes from two English words? I think most linguists regard it as a mistake to try to preserve diachrony in a synchronic description. Evidence for it being one word synchronically includes:
SyntaxNo one has argued against the current analysis and @sylvainkahane's argument here for Part of speechSeveral of the choices are definitely wrong:
The two plausible candidates correspond to the two halves of the meaning of "etc.": CCONJ or NOUN. I think we do have to accept that "etc." is a weird special word, and anything we do is shoving it into some category or another. @sylvainkahane gives the case for CCONJ. But I think we are better off calling it a NOUN:
|
Hmm. What about the argument that it can coordinate with non-nominals? "We need to mow the lawn, weed the garden, paint the mailbox, etc.". "Bees swarmed everywhere—inside the hive, above the tree, etc." Also, unlike other nouns, it must be the last element in a coordination. Non-Latinate paraphrases:
It seems to me that no standard POS is a great fit because "etc." has a very special distribution (last element of a coordination of any type). I could see this being an argument to call it |
This seems like a derived sense meaning "other miscellaneous/non-notable things", but of course nouns get derived from other parts of speech all the time. |
Another idea is to call it |
Though I can't find any instances in GUM or EWT, "both" can also occur post-coordination: "We invited him and her both" (meaning 'We invited both him and her'). So that would be another potential justification for |
Latvian doesn't use etc. particularly often, but there are two common abbreviations we would like to annotate in similar manner:
There are also couple rarer, u.t.j.p. (un tā jo projām 'and so on'), v.tml. (vai tamlīdzīgi 'or similar'), u.tml. (un tamlīdzīgi 'and similar'), thus, after much discussion we just assigned separate tag ( For UD needs we currently convert them to Anyway, I am very interested in the final conclusions of this discussion :) |
I agree , @nschneid, that the fact that you can use “etc.” with things other than nominals is an argument against calling it a noun (though we do get unlike category coordination in English and you might possibly regard the verbal cases as ellipsis of “and [do] other things”). And I certainly agree that “no standard POS is a great fit”. I think we need to choose something as a convention. I agree that choosing CCONJ is also reasonable. I still suspect NOUN might be best. While in general “etc.” is final, one other usage to consider is that it can be repeated: “We’ll need sleeping bags, tents, water bottles, etc., etc.” |
If people insist on viewing it as nominal I would think PRON would make more sense than NOUN. It is vaguely similar to "everything-else"—both in meaning, and in that it doesn't have a plural ending despite referring to multiple items. But it also can't do things that nominals normally do, like head NPs (absent coordination), or be the antecedent for anaphora. |
I find this argument convincing for
Mm, if we agree it's essentially a nominal I would prefer noun, I think it would be odd to say that it's a loan-pronoun just for the semantic reason that it is unspecific, and typologically loan-pronouns are quite rare. I also don't think it's considered a pronoun in Latin despite being semantically vague, and there are also some oddities about its use, such as repeatability ("etc. etc.") which don't really fit that profile. |
In the spirit of putting all options on the table, we could also consider |
How wedded are we to the FWIW, CGEL (p. 1305) calls "both" and "either" determinatives (as the POS) whether they occur in determiner position of an NP, or "function as marker of the first coordinate in correlative coordination". I.e.: CGEL does NOT consider "both" or "either" to be coordinating conjunctions when they occur within coordinate structures.
If we were to decide that elaborations of a coordination relation are not |
Interesting, I didn't realize that. But at least those are modifiers within NPs right? What about |
I suppose so. |
Just adding another data point: the Punjabi translational equivalent ਆਦਿ ādi I tagged as |
As I mentioned above, currently the inventory of If anyone is curious, here is the distribution of the coordinate item in GUM: NOUN 10 Also wanted to add to @manning 's dictionary survey that dictionary.com concurs with Merriam Webster in labeling it as a noun (and listing the plural from @manning 's example as well): |
Not as overwhelmingly skewed in EWT—roughly 45 NOUN+PROPN, 10 VERB, 3 ADJ, 2 ADV. (I say "roughly" because some of them look like annotation errors.)
That's the spelled-out version which can be pluralized as "etceteras". For "etc." it merely says "abbreviation", which is a cop-out IMO. :) https://www.dictionary.com/browse/etc Anyway I agree that "etc." is not as frequent as other Regarding coordination, I think there are multiple constructions at play:
|
Oh I realized another thing: In its post-coordination use, there is a standard way to emphasize the magnitude of the "etc."—by repeating it: I bought an apple, a banana, a carrot, etc. etc. Not by pluralizing it, as you would expect if it were nominal (*I bought an apple, a banana, a carrot, many etceteras), and not by adding an intensifier, as you would expect for an adjective or adverb (*I bought an apple, a banana, a carrot, very etc.). This repetition is not just a marginal thing, BTW: COCA has >2k hits for "etc etc". |
OK, but if we have to choose one, then it looks like EWT supports NOUN too
Based on frequencies, the main use is for lists of nominals (18/24 in GUM, I missed a few earlier because I forgot to search without the period too)
This idea will run into problems when there is only one item before "etc", as in "books etc." CCONJ basically operates in patterns like "X CCONJ/cc Y/conj", and in the
Sure, but I don't see how that would rule out a noun. I can say "all day it was just letters letters letters" and I don't think that detracts from "letters" being a noun (and here too, I would attach them via
Traditionally I think But TBH I have never felt that UD English need upos=PART at all; in my opinion the best upos for those three items would have been:
The last one is maybe more debatable, but all of them look more plausible to me as particles than "etc.", maybe also because they are closed class items (function words, as you say), whereas "etc." is a scholarly loan, which although unique, seems to come from an open borrowing process (I don't want to see words like "op. cit.", "ibid." and "scil." or who knows what else creep into the particle class). I fully agree that "etc." is odd, but essentially I think having a noun that only appears in coordinations is less odd than a particle that only appears in coordinations, and actually shares some properties with referring expressions. |
Maybe "etc." started out as a scholarly loan—and the way we write it as an abbreviation reminds us of that—but I think ordinary people use it in spoken conversation with no idea of its Latin origins, and it is something of a function word even though we don't traditionally think of it when making lists of function words. That said, if we wanted to have a simple rule that abbreviations borrowed from Latin do not fit in any normal English POS category, then the correct tag would be Agreed that "op. cit.", "ibid.", etc. (ha) are not a good fit for PART, and it's hard to imagine anyone using them without knowing they're scholarly jargon borrowed from Latin. |
I'm OK with that too.
Sorry, I didn't mean that the fact it's a borrowing is relevant, my intention was to say that, as a loanword, it comes from an open-ended process, and my expectation is that PART is a closed class. I could easily imagine other loans might behave idiosyncratically, and I wouldn't want them to seep into PART because we opened the door with "etc.". That's why I strongly prefer one of the open pos classes for "etc." (but that doesn't mean it has to be NOUN or ADJ; X is fine by me if you think that's better, and actually reflects xpos better). |
I'm less opposed to That doesn't address @aryamanarora's point, though, where the equivalent word is not salient as a borrowing in Punjabi. Yes, borrowings are more likely to end up in an open class, but if it now patterns distributionally like a closed-class item (or rather, unlike any open class item) I don't think the etymology should be relevant for choosing between non- |
I think that's just because it's an acronym, no? It distributes pretty similarly to "and + NOUN", and based on the general most common treatment of acronyms in UD as stand-ins for their heads, tagging it as NOUN doesn't seem so strange to me. But if that's controversial then X is fine for me too, as I said.
Agreed, I don't know Punjabi and I'm definitely not making any statements on how it should be tagged in other languages, especially ones where formal morphology plays a more significant role in choosing POS categories. Just for English, I think it behaves most similarly to an acronym standing for "and + NOUN". |
That's how etc is currently annotated in Latin treebanks using them, especially UDante (medieval, literary Latin). Features are applied to better frame it, specifically I am opposed to chose any lexical part of speech for etc, given that this "word" has a maximally generic applications. Since I however think that
I don't think this would be a problem: each abbreviation has its own history. Moreover, the problem with etc is that it has acquired its own life and cannot be truly analysed as its components anymore, and especially not as any other abbreviation, i.e. as simple graphical variant. |
A writeup of the various points of view on "etc." It was decided that, despite the unusual distribution, |
I have to admit I am quite perplexed by this final choice, even after reading the final writeup. If we can agree that etc and similar "words" are on the functional side, as their stated generic anaphoricity strongly suggests, then I do not see why
where ceterus is currently tagged as a I do not know if this derives from some generic resistance against opening the
Is this really so different from indefinite pronouns like some? |
The discussion in the guidelines group was mostly (although not entirely) about the use of the word in English, where it is a loanword but many speakers no longer perceive it as code switching. It is somehow assumed/hoped that the decision will be applicable to other languages that use etc. as a loanword, although it hasn't been discussed thoroughly (I think Swedish was mentioned as an example). I suppose that Latin has the liberty to treat the expression as what it really is etymologically, given that it is not a loanword there.
|
In EWT at least we consider some to be a DET, and someone to be a PRON. Honestly the only thing we all agree on is that there is no good category for "etc." (in English anyway). It's sort of functional, and associated mainly with coordination, but doesn't seem as grammatically "core" as pronouns, and doesn't exist in a paradigm, which is why I think PRON seemed unintuitive (and PART). Nouns like "other" and "rest" can also have similar meanings. In reality, maybe it lies somewhere in between NOUN and PRON. Somebody should do a distributional corpus study and write a paper on it! |
You are right, I did a mistake here and meant someone. But also the pronominal uses of an element like some (i.e. heading its own phrase) fit to the discussion. As for other, I am surprised to hear of it as a For sure it is atypical with respect to the "standard" personal pronouns. I am not sure that (universally) coreness (there are always atypical exponents of a class) or paradigms make it an unviable option. I would like to insist again on fact that if we agree that it is "sort of functional", this should cut off the discussion and make us converge on
However, this analysis is represented as rather universal in the guidelines. Maybe it should be shifted to the en pages, and the u left more general? In Latin we have indeed to choose if we want to be strict in treating it as a MWT and expanding it, but it might also warrant to be annotated as a "new" word, in the vein of id est 'that is' becomes a |
Note an earlier thread on attachment of "etc.": #483 |
Analyzing the expression
etc
in corpus Portuguese-Bosque (UniversalDependencies/UD_Portuguese-Bosque#386) we identified inconsistencies of this annotation in other UD corpus:English (EWT and GUM): use upos equal to X.
German (HDT): separate etc in
et
andcetera
.French (ParTUT, GSD and Sequoia): varies between INTJ (ParTUT), X and ADV (GSD) and ADV (Sequoia).
Spanish (AnCora and GSD): varies between PUNCT (AnCora) and ADV (GSD).
Italian (ISDT and VIT): varies between ADV (ISDT) and NOUN (VIT).
The text was updated successfully, but these errors were encountered: