Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inconsistent analysis of etc #820

Closed
wellington36 opened this issue Nov 10, 2021 · 52 comments
Closed

inconsistent analysis of etc #820

wellington36 opened this issue Nov 10, 2021 · 52 comments
Labels
dependencies English Latin standard needed universal UPOS Universal part-of-speech tags: definitions and examples
Milestone

Comments

@wellington36
Copy link

Analyzing the expression etc in corpus Portuguese-Bosque (UniversalDependencies/UD_Portuguese-Bosque#386) we identified inconsistencies of this annotation in other UD corpus:

  • English (EWT and GUM): use upos equal to X.

  • German (HDT): separate etc in et and cetera.

  • French (ParTUT, GSD and Sequoia): varies between INTJ (ParTUT), X and ADV (GSD) and ADV (Sequoia).

  • Spanish (AnCora and GSD): varies between PUNCT (AnCora) and ADV (GSD).

  • Italian (ISDT and VIT): varies between ADV (ISDT) and NOUN (VIT).

@sylvainkahane
Copy link
Contributor

That's a very complicate word that does not fit the distribution of any other word. They are called extenders by Overstreet 2005. In the spoken French we analyzed them as CCONJ, even if they are not equivalent to coordinating conjunctions.
(More precisely et and caetera are analysed as ADVs and the idiom they form as a CCONJ.)
http://match.grew.fr/?corpus=SUD_French-Rhapsodie@latest&custom=618c2df1e11a8

Overstreet M. (2005). And stuff und so: Investigating Pragmatics Expressions in English and German. Journal of Pragmatics 37, 1845–1864.

@nschneid
Copy link
Contributor

nschneid commented Nov 10, 2021

For English, as I understand it, the idea is that "etc" is a foreign word hence upos of X. But it attaches as cc conj.

A related thing that has been discussed but not resolved is the structure of "et al."—the options being

  • flat:foreign (treating it as a foreign idiom) and conj (like "etc.")
  • cc (treating "et" as a conjunction) and conj for "al."

@sylvainkahane
Copy link
Contributor

"etc" is a loan word in English, not a foreign word. X is not a good option.
Moreover it occupies the position of the last conjunct in a coordination and can commute with expressions such as "and so on". So I think it must be conj.

@amir-zeldes
Copy link
Contributor

For English, as I understand it, the idea is that "etc" is a foreign word hence upos of X. But it attaches as cc.

You mean as conj right? That's how it is in both EWT and GUM

it occupies the position of the last conjunct in a coordination and can commute with expressions such as "and so on"

Exactly, I think one of the reasons for this analysis, at least coming from GUM which has entity and coreference annotation, is that it behaves like a plural coordinate phrase and can corefer with one. So we can have:

  • [Lewis, Simons, & Fennig]i ... based on [Lewis et al.]i (actual GUM example!)

"etc" is a loan word in English, not a foreign word

In the English corpora, the xpos tag FW is usually automatically converted to X, it's only 'foreign' because the PTB guidelines treated it this way. I agree it's not ideal, but I'm not sure if it's worth making the correspondence with xpos more piecemeal by changing this specific word's upos tag (though it doesn't matter too much to me personally)

@dan-zeman dan-zeman added dependencies UPOS Universal part-of-speech tags: definitions and examples universal labels Nov 10, 2021
@dan-zeman dan-zeman added this to the v2.9 milestone Nov 10, 2021
@arademaker
Copy link
Contributor

arademaker commented Nov 11, 2021

Why not following the German HDT and split et/cc cetera/noun? That is, etc is a MWT.

The second case of @nschneid right?

@arademaker
Copy link
Contributor

Moreover it occupies the position of the last conjunct in a coordination and can commute with expressions such as "and so on". So I think it must be conj.

But what upostag to use? That is why I prefer split "et cetera"

@nschneid
Copy link
Contributor

Moreover it occupies the position of the last conjunct in a coordination and can commute with expressions such as "and so on". So I think it must be conj.

But what upostag to use? That is why I prefer split "et cetera"

Just "et cetera"? Are other abbreviations split as well? In the English tokenization we only split off clitics.

@dan-zeman
Copy link
Member

This issue starts to overlap with #181 (and possibly also #112 and #516).

@sylvainkahane
Copy link
Contributor

Two complements about the CCONJ analysis of "etc".
Semantically "etc" contains the meaning of "and": "A, B, etc" is always a (semantic) conjunction (as opposed to the disjunction "A or B").
Syntactically, "etc" excludes other CCONJs: "A and B", "A, B, etc", but *"A and B etc".
This mutual exclusion between "etc" and other CCONJs can allow us to consider that they belong to the same distributional class, even if "etc" occupies another position.
Of course, "etc" does not share all the properties of CCONJs, but it is the best choice among a list of bad choices. X is a no-choice. ADV does not make sense, "etc" as nothing in common with ADVs that modify a verb or an adjective. NOUN is worst, "etc" cannot occupy nominal positions and it can close any coordination (I would like to dance, jump, etc). PUNCT is used for written symbols that have only a suprasegmental counterpart in spoken language.

@arademaker
Copy link
Contributor

@dan-zeman is right, this issue is part of the #181, should we close it here and continue there? I can't see etc tagged as ADV in Portuguese, but I may be wrong. We have 14 cases in Bosque. In #181, @manning was against splitting et cetera but that would solve the tag problem considering the analysis from @sylvainkahane above.

@amir-zeldes
Copy link
Contributor

Changing the tokenization for etc. would be a pretty radical break with LDC and other corpus behavior in English, so I would be strongly against it, and as @nschneid points out it is a slippery slope opening a huge number of questions regarding what to split or not to split (we also don't split acronyms, and I don't see that 'etc' is fundamentally different)

Latin "cetera" is a plural adjective meaning "remaining", so if it's not a foreign word, then I suppose it could be tagged with upos ADJ, but it's not that X offends me that much - the guidelines state that it is used with tokens that "for some reason cannot be assigned a real part-of-speech category", and I think it's OK that that guideline is fairly vague. As @sylvainkahane pointed out, it is basically a sui generis, so no other tag fits well. In any case, "etc" seems more complex than the simple integrated loan word example of "sombrero":

https://universaldependencies.org/u/pos/X.html

Happy to move this to #181 if preferred.

@manning
Copy link
Contributor

manning commented Nov 14, 2021

I think I agree with almost everything @sylvainkahane writes, except that I don't come down on the side of CCONJ.

One word or two

Yes, "etc." has a history whereby it comes from two Latin words. But it just doesn't seem a good synchronic analysis to say that it should be two words. Would we next split up "another" because it comes from two English words? I think most linguists regard it as a mistake to try to preserve diachrony in a synchronic description. Evidence for it being one word synchronically includes:

  • It is frequently pronounced as [ɛksɛtɹʌ] or [ɪksɛtɹə] (perhaps even usually, though dictionaries are slow to realize it).
  • It is frequently written as "ect." (non-standard, but common, perhaps related to previous item).

Syntax

No one has argued against the current analysis and @sylvainkahane's argument here for conj: "Moreover it occupies the position of the last conjunct in a coordination and can commute with expressions such as "and so on". So I think it must be conj." This does seem to me the best way to treat it in the syntax. Treating it as cc would look very odd and not capture the idea of there being conjoined things. If you compare the two sentences "I'll bring sheets, etc." and "I'll bring sheets, towels". Then I think we are best off representing both of them with a conj: sheets --conj--> etc. and sheets --conj--> towels.

Part of speech

Several of the choices are definitely wrong:

  • X: "etc." has been around since late Middle English. It should be analyzed as a long incorporated loan word. The UD guidelines say of X: "This usage does not extend to ordinary loan words which should be assigned a normal part-of-speech." It is only an X in EWT for the reason @amir-zeldes notes, as automatic conversion by a default rule from LDC FW.
  • INTJ, PUNCT: It's not an interjection or punctuation. It just isn't.
  • ADV: This is the part of speech assigned by Oxford dictionaries of English. It's hard to understand why. I agree with @sylvainkahane that ""etc" has nothing in common with ADVs that modify a verb or an adjective."

The two plausible candidates correspond to the two halves of the meaning of "etc.": CCONJ or NOUN. I think we do have to accept that "etc." is a weird special word, and anything we do is shoving it into some category or another. @sylvainkahane gives the case for CCONJ. But I think we are better off calling it a NOUN:

@nschneid
Copy link
Contributor

nschneid commented Nov 14, 2021

Hmm. What about the argument that it can coordinate with non-nominals? "We need to mow the lawn, weed the garden, paint the mailbox, etc.". "Bees swarmed everywhere—inside the hive, above the tree, etc."

Also, unlike other nouns, it must be the last element in a coordination.

Non-Latinate paraphrases:

  • @sylvainkahane points out "...and so on" is a valid paraphrase. Where this occurs in EWT it is advmod(on/ADV, so/ADV). GUM also treats both as ADV (although it is inconsistent about which is the head).

  • Another option is "...and more". Where this occurs we currently tag "more" as ADJ, though I'm not necessarily wedded to that.

It seems to me that no standard POS is a great fit because "etc." has a very special distribution (last element of a coordination of any type). I could see this being an argument to call it X (or ADV, in systems where that is the garbage category).

@nschneid
Copy link
Contributor

This seems like a derived sense meaning "other miscellaneous/non-notable things", but of course nouns get derived from other parts of speech all the time.

@nschneid
Copy link
Contributor

nschneid commented Nov 14, 2021

Another idea is to call it cc:postconj, by analogy to cc:preconj ("both...and", "(n)either...(n)or"). On this analysis it is not a conjunct but a marker that follows the last conjunct to refine its meaning as a non-exclusive list. A downside of this is that it can occur after just one item ("The box contains books etc."). So it would be weird to say that is postcoordination "etc." (perhaps there it could be called ADV/advmod there—cf. postmodifying "plus" as in "1 year plus").

@nschneid
Copy link
Contributor

Though I can't find any instances in GUM or EWT, "both" can also occur post-coordination: "We invited him and her both" (meaning 'We invited both him and her'). So that would be another potential justification for cc:postconj.

@lauma
Copy link
Contributor

lauma commented Nov 15, 2021

Latvian doesn't use etc. particularly often, but there are two common abbreviations we would like to annotate in similar manner:

  1. u.c. from un citi 'and others'
  2. utt. from un tā tālāk 'and so on'

There are also couple rarer, u.t.j.p. (un tā jo projām 'and so on'), v.tml. (vai tamlīdzīgi 'or similar'), u.tml. (un tamlīdzīgi 'and similar'), thus, after much discussion we just assigned separate tag (yd, that is, abbreviations serving as discourse markers) for them in our local tagset.

For UD needs we currently convert them to SYM with role conj, and the same way we annotate if some texts in our corpus use etc. SYM tag was born out of pure desperation and lack of understanding, how to treat it in UD style, but for conj our thinking was that usually these small abbreviations end some kind of list by indicating that the written list is incomplete and enlists only some of the items writer was thinking about. That is, Latvian thinking was that abbreviation works as the final element of the list.

Anyway, I am very interested in the final conclusions of this discussion :)

@manning
Copy link
Contributor

manning commented Nov 15, 2021

I agree , @nschneid, that the fact that you can use “etc.” with things other than nominals is an argument against calling it a noun (though we do get unlike category coordination in English and you might possibly regard the verbal cases as ellipsis of “and [do] other things”). And I certainly agree that “no standard POS is a great fit”. I think we need to choose something as a convention. I agree that choosing CCONJ is also reasonable. I still suspect NOUN might be best. While in general “etc.” is final, one other usage to consider is that it can be repeated: “We’ll need sleeping bags, tents, water bottles, etc., etc.”

@nschneid
Copy link
Contributor

If people insist on viewing it as nominal I would think PRON would make more sense than NOUN. It is vaguely similar to "everything-else"—both in meaning, and in that it doesn't have a plural ending despite referring to multiple items.

But it also can't do things that nominals normally do, like head NPs (absent coordination), or be the antecedent for anaphora.

@amir-zeldes
Copy link
Contributor

“and [do] other things”

I find this argument convincing for NOUN, and I guess actually that type of ellipsis would probably work in Latin as well (VP etc.), and there too it would be superficially an unlike coordination, but "cetera" would remain an adjective.

PRON would make more sense than NOUN. It is vaguely similar to "everything-else"

Mm, if we agree it's essentially a nominal I would prefer noun, I think it would be odd to say that it's a loan-pronoun just for the semantic reason that it is unspecific, and typologically loan-pronouns are quite rare. I also don't think it's considered a pronoun in Latin despite being semantically vague, and there are also some oddities about its use, such as repeatability ("etc. etc.") which don't really fit that profile.

@nschneid
Copy link
Contributor

nschneid commented Nov 15, 2021

And I certainly agree that “no standard POS is a great fit”. I think we need to choose something as a convention.

In the spirit of putting all options on the table, we could also consider PART. It is like a function word in that it only occurs in a particular grammatical construction. PART is essentially "miscellaneous function word".

@nschneid
Copy link
Contributor

How wedded are we to the cc:preconj relation for "both X and Y"? I ask because it always felt weird to me to call those CCONJ just because they are elements of a coordinating construction, as they are not the elements that link the conjuncts, but rather markers that refine the nature of the coordination.

FWIW, CGEL (p. 1305) calls "both" and "either" determinatives (as the POS) whether they occur in determiner position of an NP, or "function as marker of the first coordinate in correlative coordination". I.e.: CGEL does NOT consider "both" or "either" to be coordinating conjunctions when they occur within coordinate structures.

cc:preconj is perhaps too specific anyway as it applies to only a few lemmas.

If we were to decide that elaborations of a coordination relation are not CCONJ or cc(:preconj), but rather (say) ADV/advmod, this would bear on what we do for "etc."

@nschneid
Copy link
Contributor

Are there other examples of NOUNs that occur in lexically productive combinations, but just in one position of one particular construction? (Not hapaxes in frozen expressions like kith and kin.)

Perhaps all Chinese classifiers?

Interesting, I didn't realize that. But at least those are modifiers within NPs right?

What about PART, as it is a category for syntactically exceptional items? Possessive 's occurs only at the end of an NP and infinitive marker to only at the beginning of a clause, and these do not share the wider distribution of other categories in English.

@dan-zeman
Copy link
Member

Perhaps all Chinese classifiers?

Interesting, I didn't realize that. But at least those are modifiers within NPs right?

I suppose so.

@aryamanarora
Copy link
Member

aryamanarora commented Jan 19, 2022

Just adding another data point: the Punjabi translational equivalent ਆਦਿ ādi I tagged as PART since it takes no nominal declensions, has no apparent gender, only occurs at the end of coordinations--it doesn't seem to type well with any other part of speech. It also doesn't really have the same weirdness of et cetera as a potentially foreign word, since Sanskrit loans are common and fully incorporated into the lexicon in Punjabi.

@amir-zeldes
Copy link
Contributor

amir-zeldes commented Jan 19, 2022

As I mentioned above, currently the inventory of PART in English is only the negation "not", infinitive "to" and the genitive "'s". All three are highly common, indeclinable function words; adding "etc.", which is a learned loan-item, seems out of place in that list, and also makes it a bit odd that it is coordinated so often with nouns (we say "dogs etc." but not "to etc.", "not etc." or "'s etc.") - of course coordination doesn't have to occur between like items, but it most often does.

If anyone is curious, here is the distribution of the coordinate item in GUM:

NOUN 10
PROPN 1
ADJ 1
VERB 1

Also wanted to add to @manning 's dictionary survey that dictionary.com concurs with Merriam Webster in labeling it as a noun (and listing the plural from @manning 's example as well):

https://www.dictionary.com/browse/etcetera

@nschneid
Copy link
Contributor

If anyone is curious, here is the distribution of the coordinate item in GUM:

NOUN 10 PROPN 1 ADJ 1 VERB 1

Not as overwhelmingly skewed in EWT—roughly 45 NOUN+PROPN, 10 VERB, 3 ADJ, 2 ADV. (I say "roughly" because some of them look like annotation errors.)

Also wanted to add to @manning 's dictionary survey that dictionary.com concurs with Merriam Webster in labeling it as a noun (and listing the plural from @manning 's example as well):

https://www.dictionary.com/browse/etcetera

That's the spelled-out version which can be pluralized as "etceteras". For "etc." it merely says "abbreviation", which is a cop-out IMO. :) https://www.dictionary.com/browse/etc

Anyway I agree that "etc." is not as frequent as other PART items, but is frequency a necessary criterion? I thought PART was basically for words that are extremely constrained and exceptional grammatically, and tend on the functional side.

Regarding coordination, I think there are multiple constructions at play:

  • Following any coordination, usually without a conjunction: We ate cake, drank beer, etc.: I would consider this the main use, and AFAICT it is unrestricted with respect to the type of conjunct—it can even work with non-constituent coordination: I bought Alice an apple, Bob a banana, Caleb a carrot, etc. Because it must occur at the end of the list, and conjuncts in comma-separated coordinations are usually reorderable, it seems to me not a conjunct but a special item associated with the whole coordination construction. (Yes, I know the etymology involved a conjunction + conjunct but I'm pretty sure most English speakers don't know that or process "etc." that way.) It would not be crazy to call it CCONJ along similar lines as cc:preconj items "both"/"neither"/"either" being tagged CCONJ. Or PART to say this is just an unusual item in a very particular construction.
  • Following a single item: We ordered cake etc.; COCA: someone could say to her husband the same things people say to you " aren't you sacrificing the chance to be with someone that you are physically attracted to etc? ": This looks to me like more of a discourse marker use, especially in the second case. Sort of metalinguistic, 'there's more I could say here along similar lines but you get the idea'. The "etc." is completely optional without changing grammaticality; contrast *I ate cookies, cake. You could replace "etc." with "and more", or "and so on", but not simply "more" or "so on". I think ADV would not be too terrible here (or, again, PART because it's weird).
  • The noun use meaning "miscellaneous thing", which is very rare, normally spelled as "etcetera", and can be (is usually?) pluralized, and takes the usual nominal modifiers. COCA: his ranching-related etceteras. I don't think I have ever used this item so I would not take it to be indicative of typical "etc." in my own grammar.

@nschneid
Copy link
Contributor

Oh I realized another thing: In its post-coordination use, there is a standard way to emphasize the magnitude of the "etc."—by repeating it: I bought an apple, a banana, a carrot, etc. etc. Not by pluralizing it, as you would expect if it were nominal (*I bought an apple, a banana, a carrot, many etceteras), and not by adding an intensifier, as you would expect for an adjective or adverb (*I bought an apple, a banana, a carrot, very etc.).

This repetition is not just a marginal thing, BTW: COCA has >2k hits for "etc etc".

@amir-zeldes
Copy link
Contributor

Not as overwhelmingly skewed in EWT—roughly 45 NOUN+PROPN, 10 VERB, 3 ADJ, 2 ADV

OK, but if we have to choose one, then it looks like EWT supports NOUN too

We ate cake, drank beer, etc.: I would consider this the main use

Based on frequencies, the main use is for lists of nominals (18/24 in GUM, I missed a few earlier because I forgot to search without the period too)

It would not be crazy to call it CCONJ along similar lines as cc:preconj items "both"/"neither"/"either" being tagged CCONJ

This idea will run into problems when there is only one item before "etc", as in "books etc." CCONJ basically operates in patterns like "X CCONJ/cc Y/conj", and in the cc:preconj pattern in "CCONJ/cc:preconj X CCONJ/cc Y/conj". If we only have "X etc.", then it is not clear what CCONJ is functioning as a coordinator for: we are missing the second conjunct IMO which is what licenses the coordinating conjunction.

there is a standard way to emphasize the magnitude of the "etc."—by repeating it: I bought an apple, a banana, a carrot, etc. etc.

Sure, but I don't see how that would rule out a noun. I can say "all day it was just letters letters letters" and I don't think that detracts from "letters" being a noun (and here too, I would attach them via conj)

is frequency a necessary criterion? I thought PART was basically for words that are extremely constrained and exceptional grammatically, and tend on the functional side.

Traditionally I think PART is something like a wastebasket for things that don't fit elsewhere (and seem to usually be indiclinable). In some languages they form organic classes based on some criterion (for example the Classical Greek particles, which unlike adverbs obey Wackernagel's Law).

But TBH I have never felt that UD English need upos=PART at all; in my opinion the best upos for those three items would have been:

  • not - ADV
  • 's - ADP
  • to - SCONJ

The last one is maybe more debatable, but all of them look more plausible to me as particles than "etc.", maybe also because they are closed class items (function words, as you say), whereas "etc." is a scholarly loan, which although unique, seems to come from an open borrowing process (I don't want to see words like "op. cit.", "ibid." and "scil." or who knows what else creep into the particle class). I fully agree that "etc." is odd, but essentially I think having a noun that only appears in coordinations is less odd than a particle that only appears in coordinations, and actually shares some properties with referring expressions.

@nschneid
Copy link
Contributor

Maybe "etc." started out as a scholarly loan—and the way we write it as an abbreviation reminds us of that—but I think ordinary people use it in spoken conversation with no idea of its Latin origins, and it is something of a function word even though we don't traditionally think of it when making lists of function words.

That said, if we wanted to have a simple rule that abbreviations borrowed from Latin do not fit in any normal English POS category, then the correct tag would be X. Whether it's a borrowing or not should be irrelevant to choosing between NOUN, CCONJ, and PART.

Agreed that "op. cit.", "ibid.", etc. (ha) are not a good fit for PART, and it's hard to imagine anyone using them without knowing they're scholarly jargon borrowed from Latin.

@amir-zeldes
Copy link
Contributor

then the correct tag would be X

I'm OK with that too.

Whether it's a borrowing or not should be irrelevant

Sorry, I didn't mean that the fact it's a borrowing is relevant, my intention was to say that, as a loanword, it comes from an open-ended process, and my expectation is that PART is a closed class. I could easily imagine other loans might behave idiosyncratically, and I wouldn't want them to seep into PART because we opened the door with "etc.". That's why I strongly prefer one of the open pos classes for "etc." (but that doesn't mean it has to be NOUN or ADJ; X is fine by me if you think that's better, and actually reflects xpos better).

@nschneid
Copy link
Contributor

nschneid commented Jan 21, 2022

I'm less opposed to X than @manning and @sylvainkahane are. I agree with them in principle that it's a well-integrated word of English, but given that it doesn't seem to pattern distributionally like any other word of English, and it's often spelled as an abbreviation reflecting its origin, X may be a reasonable approach in practice.

That doesn't address @aryamanarora's point, though, where the equivalent word is not salient as a borrowing in Punjabi.

Yes, borrowings are more likely to end up in an open class, but if it now patterns distributionally like a closed-class item (or rather, unlike any open class item) I don't think the etymology should be relevant for choosing between non-X tags.

@amir-zeldes
Copy link
Contributor

it doesn't seem to pattern distributionally like any other word of English

I think that's just because it's an acronym, no? It distributes pretty similarly to "and + NOUN", and based on the general most common treatment of acronyms in UD as stand-ins for their heads, tagging it as NOUN doesn't seem so strange to me. But if that's controversial then X is fine for me too, as I said.

That doesn't address @aryamanarora's point, though, where the equivalent word is not salient as a borrowing in Punjabi

Agreed, I don't know Punjabi and I'm definitely not making any statements on how it should be tagged in other languages, especially ones where formal morphology plays a more significant role in choosing POS categories. Just for English, I think it behaves most similarly to an acronym standing for "and + NOUN".

@Stormur
Copy link
Contributor

Stormur commented Jan 21, 2022

That said, if we wanted to have a simple rule that abbreviations borrowed from Latin do not fit in any normal English POS category, then the correct tag would be X. Whether it's a borrowing or not should be irrelevant to choosing between NOUN, CCONJ, and PART.

That's how etc is currently annotated in Latin treebanks using them, especially UDante (medieval, literary Latin). Features are applied to better frame it, specifically Abbr=Yes to acknowledge its origin and Compound=Yes to give back its structure. The choice of X is a kind of (literal) crux desperationis, since, as has been discussed here, it cannot really be assigned to anything else, and already in Latin it becomes very questionable if it can be segmented into its components (et CCONJ 'and' and caeter-, neuter plural of undeterminable case from caeterus DET 'further (ones)'), let alone in other languages where it has been borrowed into. I agree with the dependency relation of conj and think that this is a rather uncontroversial choice.

I am opposed to chose any lexical part of speech for etc, given that this "word" has a maximally generic applications. Since I however think that X is the true "wastebasket" of parts of speech, once we abandon any idea of segmenting it and consider it a single unity, I can envision only one other choice which would make me feel more in harmony with the annotational universe:

  • PART: this moves from the fact a participle is, as it were, the epitome of the functional word. etc (and all its cousins, like the Greek κτλ = και τα λοιπά and many others) is pure function indeed: it is just an expander of a co-ordinated series of any length and at the same time acts as its closing. It does not have any autonomous meaning whatsoever: as said, it is maximally general. So, if I were to give etc citizenship among POSs, my first choice would be PART.

I could easily imagine other loans might behave idiosyncratically, and I wouldn't want them to seep into PART because we opened the door with "etc.".

I don't think this would be a problem: each abbreviation has its own history. Moreover, the problem with etc is that it has acquired its own life and cannot be truly analysed as its components anymore, and especially not as any other abbreviation, i.e. as simple graphical variant.

@nschneid
Copy link
Contributor

A writeup of the various points of view on "etc."

It was decided that, despite the unusual distribution, NOUN is the least objectionable tag, and conj is the appropriate deprel even if coordinated with things other than nominals (cf. "We went swimming, hiking, and other things").

nschneid added a commit that referenced this issue Sep 14, 2022
nschneid added a commit that referenced this issue Sep 14, 2022
@nschneid
Copy link
Contributor

@Stormur
Copy link
Contributor

Stormur commented Sep 15, 2022

A writeup of the various points of view on "etc."

It was decided that, despite the unusual distribution, NOUN is the least objectionable tag, and conj is the appropriate deprel even if coordinated with things other than nominals (cf. "We went swimming, hiking, and other things").

I have to admit I am quite perplexed by this final choice, even after reading the final writeup. If we can agree that etc and similar "words" are on the functional side, as their stated generic anaphoricity strongly suggests, then I do not see why PRON could not be appropriated, being the functional counterpart of NOUN. It surely has a very specific distribution; but it surely has a deictic nature and it also ties in well with its contrastive/indefinite origin, if this has some role (as the choice of ADV for usw = und so weiter in German points to):

  • et cetera 'and the other (things)'

where ceterus is currently tagged as a DET with contrastive meaning (PronType=Con) in Latin (also an indefinite reading might be available). But in general, I think that all such terms should follow a unified annotation as long as they behave the same, as they seem to do.

I do not know if this derives from some generic resistance against opening the PRON class to some "non canonical" (i.e. non personal) elements, but etc seems a perfect candidate; the biggest vulnus for me anyway is to see it associated to a lexical class. I do not get this objection from the writeup:

In general, I think the speaker is suggesting a few members of a list and implying more and there is usually no anaphoric relation where the context or text provides other referents.

Is this really so different from indefinite pronouns like some?

@dan-zeman
Copy link
Member

where ceterus is currently tagged as a DET with contrastive meaning (PronType=Con) in Latin

The discussion in the guidelines group was mostly (although not entirely) about the use of the word in English, where it is a loanword but many speakers no longer perceive it as code switching. It is somehow assumed/hoped that the decision will be applicable to other languages that use etc. as a loanword, although it hasn't been discussed thoroughly (I think Swedish was mentioned as an example). I suppose that Latin has the liberty to treat the expression as what it really is etymologically, given that it is not a loanword there.

PRON was indeed discussed as one of the options. None of the options was welcomed as a good solution, so instead of endlessly repeating the same objections back and forth, we gradually eliminated them one-by-one through voting. NOUN survived.

@nschneid
Copy link
Contributor

Is this really so different from indefinite pronouns like some?

In EWT at least we consider some to be a DET, and someone to be a PRON.

Honestly the only thing we all agree on is that there is no good category for "etc." (in English anyway). It's sort of functional, and associated mainly with coordination, but doesn't seem as grammatically "core" as pronouns, and doesn't exist in a paradigm, which is why I think PRON seemed unintuitive (and PART). Nouns like "other" and "rest" can also have similar meanings. In reality, maybe it lies somewhere in between NOUN and PRON. Somebody should do a distributional corpus study and write a paper on it!

@Stormur
Copy link
Contributor

Stormur commented Sep 16, 2022

Is this really so different from indefinite pronouns like some?

In EWT at least we consider some to be a DET, and someone to be a PRON.

You are right, I did a mistake here and meant someone. But also the pronominal uses of an element like some (i.e. heading its own phrase) fit to the discussion. As for other, I am surprised to hear of it as a NOUN... I see it much more as a (contrastive) DET (admitting pronominal uses).

For sure it is atypical with respect to the "standard" personal pronouns. I am not sure that (universally) coreness (there are always atypical exponents of a class) or paradigms make it an unviable option. I would like to insist again on fact that if we agree that it is "sort of functional", this should cut off the discussion and make us converge on PRON in the the dyad NOUN/PRON.

The discussion in the guidelines group was mostly (although not entirely) about the use of the word in English

However, this analysis is represented as rather universal in the guidelines. Maybe it should be shifted to the en pages, and the u left more general? In Latin we have indeed to choose if we want to be strict in treating it as a MWT and expanding it, but it might also warrant to be annotated as a "new" word, in the vein of id est 'that is' becomes a CCONJ. And I am rather convinced that a treatment for such a "universal coordination terminator" should be unified and no more dependent of its specific etymology (e.g. I would not like an analysis as ADV in German, all the more so because it is not adverbially modifying anything).

@nschneid
Copy link
Contributor

Note an earlier thread on attachment of "etc.": #483

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies English Latin standard needed universal UPOS Universal part-of-speech tags: definitions and examples
Projects
None yet
Development

No branches or pull requests