-
Notifications
You must be signed in to change notification settings - Fork 292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Name collisions #208
Comments
Does it make more sense to make the id an element or attribute, i.e. I would hope that the id would only be used when necessary....there are presumably not that many such cases... |
Yes, this is an improvement! |
How would we figure out the paper-to-author mapping for old papers?
How would we figure out the paper-to-author mapping for new papers?
|
Hi everyone, this is my first time commenting. |
To the list of tools for finding the paper-to-author mappings, Sebastian Kohlmeier has suggested we use Semantic Scholar, and implied he might be willing to help out. https://api.semanticscholar.org I'm not sure about the exact abilities, but this could work well for when we have DOIs, at least. |
And Sebastian is quite enthusiastic about helping the ACL Anthology out with this. He'd like to mutually work to recognize and reduce errors in the system. |
Heng Ji's group may be able to help with this after the summer! She says: "it’s a standard xdoc entity clustering problem, here we have both paper content and author affiliation and citation network, which will make it easier." |
I was thinking I'd like to help on the project during the summer. |
Somewhat related to the discussion in #245: have a look at this example from
The result is that one of the Robert Mercers' pages incorrectly has some papers written by the other one. (This isn't a unique case; there's at least one more, which is similarly commented.) At some point we will assign an ID to these two authors, and each paper written by "Robert Mercer" will have an ID to disambiguate it. At that point, since the two authors do have self-chosen ways of writing their names unambiguously, should the website/BibTeX records for the ambiguous papers substitute in the unambiguous names? (CMoS 14.73 recommends inserting middle initials when needed for disambiguation.) EDIT: Using the similar schemes as the ones being discussed in #245, these papers could be annotated as one of:
For this pair of authors, then, no ID would be needed. |
More generally, though, there are going to be overlaps not just in name variants but also in canonical representations, unless we take the Screen Authors Guild approach (your name has to be globally unique). I actually like that quite a bit but doubt we have the authority. Another solution to this is for each name, we manually list all that author's papers, e.g.,
I thought you had done this once in a PR. |
EDITED: Sorry, the "Comment" button pushed itself too early. To be sure, we'll definitely need IDs for when canonical names collide. So I think we need the In the case of Robert Mercer, we can write
and no
and for a paper written by Y. Liu (which there is), we would need both:
How does that look? The remaining question would be how to choose the ids. I hope we don't have to assign numbers and can find some more personal way to identify authors, but numbers are certainly straightforward. In the case of Chinese authors, I personally think it would make most sense to use Chinese names as identifiers. But maybe it should be up to the authors themselves. Regarding the |
After the discussion in #271, I believe that what @mjpost is aiming for makes the
which could be disambiguated equally well by writing something like:
Do we want to be able to get the same effect in two ways like this? If not, should we drop the |
Wouldn't it be enough to have the name database be:
If the XML already has IDs, it seems more foolproof to extract ID-tagged variants automatically from there. Otherwise, yes, I'd be absolutely in favor of only using the ID method everywhere, if we can achieve all of our variant handling needs that way. |
My interpretation of the first comment in this issue was that the id wasn't globally unique, but just enough to disambiguate. In other words, there could be Yang Lius with ids But maybe globally unique IDs would be easier to manage, in which case, yes, your YAML makes more sense. [Deleted; I later realized this is exactly what you were saying.] |
I think the name |
Yes, I think that's all reasonable. So I think the proposal on the table is for a name variant to have either of two forms:
or
|
It seems a globally unique Of the forms David lists above, I like the first variant better ( I think we should add other fields to name_variants.yaml to help us find the correct global ID. For example, when importing ACL 2038, we could take the START ID from the ingest and use that to lookup the global ID and take the right papers. As for whether it should be an error—maybe. We should at least print a WARNING like what happens when there is an EDITOR field on a regular paper. |
I also like the idea of UTF-8... @mbollmann do you have any thoughts (like do you think that the id should take the place of the slug)? Regarding the two ways of annotating "Mattt Post", my meaning was that in the proposal on the table, as I understand it, both of the forms would be possible. I think that's a good thing -- it will make ingestion much easier, as IDs will only have to be looked up for variants that cause errors (for being nonexistent or ambiguous). Adding other fields to name_variants.yaml sounds good. If an email address is public (= appears on a paper), we could put it in there and use it to automate lookups. ATM we do not have access to START ids, but I imagine that would be an easy modification for Softconf. I think if two people share a variant, it should be an error. If a paper uses the shared variant, we have no way of knowing who the author is. |
I was confused by this line—you must mean that shared variants not disambiguated with a global ID should be an error (that is, the complete XML |
Right -- what I meant is that the following should be, and already is, an error:
But come to think of it, I think we actually should allow the above. We are going to start allowing shared canonical names, so we should start allowing shared variants as well. Of course, if a shared name actually gets used in the XML without an id, it should be an error. I also think that, to error-proof new ingestions, it should be an error for the XML to have two authors with the same name, and one has an id and the other doesn't, i.e.
The reasoning is that if paper 2 gets ingested, two things could happen, both of them bad:
|
In the above commit, I tried converting all the Advantages of slug-like ids:
Advantages of using the canonical name as the id:
|
I like your resolution in your comment above. That is, once there is a clash of surface name form between two authors, all subsequent additions to the Anthology must disambiguate all instances upon ingestion. I like simplicity, but it seems to me that in general, the purpose of IDs is to remove any possible ambiguity. If we permit arbitrary unicode in the ID, we could have situations (or, would have trouble preventing situations) where there could be visually identical IDs that in reality map to different unicode code points. This occurs often for example for Arabic letters, where there are distinct entries in the Arabic and Persian portions of the Unicode tables. So I don't know how to allow unicode (and I like the Chinese piece) without opening up a can of worms that we may regret—even if only because we have to spend a bunch of time hammering out policy on the long tail of exceptions. This all suggests to me that IDs should be alphanumeric, but I am only 80% sure. |
Yeah. For now, I'm continuing with PR #297 which uses slugs as ids. And as a test case, I've given the two Yang Lius ids of |
Ph.D. affiliation is a good one. We could always take a PR from someone who really hated their ID, should that scenario arise. |
In addition to the above ideas, I would be in favor of introducing an (initially optional) OrcID field for all new paper submissions, with the goal of eventually requiring OrcID on paper submissions. |
There are still some specific authors that need to be disambiguated, and I'm hoping that volunteers may be able to help. Should I make a checklist here, or start a new issue? If the latter, should we close this issue? |
We now have a system in place, so let's close this issue. A checklist or rough description of what's needed in a new issue would be great, and then we can search for some help. |
…lved some reorganization of the data structures in AnthologyIndex.) acl-org#208, also fixes acl-org#305.
We have a way of merging name variations (#86), but we also have the opposite issue of distinct authors with identical surface representations (see discussion starting here).
As @mbollmann suggested, the way to solve this is with some kind of user identifier. One way to do this that is in line with the rest of the publishing world is to get ORCIDs, but this requires user action. An easier solution might be the following:
1
.<id>
annotation can be added to any<author>
line. The ID could be an arbitrary identifier as long as it is consistent across all the author lines.This doesn’t address how to handle the fact that people could have overlapping surface representations but different canonical representations, or what URL to assign to their author pages, but I thought I’d get the discussion started.
The text was updated successfully, but these errors were encountered: