Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Name collisions #208

Closed
mjpost opened this issue Mar 23, 2019 · 27 comments
Closed

Name collisions #208

mjpost opened this issue Mar 23, 2019 · 27 comments

Comments

@mjpost
Copy link
Member

mjpost commented Mar 23, 2019

We have a way of merging name variations (#86), but we also have the opposite issue of distinct authors with identical surface representations (see discussion starting here).

As @mbollmann suggested, the way to solve this is with some kind of user identifier. One way to do this that is in line with the rest of the publishing world is to get ORCIDs, but this requires user action. An easier solution might be the following:

  • We continue to use names as IDs. Each name is implicitly assigned an in-class identifier of 1.
  • An optional <id> annotation can be added to any <author> line. The ID could be an arbitrary identifier as long as it is consistent across all the author lines.
  • Name variants works as before, except our tuples are now (ID, canonical name, *variants).

This doesn’t address how to handle the fact that people could have overlapping surface representations but different canonical representations, or what URL to assign to their author pages, but I thought I’d get the discussion started.

@davidweichiang
Copy link
Collaborator

Does it make more sense to make the id an element or attribute, i.e. <author id="123"><first>...? If it’s an element, one might be tempted to get all the text from the author node, erroneously making the id part of the name?

I would hope that the id would only be used when necessary....there are presumably not that many such cases...

@mjpost
Copy link
Member Author

mjpost commented Mar 23, 2019

Yes, this is an improvement!

@davidweichiang
Copy link
Collaborator

How would we figure out the paper-to-author mapping for old papers?

How would we figure out the paper-to-author mapping for new papers?

  • Same as above
  • Ask publication chairs to send the Anthology the original START metadata (currently they discard it, I think), which contains author emails and affiliations; however, these can change, and emails need to be kept private
  • Ask START to include author START ids

@MartiHearst
Copy link

Hi everyone, this is my first time commenting.
One issue that comes up is when other people add an author to their paper, e.g., a student adds their advisor and so on. It isn't always the case that an author is aware of how their name is being put into the system unless the system controls this, as is done in publishing systems that use something like ORCID.
So in short, we can't rely on an individual always monitoring how their own name is being used on a new paper themselves.

@mjpost
Copy link
Member Author

mjpost commented Mar 27, 2019

To the list of tools for finding the paper-to-author mappings, Sebastian Kohlmeier has suggested we use Semantic Scholar, and implied he might be willing to help out.

https://api.semanticscholar.org

I'm not sure about the exact abilities, but this could work well for when we have DOIs, at least.

@MartiHearst
Copy link

To the list of tools for finding the paper-to-author mappings, Sebastian Kohlmeier has suggested we use Semantic Scholar, and implied he might be willing to help out.

https://api.semanticscholar.org

I'm not sure about the exact abilities, but this could work well for when we have DOIs, at least.

And Sebastian is quite enthusiastic about helping the ACL Anthology out with this. He'd like to mutually work to recognize and reduce errors in the system.

@davidweichiang
Copy link
Collaborator

Heng Ji's group may be able to help with this after the summer! She says: "it’s a standard xdoc entity clustering problem, here we have both paper content and author affiliation and citation network, which will make it easier."

@MartiHearst
Copy link

I was thinking I'd like to help on the project during the summer.

@davidweichiang
Copy link
Collaborator

davidweichiang commented Apr 12, 2019

Somewhat related to the discussion in #245: have a look at this example from name_variants.yaml:

- canonical: {first: Robert E., last: Mercer}
  comment: Western Ontario; also published as Robert Mercer, overlapping with another
    Robert Mercer
  variants:
  - {first: Robert E., last: MERCER}
  - {first: Robert, last: Mercer}
- canonical: {first: Robert L., last: Mercer}
  comment: IBM; also published as Robert Mercer, overlapping with another Robert Mercer
  variants:
  - {first: R. L., last: Mercer}
  - {first: R., last: MERCER}
  - {first: R., last: Mercer}

The result is that one of the Robert Mercers' pages incorrectly has some papers written by the other one. (This isn't a unique case; there's at least one more, which is similarly commented.)

At some point we will assign an ID to these two authors, and each paper written by "Robert Mercer" will have an ID to disambiguate it. At that point, since the two authors do have self-chosen ways of writing their names unambiguously, should the website/BibTeX records for the ambiguous papers substitute in the unambiguous names? (CMoS 14.73 recommends inserting middle initials when needed for disambiguation.)

EDIT: Using the similar schemes as the ones being discussed in #245, these papers could be annotated as one of:

(c) <author><first complete="Robert L.">Robert</first><last>Mercer</last></author>
(d) <author><first original="Robert">Robert L.</first><last>Mercer</last></author>

For this pair of authors, then, no ID would be needed.

@mjpost
Copy link
Member Author

mjpost commented Apr 12, 2019

More generally, though, there are going to be overlaps not just in name variants but also in canonical representations, unless we take the Screen Authors Guild approach (your name has to be globally unique). I actually like that quite a bit but doubt we have the authority.

Another solution to this is for each name, we manually list all that author's papers, e.g.,

- canonical: {first: Robert L., last: Mercer}
  comment: IBM; also published as Robert Mercer, overlapping with another Robert Mercer
  papers: PXX-YYYY, DXX-YYYY, etc.
  variants:
  - {first: R. L., last: Mercer}
  - {first: R., last: MERCER}
  - {first: R., last: Mercer}

I thought you had done this once in a PR.

@davidweichiang
Copy link
Collaborator

davidweichiang commented Apr 12, 2019

EDITED: Sorry, the "Comment" button pushed itself too early.

To be sure, we'll definitely need IDs for when canonical names collide. So I think we need the complete attribute, which affects both disambiguation and display, and the id attribute, which affects disambiguation only.

In the case of Robert Mercer, we can write

<author><first complete="Robert L.">Robert</first><last>Mercer</last></author>

and no id is necessary, but in the case of Yang Liu, we have to write

<author id='002'><first>Yang</first><last>Liu</last></author>

and for a paper written by Y. Liu (which there is), we would need both:

<author id='002'><first complete="Yang">Y.</first><last>Liu</last></author>

How does that look?

The remaining question would be how to choose the ids. I hope we don't have to assign numbers and can find some more personal way to identify authors, but numbers are certainly straightforward. In the case of Chinese authors, I personally think it would make most sense to use Chinese names as identifiers. But maybe it should be up to the authors themselves.

Regarding the papers fields in name_variants.yaml, that was just to assist with editing the file, and I removed them, since that information was redundant with the XML and therefore just creates more room for inconsistency.

@davidweichiang
Copy link
Collaborator

After the discussion in #271, I believe that what @mjpost is aiming for makes the id and complete attributes redundant. Currently, we have situations like:

<paper>
   <author><first complete="Richard">R.</first><last>Evans</last></author>
</paper>
<paper>
   <author><first complete="Roger">R.</first><last>Evans</last></author>
</paper>

which could be disambiguated equally well by writing something like:

# XML
<paper>
   <author id="richard-evans"><first>R.</first><last>Evans</last></author>
</paper>
<paper>
   <author id="roger-evans"><first>R.</first><last>Evans</last></author>
</paper>

# name_variants.yaml
- canonical: {first: Richard, last: Evans}
  variants:
  - {id: richard-evans, first: R., last, Evans}
- canonical: {first: Roger, last: Evans}
  variants:
  - {id: roger-evans, first: R., last, Evans}

Do we want to be able to get the same effect in two ways like this? If not, should we drop the complete attribute and convert them all to ids?

@mbollmann
Copy link
Member

Wouldn't it be enough to have the name database be:

- canonical: {first: Richard, last: Evans}
  id: richard-evans
- canonical: {first: Roger, last: Evans}
  id: roger-evans

If the XML already has IDs, it seems more foolproof to extract ID-tagged variants automatically from there.

Otherwise, yes, I'd be absolutely in favor of only using the ID method everywhere, if we can achieve all of our variant handling needs that way.

@davidweichiang
Copy link
Collaborator

davidweichiang commented Apr 22, 2019

My interpretation of the first comment in this issue was that the id wasn't globally unique, but just enough to disambiguate. In other words, there could be Yang Lius with ids 1 and 2, and there could also be Robert Mercers with ids 1 and 2. In that case, I thought it would make sense to put it alongside the first and last name.

But maybe globally unique IDs would be easier to manage, in which case, yes, your YAML makes more sense. [Deleted; I later realized this is exactly what you were saying.]

@mbollmann
Copy link
Member

I think the name id raises the expectation that it will be globally unique. Either way, if you have two Yang Lius that you tag with 1 and 2, and then add another paper from Yang Liu at some later point, you have to look up what (if any) tags have previously been used to disambiguate them. At that point, there's no real advantage anymore to using simple disambiguation tags, so might as well make IDs globally unique, no?

@davidweichiang
Copy link
Collaborator

Yes, I think that's all reasonable. So I think the proposal on the table is for a name variant to have either of two forms:

# XML
<author id="Matt Post"><first>Mattt</first><last>Post</last></author>

# name_variants.yaml
- canonical: {first: Matt, last: Post}
  id: Matt Post

or

# XML
<author><first>Mattt</first><last>Post</last></author>

# name_variants.yaml
- canonical: {first: Matt, last: Post}
  variants:
  - {first: Mattt, last: Post}
  • Are there any constraints on what an id should look like (spaces, punctuation, case-sensitive, Unicode, etc.)?

  • Should it be an error for a name to appear in the XML that is ambiguous (appears as a canonical or variant of more than one person) but lacks an id?

@mjpost
Copy link
Member Author

mjpost commented Apr 24, 2019

It seems a globally unique ID is increasingly the way to go. For constraints I would suggest (a) it be UTF-8 and (b) we disallow a double-quote in the value.

Of the forms David lists above, I like the first variant better (<author id="Matt Post"><first>Mattt</first><last>Post</last></author>).

I think we should add other fields to name_variants.yaml to help us find the correct global ID. For example, when importing ACL 2038, we could take the START ID from the ingest and use that to lookup the global ID and take the right papers.

As for whether it should be an error—maybe. We should at least print a WARNING like what happens when there is an EDITOR field on a regular paper.

@davidweichiang
Copy link
Collaborator

I also like the idea of UTF-8... @mbollmann do you have any thoughts (like do you think that the id should take the place of the slug)?

Regarding the two ways of annotating "Mattt Post", my meaning was that in the proposal on the table, as I understand it, both of the forms would be possible. I think that's a good thing -- it will make ingestion much easier, as IDs will only have to be looked up for variants that cause errors (for being nonexistent or ambiguous).

Adding other fields to name_variants.yaml sounds good. If an email address is public (= appears on a paper), we could put it in there and use it to automate lookups. ATM we do not have access to START ids, but I imagine that would be an easy modification for Softconf.

I think if two people share a variant, it should be an error. If a paper uses the shared variant, we have no way of knowing who the author is.

@mjpost
Copy link
Member Author

mjpost commented Apr 25, 2019

I think if two people share a variant, it should be an error. If a paper uses the shared variant, we have no way of knowing who the author is.

I was confused by this line—you must mean that shared variants not disambiguated with a global ID should be an error (that is, the complete XML author tag, not the surface string itself). That makes sense and you've convinced me—though I would like to know how many of these we'll encounter the first time we enforce this.

@davidweichiang
Copy link
Collaborator

davidweichiang commented Apr 25, 2019

Right -- what I meant is that the following should be, and already is, an error:

# name_variants.yaml
- canonical: {first: Robert L., last: Mercer}
  variants:
  - {first: Robert, last: Mercer}
- canonical: {first: Robert E., last: Mercer}
  variants:
  - {first: Robert, last: Mercer}

But come to think of it, I think we actually should allow the above. We are going to start allowing shared canonical names, so we should start allowing shared variants as well. Of course, if a shared name actually gets used in the XML without an id, it should be an error.

I also think that, to error-proof new ingestions, it should be an error for the XML to have two authors with the same name, and one has an id and the other doesn't, i.e.

# name_variants.yaml
- canonical: {first: Robert L., last: Mercer}
  id: "Robert L. Mercer"
- canonical: {first: Robert E., last: Mercer}
  id: "Robert E. Mercer"

# XML
<paper id="1">
  <author id="Robert L. Mercer"><first>Robert</first><last>Mercer</last></author>
</paper>
<paper id="2">
  <author><first>Robert</first><last>Mercer</last></author>
</paper>

The reasoning is that if paper 2 gets ingested, two things could happen, both of them bad:

  • A third author could be created with canonical name "Robert Mercer"
  • Paper 1 could implicitly create the variant "Robert Mercer" for Robert L. Mercer, and paper 2 would therefore be assigned to Robert L. Mercer
    So I think it should be an error, and a human needs to disambiguate it.

@davidweichiang
Copy link
Collaborator

In the above commit, I tried converting all the complete attributes to id attributes. I used the slug of the canonical name as the id, but then had second thoughts. Maybe the id should simply be the canonical name -- that is, adopt the SAG policy for canonical names, and the id attribute disambiguates names by providing the (unique) canonical name when needed.

Advantages of slug-like ids:

  • They can be used directly in the URL of the author page
  • There aren't many confusable characters (like ş and ș, for example)
  • They are easy for anyone to type (and until we have an automatic name disambiguator, someone will have to type IDs for ambiguous names)

Advantages of using the canonical name as the id:

  • The modification to the code would be far simpler -- just a few lines in AnthologyIndex.register, I believe
  • Names are natural language and slugs are not, and we're all about natural language, right? In particular, Chinese characters could be appended to canonical names to make them unique

@mjpost
Copy link
Member Author

mjpost commented May 6, 2019

I like your resolution in your comment above. That is, once there is a clash of surface name form between two authors, all subsequent additions to the Anthology must disambiguate all instances upon ingestion.

I like simplicity, but it seems to me that in general, the purpose of IDs is to remove any possible ambiguity. If we permit arbitrary unicode in the ID, we could have situations (or, would have trouble preventing situations) where there could be visually identical IDs that in reality map to different unicode code points. This occurs often for example for Arabic letters, where there are distinct entries in the Arabic and Persian portions of the Unicode tables.

So I don't know how to allow unicode (and I like the Chinese piece) without opening up a can of worms that we may regret—even if only because we have to spend a bunch of time hammering out policy on the long tail of exceptions. This all suggests to me that IDs should be alphanumeric, but I am only 80% sure.

@davidweichiang
Copy link
Collaborator

Yeah. For now, I'm continuing with PR #297 which uses slugs as ids. And as a test case, I've given the two Yang Lius ids of yang-liu-ict and yang-liu-icsi -- the reasoning being that affiliations are a common way of disambiguating people, and one's PhD affiliation is something that can never change (unless you get a second PhD, I guess). Alternatives welcome.

@mjpost
Copy link
Member Author

mjpost commented May 6, 2019

Ph.D. affiliation is a good one. We could always take a PR from someone who really hated their ID, should that scenario arise.

@dowobeha
Copy link

dowobeha commented May 6, 2019

In addition to the above ideas, I would be in favor of introducing an (initially optional) OrcID field for all new paper submissions, with the goal of eventually requiring OrcID on paper submissions.

davidweichiang added a commit that referenced this issue May 6, 2019
…lved some reorganization of the data structures in AnthologyIndex.) #208, also fixes #305.
@davidweichiang
Copy link
Collaborator

There are still some specific authors that need to be disambiguated, and I'm hoping that volunteers may be able to help. Should I make a checklist here, or start a new issue? If the latter, should we close this issue?

@mjpost
Copy link
Member Author

mjpost commented May 9, 2019

We now have a system in place, so let's close this issue. A checklist or rough description of what's needed in a new issue would be great, and then we can search for some help.

najtin pushed a commit to ir-anthology/ir-anthology that referenced this issue Jun 9, 2021
…lved some reorganization of the data structures in AnthologyIndex.) acl-org#208, also fixes acl-org#305.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants