Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Author name URLs #623

Open
mjpost opened this issue Nov 5, 2019 · 25 comments
Open

Author name URLs #623

mjpost opened this issue Nov 5, 2019 · 25 comments
Assignees

Comments

@mjpost
Copy link
Member

mjpost commented Nov 5, 2019

I can't find where I commented on this, but now that ACL 2020 is collecting links to ORCid, Semantic Scholar, and Anthology pages, I'm reminded that we don't have stable author page names. For example, if another Matt Post comes along, we have to fork the current pages.

I like Semantic Scholar's approach, where for example my page is:

https://www.semanticscholar.org/author/Matt-Post/38842528

I don't know how the integer is selected, but we could use a similar system, say starting at 1, and moving up from there. When there is no ambiguity, the base page would redirect to /1, e.g.,

https://www.aclweb.org/anthology/people/matt-posthttps://www.aclweb.org/anthology/people/matt-post/1

When there is ambiguity, the base page would then be used to hold variants with no assigned ID.

@davidweichiang
Copy link
Collaborator

davidweichiang commented Nov 7, 2019

Since ACL 2020 intends to share information with future conferences, it may be desirable to commit to making all current author pages stable (so you will always be matt-post and future Matt Posts will have to get a suffix).

To handle ambiguous names, it may help to distinguish between names and people.

So, if there's only one Matt Post, then https://www.aclweb.org/anthology/names/matt-post redirects to https://www.aclweb.org/anthology/people/matt-post. But if a new Matt Post arrives on the scene; somehow he is assigned a new id, say, matt-post-nd. Then https://www.aclweb.org/anthology/names/matt-post becomes a disambiguation page that contains:

  • Links to https://www.aclweb.org/anthology/people/matt-post and https://www.aclweb.org/anthology/people/matt-post-nd.
  • Any papers written by "Matt Post" which cannot be positively assigned to either person.

@davidweichiang
Copy link
Collaborator

I also wonder if, rather than creating a new system of IDs on top of ORCID, START, DBLP, Semantic Scholar, and Google Scholar, should we adopt one of those existing systems of IDs?

@nschneid
Copy link
Contributor

nschneid commented Nov 7, 2019

In principle I like the idea of adopting ORCID since uniquely identifying people is its entire purpose. START usernames are not as clean (I've seen people with multiple START IDs), and the others are automatically mined and therefore subject to error. But what about authors who don't have an ORCID? I suspect in any event we'll need a mixture of external and internal IDs.

@akoehn
Copy link
Member

akoehn commented Nov 8, 2019

This is not an easy problem: We should definitely not roll our own ID schema, relying on semanticscholar seems to brittle (how long will they / their IDs be around?); ORCID seems to be the best option because it is the only ID explicitly made for this job and I am very much in favor of future conferences collecting ORCIDs for submissions. However, it is not easy to 1) find the orcids for already existing papers and 2) deal with people without orcid.

The proposal by @davidweichiang seems sensible to me (existing author keeps URL on clashes), the /names/ URL would then be linked to from /people/ pages of people with multiple authors, similar to disambiguation sites on Wikipedia?

And whoever has access to people organizing conferences: please lobby for orcid, it will make our lives easier in the long run :-)

@knmnyn
Copy link
Collaborator

knmnyn commented Nov 25, 2019 via email

@mjpost
Copy link
Member Author

mjpost commented Nov 27, 2019

It seems ORCID is the way to go, when we have it. It's too bad that the ACL email that went out recently collected pretty much everything except ORCIDs.

@knmnyn
Copy link
Collaborator

knmnyn commented Nov 27, 2019 via email

@mjpost
Copy link
Member Author

mjpost commented Nov 27, 2019

It would be good to have that in START, but without it being mandatory, I don't think anyone will fill it out. Though I bet we can triangulate them with all the other information we're getting.

@knmnyn
Copy link
Collaborator

knmnyn commented Nov 27, 2019 via email

@mjpost
Copy link
Member Author

mjpost commented Nov 27, 2019

A good idea. Though I suspect that many people haven't even registered for an ORCID. I'm fairly trendy with these things and only recently did so myself.

@mbollmann
Copy link
Member

Semantic Scholar is the main source of information in the ACL form, and that one asks for ORCID on sign-up, so that could be a place to start.

@akoehn
Copy link
Member

akoehn commented Nov 28, 2019

@mbollmann I don't understand your suggestion (in: I do not even know whether you made a suggestion).

In my opinion, we (that is probably @mjpost) should lobby for adding (ideally: required) ORCID fields into future submission processes. Without that step, there will always be additional manual work to perform matching and I don't think that is sustainable in the long run. When ORCIDs are not introduced by conferences, there is little point in introducing them here.

The point of ORCIDs is that they are clean; performing error-prone matching all the time on our end defeats its purpose.

@mbollmann
Copy link
Member

I was just pitching the idea that we could seed an initial ORCID database for the Anthology via Semantic Scholar. This does not include manual or error-prone matching because:

  • All ACL2020 authors/reviewers are prompted to give their Semantic Scholar URL (both in the ACL Information Form and on the global START profile) and their Anthology URL;
  • when claiming a Semantic Scholar page they are prompted to enter their ORCID (if they have one);
  • Semantic Scholar supposedly provides a public API to fetch this information easily.

Now, I don't know how many people will have both claimed their SS page and added their ORCID, but it could be a start. Of course asking for them directly as part of the submission process should be the way to go in the future, I totally agree with you here, @akoehn.

@akoehn
Copy link
Member

akoehn commented Nov 28, 2019 via email

@nschneid
Copy link
Contributor

nschneid commented Nov 28, 2019

If the community wants to adopt ORCID then probably the best way is to make it a required part of the START global profile, in time for the ACL 2020 camera-ready deadline (IMO it would be too sudden a change to require it for the submission deadline).

in our database, the ORCID is not a property of the author (as the whole point is that we do not have author entities in our data) but of each individual paper

We effectively have (imperfect) author entities through a combination of the author name strings in paper entries and the name_variants database. I assume we'd need to (a) propagate ORCIDs backwards or (b) go with a hybrid strategy that clusters by ORCID where available and continues to use the name_variants system for compatibility with legacy data (or future data from non-ACL/non-START events).

If we want to be conservative about propagating ORCIDs backward, I suppose it might be possible to obtain START usernames on papers at least for recent major conferences, since START usernames are a more unique set of identifiers than the name strings (though some authors have multiple START profiles). Then these could be mapped to ORCIDs with growing coverage as more people update their global profiles for ACL 2020 and future venues.

We could also email authors on an ad hoc basis to confirm that the Anthology isn't conflating them with other authors. This would allow cleaner back-propagation of ORCIDs.

@dowobeha
Copy link

I concur that ORCID is the way to go. I would be in favor of making ORCID mandatory in START.

@nschneid
Copy link
Contributor

nschneid commented Jul 4, 2020

Since ACL is a time for planning, I want to revisit this thread. Can we push for mandatory ORCIDs in START, maybe in time for EMNLP camera-ready? (@mjpost, would this require discussion among the ACL Exec?)

Note that START in general (at least for workshops; I don't know about EMNLP) allows listing unregistered users as authors. So I think the policy should be that camera-ready submissions have ORCIDs for ALL authors, and if it is a registered user it would be loaded automatically from the START global profile.

@mjpost
Copy link
Member Author

mjpost commented Jul 4, 2020

Good idea. A few thoughts:

  • What do you envision the author URL being?
  • How do we handle venues that won’t provide ORCIDs?
  • How do we handle ambiguities and assignments involving deceased people?

I like how author pages are guessable. One idea is to use a single guessable name ID page, eg anthology/people/matt-post/. This could serve as a collection place for undisambiguated names, and could also redirect to unambiguous names with IDs, eg anthology/people/matt-post/$ORCID. I’m not sure how we would disambiguate people who don’t have ORCIDs though or for whom we can’t get them.

@nschneid
Copy link
Contributor

nschneid commented Jul 4, 2020

The simplest step forward might be to say that ORCIDs are attached as an extra field to papers, not Anthology author records directly, though of course any paper with an ORCID would allow us to unambiguously match against existing authors with the same ORCID on other papers (or to infer it's a new author if all existing authors by that name have papers with other ORCIDs).

The id attribute on an author name would continue to be used to disambiguate the Anthology author. Whether id is explicit or not, it would be an error for an Anthology author to have papers under multiple distinct ORCIDs.

Then we could allow manual disambiguation of past authorship by adding the ORCID for the paper. (Maybe there should be a UI for authors to do this themselves: manually verify their past papers. But if not it can be done directly in XML.) Thus any explicit ORCID in the XML would be trustworthy. Papers for which we don't have ORCIDs would continue to be assigned to semiautomatic author pages under the current system. Perhaps the verified/unverified distinction should be exposed to the user.

@danielgildea
Copy link
Collaborator

danielgildea commented Nov 10, 2020

How about this:

  1. We start including ORCIDs in the id attribute in the xml, when we get it from the venues.
  2. Internally, our author ID is either the ORCID (if known) or the slugification of the name.
  3. If no id attribute is present in the xml, and the author's name slugifies to the same thing as
    some other author field with an ORCID, they are considered to be the same person. This would be
    a change from the current setup where an error is generated if you use the same name with and
    without an id attribute.
  4. However, if the same slugification appears with more than one id attribute (Yang Liu), then you
    have to specify the id wherever the name appears (as you do now).

This way we can gradually add ORCIDs for people already in the database, for the vast majority of cases where the name is unambiguous. There will be a few cases where, as ORCIDs come in, we realize that existing names refer to more than one person. At that point, we will have to retroactively disambiguate by hand.

As far as author URLs, I would say stick with anthology/people/matt-post/ except when
ambiguous, in which case anthology/people/matt-post-$ORCID/.

@nschneid
Copy link
Contributor

Would this mean overloading the id field to be sometimes ORCID (in new data), sometimes current ID for different papers from the same individual? I worry that this would be confusing for users of the data, who would expect different explicit id values to refer to different individuals. Might be better to have a separate orcid field.

@danielgildea
Copy link
Collaborator

I was imagining that we would replace the current IDs with ORCIDs when we find out the ORCIDs.

@nschneid
Copy link
Contributor

Would these be manually reviewed? Just want to be sure new sources of noise are distinguished from authoritative pieces of metadata.

@danielgildea
Copy link
Collaborator

Yes.

@mjpost
Copy link
Member Author

mjpost commented Dec 4, 2020

I like this, but what about the minor change of using people/matt-post/$ORCID/ as the author URL instead. This lets us easily identify all authors with a single SLUG and create disambiguation landing page, and also follows conventions used by other services, e.g., my page on Semantic Scholar.

One other thing this addresses: for authors we disambiguate manually, we can keep their ID that we choose for them. Should we ever get an ORCID for them, we can easily create a link to that as their canonical author page, so as to create link permanence.

danielgildea pushed a commit that referenced this issue Jan 3, 2021
Issue #623

Now generates author pages with urls in form name/id

for most people this looks like:
 people/d/david-chiang/david-chiang/

Matt Post has an ORCID in name_variants.yaml, so his page is:
 people/m/matt-post/0000-0002-1297-6794/

and then there is:
 people/y/yang-liu/yang-liu-edinburgh/
 people/y/yang-liu/yang-liu-ict/
 people/y/yang-liu/yang-liu-icsi/
 people/y/yang-liu/yang-liu-umich/

I don't know how to make the old URLs people/m/matt-post/ resolve.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants