Author name URLs #623

mjpost · 2019-11-05T23:00:38Z

I can't find where I commented on this, but now that ACL 2020 is collecting links to ORCid, Semantic Scholar, and Anthology pages, I'm reminded that we don't have stable author page names. For example, if another Matt Post comes along, we have to fork the current pages.

I like Semantic Scholar's approach, where for example my page is:

https://www.semanticscholar.org/author/Matt-Post/38842528

I don't know how the integer is selected, but we could use a similar system, say starting at 1, and moving up from there. When there is no ambiguity, the base page would redirect to /1, e.g.,

https://www.aclweb.org/anthology/people/matt-post → https://www.aclweb.org/anthology/people/matt-post/1

When there is ambiguity, the base page would then be used to hold variants with no assigned ID.

The text was updated successfully, but these errors were encountered:

davidweichiang · 2019-11-07T14:48:46Z

Since ACL 2020 intends to share information with future conferences, it may be desirable to commit to making all current author pages stable (so you will always be matt-post and future Matt Posts will have to get a suffix).

To handle ambiguous names, it may help to distinguish between names and people.

So, if there's only one Matt Post, then https://www.aclweb.org/anthology/names/matt-post redirects to https://www.aclweb.org/anthology/people/matt-post. But if a new Matt Post arrives on the scene; somehow he is assigned a new id, say, matt-post-nd. Then https://www.aclweb.org/anthology/names/matt-post becomes a disambiguation page that contains:

Links to https://www.aclweb.org/anthology/people/matt-post and https://www.aclweb.org/anthology/people/matt-post-nd.
Any papers written by "Matt Post" which cannot be positively assigned to either person.

davidweichiang · 2019-11-07T14:51:21Z

I also wonder if, rather than creating a new system of IDs on top of ORCID, START, DBLP, Semantic Scholar, and Google Scholar, should we adopt one of those existing systems of IDs?

nschneid · 2019-11-07T15:01:36Z

In principle I like the idea of adopting ORCID since uniquely identifying people is its entire purpose. START usernames are not as clean (I've seen people with multiple START IDs), and the others are automatically mined and therefore subject to error. But what about authors who don't have an ORCID? I suspect in any event we'll need a mixture of external and internal IDs.

akoehn · 2019-11-08T11:16:37Z

This is not an easy problem: We should definitely not roll our own ID schema, relying on semanticscholar seems to brittle (how long will they / their IDs be around?); ORCID seems to be the best option because it is the only ID explicitly made for this job and I am very much in favor of future conferences collecting ORCIDs for submissions. However, it is not easy to 1) find the orcids for already existing papers and 2) deal with people without orcid.

The proposal by @davidweichiang seems sensible to me (existing author keeps URL on clashes), the /names/ URL would then be linked to from /people/ pages of people with multiple authors, similar to disambiguation sites on Wikipedia?

And whoever has access to people organizing conferences: please lobby for orcid, it will make our lives easier in the long run :-)

knmnyn · 2019-11-25T14:33:50Z

Hi all: I'd also support ORCID. I had brought this up to TACL and CL before and then I understood that MIT Press was pursuing this anyways, so the editors on both CL and TACL stopped worrying about it. I agree with Arne, not to create our own. This is exactly why ORCID was created in the same guise as DOIs, and it will survive any one potential parties' demise (the verdict is not so clear with Semantic Scholar, IMHO). I also agree with Nathan in that we definitely need at least an internal system. I think we should use ORCID as a primary vehicle (and redirect folks to those IDs where possible) but also retain our own author URLs for cases where there are multiple namesakes; (matt-post, matt-post-2) . When and if an author mints a ORCID and reveals it to us, we permanently forward the existing namesake page to the ORCID (so matt-post gets redirected to 0000-0002-1297-6794 and we don't re-use matt-post again; the next matt-post is matt-post-3 Cheers, Min

…

-- Min-Yen KAN (Dr) :: Associate Professor :: National University of Singapore :: NUS School of Computing, AS6 05-12, 13 Computing Drive Singapore 117417 :: +65 6516 1885(DID) :: +65 6779 4580 (Fax) :: [email protected] (E) :: www.comp.nus.edu.sg/~kanmy (W)

On Fri, Nov 8, 2019 at 7:16 PM Arne Köhn ***@***.***> wrote: This is not an easy problem: We should definitely not roll our own ID schema, relying on semanticscholar seems to brittle (how long will they / their IDs be around?); ORCID seems to be the best option because it is the only ID explicitly made for this job and I am very much in favor of future conferences collecting ORCIDs for submissions. However, it is not easy to 1) find the orcids for already existing papers and 2) deal with people without orcid. The proposal by @davidweichiang <https://github.com/davidweichiang> seems sensible to me (existing author keeps URL on clashes), the /names/ URL would then be linked to from /people/ pages of people with multiple authors, similar to disambiguation sites on Wikipedia? And whoever has access to people organizing conferences: please lobby for orcid, it will make our lives easier in the long run :-) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#623?email_source=notifications&email_token=AABU7263KCNSXUMI2OJ3NM3QSVDBLA5CNFSM4JJL7222YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDQNTTQ#issuecomment-551606734>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABU72ZKWT2CZMAR4ULXX3LQSVDBLANCNFSM4JJL722Q> .

mjpost · 2019-11-27T00:33:21Z

It seems ORCID is the way to go, when we have it. It's too bad that the ACL email that went out recently collected pretty much everything except ORCIDs.

knmnyn · 2019-11-27T00:54:10Z

It’s not too late per se. I think we could encourage Rich Gerber at START to add a field to the global profile to collect ORCID. It’d just not be mandatory at this point. - M

On Wed, 27 Nov 2019 at 08:33, Matt Post ***@***.***> wrote: It seems ORCID is the way to go, when we have it. It's too bad that the ACL email that went out recently collected pretty much everything *except* ORCIDs. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#623?email_source=notifications&email_token=AABU723RXZIW4BMDNAM5Q53QVW55FA5CNFSM4JJL7222YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFH4KQI#issuecomment-558875969>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABU7225IDFY3VAM3ETPEZTQVW55FANCNFSM4JJL722Q> .

-- - M

mjpost · 2019-11-27T03:17:29Z

It would be good to have that in START, but without it being mandatory, I don't think anyone will fill it out. Though I bet we can triangulate them with all the other information we're getting.

knmnyn · 2019-11-27T03:20:24Z

Very true. If there's an automatic triangulation software one of us writes, we could have it validate the result by sending an email to the START user. Cheers, Min

…

-- Min-Yen KAN (Dr) :: Associate Professor :: National University of Singapore :: NUS School of Computing, AS6 05-12, 13 Computing Drive Singapore 117417 :: +65 6516 1885(DID) :: +65 6779 4580 (Fax) :: [email protected] (E) :: www.comp.nus.edu.sg/~kanmy (W)

On Wed, Nov 27, 2019 at 11:17 AM Matt Post ***@***.***> wrote: It would be good to have that in START, but without it being mandatory, I don't think anyone will fill it out. Though I bet we can triangulate them with all the other information we're getting. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#623?email_source=notifications&email_token=AABU722Z3CZ6FGSGFOQG6ALQVXRETA5CNFSM4JJL7222YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFIEUPA#issuecomment-558910012>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABU726HV6TLWHV4VIBSKF3QVXRETANCNFSM4JJL722Q> .

mjpost · 2019-11-27T03:23:54Z

A good idea. Though I suspect that many people haven't even registered for an ORCID. I'm fairly trendy with these things and only recently did so myself.

mbollmann · 2019-11-27T08:08:25Z

Semantic Scholar is the main source of information in the ACL form, and that one asks for ORCID on sign-up, so that could be a place to start.

akoehn · 2019-11-28T15:59:18Z

@mbollmann I don't understand your suggestion (in: I do not even know whether you made a suggestion).

In my opinion, we (that is probably @mjpost) should lobby for adding (ideally: required) ORCID fields into future submission processes. Without that step, there will always be additional manual work to perform matching and I don't think that is sustainable in the long run. When ORCIDs are not introduced by conferences, there is little point in introducing them here.

The point of ORCIDs is that they are clean; performing error-prone matching all the time on our end defeats its purpose.

mbollmann · 2019-11-28T18:11:52Z

I was just pitching the idea that we could seed an initial ORCID database for the Anthology via Semantic Scholar. This does not include manual or error-prone matching because:

All ACL2020 authors/reviewers are prompted to give their Semantic Scholar URL (both in the ACL Information Form and on the global START profile) and their Anthology URL;
when claiming a Semantic Scholar page they are prompted to enter their ORCID (if they have one);
Semantic Scholar supposedly provides a public API to fetch this information easily.

Now, I don't know how many people will have both claimed their SS page and added their ORCID, but it could be a start. Of course asking for them directly as part of the submission process should be the way to go in the future, I totally agree with you here, @akoehn.

akoehn · 2019-11-28T18:51:42Z

Yes, this is true. But: in our database, the ORCID is not a property of the author (as the whole point is that we do not have author entities in our data) but of each individual paper. We would therefore only obtain ORCIDs for ACL 2020 papers, and that with a lot of trouble: we would need to obtain the start username for every author, obtain the start-user -> semantic scholar mapping, and then query semantic scholar. We would then need to back-propagate this information for every accepted paper. This seems to be quite a bit of work given that we probably will only be able to map a small subset of ACL 2020 submissions that way (not everyone has orcid, not everyone has a semantic scholar page, not every page is claimed, etc.).

nschneid · 2019-11-28T19:24:55Z

If the community wants to adopt ORCID then probably the best way is to make it a required part of the START global profile, in time for the ACL 2020 camera-ready deadline (IMO it would be too sudden a change to require it for the submission deadline).

in our database, the ORCID is not a property of the author (as the whole point is that we do not have author entities in our data) but of each individual paper

We effectively have (imperfect) author entities through a combination of the author name strings in paper entries and the name_variants database. I assume we'd need to (a) propagate ORCIDs backwards or (b) go with a hybrid strategy that clusters by ORCID where available and continues to use the name_variants system for compatibility with legacy data (or future data from non-ACL/non-START events).

If we want to be conservative about propagating ORCIDs backward, I suppose it might be possible to obtain START usernames on papers at least for recent major conferences, since START usernames are a more unique set of identifiers than the name strings (though some authors have multiple START profiles). Then these could be mapped to ORCIDs with growing coverage as more people update their global profiles for ACL 2020 and future venues.

We could also email authors on an ad hoc basis to confirm that the Anthology isn't conflating them with other authors. This would allow cleaner back-propagation of ORCIDs.

dowobeha · 2020-01-30T17:31:10Z

I concur that ORCID is the way to go. I would be in favor of making ORCID mandatory in START.

nschneid · 2020-07-04T18:52:59Z

Since ACL is a time for planning, I want to revisit this thread. Can we push for mandatory ORCIDs in START, maybe in time for EMNLP camera-ready? (@mjpost, would this require discussion among the ACL Exec?)

Note that START in general (at least for workshops; I don't know about EMNLP) allows listing unregistered users as authors. So I think the policy should be that camera-ready submissions have ORCIDs for ALL authors, and if it is a registered user it would be loaded automatically from the START global profile.

mjpost · 2020-07-04T19:09:57Z

Good idea. A few thoughts:

What do you envision the author URL being?
How do we handle venues that won’t provide ORCIDs?
How do we handle ambiguities and assignments involving deceased people?

I like how author pages are guessable. One idea is to use a single guessable name ID page, eg anthology/people/matt-post/. This could serve as a collection place for undisambiguated names, and could also redirect to unambiguous names with IDs, eg anthology/people/matt-post/$ORCID. I’m not sure how we would disambiguate people who don’t have ORCIDs though or for whom we can’t get them.

nschneid · 2020-07-04T19:27:43Z

The simplest step forward might be to say that ORCIDs are attached as an extra field to papers, not Anthology author records directly, though of course any paper with an ORCID would allow us to unambiguously match against existing authors with the same ORCID on other papers (or to infer it's a new author if all existing authors by that name have papers with other ORCIDs).

The id attribute on an author name would continue to be used to disambiguate the Anthology author. Whether id is explicit or not, it would be an error for an Anthology author to have papers under multiple distinct ORCIDs.

Then we could allow manual disambiguation of past authorship by adding the ORCID for the paper. (Maybe there should be a UI for authors to do this themselves: manually verify their past papers. But if not it can be done directly in XML.) Thus any explicit ORCID in the XML would be trustworthy. Papers for which we don't have ORCIDs would continue to be assigned to semiautomatic author pages under the current system. Perhaps the verified/unverified distinction should be exposed to the user.

danielgildea · 2020-11-10T19:54:43Z

How about this:

We start including ORCIDs in the id attribute in the xml, when we get it from the venues.
Internally, our author ID is either the ORCID (if known) or the slugification of the name.
If no id attribute is present in the xml, and the author's name slugifies to the same thing as
some other author field with an ORCID, they are considered to be the same person. This would be
a change from the current setup where an error is generated if you use the same name with and
without an id attribute.
However, if the same slugification appears with more than one id attribute (Yang Liu), then you
have to specify the id wherever the name appears (as you do now).

This way we can gradually add ORCIDs for people already in the database, for the vast majority of cases where the name is unambiguous. There will be a few cases where, as ORCIDs come in, we realize that existing names refer to more than one person. At that point, we will have to retroactively disambiguate by hand.

As far as author URLs, I would say stick with anthology/people/matt-post/ except when
ambiguous, in which case anthology/people/matt-post-$ORCID/.

nschneid · 2020-11-10T20:52:55Z

Would this mean overloading the id field to be sometimes ORCID (in new data), sometimes current ID for different papers from the same individual? I worry that this would be confusing for users of the data, who would expect different explicit id values to refer to different individuals. Might be better to have a separate orcid field.

danielgildea · 2020-11-10T21:18:36Z

I was imagining that we would replace the current IDs with ORCIDs when we find out the ORCIDs.

nschneid · 2020-11-10T21:21:52Z

Would these be manually reviewed? Just want to be sure new sources of noise are distinguished from authoritative pieces of metadata.

danielgildea · 2020-11-10T21:22:59Z

Yes.

mjpost · 2020-12-04T14:59:26Z

I like this, but what about the minor change of using people/matt-post/$ORCID/ as the author URL instead. This lets us easily identify all authors with a single SLUG and create disambiguation landing page, and also follows conventions used by other services, e.g., my page on Semantic Scholar.

One other thing this addresses: for authors we disambiguate manually, we can keep their ID that we choose for them. Should we ever get an ORCID for them, we can easily create a link to that as their canonical author page, so as to create link permanence.

Issue #623 Now generates author pages with urls in form name/id for most people this looks like: people/d/david-chiang/david-chiang/ Matt Post has an ORCID in name_variants.yaml, so his page is: people/m/matt-post/0000-0002-1297-6794/ and then there is: people/y/yang-liu/yang-liu-edinburgh/ people/y/yang-liu/yang-liu-ict/ people/y/yang-liu/yang-liu-icsi/ people/y/yang-liu/yang-liu-umich/ I don't know how to make the old URLs people/m/matt-post/ resolve.

mjpost assigned davidweichiang, akoehn and mbollmann Nov 5, 2019

mjpost mentioned this issue Nov 10, 2020

Correction: Diacritics missing from author name though present in PDF #333

Open

danielgildea mentioned this issue Jan 3, 2021

author urls in style name/id #1179

Draft

mbollmann mentioned this issue Mar 8, 2024

Question about ACL profiles - needed for changes to OpenReview #3112

Closed

mjpost pinned this issue Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Author name URLs #623

Author name URLs #623

mjpost commented Nov 5, 2019

davidweichiang commented Nov 7, 2019 •

edited

Loading

davidweichiang commented Nov 7, 2019

nschneid commented Nov 7, 2019

akoehn commented Nov 8, 2019

knmnyn commented Nov 25, 2019 via email

mjpost commented Nov 27, 2019

knmnyn commented Nov 27, 2019 via email

mjpost commented Nov 27, 2019

knmnyn commented Nov 27, 2019 via email

mjpost commented Nov 27, 2019

mbollmann commented Nov 27, 2019

akoehn commented Nov 28, 2019

mbollmann commented Nov 28, 2019

akoehn commented Nov 28, 2019 via email

nschneid commented Nov 28, 2019 •

edited

Loading

dowobeha commented Jan 30, 2020

nschneid commented Jul 4, 2020 •

edited

Loading

mjpost commented Jul 4, 2020

nschneid commented Jul 4, 2020

danielgildea commented Nov 10, 2020 •

edited

Loading

nschneid commented Nov 10, 2020

danielgildea commented Nov 10, 2020

nschneid commented Nov 10, 2020

danielgildea commented Nov 10, 2020

mjpost commented Dec 4, 2020

Author name URLs #623

Author name URLs #623

Comments

mjpost commented Nov 5, 2019

davidweichiang commented Nov 7, 2019 • edited Loading

davidweichiang commented Nov 7, 2019

nschneid commented Nov 7, 2019

akoehn commented Nov 8, 2019

knmnyn commented Nov 25, 2019 via email

mjpost commented Nov 27, 2019

knmnyn commented Nov 27, 2019 via email

mjpost commented Nov 27, 2019

knmnyn commented Nov 27, 2019 via email

mjpost commented Nov 27, 2019

mbollmann commented Nov 27, 2019

akoehn commented Nov 28, 2019

mbollmann commented Nov 28, 2019

akoehn commented Nov 28, 2019 via email

nschneid commented Nov 28, 2019 • edited Loading

dowobeha commented Jan 30, 2020

nschneid commented Jul 4, 2020 • edited Loading

mjpost commented Jul 4, 2020

nschneid commented Jul 4, 2020

danielgildea commented Nov 10, 2020 • edited Loading

nschneid commented Nov 10, 2020

danielgildea commented Nov 10, 2020

nschneid commented Nov 10, 2020

danielgildea commented Nov 10, 2020

mjpost commented Dec 4, 2020

davidweichiang commented Nov 7, 2019 •

edited

Loading

nschneid commented Nov 28, 2019 •

edited

Loading

nschneid commented Jul 4, 2020 •

edited

Loading

danielgildea commented Nov 10, 2020 •

edited

Loading