Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Author name truecasing #643

Open
nschneid opened this issue Nov 12, 2019 · 15 comments
Open

Author name truecasing #643

nschneid opened this issue Nov 12, 2019 · 15 comments

Comments

@nschneid
Copy link
Contributor

nschneid commented Nov 12, 2019

Related to #638, #641: Many author names in EMNLP 2019 are all-caps or all-lowercase, presumably because that is how they appear in START. It seems impractical to fix them manually for every conference. Should there be a heuristic in the ingestion script that corrects these? For example:

  • Let "word" be a segment of the name when splitting on spaces and hyphens.
  • If the first name contains no capital letters, capitalize the first character of every word in the first name. Likewise for last name.
  • If the first name has more than one uppercase letter and no lowercase letters, lowercase all but the first letter in each word. Likewise for last name, except the last word of the last name if it is "II", "III", or "IV".

The canonical form in name_variants.yaml could serve as a whitelist for known exceptions, e.g. "Balamurali AR". Note that the above heuristics preserve mixed-case names like ChengXiang and McKinley, so these do not need whitelisting.

@davidweichiang
Copy link
Collaborator

Unlike #590, I think this is more important to fix, because our BibTeX styles do not change case in author names. But getting the heuristics right could be tricky.

I believe that Balamurali AR is not an edge case; there are a lot of South Asian names that use initials without periods.

Also, MCKINLEY would not be correctly lowercased by this heuristic; that could be a somewhat common case.

There might be some authors who insist on having their names in all caps or all lowercase. I think I would be okay with using name_variants.yaml to record these as exceptions.

@nschneid
Copy link
Contributor Author

Also, MCKINLEY would not be correctly lowercased by this heuristic; that could be a somewhat common case.

I went through the all-caps names in EMNLP 2019, and most were Chinese surnames. I suspect it is a convention in China to write romanized surnames in all-caps.

If we're really worried about mckinley/MCKINLEY and similar, we could have an additional heuristic which matches against existing names in the database.

@davidweichiang
Copy link
Collaborator

There's already some code to match against existing author names. It could be updated and improved, and that might address this problem partly.

I've suggested in the past that we might consider contacting some people and asking them to update their START profiles. ACL 2020 is asking them to do it right now anyway.

I just looked at the EMNLP 2019 list too and saw a couple of French names and an Indian name where the surname was in all caps. I agree that your heuristic is going to be 99% correct for names written in all caps.

But I think names like Balumurali AR are common enough to worry about. It won't do to put in an exception for names that are two or three letters long, because many Chinese names are also two or three letters long.

@davidweichiang
Copy link
Collaborator

Would it be too specific to apply your heuristic only to names that are written in Pinyin, which is very easy to check?

@nschneid
Copy link
Contributor Author

I don't know how that is checked but it should cover most of the cases. Maybe the rest should require a manual decision to whitelist or truecase.

@nschneid
Copy link
Contributor Author

And the manual decision can usually be made by checking the PDF. Even better if we could scrape the author capitalization from the PDF, but that might be too hard.

@davidweichiang
Copy link
Collaborator

We do have a script that scrapes from PDF. It is not run regularly, though. And sometimes authors use all caps in the PDF too.

The Pinyin filter would be a good 90% solution; my main worry is that a language specific rule could be perceived as discriminatory.

@nschneid
Copy link
Contributor Author

nschneid commented Nov 15, 2019

Eh...it seems to me the status quo is (unintentionally) discriminatory against people whose surnames are sometimes entered in all-caps, because inconsistencies will make it harder to browse their work. And these are disproportionately people from China. So it makes sense to correct that, and ideally not in a way that hypercorrects South Asian abbreviated names.

Ideally this would be something that START would encourage authors to specify consistently in the first place ("You entered 'LU', but ACL style is to use only initial capitals within names. Did you mean 'Lu'?"). But we don't really have control over that.

@davidweichiang
Copy link
Collaborator

I agree that ideally this should happen earlier than ingestion into the Anthology, because names from START also appear in the conference website, handbook, etc.

@davidweichiang
Copy link
Collaborator

I tried a simpler version of these heuristics on the EMNLP 2018 authors, and it worked perfectly except for one possible false positive (the first name "cmcc"). The heuristic is:

  • If the first name is all lowercase, change it to title case (Python str.title() method).
  • If the first name is all uppercase and (is 4 chars or more or is a Pinyin syllable), change it to title case.
  • Similarly for the last name.

@davidweichiang
Copy link
Collaborator

FWIW, START does have a tool in the pub chair console for correcting case problems in both titles and authors. I don't know whether it is regularly used. It also makes some mistakes (e.g., III is converted to Iii, and di is not converted to Di even if part of a Chinese name). And presumably changes to author names are not propagated back up to the global profile.

@davidweichiang
Copy link
Collaborator

Running this heuristic on the current Anthology authors yields 872 corrections. There are some false positives, though. Some seem fixable (MAXWELL III -> Maxwell Iii) but some seem tougher, especially corporate authors like ARC A3 or TIPSTER SE/CM.

@nschneid
Copy link
Contributor Author

Nice! Could we run this periodically and record the exceptions as having been manually checked?

@davidweichiang
Copy link
Collaborator

It would be a tedious process each time. I am hoping that START will incorporate something like this so we don't have to deal with it. But otherwise, it would make most sense, I think, to have it run automatically at ingestion time.

davidweichiang added a commit that referenced this issue Dec 17, 2019
This was done by:

- author_case.py to generate a list of potential changes
- hand correct the list of changes
- change_authors.py to apply the changes
@mjpost
Copy link
Member

mjpost commented Nov 10, 2020

With commit ab92b62, ingest.py now prompts the ingestor to confirm capitalization, when it discovers all-lowercase or all-uppercase names. Truecasing would be a better approach, but I think this quick fix probably captures 99% of instances.

I agree START should do this, but it also seems fixable at ingestion time, which is a longer-term solution.

najtin pushed a commit to ir-anthology/ir-anthology that referenced this issue Jun 9, 2021
This was done by:

- author_case.py to generate a list of potential changes
- hand correct the list of changes
- change_authors.py to apply the changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants