Author name truecasing #643

nschneid · 2019-11-12T14:37:06Z

Related to #638, #641: Many author names in EMNLP 2019 are all-caps or all-lowercase, presumably because that is how they appear in START. It seems impractical to fix them manually for every conference. Should there be a heuristic in the ingestion script that corrects these? For example:

Let "word" be a segment of the name when splitting on spaces and hyphens.
If the first name contains no capital letters, capitalize the first character of every word in the first name. Likewise for last name.
If the first name has more than one uppercase letter and no lowercase letters, lowercase all but the first letter in each word. Likewise for last name, except the last word of the last name if it is "II", "III", or "IV".

The canonical form in name_variants.yaml could serve as a whitelist for known exceptions, e.g. "Balamurali AR". Note that the above heuristics preserve mixed-case names like ChengXiang and McKinley, so these do not need whitelisting.

davidweichiang · 2019-11-12T17:34:48Z

Unlike #590, I think this is more important to fix, because our BibTeX styles do not change case in author names. But getting the heuristics right could be tricky.

I believe that Balamurali AR is not an edge case; there are a lot of South Asian names that use initials without periods.

Also, MCKINLEY would not be correctly lowercased by this heuristic; that could be a somewhat common case.

There might be some authors who insist on having their names in all caps or all lowercase. I think I would be okay with using name_variants.yaml to record these as exceptions.

nschneid · 2019-11-12T19:28:43Z

Also, MCKINLEY would not be correctly lowercased by this heuristic; that could be a somewhat common case.

I went through the all-caps names in EMNLP 2019, and most were Chinese surnames. I suspect it is a convention in China to write romanized surnames in all-caps.

If we're really worried about mckinley/MCKINLEY and similar, we could have an additional heuristic which matches against existing names in the database.

davidweichiang · 2019-11-12T19:39:14Z

There's already some code to match against existing author names. It could be updated and improved, and that might address this problem partly.

I've suggested in the past that we might consider contacting some people and asking them to update their START profiles. ACL 2020 is asking them to do it right now anyway.

I just looked at the EMNLP 2019 list too and saw a couple of French names and an Indian name where the surname was in all caps. I agree that your heuristic is going to be 99% correct for names written in all caps.

But I think names like Balumurali AR are common enough to worry about. It won't do to put in an exception for names that are two or three letters long, because many Chinese names are also two or three letters long.

davidweichiang · 2019-11-15T19:49:24Z

Would it be too specific to apply your heuristic only to names that are written in Pinyin, which is very easy to check?

nschneid · 2019-11-15T19:55:42Z

I don't know how that is checked but it should cover most of the cases. Maybe the rest should require a manual decision to whitelist or truecase.

nschneid · 2019-11-15T20:02:46Z

And the manual decision can usually be made by checking the PDF. Even better if we could scrape the author capitalization from the PDF, but that might be too hard.

davidweichiang · 2019-11-15T20:16:50Z

We do have a script that scrapes from PDF. It is not run regularly, though. And sometimes authors use all caps in the PDF too.

The Pinyin filter would be a good 90% solution; my main worry is that a language specific rule could be perceived as discriminatory.

nschneid · 2019-11-15T20:32:18Z

Eh...it seems to me the status quo is (unintentionally) discriminatory against people whose surnames are sometimes entered in all-caps, because inconsistencies will make it harder to browse their work. And these are disproportionately people from China. So it makes sense to correct that, and ideally not in a way that hypercorrects South Asian abbreviated names.

Ideally this would be something that START would encourage authors to specify consistently in the first place ("You entered 'LU', but ACL style is to use only initial capitals within names. Did you mean 'Lu'?"). But we don't really have control over that.

davidweichiang · 2019-11-25T16:07:43Z

I agree that ideally this should happen earlier than ingestion into the Anthology, because names from START also appear in the conference website, handbook, etc.

davidweichiang · 2019-12-10T17:40:19Z

I tried a simpler version of these heuristics on the EMNLP 2018 authors, and it worked perfectly except for one possible false positive (the first name "cmcc"). The heuristic is:

If the first name is all lowercase, change it to title case (Python str.title() method).
If the first name is all uppercase and (is 4 chars or more or is a Pinyin syllable), change it to title case.
Similarly for the last name.

davidweichiang · 2019-12-10T21:00:05Z

FWIW, START does have a tool in the pub chair console for correcting case problems in both titles and authors. I don't know whether it is regularly used. It also makes some mistakes (e.g., III is converted to Iii, and di is not converted to Di even if part of a Chinese name). And presumably changes to author names are not propagated back up to the global profile.

davidweichiang · 2019-12-12T01:10:20Z

Running this heuristic on the current Anthology authors yields 872 corrections. There are some false positives, though. Some seem fixable (MAXWELL III -> Maxwell Iii) but some seem tougher, especially corporate authors like ARC A3 or TIPSTER SE/CM.

nschneid · 2019-12-12T01:16:11Z

Nice! Could we run this periodically and record the exceptions as having been manually checked?

davidweichiang · 2019-12-12T01:38:08Z

It would be a tedious process each time. I am hoping that START will incorporate something like this so we don't have to deal with it. But otherwise, it would make most sense, I think, to have it run automatically at ingestion time.

This was done by: - author_case.py to generate a list of potential changes - hand correct the list of changes - change_authors.py to apply the changes

mjpost · 2020-11-10T14:52:23Z

With commit ab92b62, ingest.py now prompts the ingestor to confirm capitalization, when it discovers all-lowercase or all-uppercase names. Truecasing would be a better approach, but I think this quick fix probably captures 99% of instances.

I agree START should do this, but it also seems fixable at ingestion time, which is a longer-term solution.

This was done by: - author_case.py to generate a list of potential changes - hand correct the list of changes - change_authors.py to apply the changes

davidweichiang mentioned this issue Nov 12, 2019

LaTeX processing is not being done on ingestion #644

Closed

davidweichiang mentioned this issue Nov 25, 2019

check for abstracts cut-and-pasted from the PDF #666

Open

davidweichiang added a commit that referenced this issue Dec 13, 2019

Auto correct author name case (#643)

115f03c

davidweichiang added a commit that referenced this issue Dec 17, 2019

Auto correct author name case (#643) (#695)

853c9ac

This was done by: - author_case.py to generate a list of potential changes - hand correct the list of changes - change_authors.py to apply the changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Author name truecasing #643

Author name truecasing #643

nschneid commented Nov 12, 2019 •

edited

Loading

davidweichiang commented Nov 12, 2019

nschneid commented Nov 12, 2019

davidweichiang commented Nov 12, 2019

davidweichiang commented Nov 15, 2019

nschneid commented Nov 15, 2019

nschneid commented Nov 15, 2019

davidweichiang commented Nov 15, 2019

nschneid commented Nov 15, 2019 •

edited

Loading

davidweichiang commented Nov 25, 2019

davidweichiang commented Dec 10, 2019

davidweichiang commented Dec 10, 2019

davidweichiang commented Dec 12, 2019

nschneid commented Dec 12, 2019

davidweichiang commented Dec 12, 2019

mjpost commented Nov 10, 2020

Author name truecasing #643

Author name truecasing #643

Comments

nschneid commented Nov 12, 2019 • edited Loading

davidweichiang commented Nov 12, 2019

nschneid commented Nov 12, 2019

davidweichiang commented Nov 12, 2019

davidweichiang commented Nov 15, 2019

nschneid commented Nov 15, 2019

nschneid commented Nov 15, 2019

davidweichiang commented Nov 15, 2019

nschneid commented Nov 15, 2019 • edited Loading

davidweichiang commented Nov 25, 2019

davidweichiang commented Dec 10, 2019

davidweichiang commented Dec 10, 2019

davidweichiang commented Dec 12, 2019

nschneid commented Dec 12, 2019

davidweichiang commented Dec 12, 2019

mjpost commented Nov 10, 2020

nschneid commented Nov 12, 2019 •

edited

Loading

nschneid commented Nov 15, 2019 •

edited

Loading