Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto correct author name case (#643) #695

Merged
merged 7 commits into from
Dec 17, 2019
Merged

Auto correct author name case (#643) #695

merged 7 commits into from
Dec 17, 2019

Conversation

davidweichiang
Copy link
Collaborator

This was done by:

  • author_case.py to generate a list of potential changes
  • hand correct the list of changes
  • change_authors.py to apply the changes
  • clean_name_variants.py to remove unused entries in name_variants.yaml
  • annoyingly, manually replace a few intentionally unused entries in name_variants.yaml

@davidweichiang
Copy link
Collaborator Author

I didn't understand why the build failed -- something about trailing whitespace?

@davidweichiang
Copy link
Collaborator Author

@nschneid are you interested in skimming through the diffs to spot any errors?

data/xml/E87.xml Outdated
@@ -183,10 +183,10 @@
</paper>
<paper id="29">
<title>AUXILIARIES AND CLITICS IN FRENCH UCG GRAMMAR</title>
<author id="karine-baschung"><first>K.</first><last>BASCHUNG</last></author>
<author id="karine-baschung"><first>K.</first><last>Baschung</last></author>
<author id="gabriel-g-bes"><first>G.G.</first><last>BES</last></author>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

False negative?

data/xml/I08.xml Outdated
@@ -1002,7 +1002,7 @@
</paper>
<paper id="144">
<title>Cross Lingual Information Access System for <fixed-case>I</fixed-case>ndian Languages</title>
<author><first>CLIA</first><last>Consortium</last></author>
<author><first>Clia</first><last>Consortium</last></author>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

False positive: acronym

data/xml/M93.xml Outdated
@@ -103,9 +103,9 @@
</paper>
<paper id="14">
<title><fixed-case>NEC</fixed-case>: DESCRIPTION OF THE VENIEX SYSTEM AS USED FOR <fixed-case>MUC</fixed-case>-5</title>
<author><first>Kazunori</first><last>MURAKI</last></author>
<author><first>Kazunori</first><last>Muraki</last></author>
<author><first>Shinichi</first><last>DOI</last></author>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

False negative?

data/xml/P91.xml Outdated
@@ -195,7 +195,7 @@
</paper>
<paper id="24">
<title>EXPERIMENTS AND PROSPECTS OF EXAMPLE-BASED MACHINE TRANSLATION</title>
<author><first>Eiichiro</first><last>SUMITA</last></author>
<author><first>Eiichiro</first><last>Sumita</last></author>
<author><first>Hitoshi</first><last>HDA</last></author>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparent OCR error: surname should be Iida

data/xml/W97.xml Outdated
@@ -143,7 +143,7 @@
<paper id="25">
<title>A Local Grammar-based Approach to Recognizing of Proper Names in <fixed-case>K</fixed-case>orean Texts</title>
<author><first>Jee-Sun</first><last>NAM</last></author>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

False negative

data/xml/C88.xml Outdated
<author><first>Ronald M.</first><last>KAPLAN</last></author>
<author><first>John T.</first><last>MAXWELL III</last></author>
<author><first>Ronald M.</first><last>Kaplan</last></author>
<author><first>John T.</first><last>Maxwell Iii</last></author>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

III

data/xml/C88.xml Outdated
<url>C88-2134</url>
</paper>
<paper id="135">
<title>A Computer Readability Formula of <fixed-case>J</fixed-case>apanese Texts for Machine Scoring</title>
<author><last>TATEISI</last><first>Yuka</first></author>
<author><last>Tateisi</last><first>Yuka</first></author>
<author><last>ONO</last><first>Yoshihiko</first></author>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

False negative

data/xml/C88.xml Outdated
@@ -850,22 +850,22 @@
<paper id="143">
<title>MASSIVE DISAMBIGUATION OF LARGE TEXT CORPORA WITH FLEXIBLE CATEGORIAL GRAMMAR</title>
<author><first>Ton</first><last>van der WOUDEN</last></author>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

False negative

data/xml/C92.xml Outdated
<author><first>Fathi</first><last>DEBILI</last></author>
<author><first>Elyes</first><last>SAMMOUDA</last></author>
<author><first>Fathi</first><last>Debili</last></author>
<author><first>Elyes</first><last>Sammouda</last></author>
<url>C92-2079</url>
</paper>
<paper id="80">
<title>TRANSLATION AMBIGUITY RESOLUTION BASED ON TEXT CORPORA OF SOURCE AND TARGET LANGUAGES</title>
<author><first>Shinichi</first><last>DOI</last></author>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

False negative

data/xml/C92.xml Outdated
<author><first>Hideki</first><last>TANAKA</last></author>
<author><first>Teruaki</first><last>AIZAWA</last></author>
<author><first>Hideki</first><last>Tanaka</last></author>
<author><first>Teruaki</first><last>Aizawa</last></author>
<author><first>Yeun-Bae</first><last>KIM</last></author>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

False negative

data/xml/C92.xml Outdated
@@ -1158,17 +1158,17 @@
</paper>
<paper id="192">
<title>BESOINS LEXICAUX A LA LUMIERE DE L’ANALYSE STATISTIQUE DU CORPUS DE TEXTES DU PROJET “BREF” - LE LEXIQUE “BDLEX” DU FRANCAIS ECRIT ET ORAL</title>
<author><first>I.</first><last>FERRANE</last></author>
<author><first>I.</first><last>Ferrane</last></author>
<author id="martine-de-calmes"><first>M.</first><last>de CALMES</last></author>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

False negative

data/xml/C98.xml Outdated
@@ -285,16 +285,16 @@
</paper>
<paper id="44">
<title>Veins Theory: A Model of Global Discourse Cohesion and Coherence</title>
<author><first>Dan</first><last>CRISTEA</last></author>
<author><first>Dan</first><last>Cristea</last></author>
<author><first>Nancy</first><last>IDE</last></author>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

False negative

data/xml/C98.xml Outdated
@@ -1308,8 +1308,8 @@
<paper id="204">
<title>Reactive Content Selection in the Generation of Real-time Soccer Commentary</title>
<author><first>Kumiko</first><last>TANAKA-Ishii</last></author>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

False negative

data/xml/C98.xml Outdated
@@ -1495,7 +1495,7 @@
</paper>
<paper id="234">
<title>Word Association and <fixed-case>MI</fixed-case>-Trigger-based Language Modeling</title>
<author><first>GuoDong</first><last>ZHOU</last></author>
<author><first>GuoDong</first><last>Zhou</last></author>
<author><first>KimTeng</first><last>LUA</last></author>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

False negative

@nschneid
Copy link
Contributor

Done with my scan of the diffs. (Didn't mark all instances of repeated names like Nancy IDE.)

The main problem I saw was false negative all-caps names where coauthors were correctly truecased. Maybe there should be a heuristic that uses coauthor capitalization as a cue. Of course, I probably didn't see false negatives where no coauthor names were changed because they wouldn't show up in the diff.

Anyway, there may still be some errors but it's a huge number of fixes!

@akoehn
Copy link
Member

akoehn commented Dec 13, 2019

I didn't understand why the build failed -- something about trailing whitespace?

We now have more validation, see #669. In short: use black for python code formatting, make sure XML and yaml is well-formed. make check tells you whether everything is fine before commiting, make autofix fixes what can be fixed without user intervention. pre-commit hooks to automatically check this are also available.

@davidweichiang
Copy link
Collaborator Author

davidweichiang commented Dec 13, 2019 via email

@nschneid
Copy link
Contributor

How risky would it be to recase any word that is CVC or VCV?

Hard for me to predict. How common are sequences of 3 initials without spaces or periods?

@davidweichiang
Copy link
Collaborator Author

Unfortunately I have a weird Python setup right now that prevents me from running venv, so many of the make targets don't work for me. :( I managed to run black, but don't know how to do the trailing whitespace check. @akoehn are there any more instructions anywhere? Thanks.

@davidweichiang
Copy link
Collaborator Author

I think I may have the pre-commit hooks working, but I find two things about this confusing:

  • The offending files have already been committed, so there's no way to run pre-commit checks on them; I have to make some unrelated change to the file and re-stage it in order to run the check
  • The check that GitHub runs only says that the trailing whitespace check failed; it doesn't tell me which file it failed on.

@davidweichiang
Copy link
Collaborator Author

@nschneid Good question. In our current XML, these are all of them (I'm surprised there are so few):

C16.xml:      <author><first>Hayate</first><last>ISO</last></author>
C16.xml:      <author><last>LAU</last><first>Raymond</first></author>
C18.xml:      <author><first>Avinesh</first><last>PVS</last></author>
C67.xml:      <author><first>Martin</first><last>KAY</last></author>
C82.xml:      <author><first>Danilo</first><last>FUM</last></author>
C86.xml:      <author><first>Masahiro</first><last>ABE</last></author>
C88.xml:      <author><first>Naoki</first><last>ABE</last></author>
C88.xml:      <author><first>Michael B.</first><last>KAC</last></author>
C88.xml:      <author><first>Ingolf</first><last>MAX</last></author>
C88.xml:      <author><first>Tsunehisa</first><last>DOI</last></author>
C88.xml:      <author><first>Paradip</first><last>DEY</last></author>
C88.xml:      <author><last>ONO</last><first>Yoshihiko</first></author>
C88.xml:      <author><first>Tomas</first><last>VLK</last></author>
C90.xml:      <author><first>Nancy M.</first><last>IDE</last></author>
C90.xml:      <author><first>Yung-Taek</first><last>KIM</last></author>
C92.xml:      <author><first>Martin</first><last>KAY</last></author>
C92.xml:      <author><first>Shinichi</first><last>DOI</last></author>
C92.xml:      <author><first>Yeun-Bae</first><last>KIM</last></author>
C98.xml:      <author><first>Nancy</first><last>IDE</last></author>
C98.xml:      <author><first>Nancy</first><last>IDE</last></author>
C98.xml:      <author><first>Shinichi</first><last>DOI</last></author>
C98.xml:      <author><first>KimTeng</first><last>LUA</last></author>
E85.xml:      <author><first>Danilo</first><last>FUM</last></author>
E87.xml:      <author id="gabriel-g-bes"><first>G.G.</first><last>BES</last></author>
E89.xml:      <author><first>Danilo</first><last>FUM</last></author>
I11.xml:      <author><first>Avinesh</first><last>PVS</last></author>
L10.xml:      <author><first>Avinesh</first><last>PVS</last></author>
M93.xml:      <author><first>Shinichi</first><last>DOI</last></author>
N10.xml:      <author><first>Avinesh</first><last>PVS</last></author>
P91.xml:      <author><first>Hitoshi</first><last>HDA</last></author>
R09.xml:      <author><first>Chaitanya</first><last>GSK</last></author>
S14.xml:      <author><first>Avinesh</first> <last>PVS</last></author>
W00.xml:      <author><first>Mamiko</first><last>OKA</last></author>
W04.xml:      <author><first>Chooi-Ling</first><last>GOH</last></author>
W10.xml:      <author><first>Avinesh</first><last>PVS</last></author>
W10.xml:      <author><first>Avinesh</first><last>PVS</last></author>
W11.xml:      <author><first>Chaitanya</first><last>GSK</last></author>
W12.xml:      <author><first>Santosh</first><last>GSK</last></author>
W97.xml:      <author><first>Jean-David</first><last>STA</last></author>
W97.xml:      <author><first>Jee-Sun</first><last>NAM</last></author>
W98.xml:      <author><first>Nancy</first><last>IDE</last></author>
W99.xml:      <author><first>Nancy</first><last>IDE</last></author>

I believe the only names here that are initials are GSK and PVS, so maybe these names should be whitelisted and everything else lowercased. Even VLK should be lowercased to Vlk.

In case you're curious (of course you are), here are all the two-capital-letter names:

C80.xml:      <author><first>JB</first><last>Berthelin</last></author>
D11.xml:      <author><first>Balamurali</first><last>AR</last></author>
D15.xml:      <author><first>CJ</first> <last>Linton</last></author>
L12.xml:      <author><first>Balamurali</first><last>AR</last></author>
P11.xml:      <author><first>Balamurali</first><last>AR</last></author>
P15.xml:      <author><first>Balamurali</first> <last>AR</last></author>
S16.xml:      <author><first>Soman</first> <last>KP</last></author>
S19.xml:      <author><first>Swapna</first><last>TR</last></author>
W07.xml:      <author><first>DJ</first><last>Hovermale</last></author>
W11.xml:      <author><first>Balamurali</first><last>AR</last></author>
W11.xml:      <author><first>Sowmya</first><last>VB</last></author>
W12.xml:      <author><first>Vidya Raj</first><last>RK</last></author>
W13.xml:      <author><first>SV</first><last>Ramanan</last></author>
W13.xml:      <author><first>Raymond T.</first><last>NG</last></author>
W14.xml:      <author><first>IV</first> <last>Ramakrishnan</last></author>
W15.xml:      <author><first>Muneeb</first> <last>TH</last></author>
W17.xml:      <author><first>Soman</first> <last>KP</last></author>
W18.xml:      <author><first>JT</first><last>Wolohan</last></author>

NG should be lowercased and the rest should all be kept.

@nschneid
Copy link
Contributor

I believe the only names here that are initials are GSK and PVS, so maybe these names should be whitelisted and everything else lowercased.

Sounds good.

NG should be lowercased and the rest should all be kept.

Yup, a special rule for NG is probably warranted.

Thanks!

@davidweichiang
Copy link
Collaborator Author

I believe this is ready to merge, and I believe it eliminates all all-uppercase names and all all-lowercase names (except for one karel Oliva who seems to have wanted it that way).

@davidweichiang
Copy link
Collaborator Author

The current heuristic is:

  • If there is no first name, don't touch the last name (which often contains an acronym).
  • If the entire first name or entire last name is all lowercase, titlecase it (Python str.title()).
  • Else, go word by word (splitting on space, hyphen, and period). If a word is all uppercase, then:
  • If it has two letters and appears on a list of two-letter words that could be part of Chinese or Romance names, titlecase it.
  • Else if it does not appear on a list of words that should not be titlecased (III and a couple of other exceptions), titlecase it.

There are some further improvements that could be made: for example, McKINLEY -> McKinley and MCKINLEY -> McKinley.

How this currently works is that author_case.py generates a list of changes, so the user can delete any bad changes. Then change_authors.py applies the approved changes.

If @mjpost wants to include this (and possibly other automatic filters) into ingestion, I'm not sure what the best way is. Make the hand-checking part of the ingestion process? Or fully automate it and log any changes made?

@akoehn
Copy link
Member

akoehn commented Dec 14, 2019

@davidweichiang you can run make check to check everything, the pre-commit hooks only run make check_commit, which only checks files that have edits.

If you can't run these checks, the easiest would probably to run something like grep ' $' data/xml/*

@davidweichiang
Copy link
Collaborator Author

@akoehn would it be bad to remove the makefile dependencies on venv so that the user does make venv as a separate step?

@mjpost another possibility for the ingestion pipeline is for any automatic tools to do their thing but log the changes in the XML itself, like

<provenance>changed author last from CHIANG to Chiang</provenance>

so that it would be easy for someone to go through later and correct errors?

@mjpost
Copy link
Member

mjpost commented Dec 14, 2019

Will look at this first thing next week. My main question is how to integrate this into ingestion, but I haven't had a chance to look yet and try to figure it out myself.

Copy link
Member

@mjpost mjpost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some comments.

For example:

Z99-9999 \t author \t ARAVIND K. || JOSHI \t Aravind K. Joshi

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing ||| here in the final field.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example usage would be helpful here, too, for others.

"""author_case.py

Try to correct author names that are written in all uppercase or all lowercase.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add example usage to the top?

@davidweichiang
Copy link
Collaborator Author

Better documentation added; once we figure out whether and how to incorporate this kind of fix into the ingestion pipeline, I will try to stabilize the interface and documentation more.

@akoehn
Copy link
Member

akoehn commented Dec 17, 2019

would it be bad to remove the makefile dependencies on venv so that the user does make venv as a separate step?

Yes, because it is not possible. The commands in make are run in a subshell and cannot edit the users environment variables (this is a good thing).

Also, the current setup checks that the venv is up to date and correctly set up at every step. Without it, users could forget to set up the venv, to enable it before every make run, or to update the venv if another commit has changed the dependencies.

@davidweichiang
Copy link
Collaborator Author

@akoehn Oh, that makes sense. My workaround is to change VENV to point to a dummy script.

@davidweichiang
Copy link
Collaborator Author

@mjpost okay to merge?

@mjpost
Copy link
Member

mjpost commented Dec 17, 2019

LGTM.

@davidweichiang davidweichiang merged commit 853c9ac into master Dec 17, 2019
@davidweichiang davidweichiang deleted the author-case branch December 17, 2019 21:29
najtin pushed a commit to ir-anthology/ir-anthology that referenced this pull request Jun 9, 2021
This was done by:

- author_case.py to generate a list of potential changes
- hand correct the list of changes
- change_authors.py to apply the changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants