Auto correct author name case (#643) #695

davidweichiang · 2019-12-13T04:12:59Z

This was done by:

author_case.py to generate a list of potential changes
hand correct the list of changes
change_authors.py to apply the changes
clean_name_variants.py to remove unused entries in name_variants.yaml
annoyingly, manually replace a few intentionally unused entries in name_variants.yaml

davidweichiang · 2019-12-13T04:19:38Z

I didn't understand why the build failed -- something about trailing whitespace?

davidweichiang · 2019-12-13T04:20:38Z

@nschneid are you interested in skimming through the diffs to spot any errors?

nschneid · 2019-12-13T04:47:22Z

data/xml/E87.xml

@@ -183,10 +183,10 @@
    </paper>
    <paper id="29">
      <title>AUXILIARIES AND CLITICS IN FRENCH UCG GRAMMAR</title>
-      <author id="karine-baschung"><first>K.</first><last>BASCHUNG</last></author>
+      <author id="karine-baschung"><first>K.</first><last>Baschung</last></author>
      <author id="gabriel-g-bes"><first>G.G.</first><last>BES</last></author>


False negative?

nschneid · 2019-12-13T04:48:32Z

data/xml/I08.xml

@@ -1002,7 +1002,7 @@
    </paper>
    <paper id="144">
      <title>Cross Lingual Information Access System for <fixed-case>I</fixed-case>ndian Languages</title>
-      <author><first>CLIA</first><last>Consortium</last></author>
+      <author><first>Clia</first><last>Consortium</last></author>


False positive: acronym

nschneid · 2019-12-13T04:49:35Z

data/xml/M93.xml

@@ -103,9 +103,9 @@
    </paper>
    <paper id="14">
      <title><fixed-case>NEC</fixed-case>: DESCRIPTION OF THE VENIEX SYSTEM AS USED FOR <fixed-case>MUC</fixed-case>-5</title>
-      <author><first>Kazunori</first><last>MURAKI</last></author>
+      <author><first>Kazunori</first><last>Muraki</last></author>
      <author><first>Shinichi</first><last>DOI</last></author>


False negative?

nschneid · 2019-12-13T04:52:17Z

data/xml/P91.xml

@@ -195,7 +195,7 @@
    </paper>
    <paper id="24">
      <title>EXPERIMENTS AND PROSPECTS OF EXAMPLE-BASED MACHINE TRANSLATION</title>
-      <author><first>Eiichiro</first><last>SUMITA</last></author>
+      <author><first>Eiichiro</first><last>Sumita</last></author>
      <author><first>Hitoshi</first><last>HDA</last></author>


Apparent OCR error: surname should be Iida

nschneid · 2019-12-13T04:55:17Z

data/xml/W97.xml

@@ -143,7 +143,7 @@
    <paper id="25">
      <title>A Local Grammar-based Approach to Recognizing of Proper Names in <fixed-case>K</fixed-case>orean Texts</title>
      <author><first>Jee-Sun</first><last>NAM</last></author>


False negative

nschneid · 2019-12-13T05:00:36Z

data/xml/C88.xml

-      <author><first>Ronald M.</first><last>KAPLAN</last></author>
-      <author><first>John T.</first><last>MAXWELL III</last></author>
+      <author><first>Ronald M.</first><last>Kaplan</last></author>
+      <author><first>John T.</first><last>Maxwell Iii</last></author>


nschneid · 2019-12-13T05:01:57Z

data/xml/C88.xml

      <url>C88-2134</url>
    </paper>
    <paper id="135">
      <title>A Computer Readability Formula of <fixed-case>J</fixed-case>apanese Texts for Machine Scoring</title>
-      <author><last>TATEISI</last><first>Yuka</first></author>
+      <author><last>Tateisi</last><first>Yuka</first></author>
      <author><last>ONO</last><first>Yoshihiko</first></author>


False negative

nschneid · 2019-12-13T05:02:30Z

data/xml/C88.xml

@@ -850,22 +850,22 @@
    <paper id="143">
      <title>MASSIVE DISAMBIGUATION OF LARGE TEXT CORPORA WITH FLEXIBLE CATEGORIAL GRAMMAR</title>
      <author><first>Ton</first><last>van der WOUDEN</last></author>


False negative

nschneid · 2019-12-13T05:06:22Z

data/xml/C92.xml

-      <author><first>Fathi</first><last>DEBILI</last></author>
-      <author><first>Elyes</first><last>SAMMOUDA</last></author>
+      <author><first>Fathi</first><last>Debili</last></author>
+      <author><first>Elyes</first><last>Sammouda</last></author>
      <url>C92-2079</url>
    </paper>
    <paper id="80">
      <title>TRANSLATION AMBIGUITY RESOLUTION BASED ON TEXT CORPORA OF SOURCE AND TARGET LANGUAGES</title>
      <author><first>Shinichi</first><last>DOI</last></author>


False negative

nschneid · 2019-12-13T05:06:47Z

data/xml/C92.xml

-      <author><first>Hideki</first><last>TANAKA</last></author>
-      <author><first>Teruaki</first><last>AIZAWA</last></author>
+      <author><first>Hideki</first><last>Tanaka</last></author>
+      <author><first>Teruaki</first><last>Aizawa</last></author>
      <author><first>Yeun-Bae</first><last>KIM</last></author>


False negative

nschneid · 2019-12-13T05:09:07Z

data/xml/C92.xml

@@ -1158,17 +1158,17 @@
    </paper>
    <paper id="192">
      <title>BESOINS LEXICAUX A LA LUMIERE DE L’ANALYSE STATISTIQUE DU CORPUS DE TEXTES DU PROJET “BREF” - LE LEXIQUE “BDLEX” DU FRANCAIS ECRIT ET ORAL</title>
-      <author><first>I.</first><last>FERRANE</last></author>
+      <author><first>I.</first><last>Ferrane</last></author>
      <author id="martine-de-calmes"><first>M.</first><last>de CALMES</last></author>


False negative

nschneid · 2019-12-13T05:10:07Z

data/xml/C98.xml

@@ -285,16 +285,16 @@
    </paper>
    <paper id="44">
      <title>Veins Theory: A Model of Global Discourse Cohesion and Coherence</title>
-      <author><first>Dan</first><last>CRISTEA</last></author>
+      <author><first>Dan</first><last>Cristea</last></author>
      <author><first>Nancy</first><last>IDE</last></author>


False negative

nschneid · 2019-12-13T05:11:28Z

data/xml/C98.xml

@@ -1308,8 +1308,8 @@
    <paper id="204">
      <title>Reactive Content Selection in the Generation of Real-time Soccer Commentary</title>
      <author><first>Kumiko</first><last>TANAKA-Ishii</last></author>


False negative

nschneid · 2019-12-13T05:12:08Z

data/xml/C98.xml

@@ -1495,7 +1495,7 @@
    </paper>
    <paper id="234">
      <title>Word Association and <fixed-case>MI</fixed-case>-Trigger-based Language Modeling</title>
-      <author><first>GuoDong</first><last>ZHOU</last></author>
+      <author><first>GuoDong</first><last>Zhou</last></author>
      <author><first>KimTeng</first><last>LUA</last></author>


False negative

nschneid · 2019-12-13T05:19:08Z

Done with my scan of the diffs. (Didn't mark all instances of repeated names like Nancy IDE.)

The main problem I saw was false negative all-caps names where coauthors were correctly truecased. Maybe there should be a heuristic that uses coauthor capitalization as a cue. Of course, I probably didn't see false negatives where no coauthor names were changed because they wouldn't show up in the diff.

Anyway, there may still be some errors but it's a huge number of fixes!

akoehn · 2019-12-13T08:57:03Z

I didn't understand why the build failed -- something about trailing whitespace?

We now have more validation, see #669. In short: use black for python code formatting, make sure XML and yaml is well-formed. make check tells you whether everything is fine before commiting, make autofix fixes what can be fixed without user intervention. pre-commit hooks to automatically check this are also available.

davidweichiang · 2019-12-13T11:56:49Z

Thanks!!! Seems like I need to update the heuristics to closer to what you originally suggested (go word by word instead looking at the whole first/last name). Additionally, three-letter words need better handling. How risky would it be to recase any word that is CVC or VCV?

…

On Dec 13, 2019, at 00:19, Nathan Schneider ***@***.***> wrote: Done with my scan of the diffs. (Didn't mark all instances of repeated names like Nancy IDE.) The main problem I saw was false negative all-caps names where coauthors were correctly truecased. Maybe there should be a heuristic that uses coauthor capitalization as a cue. Of course, I probably didn't see false negatives where no coauthor names were changed because they wouldn't show up in the diff. Anyway, there may still be some errors but it's a huge number of fixes! — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

nschneid · 2019-12-13T15:05:47Z

How risky would it be to recase any word that is CVC or VCV?

Hard for me to predict. How common are sequences of 3 initials without spaces or periods?

davidweichiang · 2019-12-13T16:16:45Z

Unfortunately I have a weird Python setup right now that prevents me from running venv, so many of the make targets don't work for me. :( I managed to run black, but don't know how to do the trailing whitespace check. @akoehn are there any more instructions anywhere? Thanks.

davidweichiang · 2019-12-13T16:27:47Z

I think I may have the pre-commit hooks working, but I find two things about this confusing:

The offending files have already been committed, so there's no way to run pre-commit checks on them; I have to make some unrelated change to the file and re-stage it in order to run the check
The check that GitHub runs only says that the trailing whitespace check failed; it doesn't tell me which file it failed on.

davidweichiang · 2019-12-13T16:42:36Z

@nschneid Good question. In our current XML, these are all of them (I'm surprised there are so few):

C16.xml:      <author><first>Hayate</first><last>ISO</last></author>
C16.xml:      <author><last>LAU</last><first>Raymond</first></author>
C18.xml:      <author><first>Avinesh</first><last>PVS</last></author>
C67.xml:      <author><first>Martin</first><last>KAY</last></author>
C82.xml:      <author><first>Danilo</first><last>FUM</last></author>
C86.xml:      <author><first>Masahiro</first><last>ABE</last></author>
C88.xml:      <author><first>Naoki</first><last>ABE</last></author>
C88.xml:      <author><first>Michael B.</first><last>KAC</last></author>
C88.xml:      <author><first>Ingolf</first><last>MAX</last></author>
C88.xml:      <author><first>Tsunehisa</first><last>DOI</last></author>
C88.xml:      <author><first>Paradip</first><last>DEY</last></author>
C88.xml:      <author><last>ONO</last><first>Yoshihiko</first></author>
C88.xml:      <author><first>Tomas</first><last>VLK</last></author>
C90.xml:      <author><first>Nancy M.</first><last>IDE</last></author>
C90.xml:      <author><first>Yung-Taek</first><last>KIM</last></author>
C92.xml:      <author><first>Martin</first><last>KAY</last></author>
C92.xml:      <author><first>Shinichi</first><last>DOI</last></author>
C92.xml:      <author><first>Yeun-Bae</first><last>KIM</last></author>
C98.xml:      <author><first>Nancy</first><last>IDE</last></author>
C98.xml:      <author><first>Nancy</first><last>IDE</last></author>
C98.xml:      <author><first>Shinichi</first><last>DOI</last></author>
C98.xml:      <author><first>KimTeng</first><last>LUA</last></author>
E85.xml:      <author><first>Danilo</first><last>FUM</last></author>
E87.xml:      <author id="gabriel-g-bes"><first>G.G.</first><last>BES</last></author>
E89.xml:      <author><first>Danilo</first><last>FUM</last></author>
I11.xml:      <author><first>Avinesh</first><last>PVS</last></author>
L10.xml:      <author><first>Avinesh</first><last>PVS</last></author>
M93.xml:      <author><first>Shinichi</first><last>DOI</last></author>
N10.xml:      <author><first>Avinesh</first><last>PVS</last></author>
P91.xml:      <author><first>Hitoshi</first><last>HDA</last></author>
R09.xml:      <author><first>Chaitanya</first><last>GSK</last></author>
S14.xml:      <author><first>Avinesh</first> <last>PVS</last></author>
W00.xml:      <author><first>Mamiko</first><last>OKA</last></author>
W04.xml:      <author><first>Chooi-Ling</first><last>GOH</last></author>
W10.xml:      <author><first>Avinesh</first><last>PVS</last></author>
W10.xml:      <author><first>Avinesh</first><last>PVS</last></author>
W11.xml:      <author><first>Chaitanya</first><last>GSK</last></author>
W12.xml:      <author><first>Santosh</first><last>GSK</last></author>
W97.xml:      <author><first>Jean-David</first><last>STA</last></author>
W97.xml:      <author><first>Jee-Sun</first><last>NAM</last></author>
W98.xml:      <author><first>Nancy</first><last>IDE</last></author>
W99.xml:      <author><first>Nancy</first><last>IDE</last></author>

I believe the only names here that are initials are GSK and PVS, so maybe these names should be whitelisted and everything else lowercased. Even VLK should be lowercased to Vlk.

In case you're curious (of course you are), here are all the two-capital-letter names:

C80.xml:      <author><first>JB</first><last>Berthelin</last></author>
D11.xml:      <author><first>Balamurali</first><last>AR</last></author>
D15.xml:      <author><first>CJ</first> <last>Linton</last></author>
L12.xml:      <author><first>Balamurali</first><last>AR</last></author>
P11.xml:      <author><first>Balamurali</first><last>AR</last></author>
P15.xml:      <author><first>Balamurali</first> <last>AR</last></author>
S16.xml:      <author><first>Soman</first> <last>KP</last></author>
S19.xml:      <author><first>Swapna</first><last>TR</last></author>
W07.xml:      <author><first>DJ</first><last>Hovermale</last></author>
W11.xml:      <author><first>Balamurali</first><last>AR</last></author>
W11.xml:      <author><first>Sowmya</first><last>VB</last></author>
W12.xml:      <author><first>Vidya Raj</first><last>RK</last></author>
W13.xml:      <author><first>SV</first><last>Ramanan</last></author>
W13.xml:      <author><first>Raymond T.</first><last>NG</last></author>
W14.xml:      <author><first>IV</first> <last>Ramakrishnan</last></author>
W15.xml:      <author><first>Muneeb</first> <last>TH</last></author>
W17.xml:      <author><first>Soman</first> <last>KP</last></author>
W18.xml:      <author><first>JT</first><last>Wolohan</last></author>

NG should be lowercased and the rest should all be kept.

nschneid · 2019-12-13T20:04:55Z

I believe the only names here that are initials are GSK and PVS, so maybe these names should be whitelisted and everything else lowercased.

Sounds good.

NG should be lowercased and the rest should all be kept.

Yup, a special rule for NG is probably warranted.

Thanks!

davidweichiang · 2019-12-14T03:10:37Z

I believe this is ready to merge, and I believe it eliminates all all-uppercase names and all all-lowercase names (except for one karel Oliva who seems to have wanted it that way).

davidweichiang · 2019-12-14T03:28:59Z

The current heuristic is:

If there is no first name, don't touch the last name (which often contains an acronym).
If the entire first name or entire last name is all lowercase, titlecase it (Python str.title()).
Else, go word by word (splitting on space, hyphen, and period). If a word is all uppercase, then:

If it has two letters and appears on a list of two-letter words that could be part of Chinese or Romance names, titlecase it.
Else if it does not appear on a list of words that should not be titlecased (III and a couple of other exceptions), titlecase it.

There are some further improvements that could be made: for example, McKINLEY -> McKinley and MCKINLEY -> McKinley.

How this currently works is that author_case.py generates a list of changes, so the user can delete any bad changes. Then change_authors.py applies the approved changes.

If @mjpost wants to include this (and possibly other automatic filters) into ingestion, I'm not sure what the best way is. Make the hand-checking part of the ingestion process? Or fully automate it and log any changes made?

akoehn · 2019-12-14T07:48:43Z

@davidweichiang you can run make check to check everything, the pre-commit hooks only run make check_commit, which only checks files that have edits.

If you can't run these checks, the easiest would probably to run something like grep ' $' data/xml/*

davidweichiang · 2019-12-14T13:56:59Z

@akoehn would it be bad to remove the makefile dependencies on venv so that the user does make venv as a separate step?

@mjpost another possibility for the ingestion pipeline is for any automatic tools to do their thing but log the changes in the XML itself, like

<provenance>changed author last from CHIANG to Chiang</provenance>

so that it would be easy for someone to go through later and correct errors?

mjpost · 2019-12-14T14:22:42Z

Will look at this first thing next week. My main question is how to integrate this into ingestion, but I haven't had a chance to look yet and try to figure it out myself.

mjpost

I added some comments.

mjpost · 2019-12-16T22:13:44Z

bin/change_authors.py

+For example:
+
+Z99-9999 \t author \t ARAVIND K. || JOSHI \t Aravind K. Joshi
+


Missing ||| here in the final field.

Example usage would be helpful here, too, for others.

mjpost · 2019-12-16T22:14:13Z

bin/author_case.py

+"""author_case.py
+
+Try to correct author names that are written in all uppercase or all lowercase.
+


Can you add example usage to the top?

davidweichiang · 2019-12-17T04:15:51Z

Better documentation added; once we figure out whether and how to incorporate this kind of fix into the ingestion pipeline, I will try to stabilize the interface and documentation more.

akoehn · 2019-12-17T08:41:40Z

would it be bad to remove the makefile dependencies on venv so that the user does make venv as a separate step?

Yes, because it is not possible. The commands in make are run in a subshell and cannot edit the users environment variables (this is a good thing).

Also, the current setup checks that the venv is up to date and correctly set up at every step. Without it, users could forget to set up the venv, to enable it before every make run, or to update the venv if another commit has changed the dependencies.

davidweichiang · 2019-12-17T20:55:37Z

@akoehn Oh, that makes sense. My workaround is to change VENV to point to a dummy script.

davidweichiang · 2019-12-17T20:56:13Z

@mjpost okay to merge?

mjpost · 2019-12-17T21:00:26Z

LGTM.

This was done by: - author_case.py to generate a list of potential changes - hand correct the list of changes - change_authors.py to apply the changes

Auto correct author name case (#643)

115f03c

nschneid reviewed Dec 13, 2019

View reviewed changes

black

66cb1bf

davidweichiang added 2 commits December 13, 2019 11:23

fixed trailing whitespace?

7c42f72

fixed trailing whitespace now?

cfa46cd

Corrected a few errors caught by @nschneid

06c506d

Improved automatic recasing of author names

b1048fa

davidweichiang requested a review from mjpost December 14, 2019 03:10

davidweichiang mentioned this pull request Dec 14, 2019

Fix some overcapitalization #590

Closed

mjpost approved these changes Dec 16, 2019

View reviewed changes

improve documentation for new scripts

d7c4b7c

davidweichiang merged commit 853c9ac into master Dec 17, 2019

davidweichiang deleted the author-case branch December 17, 2019 21:29

		For example:

		Z99-9999 \t author \t ARAVIND K. \|\| JOSHI \t Aravind K. Joshi

		"""author_case.py

		Try to correct author names that are written in all uppercase or all lowercase.

Auto correct author name case (#643) #695

Auto correct author name case (#643) #695

Conversation

davidweichiang commented Dec 13, 2019

davidweichiang commented Dec 13, 2019

davidweichiang commented Dec 13, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nschneid commented Dec 13, 2019

akoehn commented Dec 13, 2019

davidweichiang commented Dec 13, 2019 via email

nschneid commented Dec 13, 2019

davidweichiang commented Dec 13, 2019

davidweichiang commented Dec 13, 2019

davidweichiang commented Dec 13, 2019

nschneid commented Dec 13, 2019

davidweichiang commented Dec 14, 2019

davidweichiang commented Dec 14, 2019

akoehn commented Dec 14, 2019

davidweichiang commented Dec 14, 2019

mjpost commented Dec 14, 2019

mjpost left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidweichiang commented Dec 17, 2019

akoehn commented Dec 17, 2019

davidweichiang commented Dec 17, 2019

davidweichiang commented Dec 17, 2019

mjpost commented Dec 17, 2019