-
Notifications
You must be signed in to change notification settings - Fork 292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto correct author name case (#643) #695
Conversation
I didn't understand why the build failed -- something about trailing whitespace? |
@nschneid are you interested in skimming through the diffs to spot any errors? |
data/xml/E87.xml
Outdated
@@ -183,10 +183,10 @@ | |||
</paper> | |||
<paper id="29"> | |||
<title>AUXILIARIES AND CLITICS IN FRENCH UCG GRAMMAR</title> | |||
<author id="karine-baschung"><first>K.</first><last>BASCHUNG</last></author> | |||
<author id="karine-baschung"><first>K.</first><last>Baschung</last></author> | |||
<author id="gabriel-g-bes"><first>G.G.</first><last>BES</last></author> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
False negative?
data/xml/I08.xml
Outdated
@@ -1002,7 +1002,7 @@ | |||
</paper> | |||
<paper id="144"> | |||
<title>Cross Lingual Information Access System for <fixed-case>I</fixed-case>ndian Languages</title> | |||
<author><first>CLIA</first><last>Consortium</last></author> | |||
<author><first>Clia</first><last>Consortium</last></author> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
False positive: acronym
data/xml/M93.xml
Outdated
@@ -103,9 +103,9 @@ | |||
</paper> | |||
<paper id="14"> | |||
<title><fixed-case>NEC</fixed-case>: DESCRIPTION OF THE VENIEX SYSTEM AS USED FOR <fixed-case>MUC</fixed-case>-5</title> | |||
<author><first>Kazunori</first><last>MURAKI</last></author> | |||
<author><first>Kazunori</first><last>Muraki</last></author> | |||
<author><first>Shinichi</first><last>DOI</last></author> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
False negative?
data/xml/P91.xml
Outdated
@@ -195,7 +195,7 @@ | |||
</paper> | |||
<paper id="24"> | |||
<title>EXPERIMENTS AND PROSPECTS OF EXAMPLE-BASED MACHINE TRANSLATION</title> | |||
<author><first>Eiichiro</first><last>SUMITA</last></author> | |||
<author><first>Eiichiro</first><last>Sumita</last></author> | |||
<author><first>Hitoshi</first><last>HDA</last></author> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparent OCR error: surname should be Iida
data/xml/W97.xml
Outdated
@@ -143,7 +143,7 @@ | |||
<paper id="25"> | |||
<title>A Local Grammar-based Approach to Recognizing of Proper Names in <fixed-case>K</fixed-case>orean Texts</title> | |||
<author><first>Jee-Sun</first><last>NAM</last></author> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
False negative
data/xml/C88.xml
Outdated
<author><first>Ronald M.</first><last>KAPLAN</last></author> | ||
<author><first>John T.</first><last>MAXWELL III</last></author> | ||
<author><first>Ronald M.</first><last>Kaplan</last></author> | ||
<author><first>John T.</first><last>Maxwell Iii</last></author> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
III
data/xml/C88.xml
Outdated
<url>C88-2134</url> | ||
</paper> | ||
<paper id="135"> | ||
<title>A Computer Readability Formula of <fixed-case>J</fixed-case>apanese Texts for Machine Scoring</title> | ||
<author><last>TATEISI</last><first>Yuka</first></author> | ||
<author><last>Tateisi</last><first>Yuka</first></author> | ||
<author><last>ONO</last><first>Yoshihiko</first></author> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
False negative
data/xml/C88.xml
Outdated
@@ -850,22 +850,22 @@ | |||
<paper id="143"> | |||
<title>MASSIVE DISAMBIGUATION OF LARGE TEXT CORPORA WITH FLEXIBLE CATEGORIAL GRAMMAR</title> | |||
<author><first>Ton</first><last>van der WOUDEN</last></author> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
False negative
data/xml/C92.xml
Outdated
<author><first>Fathi</first><last>DEBILI</last></author> | ||
<author><first>Elyes</first><last>SAMMOUDA</last></author> | ||
<author><first>Fathi</first><last>Debili</last></author> | ||
<author><first>Elyes</first><last>Sammouda</last></author> | ||
<url>C92-2079</url> | ||
</paper> | ||
<paper id="80"> | ||
<title>TRANSLATION AMBIGUITY RESOLUTION BASED ON TEXT CORPORA OF SOURCE AND TARGET LANGUAGES</title> | ||
<author><first>Shinichi</first><last>DOI</last></author> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
False negative
data/xml/C92.xml
Outdated
<author><first>Hideki</first><last>TANAKA</last></author> | ||
<author><first>Teruaki</first><last>AIZAWA</last></author> | ||
<author><first>Hideki</first><last>Tanaka</last></author> | ||
<author><first>Teruaki</first><last>Aizawa</last></author> | ||
<author><first>Yeun-Bae</first><last>KIM</last></author> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
False negative
data/xml/C92.xml
Outdated
@@ -1158,17 +1158,17 @@ | |||
</paper> | |||
<paper id="192"> | |||
<title>BESOINS LEXICAUX A LA LUMIERE DE L’ANALYSE STATISTIQUE DU CORPUS DE TEXTES DU PROJET “BREF” - LE LEXIQUE “BDLEX” DU FRANCAIS ECRIT ET ORAL</title> | |||
<author><first>I.</first><last>FERRANE</last></author> | |||
<author><first>I.</first><last>Ferrane</last></author> | |||
<author id="martine-de-calmes"><first>M.</first><last>de CALMES</last></author> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
False negative
data/xml/C98.xml
Outdated
@@ -285,16 +285,16 @@ | |||
</paper> | |||
<paper id="44"> | |||
<title>Veins Theory: A Model of Global Discourse Cohesion and Coherence</title> | |||
<author><first>Dan</first><last>CRISTEA</last></author> | |||
<author><first>Dan</first><last>Cristea</last></author> | |||
<author><first>Nancy</first><last>IDE</last></author> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
False negative
data/xml/C98.xml
Outdated
@@ -1308,8 +1308,8 @@ | |||
<paper id="204"> | |||
<title>Reactive Content Selection in the Generation of Real-time Soccer Commentary</title> | |||
<author><first>Kumiko</first><last>TANAKA-Ishii</last></author> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
False negative
data/xml/C98.xml
Outdated
@@ -1495,7 +1495,7 @@ | |||
</paper> | |||
<paper id="234"> | |||
<title>Word Association and <fixed-case>MI</fixed-case>-Trigger-based Language Modeling</title> | |||
<author><first>GuoDong</first><last>ZHOU</last></author> | |||
<author><first>GuoDong</first><last>Zhou</last></author> | |||
<author><first>KimTeng</first><last>LUA</last></author> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
False negative
Done with my scan of the diffs. (Didn't mark all instances of repeated names like Nancy IDE.) The main problem I saw was false negative all-caps names where coauthors were correctly truecased. Maybe there should be a heuristic that uses coauthor capitalization as a cue. Of course, I probably didn't see false negatives where no coauthor names were changed because they wouldn't show up in the diff. Anyway, there may still be some errors but it's a huge number of fixes! |
We now have more validation, see #669. In short: use |
Thanks!!! Seems like I need to update the heuristics to closer to what you originally suggested (go word by word instead looking at the whole first/last name).
Additionally, three-letter words need better handling. How risky would it be to recase any word that is CVC or VCV?
… On Dec 13, 2019, at 00:19, Nathan Schneider ***@***.***> wrote:
Done with my scan of the diffs. (Didn't mark all instances of repeated names like Nancy IDE.)
The main problem I saw was false negative all-caps names where coauthors were correctly truecased. Maybe there should be a heuristic that uses coauthor capitalization as a cue. Of course, I probably didn't see false negatives where no coauthor names were changed because they wouldn't show up in the diff.
Anyway, there may still be some errors but it's a huge number of fixes!
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Hard for me to predict. How common are sequences of 3 initials without spaces or periods? |
Unfortunately I have a weird Python setup right now that prevents me from running |
I think I may have the pre-commit hooks working, but I find two things about this confusing:
|
@nschneid Good question. In our current XML, these are all of them (I'm surprised there are so few):
I believe the only names here that are initials are GSK and PVS, so maybe these names should be whitelisted and everything else lowercased. Even VLK should be lowercased to Vlk. In case you're curious (of course you are), here are all the two-capital-letter names:
NG should be lowercased and the rest should all be kept. |
Sounds good.
Yup, a special rule for NG is probably warranted. Thanks! |
I believe this is ready to merge, and I believe it eliminates all all-uppercase names and all all-lowercase names (except for one karel Oliva who seems to have wanted it that way). |
The current heuristic is:
There are some further improvements that could be made: for example, How this currently works is that If @mjpost wants to include this (and possibly other automatic filters) into ingestion, I'm not sure what the best way is. Make the hand-checking part of the ingestion process? Or fully automate it and log any changes made? |
@davidweichiang you can run If you can't run these checks, the easiest would probably to run something like |
@akoehn would it be bad to remove the makefile dependencies on venv so that the user does @mjpost another possibility for the ingestion pipeline is for any automatic tools to do their thing but log the changes in the XML itself, like
so that it would be easy for someone to go through later and correct errors? |
Will look at this first thing next week. My main question is how to integrate this into ingestion, but I haven't had a chance to look yet and try to figure it out myself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some comments.
For example: | ||
|
||
Z99-9999 \t author \t ARAVIND K. || JOSHI \t Aravind K. Joshi | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing |||
here in the final field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example usage would be helpful here, too, for others.
"""author_case.py | ||
|
||
Try to correct author names that are written in all uppercase or all lowercase. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add example usage to the top?
Better documentation added; once we figure out whether and how to incorporate this kind of fix into the ingestion pipeline, I will try to stabilize the interface and documentation more. |
Yes, because it is not possible. The commands in make are run in a subshell and cannot edit the users environment variables (this is a good thing). Also, the current setup checks that the venv is up to date and correctly set up at every step. Without it, users could forget to set up the venv, to enable it before every make run, or to update the venv if another commit has changed the dependencies. |
@akoehn Oh, that makes sense. My workaround is to change |
@mjpost okay to merge? |
LGTM. |
This was done by: - author_case.py to generate a list of potential changes - hand correct the list of changes - change_authors.py to apply the changes
This was done by: