Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correction: Diacritics missing from author name though present in PDF #333

Open
nschneid opened this issue May 13, 2019 · 30 comments · Fixed by #340
Open

Correction: Diacritics missing from author name though present in PDF #333

nschneid opened this issue May 13, 2019 · 30 comments · Fixed by #340
Labels
correction for corrections submitted to the anthology

Comments

@nschneid
Copy link
Contributor

Ironically, Diacritics Restoration Using Neural Networks lists "Jan Hajic" on the page and in the BibTeX whereas it's spelled "Jan Hajič" in the PDF.

I see he is listed with the diacritic in some venues but not others, though going by the PDFs, Jan Hajič seems to be the preferred spelling.

Should the policy be that if an author is listed with multiple spellings differing only in diacritics, the one with the most diacritics should be applied?

@nschneid nschneid added the correction for corrections submitted to the anthology label May 13, 2019
@mjpost
Copy link
Member

mjpost commented May 14, 2019

The goal is generally for (a) BibTeX to reflect PDF and (b) author pages to collect all observed name variants. So this is a mistake that should be corrected in the XML.

Want to open a PR (and be added to our list of volunteers)?

@nschneid
Copy link
Contributor Author

Before making a one-time change I'd like to understand the underlying problem. Could it be that his name is listed without the diacritic in START, so it is showing up that way in the metadata for many of the venues?

@davidweichiang
Copy link
Collaborator

I just checked -- indeed, his name in START is just Jan Hajic.

@davidweichiang
Copy link
Collaborator

Missing diacritics is a widespread problem that we hoped to sidestep by allowing name variants. I suppose one could try to write a scraper to try to detect them.

If we knew when the switch was made to using START names, then looking for frequent mismatches after that year would help to identify people to contact and ask them to consider updating their profile.

@davidweichiang
Copy link
Collaborator

I adapted the auto_first_names.py script and am running it on L18 now. It's catching quite a few errors; not just the one @nschneid pointed out, but removing extra accents, decapitalizing an all-caps name, and flagging (but not autocorrecting, alas) a couple of misspelled names.

@davidweichiang
Copy link
Collaborator

In L18 (528 papers, wow), the script made 150 changes (also wow) and printed another 100+ warnings that usually indicate a typo or missing word.

The automatic changes are easy to check, and they all look good except for a few:

INFO:L18-1066 author Tomasz Pędzimąż: changing: Pędzimąż -> Pȩdzima̧ż
INFO:L18-1495 author Anna Björk Nikulásdóttir: changing: Nikulásdóttir -> Nikulasdóttir
INFO:L18-1632 author Huda Almuzaini: changing: Almuzaini -> almuzaini

The first changes ogoneks to cedillas, I believe incorrectly.
The second one looks incorrect to me based on a Google search.
The third one lowercases the last name of someone who doesn't appear to do that regularly.

@mjpost, do you think the PDF should be followed in such cases?

@davidweichiang
Copy link
Collaborator

What system does LREC use to fill metadata? Do they use START also? I'm running the script on L16 now (for #341) and seeing some PDF/XML mismatches that are the same as in L18.

For example (not an exhaustive list):

Phillippe Langlais -> Philippe Langlais
Ina Roesiger -> Ina Rösiger

So it would be nice to fix these at the source instead of on our end.

@kilian-gebhardt
Copy link
Collaborator

@davidweichiang
Copy link
Collaborator

davidweichiang commented May 15, 2019

@mjpost what are your thoughts about editing XML to match PDF in these cases where the PDF has less information than the current XML:

  1. XML currently has Matt Post, PDF has Matt POST
  2. XML Matt Post, PDF M. Post
  3. XML Matt Post, PDF Mat Post
  4. XML Matt J. Post, PDF Matt Post
  5. XML Matt Póst, PDF Matt Post (supposing that the accent is correct)
  6. XML Matt Post, PDF matt post
    [Edit: numbered list]
    [Edit: 6]

@mjpost
Copy link
Member

mjpost commented May 15, 2019

  1. I approve on the grounds of superseding another conference's convention

  2. I like only when it is clear that initials were used because of a conference-level editorial decision (in which case we are overriding their convention with our superior one). If this were a one-off, we don't have the evidence that this wasn't the author's choice.

  3. I approve as a typo correction

  4. I dislike, because there is no evidence that this is a correction. (And in particular, I strongly dislike my name being written as Matthew, Matthew J, Matt J, etc)

  5. Is murky but I think wrong. For example, the same corrective principle might change Koehn → Köhn, which would be wrong. We could set a general rule that acknowledges typing Latin-1 characters was harder say, 20 years ago, but I think it's more straightforward to list this as an ASCII variant.

Just to be clear, since my tone may indicate otherwise, we can discuss any of these.

@danielgildea
Copy link
Collaborator

As a general rule, I would say the xml should reflect how you would want to cite the paper, and not necessarily have to match the PDF 100% of the time. On that basis, I would say that the xml should have:

  1. No full caps.
  2. Full first name if we know that the author usually uses it, and this conference/paper just didn't allow it.
  3. Typos fixed if we are absolutely sure it's a typo.
  4. Middle initial and form "Matt" vs "Matthew" etc as they appear in the pdf.
  5. Any diacritics if we are sure they are correct and are generally used by that person.

Unfortunately these rules require some research/judgment, but I think it is better to leave things the way they are in case of doubt than it is to exactly mirror the PDFs.

@akoehn
Copy link
Member

akoehn commented May 15, 2019

@mjpost : Approve means that you would like to keep the XML data and not the PDF one, correct?

For example, the same corrective principle might change Koehn → Köhn, which would be wrong

As an expert in this field [ :-) ]: it depends. You cannot change an oe to ö without any evidence. However, if either the PDF or the XML actually has Köhn in it, it is probably safe to say that the umlaut is the preferred version. Case in point: Philipp Köhn spells his name Koehn in all publications and has this probably also in softconf. No algorithm would try to change it to Köhn. I try to use the umlaut, but in some cases have to enter an ascii-only name, so Koehn will be in some database as well.

Ina Roesiger -> Ina Rösiger

In this case, one should go with Rösiger.

@mjpost
Copy link
Member

mjpost commented May 15, 2019

@akoehn—yes, approve means I was in favor of the XML diverging from the PDF in the cases mentioned above.

I like @danielgildea's concise summary above. We should start throwing conclusions from these discussions into a wiki page that describes our approach.

Just seeing (6) above: I think capitalization falls under the Gildea Principles: we correct it to English conventions unless we have evidence the author prefers it that way (e.g., danah boyd, e e cummings).

@davidweichiang
Copy link
Collaborator

I think I hear a consensus about

(3) Don't copy errors from the PDF into the XML. Note that this can occasionally be a tough call: for example, I had difficulty figuring out Elahe Khorasani vs. Elahe Khorashani.

(4-5) Assuming that neither the PDF or XML has an error, go with the PDF.

(1-2) Override styles (like first initials or all caps) imposed by a conference (which is rare).

But:

(1-2) There's less clarity about individual papers that use first initials or all caps -- @mjpost says go with the PDF.

So for the examples mentioned in this thread, and a couple more:

Consensus cases:

  • PDF Hajič, XML Hajic: change to Hajič
  • PDF Philippe, XML Phillippe -> change to Philippe
  • PDF Rösiger, XML Roesiger -> change to Rösiger
  • PDF almuzaini, XML Almuzaini: keep Almuzaini

Not sure about whether these are considered typos or not:

  • PDF Pȩdzima̧ż, XML Pędzimąż: change to Pȩdzima̧ż
  • PDF Nikulasdóttir, XML Nikulásdóttir: change to Nikulasdóttir
  • PDF Khorashani, XML Khorasani -> change to Khorashani

And the cases where there is difference of opinion:

  • PDF WANG, XML Wang -> change to WANG
  • PDF P., XML Philipp -> change to P.

@nschneid
Copy link
Contributor Author

(1-2) There's less clarity about individual papers that use first initials or all caps -- @mjpost says go with the PDF.

I'm not sure about this one. Why would an author choose to abbreviate their name in some publications but not others? It could be that the names in the PDF all follow one convention which is inconsistent with what some of the authors normally do. I would generally prefer more information over less information, so if the name was spelled out in START but abbreviated in the PDF, I'd go with the non-abbreviated version.

@mjpost
Copy link
Member

mjpost commented May 15, 2019

I agree about having more information whenever possible; I just want us to have some degree of certainty about it, so that we don't "Gilbert Keith" someone's "GK". If we have information from another source (start ID, inference about a conference convention, etc) that suggests the author is fine with an evidenced fuller version, I'm fine with it. But part of the reason to have a strong preference for the PDF is that without that convention, one can spend endless time trying to figure out what's right in all these situations.

@davidweichiang
Copy link
Collaborator

@nschneid we previously discussed first initials at length in #245; @mjpost sorry to bring it up again. In the current situation (LREC and other conferences that use START), the full names are known to be correct because they are provided by the authors, so it seems especially sad to delete information (and indeed, I didn't do it in PR #340).

I will try to summarize the above discussion in the wiki, and I will back out the changes in L18 that made some last names all-caps.

@davidweichiang
Copy link
Collaborator

Do you want to further discuss how to get people to change their names in START? If not, we can close this issue.

@davidweichiang
Copy link
Collaborator

I think we can pretty reliably restore accents now by scraping them from PDFs. What's the best way to use this -- to identify people to ask to update their START accounts, or just run the scraper as part of ingestion?

The scraper also changes casing and inserts/deletes spaces and hyphens. But it can only flag, not autocorrect, changes in spelling or insertion/deletion of names or initials.

@danielgildea
Copy link
Collaborator

As far as getting people to update their names in START, it seems like there are a
few things we might try:

  1. Try to get everyone's emails from START, and send emails to people with a mismatch.
  2. Ask ACL organizers to include a note in their email to authors about checking that the names in START match what people want, possibly including the authors' names from START in the email so that people can see easily how their names appear now.
  3. Provide pub chairs with a script to check names against the PDFs, so that they can edit the metadata and possibly bug the authors themselves.
    Any thoughts on which of these to pursue?

@mjpost
Copy link
Member

mjpost commented May 21, 2019

I think we should focus our efforts on implementing this ourself when we generate the XML (say in anthology_xml.py. Reasoning:

  • Authors sometimes don't add names via START userids (the interface for this is actually pretty confusing)
  • Some folks are not using START so there may be other errors there

(1) and (2) are still good ideas to reduce the amount that has to be fixed, though.

@davidweichiang
Copy link
Collaborator

I agree, it would be annoying for everyone involved to email individual people. So, we have an author-name scraper (https://github.com/acl-org/acl-anthology/blob/auto_accents/bin/auto_authors.py) that could be incorporated into normalize_anth.py and run as part of ingestion.

  • Currently, it downloads the PDF by HTTP (as you may have noticed if you look at the server log for the last few days), but if part of ingestion, it should be an option to read from a local directory.
  • Improve heuristics to use some kind of minimum-edit distance like auto_name_variants.py does; it has to be really high precision, though.
  • Improve heuristics to know about some letter relationships like oe and ö.
  • Add special cases, especially for nicknames where the edit distance may be high, like Kathleen-Kathy.

@mjpost
Copy link
Member

mjpost commented Jun 12, 2019

@davidweichiang, do you want a local copy of the Anthology PDFs? It's 35 GB. If you have a CLSP account we could set this up, or find another way.

@davidweichiang
Copy link
Collaborator

I don’t have a CLSP account (I don’t think). But a local copy might be a good idea if we can figure out a way.

@akoehn
Copy link
Member

akoehn commented Jun 17, 2019

Short cross link: #295 (comment) for a discussion of how to mirror PDFs in bulk. Should be ~5mins to implement.

@mjpost
Copy link
Member

mjpost commented Jun 17, 2019

I've posted a file with checksums here [14 MB].

@davidweichiang
Copy link
Collaborator

Can this file (as well as the mirroring script) become part of the repo?

@akoehn
Copy link
Member

akoehn commented Jun 17, 2019

Can we discuss that further in #295 (the mirroring issue)? I can write the script & create a pull request later today; I am currently on a train with limited bandwidth. Adding the checksums file to the repository seems like a good idea to me.

@danielgildea
Copy link
Collaborator

Hi,

I just ran find_name_variants.py, which finds names that slugify to the same thing. It found over 300 cases of people with essentially the same name that are currently considered to be different people in the database because they are not entered into name_variants.yaml. Most are missing accents, and some are different first/last split for multiword names. It looks like in all cases it is the same person.

I wonder if we should change the anthology code to consider any two names that slugify to the same thing to be the same person. That way these people could have one author page without us having to track down every name variant during ingestion, which we don't have any consistent process for currently.

@mjpost
Copy link
Member

mjpost commented Nov 10, 2020

I like that idea. We should wrap up discussion in #623 and come up with a solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
correction for corrections submitted to the anthology
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants