Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest RANLP 2019 #731

Merged
merged 17 commits into from
Jan 23, 2020
Merged

Ingest RANLP 2019 #731

merged 17 commits into from
Jan 23, 2020

Conversation

mjpost
Copy link
Member

@mjpost mjpost commented Jan 15, 2020

Also updated latex_to_unicode.py to remove \cite from abstracts (maybe @davidweichiang wants to have a look at that: 14a75ba).

@mjpost mjpost requested a review from a team January 15, 2020 15:24
@mjpost
Copy link
Member Author

mjpost commented Jan 16, 2020

The current code transforms \cite{key} to \cite key and then deletes \cite, which is not what we want.

Since we cannot (easily) algorithmically distinguish parenthetical and text citations (and people do not always use \citet and \citep correctly), I think I am in favor of just leaving them as they are. Unless you register disagreement, I’ll update the code and the PR to do that.

Noting this in #666—Softconf should also look for \cite.

@akoehn
Copy link
Member

akoehn commented Jan 16, 2020

My position on LaTeX code in abstracts: We cannot (and should not) do post-processing and guess what the authors wanted. The submission sites clearly ask for plaintext and we should handle it as such.

If the authors cannot be bothered to do the cleanup themselves, so be it. The only helpful thing would be a warning by softcfonc à la “You seem to have entered LaTeX code into this field. We expect plain text and will not perform any post-processing.” or similar (I don’t know whether there is any post-processing going on currently).

@mjpost
Copy link
Member Author

mjpost commented Jan 16, 2020

Agreed. I like things to be right but chasing this long tail of manual corrections is too tiring.

@mjpost
Copy link
Member Author

mjpost commented Jan 16, 2020

Okay, this is ready for review now I believe. I updated latex_to_unicode to (a) restore the space between control code and open brace and (b) no longer remove remaining control codes.

return ""

s = re.sub(r"\\[A-Za-z]+ |\\.", repl, s)
# Remove inserted space for remaining codes ("\code {...}" -> "\code{...}")
s = re.sub(r"(\\[A-Za-z]+) \{", r"\1{", s)

return s

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change here is that LaTeX commands not handled by the above are left in place instead of removed. We also restore the space that was inserted to ease handling. So \nonexistentcommand will be left as-is, as will \cite{citekey}. The view is that it cannot be our responsibility to ensure these are all correct.

@davidweichiang
Copy link
Collaborator

The rationale behind deleting all TeX commands was that things like \textsc, \underline, etc. I think really would be appropriate to delete. I agree that \cite is bad to delete, but think other commands might be better to delete. I can't remember if there were other problematic commands like \cite.

@mjpost
Copy link
Member Author

mjpost commented Jan 18, 2020

The argument for a default policy of non-deletion is that it is a mistake to have stray latex in here at all, but when we can't automatically fix it (with the current list of cases we handle), it's better to leave the mistake exposed than to silently remove it, since removing it prevents later (possibly manual) correction.

What about turning \cite{x} into CITATION(x)? This removes the LaTeX and also simplifies the regex code since I don't have to worry about carving out an exception when we remove all stray codes at the end (assuming we don't change that, since it's the current default).

@davidweichiang
Copy link
Collaborator

Is the suggestion in your second paragraph to do \cite{x} -> CITATION(X) and then to remove all remaining TeX commands? I think I prefer that. I would actually suggest \cite{x} -> (CITATION) which might read slightly better, e.g., "The sky is blue (CITATION)" rather than "The sky is blue CITATION(smith:1900)".

I remember another gray-area case: I have definitely seen \footnote used in abstracts.

@mjpost
Copy link
Member Author

mjpost commented Jan 20, 2020

This is done.

I think footnotes should just be removed, and added that, too.

@mjpost mjpost mentioned this pull request Jan 22, 2020
@@ -177,6 +177,9 @@ def latex_to_unicode(s):
# Transform errant citations into "(CITATION)"
s = re.sub(r"\\(new)?cite.? ?\{([\w:-]+?)\}", r"(CITATION)", s)

# Remove errant \footnotes
s = re.sub(r"\\footnote ?\{.*?\}", "", s)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you worried about cases like \footnote{This is \emph{really} important.}?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clearly not :) But I independently ran into this while reprocessing EMNLP (#736), where an (obvious) common instance is \footnote{\url{...}}.

We really need to add unit tests so we know we're handling these issues correctly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, that's an extremely common case. The lazy solution would be to delete \footnote but turn its argument into normal text.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The non-lazy solution would be to add \footnote to the list of LaTeX commands, so that it gets parsed correctly with its argument. If you're sure this is the right way to go, I can try to make the modification. Am I allowed to push to your branch?

Something similar should probably be done for \cite and friends, although there are so many variants of this command that it would clutter up the list of commands...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please do.

@@ -177,6 +177,9 @@ def latex_to_unicode(s):
# Transform errant citations into "(CITATION)"
s = re.sub(r"\\(new)?cite.? ?\{([\w:-]+?)\}", r"(CITATION)", s)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of line 153, the space after \cite is not optional.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I added it for robustness against future changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Could be removed)

@davidweichiang
Copy link
Collaborator

Another reason for handling \cite and \footnote outside of latex_to_unicode is that they aren't related to Unicode.

@davidweichiang
Copy link
Collaborator

Hopefully that works!

@mjpost
Copy link
Member Author

mjpost commented Jan 23, 2020

I'm merging in #736 since it has more test cases for the latex conversion. So far they have turned up the following problems:

  • \href{...}{...}
  • The use of parenthetical custom newcommand which when deleted leave empty parens (e.g., in the past (\past) becomes in the past ())

I'll look at handling these later today.

I also wonder if we should convert footnotes to parenthetical notes instead of removing, but I'm not sure that I'm going to do this.

@davidweichiang
Copy link
Collaborator

It should be straightforward to delete \href along with its first argument, leaving the second argument.

If authors are putting macros in abstracts, I'm not sure what we can do about it. At some point in the future, we might be able to scrape the abstract straight from the PDF.

@davidweichiang
Copy link
Collaborator

The above commit changes \href{http://url.com}{text} to text, and it also turns off changing curly braces to fixed-case inside abstracts.

@mjpost
Copy link
Member Author

mjpost commented Jan 23, 2020

@davidweichiang can you run this code and provide the output?

#!/usr/bin/env python3

import latexcodec, codecs
import sys

s = "the ''generative step'' into"
s = codecs.decode(s, "ulatex+utf8")

print(s)

I get the ”generative step” into with versions 1.0.7 and 2.0.0 of latexcodec.

@davidweichiang
Copy link
Collaborator

@mjpost I get the same output, which is correct, isn't it?

@mjpost
Copy link
Member Author

mjpost commented Jan 23, 2020

Shouldn't it be the “generative step” into? The output has two right curly quotes.

@davidweichiang
Copy link
Collaborator

But so does the LaTeX; it's the author's mistake.

@mjpost
Copy link
Member Author

mjpost commented Jan 23, 2020

Ah right.

@mjpost mjpost merged commit 0a6c8d0 into master Jan 23, 2020
@mjpost mjpost deleted the ingest-ranlp branch January 23, 2020 21:52
najtin pushed a commit to ir-anthology/ir-anthology that referenced this pull request Jun 9, 2021
* added RANLP and workshops
* EMNLP: fixed some abstracts, updated page numbers
* LaTeX to unicode conversion:
   * Remove footnotes and citations using LaTeX abstract syntax tree instead of raw source
   * delete \href along with its first argument; only protect case in title and booktitle, not other fields

Co-authored-by: David Chiang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants