-
Notifications
You must be signed in to change notification settings - Fork 281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ingest RANLP 2019 #731
Ingest RANLP 2019 #731
Conversation
The current code transforms Since we cannot (easily) algorithmically distinguish parenthetical and text citations (and people do not always use Noting this in #666—Softconf should also look for |
My position on LaTeX code in abstracts: We cannot (and should not) do post-processing and guess what the authors wanted. The submission sites clearly ask for plaintext and we should handle it as such. If the authors cannot be bothered to do the cleanup themselves, so be it. The only helpful thing would be a warning by softcfonc à la “You seem to have entered LaTeX code into this field. We expect plain text and will not perform any post-processing.” or similar (I don’t know whether there is any post-processing going on currently). |
Agreed. I like things to be right but chasing this long tail of manual corrections is too tiring. |
Okay, this is ready for review now I believe. I updated |
bin/latex_to_unicode.py
Outdated
return "" | ||
|
||
s = re.sub(r"\\[A-Za-z]+ |\\.", repl, s) | ||
# Remove inserted space for remaining codes ("\code {...}" -> "\code{...}") | ||
s = re.sub(r"(\\[A-Za-z]+) \{", r"\1{", s) | ||
|
||
return s | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change here is that LaTeX commands not handled by the above are left in place instead of removed. We also restore the space that was inserted to ease handling. So \nonexistentcommand
will be left as-is, as will \cite{citekey}
. The view is that it cannot be our responsibility to ensure these are all correct.
The rationale behind deleting all TeX commands was that things like \textsc, \underline, etc. I think really would be appropriate to delete. I agree that \cite is bad to delete, but think other commands might be better to delete. I can't remember if there were other problematic commands like \cite. |
The argument for a default policy of non-deletion is that it is a mistake to have stray latex in here at all, but when we can't automatically fix it (with the current list of cases we handle), it's better to leave the mistake exposed than to silently remove it, since removing it prevents later (possibly manual) correction. What about turning |
Is the suggestion in your second paragraph to do I remember another gray-area case: I have definitely seen |
This is done. I think footnotes should just be removed, and added that, too. |
bin/latex_to_unicode.py
Outdated
@@ -177,6 +177,9 @@ def latex_to_unicode(s): | |||
# Transform errant citations into "(CITATION)" | |||
s = re.sub(r"\\(new)?cite.? ?\{([\w:-]+?)\}", r"(CITATION)", s) | |||
|
|||
# Remove errant \footnotes | |||
s = re.sub(r"\\footnote ?\{.*?\}", "", s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you worried about cases like \footnote{This is \emph{really} important.}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clearly not :) But I independently ran into this while reprocessing EMNLP (#736), where an (obvious) common instance is \footnote{\url{...}}
.
We really need to add unit tests so we know we're handling these issues correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, that's an extremely common case. The lazy solution would be to delete \footnote
but turn its argument into normal text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The non-lazy solution would be to add \footnote
to the list of LaTeX commands, so that it gets parsed correctly with its argument. If you're sure this is the right way to go, I can try to make the modification. Am I allowed to push to your branch?
Something similar should probably be done for \cite and friends, although there are so many variants of this command that it would clutter up the list of commands...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please do.
bin/latex_to_unicode.py
Outdated
@@ -177,6 +177,9 @@ def latex_to_unicode(s): | |||
# Transform errant citations into "(CITATION)" | |||
s = re.sub(r"\\(new)?cite.? ?\{([\w:-]+?)\}", r"(CITATION)", s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of line 153, the space after \cite
is not optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, I added it for robustness against future changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Could be removed)
Another reason for handling |
Hopefully that works! |
I'm merging in #736 since it has more test cases for the latex conversion. So far they have turned up the following problems:
I'll look at handling these later today. I also wonder if we should convert footnotes to parenthetical notes instead of removing, but I'm not sure that I'm going to do this. |
It should be straightforward to delete If authors are putting macros in abstracts, I'm not sure what we can do about it. At some point in the future, we might be able to scrape the abstract straight from the PDF. |
…e and booktitle, not other fields
The above commit changes |
@davidweichiang can you run this code and provide the output?
I get |
@mjpost I get the same output, which is correct, isn't it? |
Shouldn't it be |
But so does the LaTeX; it's the author's mistake. |
Ah right. |
* added RANLP and workshops * EMNLP: fixed some abstracts, updated page numbers * LaTeX to unicode conversion: * Remove footnotes and citations using LaTeX abstract syntax tree instead of raw source * delete \href along with its first argument; only protect case in title and booktitle, not other fields Co-authored-by: David Chiang <[email protected]>
Also updated latex_to_unicode.py to remove
\cite
from abstracts (maybe @davidweichiang wants to have a look at that: 14a75ba).