Ingest RANLP 2019 #731

mjpost · 2020-01-15T15:24:23Z

Also updated latex_to_unicode.py to remove \cite from abstracts (maybe @davidweichiang wants to have a look at that: 14a75ba).

mjpost · 2020-01-16T13:59:40Z

The current code transforms \cite{key} to \cite key and then deletes \cite, which is not what we want.

Since we cannot (easily) algorithmically distinguish parenthetical and text citations (and people do not always use \citet and \citep correctly), I think I am in favor of just leaving them as they are. Unless you register disagreement, I’ll update the code and the PR to do that.

Noting this in #666—Softconf should also look for \cite.

akoehn · 2020-01-16T14:28:20Z

My position on LaTeX code in abstracts: We cannot (and should not) do post-processing and guess what the authors wanted. The submission sites clearly ask for plaintext and we should handle it as such.

If the authors cannot be bothered to do the cleanup themselves, so be it. The only helpful thing would be a warning by softcfonc à la “You seem to have entered LaTeX code into this field. We expect plain text and will not perform any post-processing.” or similar (I don’t know whether there is any post-processing going on currently).

mjpost · 2020-01-16T14:30:23Z

Agreed. I like things to be right but chasing this long tail of manual corrections is too tiring.

mjpost · 2020-01-16T15:59:45Z

Okay, this is ready for review now I believe. I updated latex_to_unicode to (a) restore the space between control code and open brace and (b) no longer remove remaining control codes.

mjpost · 2020-01-18T16:19:54Z

bin/latex_to_unicode.py

        return ""

-    s = re.sub(r"\\[A-Za-z]+ |\\.", repl, s)
+    # Remove inserted space for remaining codes ("\code {...}" -> "\code{...}")
+    s = re.sub(r"(\\[A-Za-z]+) \{", r"\1{", s)

    return s



The change here is that LaTeX commands not handled by the above are left in place instead of removed. We also restore the space that was inserted to ease handling. So \nonexistentcommand will be left as-is, as will \cite{citekey}. The view is that it cannot be our responsibility to ensure these are all correct.

davidweichiang · 2020-01-18T16:41:58Z

The rationale behind deleting all TeX commands was that things like \textsc, \underline, etc. I think really would be appropriate to delete. I agree that \cite is bad to delete, but think other commands might be better to delete. I can't remember if there were other problematic commands like \cite.

mjpost · 2020-01-18T21:35:48Z

The argument for a default policy of non-deletion is that it is a mistake to have stray latex in here at all, but when we can't automatically fix it (with the current list of cases we handle), it's better to leave the mistake exposed than to silently remove it, since removing it prevents later (possibly manual) correction.

What about turning \cite{x} into CITATION(x)? This removes the LaTeX and also simplifies the regex code since I don't have to worry about carving out an exception when we remove all stray codes at the end (assuming we don't change that, since it's the current default).

davidweichiang · 2020-01-19T01:14:25Z

Is the suggestion in your second paragraph to do \cite{x} -> CITATION(X) and then to remove all remaining TeX commands? I think I prefer that. I would actually suggest \cite{x} -> (CITATION) which might read slightly better, e.g., "The sky is blue (CITATION)" rather than "The sky is blue CITATION(smith:1900)".

I remember another gray-area case: I have definitely seen \footnote used in abstracts.

mjpost · 2020-01-20T17:27:19Z

This is done.

I think footnotes should just be removed, and added that, too.

davidweichiang · 2020-01-23T02:26:06Z

bin/latex_to_unicode.py

@@ -177,6 +177,9 @@ def latex_to_unicode(s):
    # Transform errant citations into "(CITATION)"
    s = re.sub(r"\\(new)?cite.? ?\{([\w:-]+?)\}", r"(CITATION)", s)

+    # Remove errant \footnotes
+    s = re.sub(r"\\footnote ?\{.*?\}", "", s)


Are you worried about cases like \footnote{This is \emph{really} important.}?

Clearly not :) But I independently ran into this while reprocessing EMNLP (#736), where an (obvious) common instance is \footnote{\url{...}}.

We really need to add unit tests so we know we're handling these issues correctly.

Oh yes, that's an extremely common case. The lazy solution would be to delete \footnote but turn its argument into normal text.

The non-lazy solution would be to add \footnote to the list of LaTeX commands, so that it gets parsed correctly with its argument. If you're sure this is the right way to go, I can try to make the modification. Am I allowed to push to your branch?

Something similar should probably be done for \cite and friends, although there are so many variants of this command that it would clutter up the list of commands...

Yes, please do.

davidweichiang · 2020-01-23T02:50:36Z

bin/latex_to_unicode.py

@@ -177,6 +177,9 @@ def latex_to_unicode(s):
    # Transform errant citations into "(CITATION)"
    s = re.sub(r"\\(new)?cite.? ?\{([\w:-]+?)\}", r"(CITATION)", s)


Because of line 153, the space after \cite is not optional.

True, I added it for robustness against future changes.

(Could be removed)

…ad of raw source

davidweichiang · 2020-01-23T03:36:32Z

Another reason for handling \cite and \footnote outside of latex_to_unicode is that they aren't related to Unicode.

davidweichiang · 2020-01-23T03:44:07Z

Hopefully that works!

mjpost · 2020-01-23T15:20:34Z

I'm merging in #736 since it has more test cases for the latex conversion. So far they have turned up the following problems:

\href{...}{...}
The use of parenthetical custom newcommand which when deleted leave empty parens (e.g., in the past (\past) becomes in the past ())

I'll look at handling these later today.

I also wonder if we should convert footnotes to parenthetical notes instead of removing, but I'm not sure that I'm going to do this.

davidweichiang · 2020-01-23T15:34:17Z

It should be straightforward to delete \href along with its first argument, leaving the second argument.

If authors are putting macros in abstracts, I'm not sure what we can do about it. At some point in the future, we might be able to scrape the abstract straight from the PDF.

…e and booktitle, not other fields

davidweichiang · 2020-01-23T15:38:56Z

The above commit changes \href{http://url.com}{text} to text, and it also turns off changing curly braces to fixed-case inside abstracts.

mjpost · 2020-01-23T16:52:58Z

@davidweichiang can you run this code and provide the output?

#!/usr/bin/env python3

import latexcodec, codecs
import sys

s = "the ''generative step'' into"
s = codecs.decode(s, "ulatex+utf8")

print(s)

I get the ”generative step” into with versions 1.0.7 and 2.0.0 of latexcodec.

davidweichiang · 2020-01-23T16:54:13Z

@mjpost I get the same output, which is correct, isn't it?

mjpost · 2020-01-23T16:55:40Z

Shouldn't it be the “generative step” into? The output has two right curly quotes.

davidweichiang · 2020-01-23T16:59:36Z

But so does the LaTeX; it's the author's mistake.

mjpost · 2020-01-23T17:00:28Z

Ah right.

…ingest-ranlp

* added RANLP and workshops * EMNLP: fixed some abstracts, updated page numbers * LaTeX to unicode conversion: * Remove footnotes and citations using LaTeX abstract syntax tree instead of raw source * delete \href along with its first argument; only protect case in title and booktitle, not other fields Co-authored-by: David Chiang <[email protected]>

mjpost added 2 commits January 15, 2020 10:22

added RANLP and workshops

5d86edc

remove \cite{.*} from abstracts

14a75ba

mjpost requested a review from a team January 15, 2020 15:24

mjpost added 2 commits January 15, 2020 16:43

fixed name spelling to match PDF

d0b9f1c

removed extra space

ac4d566

leaving \cite alone

c255e91

added external DOIs

56e66a7

mjpost commented Jan 18, 2020

View reviewed changes

(CITATION)

6982ee7

mjpost added 3 commits January 20, 2020 12:27

remove footnotes

2ebe321

Merge branch 'ingest-ranlp' into ingest-emnlp19-fix

ffdcacf

updated page numbers for last EMNLP fix

f9c1401

mjpost mentioned this pull request Jan 22, 2020

Ingest emnlp19 fix #736

Closed

davidweichiang reviewed Jan 23, 2020

View reviewed changes

Remove footnotes and citations using LaTeX abstract syntax tree inste…

9b94124

…ad of raw source

davidweichiang approved these changes Jan 23, 2020

View reviewed changes

Merge branch 'ingest-ranlp' into ingest-emnlp19-fix

7733c15

delete \href along with its first argument; only protect case in titl…

43c073f

…e and booktitle, not other fields

black

674dff5

mjpost added 3 commits January 23, 2020 12:01

reprocessed EMNLP volume 1 abstracts

24f26fc

Merge branch 'ingest-ranlp' of github.com:acl-org/acl-anthology into …

3e0a60c

…ingest-ranlp

manual fixes

00ca498

mjpost merged commit 0a6c8d0 into master Jan 23, 2020

mjpost deleted the ingest-ranlp branch January 23, 2020 21:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingest RANLP 2019 #731

Ingest RANLP 2019 #731

mjpost commented Jan 15, 2020

mjpost commented Jan 16, 2020

akoehn commented Jan 16, 2020

mjpost commented Jan 16, 2020

mjpost commented Jan 16, 2020

mjpost Jan 18, 2020

davidweichiang commented Jan 18, 2020

mjpost commented Jan 18, 2020

davidweichiang commented Jan 19, 2020

mjpost commented Jan 20, 2020

davidweichiang Jan 23, 2020

mjpost Jan 23, 2020

davidweichiang Jan 23, 2020

davidweichiang Jan 23, 2020

mjpost Jan 23, 2020

davidweichiang Jan 23, 2020

mjpost Jan 23, 2020

mjpost Jan 23, 2020

davidweichiang commented Jan 23, 2020

davidweichiang commented Jan 23, 2020

mjpost commented Jan 23, 2020 •

edited

Loading

davidweichiang commented Jan 23, 2020

davidweichiang commented Jan 23, 2020

mjpost commented Jan 23, 2020

davidweichiang commented Jan 23, 2020

mjpost commented Jan 23, 2020

davidweichiang commented Jan 23, 2020

mjpost commented Jan 23, 2020

		@@ -177,6 +177,9 @@ def latex_to_unicode(s):
		# Transform errant citations into "(CITATION)"
		s = re.sub(r"\\(new)?cite.? ?\{([\w:-]+?)\}", r"(CITATION)", s)

Ingest RANLP 2019 #731

Ingest RANLP 2019 #731

Conversation

mjpost commented Jan 15, 2020

mjpost commented Jan 16, 2020

akoehn commented Jan 16, 2020

mjpost commented Jan 16, 2020

mjpost commented Jan 16, 2020

Choose a reason for hiding this comment

davidweichiang commented Jan 18, 2020

mjpost commented Jan 18, 2020

davidweichiang commented Jan 19, 2020

mjpost commented Jan 20, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidweichiang commented Jan 23, 2020

davidweichiang commented Jan 23, 2020

mjpost commented Jan 23, 2020 • edited Loading

davidweichiang commented Jan 23, 2020

davidweichiang commented Jan 23, 2020

mjpost commented Jan 23, 2020

davidweichiang commented Jan 23, 2020

mjpost commented Jan 23, 2020

davidweichiang commented Jan 23, 2020

mjpost commented Jan 23, 2020

mjpost commented Jan 23, 2020 •

edited

Loading