Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LaTeX processing is not being done on ingestion #644

Closed
davidweichiang opened this issue Nov 12, 2019 · 12 comments
Closed

LaTeX processing is not being done on ingestion #644

davidweichiang opened this issue Nov 12, 2019 · 12 comments

Comments

@davidweichiang
Copy link
Collaborator

This came up in #628, and I think it appeared for the first time for EMNLP 2019 because we pushed some simplifying changes to START that turned off LaTeX on their end.

Is it because normalize_anth.py is not being run with the -t option?

The problem is that currently, normalize_anth.py cannot be rerun with the -t option; all kinds of errors come up. It ought to be fixable but might not be an easy fix.

@mjpost
Copy link
Member

mjpost commented Nov 12, 2019

Is it because normalize_anth.py is not being run with the -t option?

Ah, yes, this must be it!

I do not call normalize_anth.py directly, but only call it via bin/ingest.py, where I call 'process()', but pass it "xml" instead of "latex".

Can I use normalize_anth.py while reading in an XML file?

@davidweichiang
Copy link
Collaborator Author

Sorry, I didn't understand the last question...

@mjpost
Copy link
Member

mjpost commented Nov 12, 2019

It looks like I should change this line, passing "latex" instead of "xml". Is that correct?

[Edit: added the link]

@davidweichiang
Copy link
Collaborator Author

Right, with the caveat that LaTeX processing should not be done more than once.

@mjpost
Copy link
Member

mjpost commented Nov 12, 2019

Ingest is called just once, so this is the perfect place for it.

@davidweichiang
Copy link
Collaborator Author

OK, and do you want to do anything about EMNLP 2019?

@mjpost
Copy link
Member

mjpost commented Nov 12, 2019

Yes, will add to #645 (don't merge yet).

@mjpost
Copy link
Member

mjpost commented Nov 12, 2019

Did abstracts, titles and others are trickier at this point, will require custom script. Is it worth it for me to do that?

@davidweichiang
Copy link
Collaborator Author

Maybe it's easier to eyeball all the titles.

@mjpost
Copy link
Member

mjpost commented Nov 12, 2019

Just titles, though? Or anything else? (author names?)

@davidweichiang
Copy link
Collaborator Author

I think START sends us author names in UTF-8.

@davidweichiang
Copy link
Collaborator Author

I took a quick look at the D19 index page and only found

A Label Informative Wide & Deep Classifier for Patents and Papers
LTRC-MT Simple & Effective Hindi-English Neural Machine Translation Systems at WAT 2019
Supervised neural machine translation based on data augmentation and improved training & inference process
Finding Generalizable Evidence by Learning to Convince Q&A Models

uniblock -> uniblock

Answering Naturally : Factoid to Full length Answer Generation (not sure if this needs to be corrected)
Samvaadhana : A Telugu Dialogue System in Hospital Domain (same)
Efficiency through Auto-Sizing:Notre Dame NLP’s Submission to the WNGT 2019 Efficiency Task (missing space after colon -- yes, I'd like to correct this!)

Cherry Colin | Durrett Greg | Foster George | Haffari Reza | Khadivi Shahram | Peng Nanyun | Ren Xiang | Swayamdipta Swabha (names are all backwards)

I saw tons of author capitalization problems (#643).

davidweichiang added a commit that referenced this issue Nov 12, 2019
call normalize correctly (closes #644)
najtin pushed a commit to ir-anthology/ir-anthology that referenced this issue Jun 9, 2021
najtin pushed a commit to ir-anthology/ir-anthology that referenced this issue Jun 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants