Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check for abstracts cut-and-pasted from the PDF #666

Open
mjpost opened this issue Nov 19, 2019 · 5 comments
Open

check for abstracts cut-and-pasted from the PDF #666

mjpost opened this issue Nov 19, 2019 · 5 comments

Comments

@mjpost
Copy link
Member

mjpost commented Nov 19, 2019

Ingestion (perhaps via normalize_anth) should check for hyphenations that indicate an abstract has been cut-and-pasted from a PDF, e.g., "pro- vides". We could probably mostly correct this automatically, though note that there are situations where the hyphen should be retained and only the space deleted (e.g., "top- selling").

Here's an example.

@davidweichiang
Copy link
Collaborator

I believe cut-and-paste could also result in strange characters like LATIN SMALL LIGATURE FI (U+FB01) but I'm not sure if I've seen that.

@davidweichiang
Copy link
Collaborator

Maybe there should be a preventative step like printing a message in the final submission form saying 'Your abstract will appear in the Anthology exactly as entered here. If you copy and paste it from a PDF, please check it carefully, particularly for word breaks like "pro- vides".'

The second step would be detection; I guess the heuristic should be something like "look for strings of the form 'x- y' where 'xy' is a dictionary word but either 'x' or 'y' is not." If we can build a reliable detector, perhaps we could even ask Softconf to run it for us at submission time.

If you want to automatically correct, maybe the heuristic needs to be better than that. If the LaTeX is provided, we can use the LaTeX. Otherwise, I suppose we could train a CRF to do it, but I'm not sure if that is worth the time.

@davidweichiang
Copy link
Collaborator

Similar to issue #643, this affects the conference website, handbook, etc., as well. So it might be preferable to deal with this earlier than ingestion time.

@mjpost
Copy link
Member Author

mjpost commented Nov 26, 2019

I'd love to see this in START, which is the only place it could be fixed by the author.

Fixing automatically at ingestion time sounds fine so long as we use the dictionary approach. Maintaining a CRF or having to dig up LaTeX sounds time-consuming on the part of the ingester.

@mjpost
Copy link
Member Author

mjpost commented Mar 30, 2020

This could be a good thing to add to the proposed sanity-check.py script in ACLPUB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants