Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anthology mirrors #295

Closed
1 of 3 tasks
akoehn opened this issue May 2, 2019 · 40 comments
Closed
1 of 3 tasks

Anthology mirrors #295

akoehn opened this issue May 2, 2019 · 40 comments
Assignees

Comments

@akoehn
Copy link
Member

akoehn commented May 2, 2019

As the anthology is currently not reachable (.htaccess bug?), I want to renew this topic.
There are already two issues regarding this (#22 and #28), but both are more about sharing the old rails application than sharing the data.

It would be great to be able to create a mirror by rsync-ing (or similar) the underlying data not in this repository. IIRC, the code change to host a mirror under a different URL should be minimal. 35GB of data seem to be small enough to easily host it.

TODO:

  • add hashes to internally linked files in the XML
  • make host configurable in python scripts generating the hugo pages and pass this from Makefile
  • create script to mirror papers with hashcode check
@mbollmann
Copy link
Member

FWIW, the old Anthology included a seeding script that simply downloaded everything from the aclweb.org server via HTTP. That's bound to be slower and less efficient than syncing some other way, but it would be pretty simple to recreate this functionality.

@akoehn
Copy link
Member Author

akoehn commented May 2, 2019

Yes, but with that approach, one has to either re-download the whole anthology or implement some book-keeping as to what is new and what isn't (and a list of all downloadable files is needed). Updates to existing files are not propagated, deletions are not noted, and so on.

A read-only rsync server (probably restricted to certain public keys) would be less hacky and more efficient. Given that a server where an rsync server can be run is at hand of course.

I don't currently have plans for working on it; I just wanted to have an issue for it because the downtime hindered my research :-)

@mjpost
Copy link
Member

mjpost commented May 2, 2019

More later, but tagging #156, since mirrors won’t work until we get rid of absolute URLs.

@akoehn
Copy link
Member Author

akoehn commented May 22, 2019

Two random points:

  • The PDF files are served without .pdf (see Add .pdf to links? #179 for a reason). A mirror would therefore need to also perform the same redirecting (but that configuration is not publicly available afaik). Alternatively, the generator should obtain a switch to link to the files including .pdf to make rewrite rules unnecessary for mirrors.
  • The easiest solution I could come up with for performing the mirror is to host a file with all filenames and their hashes (sha512sum or similar, generated e.g. by running find . | xargs sha512sum). A simple ~5-liner on the client side could then check the locally available files and download the missing ones. No additional infrastructure on the server needed and integrity check included.

@mjpost
Copy link
Member

mjpost commented May 22, 2019

The .htaccess file with the rewrite rules is in the repo. It has www.aclweb.org hard-coded; I wonder if that could be generalized to work for mirrors. If not, a mirror setup script could just do a replacement.

I like the on-demand downloading. I can generate the SHA1 hashes later.

@akoehn
Copy link
Member Author

akoehn commented May 22, 2019

Ah, I overlooked it somehow.

Another question regarding mirrors: Are the additional resources (slides, data, code etc.) also mirrored and would that be a problem with licenses? The ACL papers and the COLING ones are licensed under a CC license, no problem there. But I am not sure about the rest.

@knmnyn
Copy link
Collaborator

knmnyn commented May 23, 2019 via email

@akoehn
Copy link
Member Author

akoehn commented Jun 17, 2019

Discussion from #333:

@mjpost

I've posted a file with checksums here [14 MB].

@akoehn

I can write the script & create a pull request later today; I am currently on a train with limited bandwidth. Adding the checksums file to the repository seems like a good idea to me.

We could probably save space by encoding the checksums using binary; I don't know whether that is worth the (little) added complexity.

@mjpost
Copy link
Member

mjpost commented Jun 17, 2019

This is a great idea to add these—perhaps as a “checksum” attr on <url>? Thanks for volunteering to do this.

One request, though—can you wait till tomorrow? I need to finish #317, which is very close. It will be a minor pain to have to rebase once more if you push this in first.

@akoehn
Copy link
Member Author

akoehn commented Jun 17, 2019

@mjpost the file contains .bib files. I assume they should not be mirrored?

No worry, I will not push anything. If we use checksum attrs, the code will be completely different from what I was originally going to implement. I like the approach as it makes sure every element is mirrorable (and we don't accidentally expose files such as .bib).

As the sha512sum file you posted is quite big, maybe let's settle for a shorter checksum? It is only for the client to check the integrity, we do not really need resilience to attacks. sha224 will save us ~3.5mb in the xml files and is still much more secure than we need.

@mjpost
Copy link
Member

mjpost commented Jun 17, 2019

Ah, yes, the bibs are originals that aren't used. Please do ignore.

I guess I just mean, please be prepared to run this again once I have pushed in the nested format. It shouldn't be that difficult to update but may be easier if you have it in mind that you'll have to.

Yes, a shorter checksum would be great.

@akoehn
Copy link
Member Author

akoehn commented Jun 17, 2019

I wrote a short bash script downloading the data. It does not reproduce the file structure on the ACL server, but recreates the structure as it is served (i.e. the /pdf/ directory is not used).

http://arne.chark.eu/tmp/mirror-papers.sh

Not committing it anywhere as a real solution with the xml will probably not share any code.

@davidweichiang You can try this to mirror the anthology pdfs. It should be fairly self explanatory.

mjpost added a commit that referenced this issue Jun 21, 2019
A summary of changes:

- Introduces a nested format (closes #317)
- URLs are stored using a relative format for internal links (closes #156), which facilitates mirroring (#295) 
- URLs are only displayed if they are found in the XML. I manually crawled to validate and create entries for PDFs for all frontmatter entries (closes #181 closes #180), including journal frontmatter (#264) and volume PDFs (closes #31) 
- Added missing entries and removed ones whose PDFs were missing, including LREC 2014 (closes #31 )
- It punts on C69 reformatting (closes #147)

Relevant, but not completed:
- Creating PDF volumes by pasting together individual papers (#226)
- This makes it much easier to add non-paper entries such as talks (#298), to add a volume-level "publication date (#319), and to create an RSS feed of updates (#358),
@davidweichiang
Copy link
Collaborator

Thanks. It's better than nothing, right, so should we add it to the repo until someone writes another?

@akoehn
Copy link
Member Author

akoehn commented Jun 28, 2019

Updated issue to reflect current state. Closes #348.

@davidweichiang
Copy link
Collaborator

I'm guessing that both the script above as well as the file of hashes are outdated. It would be great for the hashes to be autogenerated and for the script to become part of the repository.

@mjpost
Copy link
Member

mjpost commented Oct 9, 2019

I could do this fairly quickly. Are we agreed that an optional sha224 attribute on <url> is the best approach?

@akoehn
Copy link
Member Author

akoehn commented Oct 10, 2019

Yes, it seems to be the safe & future-proof thing to do. The only downside: sha224 is fairly big and would add ~3mb of data. I think that is okay but if we only want a single checksum, crc32 would only add about 400kb.

The schema should be changed so that the attribute is required on relative URIs:

element url {
    (xsd:anyURI {pattern="/.*" , attribute hash {xsd:string { minLength = "56" maxLength = "56" pattern="[0-9a-f]"* })
    | xsd:anyURI {pattern="https?://.*" }
}? // same for revisions etc.

Completely untested, of course :-)

@mjpost
Copy link
Member

mjpost commented Oct 16, 2019

I thought you suggested sha224 because it's a shorter hash, but it seems longer to me. Using this site to hash "Arne":

  • MD5: 297f7ee0aad5b818bafa6044072c898e
  • SHA1: 2d7739f42ebd62662a710577d3d9078342a69dee
  • SHA224: 44b2cea251afaffb0682a40d125ffb77e7aa09b28c77b245cba3c3c4

So my new proposal is to add an md5 attribute on <url> tags. If I can get consensus I'll do the work.

@akoehn
Copy link
Member Author

akoehn commented Oct 16, 2019

I only wrote that sha224 is shorter than sha512 :-)

The question is whether we only need a checksum (e.g. to verify that the download was not accidentally corrupted) or whether it should be a cryptographic hash (e.g. to guard against a third party trying to pass a specially crafted PDF as one of the PDFs of the anthology).

If we only need a checksum, crc32 should be sufficient and even shorter than MD5. If we want a cryptographic hash, we should at least use sha224, as both MD5 and SHA1 have known weaknesses. Maybe just use crc32, we can still distribute the hashes out of band later on for all the paranoid people out there ;-)

@mjpost
Copy link
Member

mjpost commented Oct 16, 2019

I can't imagine a threat case where we need to guard against PDF replacement, though it sounds like the basis of a great entry for Bruce Schneier's (now-defunct) movie plot contests. So in that case I suggest crc32.

@akoehn
Copy link
Member Author

akoehn commented May 6, 2020

Issues such as #730 might be another reason to have checksums.

@mbollmann
Copy link
Member

If @mjpost can produce an up-to-date file with CRC32 checksums, I'm happy to add them to the XML sometime this week. Unless you had already begun making a script for that @akoehn ?

@akoehn
Copy link
Member Author

akoehn commented May 6, 2020

Everything I did is linked here. Would love to work on it, but I unfortunately have no time for the anthology with the current child care closures :-((

If you add it, please add the relaxNG thing I posted above, with length adjusted for crc32 of course.

@mjpost
Copy link
Member

mjpost commented May 6, 2020

I can reproduce this shortly. One issue here is that in our current model, revisions overwrite the default paper (e.g., revision two produces P19-1001v2.pdf which also overwrite P19-1001.pdf, so that we can return the latest version by default. This will complicate checksumming since they'll have to be updated every time we create a revision.

There are two ways we could deal with this:

  1. Update checksums when adding revisions. I guess this wouldn't be too hard if it's in the XML
  2. We could no longer overwrite the main PDF, which would mean that the https://aclweb.org/anthology/P19-1001.pdf shortcut would always return the original file. Perhaps this isn't too bad since we have the canonical page anyway.

I have always been dissatisfied with overwriting the PDF in this manner (the original is saved to "v1.pdf", of course) since it creates the potential for error (see #730) and overloads the meaning of the file name.

@akoehn
Copy link
Member Author

akoehn commented May 6, 2020

I would overwrite the main pdf because I would assume to obtain the latest version with an unversioned link. Another reason is that it is otherwise very hard to programmatically obtain the current version of a paper (zotero, ebib, ...).

I agree that not overwriting the PDF would be the cleaner solution though :-/

The cleanest solution would probably to save all "v1" as v1 and have a symlink from the unversioned file name to the latest version. I don't know.

@mbollmann
Copy link
Member

+1 to all of what @akoehn said. I think overwriting is the most practical solution.

We could maybe make sure that we already update all the scripts for ingestion/adding revisions etc. to automatically compute and add/update the checksum.

@mbollmann
Copy link
Member

But that said, the existence of the v1 version is never made explicit anywhere in the XML file. Maybe we should change that, also so there actually is something to attach the checksum for that file to.

@mjpost
Copy link
Member

mjpost commented May 8, 2020

The checksums are here, if someone wants to take a shot at this:

http://cs.jhu.edu/~post/tmp/crc32.txt.gz

@mbollmann
Copy link
Member

Thanks @mjpost! What do we do with files that are currently missing (as per #264)? If we require the checksum in the RelaxNG schema as @akoehn suggested (and I think we should do that), the validation would fail. Should I add a dummy value (e.g. 00000000) for now, or how do we want to handle this?

Also, if there are no objections, I would go ahead and add <revision id="1" ...> entries for all papers that currently have revisions, so we can actually record the checksums of the original versions somewhere.

@akoehn
Copy link
Member Author

akoehn commented May 9, 2020 via email

@mjpost
Copy link
Member

mjpost commented May 9, 2020

I like the idea of adding a <revision id="" ...>` tag.

Should we fix #264 first? It should be easy to remove <url> lines where there is no paper being linked to, since this should be generating an error, anyway.

@mbollmann
Copy link
Member

@mjpost Can you share checksums of the IWPT files (and possibly EAMT, if they're gonna be merged soon)? I've got everything ready to add checksums when I find a minute, but I need complete coverage of course.

@mjpost
Copy link
Member

mjpost commented May 13, 2020

Here are IWPT. I think your changes will be ready before EAMT, and I'm still debugging the process, so I'll add the checksums after merging in your changes, if that's okay.

https://cs.jhu.edu/~post/tmp/iwpt-crc32.txt

@mjpost
Copy link
Member

mjpost commented May 13, 2020

@mbollmann I just updated that file with volume-level PDFs (#810), you may wish to add those, too.

@mjpost
Copy link
Member

mjpost commented Nov 25, 2020

Once we have mirroring working, and a permanent mirror in place, it would also be nice to setup a workflow to publish a live version of every branch, for previewing. We could define a permanent mirror with full site functionality (e.g., aclanthology.org), and with live branches at, say, aclanthology.org/dev/{branch_name}.

@akoehn
Copy link
Member Author

akoehn commented Nov 27, 2020

tadaa: http://aclanthology.lst.uni-saarland.de/anthology/

attachments are not on the mirror atm, but that is purely due to miscalculated disk space and can be fixed. Will post the code soon.

Two observations:

  • the aclweb.org webserver is sloooooow! It took me about 7h(?) just to download all data, which should have been a ~1h job with my internet connection.
  • @mjpost I recall you saying that the anthology is 30gb, it is more like 51gb.

Re the mirror for dev: sure, that is even easier because the papers do not need to be mirrored (and that is what the work was mostly about)

@akoehn
Copy link
Member Author

akoehn commented Nov 27, 2020

What I forgot: pleas click around and test whether anything is broken.

@mjpost
Copy link
Member

mjpost commented Nov 27, 2020

The mirroring seems to work—but what about changing the prefix? Using both a sub-domain and a top-level directory is redundant. Would this build at the root level instead of under /anthology?

@akoehn
Copy link
Member Author

akoehn commented Nov 28, 2020

@mjpost http://aclanthology.lst.uni-saarland.de/

@akoehn
Copy link
Member Author

akoehn commented Apr 19, 2021

I think we are (finally) done with this, after merging #1124

@akoehn akoehn closed this as completed Apr 19, 2021
najtin pushed a commit to ir-anthology/ir-anthology that referenced this issue Jun 9, 2021
A summary of changes:

- Introduces a nested format (closes acl-org#317)
- URLs are stored using a relative format for internal links (closes acl-org#156), which facilitates mirroring (acl-org#295) 
- URLs are only displayed if they are found in the XML. I manually crawled to validate and create entries for PDFs for all frontmatter entries (closes acl-org#181 closes acl-org#180), including journal frontmatter (acl-org#264) and volume PDFs (closes #31) 
- Added missing entries and removed ones whose PDFs were missing, including LREC 2014 (closes #31 )
- It punts on C69 reformatting (closes acl-org#147)

Relevant, but not completed:
- Creating PDF volumes by pasting together individual papers (acl-org#226)
- This makes it much easier to add non-paper entries such as talks (acl-org#298), to add a volume-level "publication date (acl-org#319), and to create an RSS feed of updates (acl-org#358),
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants