Anthology mirrors #295

akoehn · 2019-05-02T09:17:11Z

As the anthology is currently not reachable (.htaccess bug?), I want to renew this topic.
There are already two issues regarding this (#22 and #28), but both are more about sharing the old rails application than sharing the data.

It would be great to be able to create a mirror by rsync-ing (or similar) the underlying data not in this repository. IIRC, the code change to host a mirror under a different URL should be minimal. 35GB of data seem to be small enough to easily host it.

TODO:

add hashes to internally linked files in the XML
make host configurable in python scripts generating the hugo pages and pass this from Makefile
create script to mirror papers with hashcode check

mbollmann · 2019-05-02T10:01:34Z

FWIW, the old Anthology included a seeding script that simply downloaded everything from the aclweb.org server via HTTP. That's bound to be slower and less efficient than syncing some other way, but it would be pretty simple to recreate this functionality.

akoehn · 2019-05-02T11:32:22Z

Yes, but with that approach, one has to either re-download the whole anthology or implement some book-keeping as to what is new and what isn't (and a list of all downloadable files is needed). Updates to existing files are not propagated, deletions are not noted, and so on.

A read-only rsync server (probably restricted to certain public keys) would be less hacky and more efficient. Given that a server where an rsync server can be run is at hand of course.

I don't currently have plans for working on it; I just wanted to have an issue for it because the downtime hindered my research :-)

mjpost · 2019-05-02T19:43:30Z

More later, but tagging #156, since mirrors won’t work until we get rid of absolute URLs.

akoehn · 2019-05-22T18:16:51Z

Two random points:

The PDF files are served without .pdf (see Add .pdf to links? #179 for a reason). A mirror would therefore need to also perform the same redirecting (but that configuration is not publicly available afaik). Alternatively, the generator should obtain a switch to link to the files including .pdf to make rewrite rules unnecessary for mirrors.
The easiest solution I could come up with for performing the mirror is to host a file with all filenames and their hashes (sha512sum or similar, generated e.g. by running find . | xargs sha512sum). A simple ~5-liner on the client side could then check the locally available files and download the missing ones. No additional infrastructure on the server needed and integrity check included.

mjpost · 2019-05-22T19:15:00Z

The .htaccess file with the rewrite rules is in the repo. It has www.aclweb.org hard-coded; I wonder if that could be generalized to work for mirrors. If not, a mirror setup script could just do a replacement.

I like the on-demand downloading. I can generate the SHA1 hashes later.

akoehn · 2019-05-22T19:46:21Z

Ah, I overlooked it somehow.

Another question regarding mirrors: Are the additional resources (slides, data, code etc.) also mirrored and would that be a problem with licenses? The ACL papers and the COLING ones are licensed under a CC license, no problem there. But I am not sure about the rest.

knmnyn · 2019-05-23T06:33:52Z

Hi Arne, all: We collected those resources in the past without asking for any specific license. I would think going forward we can make the licensing explicit to CC BY 4.0, but that would just be my opinion. - M

…

On Thu, May 23, 2019 at 3:46 AM Arne Köhn ***@***.***> wrote: Ah, I overlooked it somehow. Another question regarding mirrors: Are the additional resources (slides, data, code etc.) also mirrored and would that be a problem with licenses? The ACL papers and the COLING ones are licensed under a CC license, no problem there. But I am not sure about the rest. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#295?email_source=notifications&email_token=AABU725MOR2EI3HM24VLOLTPWWPI3A5CNFSM4HJ4UEGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWADUZA#issuecomment-494942820>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AABU72Y64XOGM4LDOFKSC3DPWWPI3ANCNFSM4HJ4UEGA> .

akoehn · 2019-06-17T14:53:02Z

Discussion from #333:

@mjpost

I've posted a file with checksums here [14 MB].

@akoehn

I can write the script & create a pull request later today; I am currently on a train with limited bandwidth. Adding the checksums file to the repository seems like a good idea to me.

We could probably save space by encoding the checksums using binary; I don't know whether that is worth the (little) added complexity.

mjpost · 2019-06-17T16:38:59Z

This is a great idea to add these—perhaps as a “checksum” attr on <url>? Thanks for volunteering to do this.

One request, though—can you wait till tomorrow? I need to finish #317, which is very close. It will be a minor pain to have to rebase once more if you push this in first.

akoehn · 2019-06-17T17:09:50Z

@mjpost the file contains .bib files. I assume they should not be mirrored?

No worry, I will not push anything. If we use checksum attrs, the code will be completely different from what I was originally going to implement. I like the approach as it makes sure every element is mirrorable (and we don't accidentally expose files such as .bib).

As the sha512sum file you posted is quite big, maybe let's settle for a shorter checksum? It is only for the client to check the integrity, we do not really need resilience to attacks. sha224 will save us ~3.5mb in the xml files and is still much more secure than we need.

mjpost · 2019-06-17T17:43:39Z

Ah, yes, the bibs are originals that aren't used. Please do ignore.

I guess I just mean, please be prepared to run this again once I have pushed in the nested format. It shouldn't be that difficult to update but may be easier if you have it in mind that you'll have to.

Yes, a shorter checksum would be great.

akoehn · 2019-06-17T20:09:04Z

I wrote a short bash script downloading the data. It does not reproduce the file structure on the ACL server, but recreates the structure as it is served (i.e. the /pdf/ directory is not used).

http://arne.chark.eu/tmp/mirror-papers.sh

Not committing it anywhere as a real solution with the xml will probably not share any code.

@davidweichiang You can try this to mirror the anthology pdfs. It should be fairly self explanatory.

A summary of changes: - Introduces a nested format (closes #317) - URLs are stored using a relative format for internal links (closes #156), which facilitates mirroring (#295) - URLs are only displayed if they are found in the XML. I manually crawled to validate and create entries for PDFs for all frontmatter entries (closes #181 closes #180), including journal frontmatter (#264) and volume PDFs (closes #31) - Added missing entries and removed ones whose PDFs were missing, including LREC 2014 (closes #31 ) - It punts on C69 reformatting (closes #147) Relevant, but not completed: - Creating PDF volumes by pasting together individual papers (#226) - This makes it much easier to add non-paper entries such as talks (#298), to add a volume-level "publication date (#319), and to create an RSS feed of updates (#358),

davidweichiang · 2019-06-21T18:45:27Z

Thanks. It's better than nothing, right, so should we add it to the repo until someone writes another?

akoehn · 2019-06-28T08:45:00Z

Updated issue to reflect current state. Closes #348.

davidweichiang · 2019-10-09T20:32:56Z

I'm guessing that both the script above as well as the file of hashes are outdated. It would be great for the hashes to be autogenerated and for the script to become part of the repository.

mjpost · 2019-10-09T22:12:32Z

I could do this fairly quickly. Are we agreed that an optional sha224 attribute on <url> is the best approach?

akoehn · 2019-10-10T07:58:02Z

Yes, it seems to be the safe & future-proof thing to do. The only downside: sha224 is fairly big and would add ~3mb of data. I think that is okay but if we only want a single checksum, crc32 would only add about 400kb.

The schema should be changed so that the attribute is required on relative URIs:

element url {
    (xsd:anyURI {pattern="/.*" , attribute hash {xsd:string { minLength = "56" maxLength = "56" pattern="[0-9a-f]"* })
    | xsd:anyURI {pattern="https?://.*" }
}? // same for revisions etc.

Completely untested, of course :-)

mjpost · 2019-10-16T19:41:34Z

I thought you suggested sha224 because it's a shorter hash, but it seems longer to me. Using this site to hash "Arne":

MD5: 297f7ee0aad5b818bafa6044072c898e
SHA1: 2d7739f42ebd62662a710577d3d9078342a69dee
SHA224: 44b2cea251afaffb0682a40d125ffb77e7aa09b28c77b245cba3c3c4

So my new proposal is to add an md5 attribute on <url> tags. If I can get consensus I'll do the work.

akoehn · 2019-10-16T20:26:52Z

I only wrote that sha224 is shorter than sha512 :-)

The question is whether we only need a checksum (e.g. to verify that the download was not accidentally corrupted) or whether it should be a cryptographic hash (e.g. to guard against a third party trying to pass a specially crafted PDF as one of the PDFs of the anthology).

If we only need a checksum, crc32 should be sufficient and even shorter than MD5. If we want a cryptographic hash, we should at least use sha224, as both MD5 and SHA1 have known weaknesses. Maybe just use crc32, we can still distribute the hashes out of band later on for all the paranoid people out there ;-)

mjpost · 2019-10-16T20:31:30Z

I can't imagine a threat case where we need to guard against PDF replacement, though it sounds like the basis of a great entry for Bruce Schneier's (now-defunct) movie plot contests. So in that case I suggest crc32.

akoehn · 2020-05-06T13:24:24Z

Issues such as #730 might be another reason to have checksums.

mbollmann · 2020-05-06T13:27:48Z

If @mjpost can produce an up-to-date file with CRC32 checksums, I'm happy to add them to the XML sometime this week. Unless you had already begun making a script for that @akoehn ?

akoehn · 2020-05-06T13:40:25Z

Everything I did is linked here. Would love to work on it, but I unfortunately have no time for the anthology with the current child care closures :-((

If you add it, please add the relaxNG thing I posted above, with length adjusted for crc32 of course.

mjpost · 2020-05-06T13:49:22Z

I can reproduce this shortly. One issue here is that in our current model, revisions overwrite the default paper (e.g., revision two produces P19-1001v2.pdf which also overwrite P19-1001.pdf, so that we can return the latest version by default. This will complicate checksumming since they'll have to be updated every time we create a revision.

There are two ways we could deal with this:

Update checksums when adding revisions. I guess this wouldn't be too hard if it's in the XML
We could no longer overwrite the main PDF, which would mean that the https://aclweb.org/anthology/P19-1001.pdf shortcut would always return the original file. Perhaps this isn't too bad since we have the canonical page anyway.

I have always been dissatisfied with overwriting the PDF in this manner (the original is saved to "v1.pdf", of course) since it creates the potential for error (see #730) and overloads the meaning of the file name.

akoehn · 2020-05-06T14:31:02Z

I would overwrite the main pdf because I would assume to obtain the latest version with an unversioned link. Another reason is that it is otherwise very hard to programmatically obtain the current version of a paper (zotero, ebib, ...).

I agree that not overwriting the PDF would be the cleaner solution though :-/

The cleanest solution would probably to save all "v1" as v1 and have a symlink from the unversioned file name to the latest version. I don't know.

mbollmann · 2020-05-06T14:59:35Z

+1 to all of what @akoehn said. I think overwriting is the most practical solution.

We could maybe make sure that we already update all the scripts for ingestion/adding revisions etc. to automatically compute and add/update the checksum.

mbollmann · 2020-05-06T15:20:17Z

But that said, the existence of the v1 version is never made explicit anywhere in the XML file. Maybe we should change that, also so there actually is something to attach the checksum for that file to.

mjpost · 2020-05-08T21:42:24Z

The checksums are here, if someone wants to take a shot at this:

http://cs.jhu.edu/~post/tmp/crc32.txt.gz

mbollmann · 2020-05-09T15:59:44Z

Thanks @mjpost! What do we do with files that are currently missing (as per #264)? If we require the checksum in the RelaxNG schema as @akoehn suggested (and I think we should do that), the validation would fail. Should I add a dummy value (e.g. 00000000) for now, or how do we want to handle this?

Also, if there are no objections, I would go ahead and add <revision id="1" ...> entries for all papers that currently have revisions, so we can actually record the checksums of the original versions somewhere.

akoehn · 2020-05-09T16:52:21Z

Sounds good to me. Maybe we can use a regexp a la [0-9a-f]* | x* instead and mark missing ones as xxxxxxxxxx? Then there would be no collusion possibility. null

mjpost · 2020-05-09T17:24:48Z

I like the idea of adding a <revision id="" ...>` tag.

Should we fix #264 first? It should be easy to remove <url> lines where there is no paper being linked to, since this should be generating an error, anyway.

mbollmann · 2020-05-13T13:52:37Z

@mjpost Can you share checksums of the IWPT files (and possibly EAMT, if they're gonna be merged soon)? I've got everything ready to add checksums when I find a minute, but I need complete coverage of course.

mjpost · 2020-05-13T14:06:25Z

Here are IWPT. I think your changes will be ready before EAMT, and I'm still debugging the process, so I'll add the checksums after merging in your changes, if that's okay.

https://cs.jhu.edu/~post/tmp/iwpt-crc32.txt

mjpost · 2020-05-13T14:20:38Z

@mbollmann I just updated that file with volume-level PDFs (#810), you may wish to add those, too.

mjpost · 2020-11-25T14:53:57Z

Once we have mirroring working, and a permanent mirror in place, it would also be nice to setup a workflow to publish a live version of every branch, for previewing. We could define a permanent mirror with full site functionality (e.g., aclanthology.org), and with live branches at, say, aclanthology.org/dev/{branch_name}.

akoehn · 2020-11-27T14:37:27Z

tadaa: http://aclanthology.lst.uni-saarland.de/anthology/

attachments are not on the mirror atm, but that is purely due to miscalculated disk space and can be fixed. Will post the code soon.

Two observations:

the aclweb.org webserver is sloooooow! It took me about 7h(?) just to download all data, which should have been a ~1h job with my internet connection.
@mjpost I recall you saying that the anthology is 30gb, it is more like 51gb.

Re the mirror for dev: sure, that is even easier because the papers do not need to be mirrored (and that is what the work was mostly about)

akoehn · 2020-11-27T14:42:33Z

What I forgot: pleas click around and test whether anything is broken.

mjpost · 2020-11-27T17:00:19Z

The mirroring seems to work—but what about changing the prefix? Using both a sub-domain and a top-level directory is redundant. Would this build at the root level instead of under /anthology?

akoehn · 2020-11-28T17:39:47Z

@mjpost http://aclanthology.lst.uni-saarland.de/

akoehn · 2021-04-19T08:10:29Z

I think we are (finally) done with this, after merging #1124

A summary of changes: - Introduces a nested format (closes acl-org#317) - URLs are stored using a relative format for internal links (closes acl-org#156), which facilitates mirroring (acl-org#295) - URLs are only displayed if they are found in the XML. I manually crawled to validate and create entries for PDFs for all frontmatter entries (closes acl-org#181 closes acl-org#180), including journal frontmatter (acl-org#264) and volume PDFs (closes #31) - Added missing entries and removed ones whose PDFs were missing, including LREC 2014 (closes #31 ) - It punts on C69 reformatting (closes acl-org#147) Relevant, but not completed: - Creating PDF volumes by pasting together individual papers (acl-org#226) - This makes it much easier to add non-paper entries such as talks (acl-org#298), to add a volume-level "publication date (acl-org#319), and to create an RSS feed of updates (acl-org#358),

mjpost mentioned this issue May 9, 2019

Nested volumes and explicit <url> tags #324

Merged

9 tasks

akoehn mentioned this issue Jun 17, 2019

Correction: Diacritics missing from author name though present in PDF #333

Open

mjpost mentioned this issue Jun 27, 2019

A Makefile for the anthology #348

Closed

7 tasks

akoehn mentioned this issue Oct 9, 2019

Docker Container with minimal install #22

Closed

akoehn mentioned this issue Oct 21, 2019

Make "URL" the canonical URL instead of .pdf #587

Closed

akoehn mentioned this issue Dec 11, 2019

Metadata lookup #689

Open

mbollmann self-assigned this Apr 26, 2020

mbollmann mentioned this issue May 14, 2020

Add checksums #819

Merged

mbollmann mentioned this issue Jul 8, 2020

Revision woes #904

Open

mjpost mentioned this issue Nov 18, 2020

Mirroring #1081

Closed

akoehn closed this as completed Apr 19, 2021

Anthology mirrors #295

Anthology mirrors #295

Comments

akoehn commented May 2, 2019 • edited Loading

mbollmann commented May 2, 2019

akoehn commented May 2, 2019

mjpost commented May 2, 2019

akoehn commented May 22, 2019

mjpost commented May 22, 2019

akoehn commented May 22, 2019

knmnyn commented May 23, 2019 via email

akoehn commented Jun 17, 2019

mjpost commented Jun 17, 2019

akoehn commented Jun 17, 2019

mjpost commented Jun 17, 2019

akoehn commented Jun 17, 2019

davidweichiang commented Jun 21, 2019

akoehn commented Jun 28, 2019

davidweichiang commented Oct 9, 2019

mjpost commented Oct 9, 2019

akoehn commented Oct 10, 2019

mjpost commented Oct 16, 2019

akoehn commented Oct 16, 2019

mjpost commented Oct 16, 2019

akoehn commented May 6, 2020

mbollmann commented May 6, 2020

akoehn commented May 6, 2020

mjpost commented May 6, 2020

akoehn commented May 6, 2020

mbollmann commented May 6, 2020

mbollmann commented May 6, 2020

mjpost commented May 8, 2020

mbollmann commented May 9, 2020

akoehn commented May 9, 2020 via email

mjpost commented May 9, 2020

mbollmann commented May 13, 2020

mjpost commented May 13, 2020

mjpost commented May 13, 2020

mjpost commented Nov 25, 2020

akoehn commented Nov 27, 2020

akoehn commented Nov 27, 2020

mjpost commented Nov 27, 2020

akoehn commented Nov 28, 2020

akoehn commented Apr 19, 2021

akoehn commented May 2, 2019 •

edited

Loading