Skip to content

Data Processing

Noam Bachmann edited this page Jun 18, 2022 · 12 revisions

This page describes how data gets into the RDF stores used by SynoSpecies. Some links in this document point to private repositories, they are provided as a convenience for those who have access to these repositories but the referred contents are not needed to understand this document.

From the TreatmentBank to GitHub repository

The starting point are XML documents provided by Plazi TreatmentBank. The GoldenGate Server uploads this automatically to the treatments-xml GitHub repository if/when the QC requirements are fulfilled.

From there, they are converted using a GitHub action into RDF and serialized as Turtle—a relatively human-readable RDF format—and deposited into the treatments-rdf repository. This takes about ten minutes.

Caveat
The Plazi TreatmentBank provides two RDF representations for each treatment. For example, for the treatment http://treatment.plazi.org/id/03A587ED-3239-FFD3-83E9-FC842CD7FB02 two different RDF representation can be retrieved from the following URIs:
  • http://tb.plazi.org/GgServer/rdf/03A587ED3239FFD383E9FC842CD7FB02
  • http://tb.plazi.org/GgServer/lodRdf/03A587ED3239FFD383E9FC842CD7FB02
The second one more frequently uses URIs (IRIs) rather than literals and is thus better suited for Linked Data applications. This representation is the one returned when requesting (i.e. setting the `Accept`-Header to) `application/rdf+xml` on the main URI representing the treatment (http://treatment.plazi.org/id/03A587ED-3239-FFD3-83E9-FC842CD7FB02).

From the GitHub Repository to the Triple Store

There are three components relevant here:

  1. AllegroGraph as the triple store containing the data and computing the answers to SPARQL queries.
  2. Turtle Hook, which imports the data onto our server.
  3. PSPS, which serves as "glue", providing web access to the triple store and allowing for complete re-imports if necessary.

Every commit of treatments-rdf triggers a webhook sent by GitHub to Turtle Hook, which then loads only the changes of this commit into the triple store. This is very fast and the updated data is available on SynoSpecies within seconds at most.

To facilitate autocomplete for search, all taxon names are regularly (every ~24h) indexed through a script running on the server. This was previously coupled to each data import, but has been moved as to simultaneously make imports faster and distribute server load more evenly.

Publication of the Data to Lindas

An ldbar-deploy-cron docker container provides a cron-job executed daily. The executed script gets all the triples (i.e. the quads without the graph-element) from Allegro, uses the unix commands sort and uniq to remove duplicates, and uploads the data to Lindas