DEPRECATED. See http://d2s.semanticscience.org
This is a demonstrator ETL pipeline that converts relational databases, tabular files, and XML files to RDF. A generic RDF, based on the input data structure, is generated and SPARQL queries are designed by the user to map the generic RDF to a specific model.
- Only Docker is required to run the pipeline. Checkout the Wiki if you have issues with Docker installation.
- Following documentation focuses on Linux & MacOS.
- Windows documentation can be found here.
- Modules are from the Data2Services ecosystem.
- See data2services-transform-biolink to run Data2Services transformation workflows using CWL or Argo.
Docker containers running with a few parameters (e.g. input file path, SPARQL endpoint, credentials, mapping file path)
- Build or pull the Docker images.
- Start required services (Apache Drill and GraphDB).
- Execute the Docker modules you want, providing the proper parameters.
git clone --recursive https://github.com/MaastrichtU-IDS/data2services-pipeline.git
cd data2services-pipeline
# Update all submodules
git submodule update --recursive --remote
build.sh
is a convenience script to build/pull all Docker images, but they can be built separately.
- You need to download the GraphDB standalone zip (register to get an email with download URL).
- Then put the
.zip
files in the./graphdb
repositories.
Build/pull all docker images:
# Don't forget to put GraphDB zip file in the graphdb folder
./build.sh
In a production environment, it is considered that both Apache Drill and GraphDB services are present. Other RDF triple stores should also work, but have not been tested yet.
# Pull and start apache-drill
docker pull vemonet/apache-drill
docker run -dit --rm -p 8047:8047 -p 31010:31010 \
--name drill -v /data:/data:ro \
vemonet/apache-drill
# Build and start graphdb (don't forget to put the .zip file in the graphdb folder)
docker build -t graphdb ./graphdb
docker run -d --rm --name graphdb -p 7200:7200 \
-v /data/graphdb:/opt/graphdb/home \
-v /data/graphdb-import:/root/graphdb-import \
graphdb
- For MacOS, make sure that access to the
/data
repository has been granted in Docker configuration. - See the Wiki to run those services using
docker-compose
.
- Check the Wiki if you need help running Docker containers (sharing volumes, link between containers)
- The directory where are the files to convert needs to be in
/data
(to comply with Apache Drill shared volume). - In those examples we are using
/data/data2services
as working directory (containing all the files, note that it is usually shared as/data
in the Docker containers).
Source files can be set to be downloaded automatically using Shell scripts. See the data2services-download module for more details.
docker pull vemonet/data2services-download
docker run -it --rm -v /data/data2services:/data \
vemonet/data2services-download \
--download-datasets drugbank,hgnc,date \
--username my_login --password my_password \
--clean # to delete all files in /data/data2services
Use xml2rdf to convert XML files to a generic RDF based on the file structure.
docker pull vemonet/xml2rdf
docker run --rm -it -v /data:/data \
vemonet/xml2rdf \
-i "/data/data2services/myfile.xml.gz" \
-o "/data/data2services/myfile.nq.gz" \
-g "https://w3id.org/data2services/graph/xml2rdf"
We use AutoR2RML to generate the R2RML mapping file to convert relational databases (Postgres, SQLite, MariaDB), CSV, TSV and PSV files to a generic RDF representing the input data structure. See the Wiki for other DBMS systems and how to deploy databases.
docker pull vemonet/autor2rml
# For CSV, TSV, PSV files
# Apache Drill needs to be running with the name 'drill'
docker run -it --rm --link drill:drill -v /data:/data \
vemonet/autor2rml \
-j "jdbc:drill:drillbit=drill:31010" -r \
-o "/data/data2services/mapping.trig" \
-d "/data/data2services" \
-b "https://w3id.org/data2services/" \
-g "https://w3id.org/data2services/graph/autor2rml"
# For Postgres, a postgres docker container
# needs to be running with the name 'postgres'
docker run -it --rm --link postgres:postgres -v /data:/data \
vemonet/autor2rml \
-j "jdbc:postgresql://postgres:5432/my_database" -r \
-o "/data/data2services/mapping.trig" \
-u "postgres" -p "pwd" \
-b "https://w3id.org/data2services/" \
-g "https://w3id.org/data2services/graph/autor2rml"
Generate the generic RDF using R2RML and the previously generated mapping.trig
file.
docker pull vemonet/r2rml
# Add config.properties file for R2RML in /data/data2services
connectionURL = jdbc:drill:drillbit=drill:31010
mappingFile = /data/mapping.trig
outputFile = /data/rdf_output.nq
format = NQUADS
# Run R2RML for Drill or Postgres
docker run -it --rm --link drill:drill \ # --link postgres:postgres
-v /data/data2services:/data \
vemonet/r2rml /data/config.properties
Finally, use RdfUpload to upload the generated RDF to GraphDB. It can also be done manually using GraphDB server imports for more efficiency on large files.
docker pull vemonet/rdf-upload
docker run -it --rm --link graphdb:graphdb -v /data/data2services:/data \
vemonet/rdf-upload \
-m "HTTP" -if "/data" \
-url "http://graphdb:7200" \
-rep "test" \
-un "import_user" -pw "PASSWORD"
Last step is to transform the generic RDF generated a particular data model. See the data2services-transform-repository project for examples of transformation to the BioLink model.
We will use the data2services-sparql-operations module to execute multiple SPARQL queries from a Github repository using variables to define the graphs URIs.
docker pull vemonet/data2services-sparql-operations
# Load UniProt organisms and Human proteins as BioLink in local endpoint
docker run -d --link graphdb:graphdb \
vemonet/data2services-sparql-operations \
-f "https://github.com/MaastrichtU-IDS/data2services-transform-repository/tree/master/sparql/insert-biolink/uniprot" \
-ep "http://graphdb:7200/repositories/test/statements" \
-un MYUSERNAME -pw MYPASSWORD \
--var-output https://w3id.org/data2services/graph/biolink/uniprot
# Load DrugBank xml2rdf generic RDF as BioLink to remote SPARQL endpoint
docker run -d vemonet/data2services-sparql-operations \
-f "https://github.com/MaastrichtU-IDS/data2services-transform-repository/tree/master/sparql/insert-biolink/drugbank" \
-ep "http://graphdb.dumontierlab.com/repositories/ncats-red-kg/statements" \
-un USERNAME -pw PASSWORD \
--var-service http://localhost:7200/repositories/test \
--var-input http://data2services/graph/xml2rdf/drugbank \
--var-output https://w3id.org/data2services/graph/biolink/drugbank
- You can find example of SPARQL queries used for conversion to RDF BioLink:
- It is recommended to write multiple SPARQL queries with simple goals (get all drugs infos, get all drug-drug interactions, get gene infos), rather than one complex query addressing everything.
- Remove the
\
and make thedocker run
command one line for Windows PowerShell.
- Docker documentation (fix known issues, run, share volumes, link containers, network)
- Run using docker-compose
- Run AutoR2RML with various DBMS
- Fix CSV, TSV, PSV files without columns
- Run on Windows
- Run using convenience scripts
- Run Postgres
- Run MariaDB
- Secure GraphDB
BETA
: RDF validation using ShEx
If you use Data2Services in a scientific publication, you are highly encouraged (not required) to cite the following paper:
Data2Services: enabling automated conversion of data to services. Vincent Emonet, Alexander Malic, Amrapali Zaveri, Andreea Grigoriu and Michel Dumontier.
Bibtex entry:
@inproceedings{Emonet2018,
author = {Emonet, Vincent and Malic, Alexander and Zaveri, Amrapali and Grigoriu, Andreea and Dumontier, Michel},
title = {Data2Services: enabling automated conversion of data to services},
booktitle = {11th Semantic Web Applications and Tools for Healthcare and Life Sciences},
year = {2018}
}