Skip to content

A generic pipeline for converting tabular data into rdf data cubes

License

Notifications You must be signed in to change notification settings

Swirrl/table2qb

Repository files navigation

table2qb tesseract animation

Build Status

Build Statistical Linked-Data with CSV-on-the-Web

Create statistical linked-data by deriving CSV-on-the-Web annotations for your data tables using the RDF Data Cube Vocabulary.

Build up a knowledge graph from spreadsheets without advanced programming skills or RDF modelling knowledge.

Simply prepare CSV inputs according to the templates and table2qb will output standards-compliant CSVW or RDF.

Once you're happy with the results you can adjust the configuration to tailor the URI patterns to your heart's content.

Turn Data Tables into Data Cubes

Table2qb expects three types of CSV tables as input:

  • observations: a 'tidy data' table with one statistic per row (what the standard calls an observation)
  • components: another table defining the columns used to describe observations (what the standard calls component properties such as dimensions, measures, and attributes)
  • codelists: a further set of tables that enumerate and describe the values used in cells of the observation table (what the standard calls codes, grouped into codelists)

For example, the ONS says that:

In mid-2019, the population of the UK reached an estimated 66.8 million

This is a single observation value (66.8 million) with two dimensions (date and place) which respectively have two code values (mid-2019 and UK), a single measure (population estimate), and implicitly an attribute for the unit (people).

The regional-trade example goes into more depth. The colour-coded spreadsheet should help illustrate how the three types of table come together to describe a cube.

Each of these inputs is processed by it's own pipeline which will output CSVW - i.e. a processed version of the CSV table along with a JSON metadata annotation which describes the translation into RDF. Optionally you can also ask table2qb to perform the translation outputting RDF directly that can be loaded into a graph database and queried with SPARQL.

Table2qb also relies on a fourth CSV table for configuration:

  • columns: this describes how the observations table should be interpreted - i.e. which components and codelists should be used for each column in the observation tables

This configuration is designed to be used for multiple data cubes across a data collection (so that you can re-use e.g. a "Year" column without having to configure anew it each time) to encourage harmonisation and alignment of identifiers.

Ultimately table2qb provides a foundation to help you build a collection of interoperable statistical linked open data.

Install table2qb

Github release

Download the release from https://github.com/Swirrl/table2qb/releases.

Currently the latest is 0.3.0.

Once downloaded, unzip. The main 'table2qb' executable is in the directory ./target/table2qb-0.3.0 You can add this directory to your PATH environment variable, or just run it with the full file path on your system.

Clojure CLI

Clojure now distributes clojure and cli command-line programs for running clojure programs. To run table2qb through the clojure command, first install the Clojure CLI tools. Then create a file deps.edn containing the following:

deps.edn

{:deps {swirrl/table2qb {:git/url "https://github.com/Swirrl/table2qb.git"
                         :sha "8c4b22778db0c160b06f2f3b0b3df064d8f8452b"}
        org.apache.logging.log4j/log4j-api {:mvn/version "2.19.0"}
        org.apache.logging.log4j/log4j-core {:mvn/version "2.19.0"}
        org.apache.logging.log4j/log4j-slf4j-impl {:mvn/version "2.19.0"}}
 :aliases
 {:table2qb
  {:main-opts ["-m" "table2qb.main"]}}}

You can then run table2qb using

clojure -A:table2qb

More details about the clojure CLI and the format of the deps.edn file can be found on the Clojure website

Running table2qb

Table2qb is written in Clojure and uses tools.deps. It is recommended you use JDK 17 or later.

table2qb can be run via the clojure CLI tools through the :cli alias:

clojure -M:cli list

To get help on the available commands, type clojure -M:cli help.

To see the available pipelines (described in more detail below), type clojure -M:cli list.

To see the required command structure for one of the pipelines (for example the cube-pipeline), type clojure -M:cli describe cube-pipeline

How to run table2qb

See using table2qb for documentation on how to generate RDF data cubes using table2qb.

Example

The ./examples/employment directory provides an example of creating a data cube from scratch with table2qb.

License

Copyright © 2018 Swirrl IT Ltd.

Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.

Acknowledgements

The development of table2qb was funded by Swirrl, by the UK Office for National Statistics and by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 693849 (the OpenGovIntelligence project).