Skip to content

Mapping Template Language (MTL)

Mario Scrocca edited this page Mar 15, 2024 · 8 revisions

This page provides a description of how the Mapping Template Language (MTL) extends the Velocity Template Language (VTL) to support mappings between different data representations.

A set of example mapping templates is available in the examples folder.

Main Concepts

The current documentation assumes a basic knowledge the Apache Velocity Template Engine and VTL. More information on these aspects can be found in the Velocity User Guide. The VTL language is extended by introducing three variables that are bound at runtime to the template Context and, therefore, are accessible while specifying a mapping template:

  • $reader to access the input data
  • $functions to execute Java functions statically configured
  • $map to access other information not available in the input data

Data Frame Definition

The initial step for defining a mapping from a file A in a specific format to a file B in a different format involves reading the contents of A.

In the Mapping Template Language (MTL), the access to input data is performed via the Reader interface. Using a reference formulation for a specific format, a Reader allows extracting one or more data frames from the input data. A data frames is a flat, non-hierarchical and tabular data structure. A data frames is encoded as a List of Maps, where each map corresponds to a row in the data frame and can be accessed using the name of the columns (i.e., the keys).

The currently available implementation supports RDF, CSV, XML, JSON, and SQL inputs via dedicated Readers:

  • For RDF input files or remote triplestore the $reader variable is bound to an RDFReader accepting SPARQL queries to extract a data frame.
  • For CSV input files the $reader variable is bound to a CSVReader automatically generating a data frame from the CSV.
  • For XML input files the $reader variable is bound to an XMLReader accepting XQuery queries to extract a data frame.
  • For JSON input files the $reader variable is bound to a JsonReader accepting multiple JsonPath queries to extract a data frame.
  • For SQL databases the $reader variable is bound to a SQLReader accepting SQL queries to extract a data frame. The $reader variable in template can be used to access the Reader. Additional Reader implementations can be added to the library also to support alternative reference formulations for the same format.

The $reader can be automatically bound to the input data (e.g., if the mapping-template is run via CLI), or a Reader can be instantiated at runtime from within the template. The $functions variable exposes the following methods:

  • getRDFReaderFromFile(String filename) and getRDFReaderFromString(String s): returns dynamically an RDFReader from a RDF file or string
  • getRDFReaderForRepository(String address, String repositoryId, String context): returns dynamically an RDFReader for a remote triplestore
  • getXMLReaderFromFile(String filename) and getXMLReaderFromString(String s): returns dynamically an XMLReader from a XML file or string
  • getJSONReaderFromFile(String filename) and getJSONReaderFromString(String s): returns dynamically a JSONReader from a JSON file or string
  • getCSVReaderFromFile(String filename) and getCSVReaderFromString(String s): returns dynamically a CSVReader from a CSV file or string
  • getSQLReaderFromDatabase(String driver, String url, String databaseName, String username, String password): returns dynamically an SQLReader for a remote SQL Database (MySQL and Postgres currently supported).

This approach can be used to combine data frames extracted from different data sources within the same mapping template.

Example of Data Frame Definition

Let the input A be the following XML file:

<?xml version="1.0" encoding="UTF-8"?>
<transport>
  <bus id="25">
    <route>
      <stop id="645">International Airport</stop>
      <stop id="651">Conference center</stop>
    </route>
  </bus>
</transport>

Then reading data from A to a data frame would be written as:

#set( $query = '
    for $stop in /transport/bus/route//stop
    return map {
        "stopId": $stop/@id,
        "stopName": $stop/text(),
        "busId": $stop/ancestor::bus/@id
    }')

#set( $data = $reader.getDataframe($query))

Where #set is an Apache VTL directive to store a value in a variable. Variables are denoted with the prefix $. In this case a XQuery query is stored in the $query variable.

This query is then used to obtain a data frame. The content of the DataFrame stored in the $data variable will be:

stopId stopName busId
"645" "International Airport" "25"
"651" "Conference center" "25"

The data extracted from input A is represented by a data frame whose keys are those specified via XQuery and the values are the results obtained by applying the query to A.

Data Frame Manipulation

The manipulation of a data frame can be defined using different functions and the VTL directives.

Velocity Tools

To provide commonly required functionalities a subset of the Apache Velocity Tools can be used inside of template files. These are:

Utility Functions

A default set of utility functions for data transformation and data frame combination is made available through the $functions variable:

  • rp(String s): if a prefix is set, removes it from the parameter string. If a prefix is not set, or the prefix is not contained in the given string it returns the string as it is.
  • setPrefix(String prefix): set a prefix for the rp method.
  • sp(String s, String substring): returns the substring of the parameter string after the first occurrence of the parameter substring.
  • p(String s, String substring): returns the substring of the parameter string before the first occurrence of the parameter substring.
  • replace(String s, String regex, String replacement): returns a string replacing all the occurrences of the regex with the replacement provided.
  • newline(): returns a newline string.
  • hash(String s): returns a string representing the hash of the parameter.
  • checkString(String s): returns true if the string is not null and not an empty string.
  • checkList(List<T> l): returns true if the list is not null and not empty.
  • checkList(List<T> l, T o): returns true if the list is not null, not empty and contains o.
  • checkMap(Map<K,V> m): returns true if the map is not null and not empty.
  • checkMap(Map<K, V> m, K key): returns true if the map is not null, not empty and contains the key key.
  • mergeResults(List<Map<String,String>> results, List<Map<String,String>> otherResults): merge two data frames

Custom Functions

Custom subclasses of the TemplateFunctions class may be defined and provided (e.g., using the -fun option via CLI) to modify the set of functions available in processing the template via the $functions interface. The provided class is compiled at runtime and made available through the $functions variable in the template.

The $map variable

The $map variable contains key-value pairs that can be specified independently from the declarative mapping template and are evaluated at runtime. This is useful if the same template should be run on different input data and the generated output should contain certain constant information that dipend on the considered input but are not available in the input data.

Declarative Mapping Rules

To represent the data according to an expected data format and data model, a set of declarative mapping rules should be defined to specify how the data in the data frame should be combined to obtain the desired output. The flexibility of VTL can be leveraged to generate any textual data representatation.

TAs an example, the previously shown snippet of a mapping can be expanded to generate a set of RDF triples from the data in the extracted dataframe.

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix transit: <http://vocab.org/transit/terms/>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix ex: <http://trans.example.com/>.

#set( $query = '
    for $stop in /transport/bus/route//stop
    return map {
        "stopId": $stop/@id,
        "stopName": $stop/text(),
        "busId": $stop/ancestor::bus/@id
    }')
#set( $data = $reader.getDataframe($query))

#foreach($stop in $data)
ex:$stop.busId rdf:type transit:stop ;
  transit:stop "$stop.stopId"^^xsd:int ;
  rdfs:label "$stop.stopName" .
#end

At the beginning of the mapping, the RDF prefixes and corresponding URIs are declared. When this mapping will be executed everything that is not a VTL directive will be kept as a constant in the generated output.

At the end of the mapping, each row in the DataFrame is used to populate the structure of the desired RDF representation in the Turtle format. The VTL #foreach directive is used to loop over all the rows in the data frame. Values are retrieved using the map.key property acess syntax.

The specified mapping results to the following RDF in Turtle format.

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix transit: <http://vocab.org/transit/terms/>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix ex: <http://trans.example.com/>.

ex:25 rdf:type transit:stop ;
  transit:stop "645"^^xsd:int ;
  rdfs:label "International Airport" .
ex:25 rdf:type transit:stop ;
  transit:stop "651"^^xsd:int ;
  rdfs:label "Conference center" .

Performance Optimisations

  1. It is better to avoid nested cycles in the template by using support data structures to access efficiently large data frames. A set of functions is made available through the $functions variable to optimise the access to data frames:
  • getMap(List<Map<String, String>> results, String key): creates a support data structure to access data frames faster. Builds a map associating a single row with its value w.r.t a specified column (key parameter). The assumption is for each row the value for the given column is unique, otherwise, the result will be incomplete.
  • getListMap(List<Map<String, String>> results, String key): creates a support data structure to access data frames faster. Builds a map associating a value with all rows having that as value for a specified column (key parameter).
  • getMapValue(Map<K, V> map, K key): if checkMap(map, key) is true returns the value for key in map, otherwise returns null.
  • getListMapValue(Map<K, List<V>> listMap, K key): if checkMap(listMap, key) is true returns the value for key in listMap, otherwise returns an empty list.
  1. The access to the data heavily affects the performance of the mappings. It is better to combine the extraction of data from the input data source in the minimum number of data frames possible, i.e., not defining several small data frames for each mapping rule.

  2. Too large templates may affect performances. If it is feasible for the specific scenario considered, splitting templates into multiple files and then combining the results may improve performances.