Skip to content

Input Data Syntax: N Tuples

rdelbru edited this page Dec 12, 2011 · 9 revisions

Input Data Syntax: N-Tuples

SIREn extends Lucene with a new field type 'tuples'. The field accepts structured information in a special syntax called N-Tuples which is derived from the N-Triples syntax. The N-Tuples syntax is a superset of the N-Triples syntax. N-Tuples is a line-based, plain text format for encoding semi-structured data such as RDF graph or other data format. The content of a tuples field is an ordered list of tuples, each tuple being an ordered list of cells. The current syntax differentiates three types of cells:

  • URIs, or Uniform Resource Identifiers, are enclosed in '<' and '>';
  • Literals, or plain text, are written using double-quotes;
  • Blank nodes, or local identifiers (specific to the RDF data model), are written as '_:nodeID'.

A dot signifies the end of a tuple. In the following, we present various examples of semi-structured data encoded into N-Tuples. The possibilities are not restricted to these examples, and it is up to you to structure your data the way you want.

N-Triples

Here is a sample of a plain N-Triples document that encodes a RDF graph. The document describes itself, i.e., the FOAF file of Renaud Delbru, and the entity identified by the URI [http://renaud.delbru.fr/rdf/foaf#me].

<http://renaud.delbru.fr/rdf/foaf> <http://www.w3.org/2000/01/rdf-schema#label> "FOAF file of Renaud Delbru" .
<http://renaud.delbru.fr/rdf/foaf> <http://xmlns.com/foaf/0.1/maker> <http://renaud.delbru.fr/rdf/foaf#me> .
<http://renaud.delbru.fr/rdf/foaf#me> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> .
<http://renaud.delbru.fr/rdf/foaf#me> <http://xmlns.com/foaf/0.1/name> "Renaud Delbru" .
<http://renaud.delbru.fr/rdf/foaf#me> <http://xmlns.com/foaf/0.1/givenname> "Renaud" .
<http://renaud.delbru.fr/rdf/foaf#me> <http://xmlns.com/foaf/0.1/family_name> "Delbru" .
<http://renaud.delbru.fr/rdf/foaf#me> <http://xmlns.com/foaf/0.1/homepage> <http://renaud.delbru.fr/> .

Entity-Centric

Here is a sample of entity description using N-Tuples. Compared to the previous example where the first cell was the identifier of an entity, the first cell of a tuple is a predicate (or property name). The subsequent cells of a tuple are the values associated to the predicate.

As you can see, the syntax is flexible. In line 1 and 3, we can model a multi-valued predicate with a first cell representing the predicate and the following cells as values. You can also mix different tuple cell types (URIs, Literals) in a same tuple.

<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> "A Person" .
<http://xmlns.com/foaf/0.1/name> "Renaud Delbru" .
<http://xmlns.com/foaf/0.1/knows> <http://g1o.net#me> <http://eyaloren.org/foaf.rdf#me> .

Tabular Data

Instead of indexing RDF data, it is also possible to use SIREn for indexing tabular data using the N-Tuples data format. Below is a sample of tabular data in a CSV formated file, found in http://www.ourairports.com/data/airports.csv:

_Fields name_
"id","ident","type","name","latitude_deg","longitude_deg","elevation_ft","continent","iso_country","iso_region","municipality","scheduled_service","gps_code","iata_code","local_code","home_link","wikipedia_link","keywords"

6523,"00A","heliport","Total Rf Heliport",40.07080078125,-74.9336013793945,11,"NA","US","US-PA","Bensalem","no","00A",,"00A",,,
6560,"00S","small_airport","Mc Kenzie Bridge State Airport",44.1832008362,-122.088996887,1620,"NA","US","US-OR","Mc Kenzie Bridge","no","00S",,"00S",,"http://en.wikipedia.org/wiki/McKenzie_Bridge_State_Airport",

Using a simple script, it is possible to convert each line into a n-tuple, with values represented as URIs or Literals with the appropriate datatype tag. For example, we can consider latituted_deg, longitude_deg as double, and elevation_ft as an integer. Also, home_link and wikipedia_link values can be represented as URIs. Other values will be considered as Literals by default.

"6523" "00A" "heliport" "Total Rf Heliport" "40.07080078125"^^<xsd:double> "-74.9336013793945"^^<xsd:double> "11"^^<xsd:integer> "NA" "US" "US-PA" "Bensalem" "no" "00A" "" "00A" <> <> "" .
"6560" "00S" "small_airport" "Mc Kenzie Bridge State Airport" "44.1832008362"^^<xsd:double> "-122.088996887"^^<xsd:double> "1620"^^<xsd:integer> "NA" "US" "US-OR" "Mc Kenzie Bridge" "no" "00S" "" "00S" <> <http://en.wikipedia.org/wiki/McKenzie_Bridge_State_Airport> "" .

It is worth to note that any cell can be either a Literal or an URI.