Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DMRS representation: named graphs #24

Open
arademaker opened this issue Jul 1, 2021 · 5 comments
Open

DMRS representation: named graphs #24

arademaker opened this issue Jul 1, 2021 · 5 comments

Comments

@arademaker
Copy link
Member

arademaker commented Jul 1, 2021

Related to #21

If a user asks for a representation that supports named graphs, we should be able to produce it. In the CLI, the representations are limited to the ones that RDFLib supports. See https://en.wikipedia.org/wiki/N-Triples#N-Quads as one format.

But a user may want to save in JSON-LD using or not one named graph per semantic representation... Moreover, inside a python code, the user may need to specify if a named graph should be used or all triples should be in the single default graph (more about these concepts). What alternatives do we have?

Regarding triple stores, Allegrograph supports N-Quads and JSON-LD, both formats compatible with named graphs. More about N-Quads.

@arademaker arademaker mentioned this issue Jul 2, 2021
@yfaria yfaria mentioned this issue Jul 11, 2021
@yfaria
Copy link
Contributor

yfaria commented Jul 12, 2021

An idea that is close to RDF dataset that is implemented on RDFLib is the RDF store (more on the ideas of that here.
In #26, an implementation was made using IOMemory objects, which is subclass of the Store class but with some optimizations. The idea was to include information of levels which are "above" the specific semantic representations inside a graph labeled with a blank node and information "below" the specific semantic representation inside its own graph. The Python module now take the store and a global context of that store, which is a graph, to insert in that store the new graph of the semantic representation and insert in the global context the information about it.
The informations that are included in that global context are basically the linking between a semantic representation of an item of profile to its components. Taking a subset of an output to serialize a profile to RDF using MRSs as NQuads, we have

<http://example.com/291/0> <http://www.delph-in.net/schema/hasMRS> <http://example.com/291/0/mrs> _:N3745aaef7fc448469bbbb222390a6e78 .
<http://example.com/291/0/mrs> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.delph-in.net/schema/mrs#MRS> _:N3745aaef7fc448469bbbb222390a6e78 .
<http://example.com/291/0/mrs> <http://www.delph-in.net/schema/mrs#hasEP> <http://example.com/291/0/mrs#EP-0> _:N3745aaef7fc448469bbbb222390a6e78 .
<http://example.com/291/0/mrs> <http://www.delph-in.net/schema/mrs#hasEP> <http://example.com/291/0/mrs#EP-3> _:N3745aaef7fc448469bbbb222390a6e78 .
<http://example.com/291/0/mrs> <http://www.delph-in.net/schema/mrs#hasHcons> <http://example.com/291/0/mrs#hcons-1> _:N3745aaef7fc448469bbbb222390a6e78 .
<http://example.com/291/0/mrs#EP-1> <http://www.delph-in.net/schema/carg> "Abrams" <http://example.com/291/0/mrs> .
<http://example.com/291/0/mrs#EP-3> <http://www.delph-in.net/schema/mrs#body> <http://example.com/291/0/mrs#variable-h13> <http://example.com/291/0/mrs> .
<http://example.com/291/0/mrs#hcons-3> <http://www.delph-in.net/schema/mrs#lowHcons> <http://example.com/291/0/mrs#variable-h14> <http://example.com/291/0/mrs> .
_:Ne45e4b590dec48b6aaf99c570add0d68 <http://www.delph-in.net/schema/hasTop> <http://example.com/291/0/mrs#variable-h0> <http://example.com/291/0/mrs> .
_:Ne45e4b590dec48b6aaf99c570add0d68 <http://www.delph-in.net/schema/hasIndex> <http://example.com/291/0/mrs#variable-e2> <http://example.com/291/0/mrs> .

where the _:N3745aaef7fc448469bbbb222390a6e78 acts like the global context where we put the triples linking the URI <http://example.com/291/0/mrs> to the item, typed it and connected it to nodes of its parts; while inside the graph of this MRS we put the triples where the subjects are often parts of <http://example.com/291/0/mrs>. The only exception right now being TOPs and INDEXes, which now connects a blank node to the correct MRS variable.

When we put to serialize the RDF in a format that does not encode named graphs, like ntriples, the same triples end up being encoded but the graphs the graphs are excluded. As we are still connecting the semantic representation to its parts, the file end up still consistent with the versions before. The only thing that is not consistent is actually the TOP and INDEX encoding; which aren't satisfactory as well; we end up having a lot of blank nodes being linked to a MRS variables; the problem with connecting the MRS URI to its TOP/INDEX is that there would have a triple with the subject having the URI as the graph URI (in that case above, we would have <http://example.com/291/0/mrs> <http://www.delph-in.net/schema/hasIndex> <http://example.com/291/0/mrs#variable-e2> <http://example.com/291/0/mrs> .).
For that, we could just ignore this thing of referencing the URI of the graph inside a triple or we could change the python code to discriminating the code between the cases where we want to generate named graphs and the cases we don't want as suggested before.

@yfaria
Copy link
Contributor

yfaria commented Jul 27, 2021

RDFLib version 6.0.0 is out and it creates a new way of solving this problem.
In #26, the proposed solution was to use directly an IOMemory object, which was renamed in this newer version to Memory, which is a more efficient version of Store.
In RDFLib, every graph is created in a Store, actually a Memory object. In fact, we have

>>> from rdflib.graph import Graph
>>> g = Graph()
>>> g.store
<rdflib.plugins.stores.memory.Memory object at 0x7fd6d3dacc10>

The newest version creates the class Dataset, which is an implementation of the RDF 1.1 Dataset Notion. In terms of usage, it's similar to ConjunctiveGraphs, which is a graph that contains all graphs of its store. These two are easier to use as they already have a "default graph" and we can directly add triples to it.

Therefore, the creation of named graphs can be made in four different ways: creating a Store and creating a default graph for it; doing the same with a Memory object; creating a Dataset and adding graphs to it or creating a ConjunctiveGraph and sharing its store with the DMRS graphs. This choice won't affect the PyDelphin plugin but will affect the usage of the python package (specifically, the inputs)

@arademaker
Copy link
Member Author

Not clear what are the pros and cons of each approach or if you already decided about the way to go

@arademaker
Copy link
Member Author

arademaker commented Aug 18, 2021

Maybe related to the discussion above about HOW to use the RDFLib to implemented named graphs, we still need clarity about the way information is modeled in named graphs. From the beginning, we know that named graphs introduce one disadvantage, not all triple stores, and libraries implemented it. So we should, if possible, be able to produce an RDF with and without named graphs.

The simpler solution is to make the code unique and, given the desire output format, decide if the named graph part (the context, the fourth elements of the quads) should be serialized or not. In other words, if the user asks for a format that only supports triples (e.g. turtle), the fourth element of the quads are discarded. But there is a potential problem with that: redundance. Below, the first quads say the URI has a type and this information is in the named graph named by the same URI. See #30, but it is fine. The next two quads say that a Node and Link belong to the DMRS with itself is the named graph where those triples are defined.

<http://wordnet.princeton.edu/pwn30/01002055-a-1>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://www.delph-in.net/schema/dmrs#DMRS>
<http://wordnet.princeton.edu/pwn30/01002055-a-1> .

<http://wordnet.princeton.edu/pwn30/01002055-a-1>
<http://www.delph-in.net/schema/dmrs#hasNode>
<http://wordnet.princeton.edu/pwn30/01002055-a-1#node-10003>
<http://wordnet.princeton.edu/pwn30/01002055-a-1> .

<http://wordnet.princeton.edu/pwn30/01002055-a-1>
<http://www.delph-in.net/schema/dmrs#hasLink>
<http://wordnet.princeton.edu/pwn30/01002055-a-1#link-7>
<http://wordnet.princeton.edu/pwn30/01002055-a-1> .

We can live with this redundancy, if the nquads are loaded in a triple store, all triples of hasNode and hasLink could be easily removed without losing information:

rapper -i nquads sample.nq -o turtle sample.nq | less

Just ignore the fourth element of the quads and it gives me:

<http://wordnet.princeton.edu/pwn30/01000442-a-1>
    <http://www.delph-in.net/schema/dmrs#hasLink> , ...;
    <http://www.delph-in.net/schema/dmrs#hasNode>, ... ;
    <http://www.delph-in.net/schema/hasIndex> <http://wordnet.princeton.edu/pwn30/01000442-a-1#node-10001> ;
    <http://www.delph-in.net/schema/hasTop> <http://wordnet.princeton.edu/pwn30/01000442-a-1#node-10000> ;
    a <http://www.delph-in.net/schema/dmrs#DMRS> .

Or we should remove those triples with hasLink and hasNode before serialization if the serialization is about to support the contexts. @yfaria what do you think?

@yfaria
Copy link
Contributor

yfaria commented Aug 18, 2021

The redundancy of the RDF quad generation is there to make it compatible with applications that only supports triples without needing to discriminate whether we need or not to make triples or quads as you pointed out; the CLI application can be used to output turtle without loss of information for example.
I don't think this redundancy is a problem unless the quads start to occupy a lot of space; so I think it would only be helpful in case of memory issue. They aren't a issue on the context interpretation nor the resulting graph and it already works well outputting triples.
If we treat it as a problem, we'd have to discriminate between the desired format and have some mapping from the supported formats (https://rdflib.readthedocs.io/en/stable/plugin_serializers.html, even though there could be more formats to serialize as stated here) on whether they support or not quads; in a way that we would filter those redundant triples in case of being quads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants