- SEP number: SEP-0007
- Discussion: SPARQL Errata 8, issue 89
- Author: Andy Seaborne
EXISTS
/NOT EXISTS
relies on the
"substitute" operation.
There are several problems identified with the specification material.
Substitute is also used in LATERAL
Join.
This SEP describes the issues and proposes solutions.
The SPARQL 1.1 algebra operation substitute
evaluates a graph
pattern where there are specific values for variables given by a solution mapping.
The operation is used in the evaluation of EXISTS
and NOT EXISTS
operations.
A similar operation is needed for a LATERAL
join. Both are exposing the specific values for
variables, row-by-row, from part of the query evaluation into a graph pattern.
This document lists problems that have been identified with the substitute
operation, and proposes an improved substitution evaluation
process that addresses these problems based on the concept of
Correlated Subquery as found in SQL.
The fundamental problem is that a variable cannot be simply replaced by a value
(an RDF Term) in all places in a graph pattern. There are some places where
SPARQL function forms and AS
assignments require a variable. The proposal here
is to take the variable binding from the input solution, retaining the
variable in the graph pattern, and disallowing cases that can reset the variable
binding, as is already the case elsewhere in SPARQL.
This section describes the issues identified on the
SPARQL Exists
Community group mailing list public-sparql-exists/2016Jul/0014.
- Some uses of
EXISTS
are not defined during evaluation. - Substitution happens where definitions are only for variables.
- Blank nodes substituted into BGPs act as variables.
- Substitution can flip
MINUS
to its disjoint-domain case. - Substitution affects disconnected variables.
The evaluation process in the specification is defined for graph patterns, but there are situations where the evaluation is of an unlisted algebra form.
Examples include the following:
-
FILTER EXISTS { SELECT ?y { ?y :q :c } }
-
FILTER EXISTS { VALUES ?y { 123 } }
The argument to EXISTS
is not explicitly listed as a Graph Pattern
in the table of SPARQL algebra symbols in
section 18.2
where the argument to EXISTS
is listed as a GroupGraphPattern
containing just a subquery
or just InlineData.
There are places in the SPARQL syntax and algebra where variables are allowed, but RDF terms (constant values) are not.
For example —
FILTER EXISTS { BIND ( :e AS ?z )
{ SELECT ?x { :b :p :c } }
}
Both positions AS ?z
and SELECT ?x
must be variables.
In the algebra, this affects the following:
extend
which is related to the use ofAS
in SPARQL syntax- in line data which is related to the use of
VALUES
BOUND
In the evaluation of basic graph patterns (BGPs), blank nodes are replaced by RDF terms from the graph being matched, and variables are replaced by a solution mapping from query variables to RDF terms, so that the basic graph pattern is now a subgraph of the graph being matched.
Simply substituting a blank node for a variable in the EXISTS
evaluation process does not cause the basic graph pattern to be
restricted to subgraphs containing that blank node as an RDF term,
because it is mapped by an
RDF instance mapping
before checking that the BGP after mapping is a subgraph of the
graph being queried.
Note that elsewhere in the evaluation of the SPARQL algebra, a solution mapping with a binding from variable to blank node does treat blank nodes as RDF terms. They are not mapped by an RDF instance mapping.
For example, executing this query:
SELECT ?x WHERE {
?x :p :d .
FILTER EXISTS { ?x :q :b . }
}
against the graph { _:c :p :d . :e :q :b }
,
the substitution for EXISTS
produces
BGP(_:c :q :b)
, which then
matches against :e :q :b
, because the _:c
can be mapped to :e
by
the RDF instance mapping that is part of the pattern instance mappings in
18.3.1.
Executing this query:
SELECT ?x WHERE {
?x :p :c .
FILTER EXISTS { ?x :p :c . MINUS { ?x :p :c . } }
}
on the graph { :d :p :c }
, the substitution from
18.6
ends up producing
Minus( BGP( :d :p :c ), BGP( :d :p :c ) )
which produces a non-empty result, because the two solution mappings for the
MINUS
have disjoint domains, and
18.5
then dictates that the result is not empty.
In this query:
SELECT ?x WHERE {
BIND ( :d AS ?x )
FILTER EXISTS { BIND ( :e AS ?z )
SELECT ?y WHERE { ?x :p :c } }
}
the substitution from 18.6 ends up producing
Join ( Extend( BGP(), ?z :e ),
ToMultiSet(
Project( ToList( BGP( :d :p :c ) ), { ?y } )
))
The ?x
inside the SELECT ?y
is not projected out, so it is a "different" ?x
than the outer one; changing it to another other unused name in the same query
would not normally affect the query results.
Evaluating
substitute
is
performed for a given solution mapping. For example, if a graph pattern
has one or more matches given the variable bindings of a solution mapping,
the EXISTS
operation evaluates to true
. We call this solution mapping
the current row in this description.
This section proposes an alternative mechanism. Rather than replace each
variable by the value it is bound to in the current row, this alternative
mechanism makes the whole of the current row available at any point in the
evaluation of an EXISTS
expression. It uses the current row to
restrict the binding of variables, at the points where variable bindings are
created during evaluation of EXISTS
, to be those from the current row.
It makes illegal syntactic constructs that could lead to an attempt to rebind a
variable from the current row through the AS
syntax.
Section "Addressing Issues" describes how this alternative
definition of substitute
addresses each of the issues identified above.
The proposal has the following three parts:
- Place the current row mapping variables to values (the RDF terms) so that the variables always get their values from the current row. This is the replacement for syntactic substitution in the original definition.
- Rename inner scope variables so that variables that are only used within a sub-query are not affected by the current row. This reflects the fact that in SPARQL such variables are not present in solution mappings outside their sub-query.
- Disallow syntactic forms that set variables that may already be present in the current row. SPARQL solution mappings can only have one binding for a variable, and the current row provides that binding.
Within sub-queries, variables with the same name can be used, but they do not
appear in the overall results of the query if they do not occur in the
projection of the sub-query. Such inner variables are not
in-scope when they are
not in the output of the projection part of the inner SELECT
expression.
Here, the ?s
is not mentioned in the projection of
SELECT (count(*) AS ?C)
:
SELECT * {
?s :value ?v .
FILTER EXISTS {
{ SELECT (count(*) AS ?C) {
?s :property ?w .
}
}
FILTER ( ?C < ?v )
}
}
Replacing ?s
in the sub-query with, for example,
?V1234
does not change the overall results:
SELECT * {
?s :value ?v .
FILTER EXISTS {
{ SELECT (count(*) AS ?C) {
?V1234 :property ?w .
}
}
FILTER ( ?C < ?v )
}
}
Such variables can be replaced with a variable of a different name, if that name is not used anywhere else in the query, and the same results are obtained in the sub-query. A sub-query always has a projection as its top-most algebra operator.
To preserve this, any such variables are renamed so they do not coincide with
variables from the current row being filtered by EXISTS
.
The SPARQL algebra project
operator has two components: an algebra expression,
and a set of variables for the projection.
Definition: Projection Expression Variable RemappingFor a projection algebra operation `P` with set of variables `PV`, define a partial mapping `F` from `V`, the set of all variables, to `V` where:
F(v) = v if v in PV F(v) = v1 where v is a variable mentioned in the project expression and v1 is a fresh variable F(v) = v otherwise.Define the Projection Expression Variable Remapping `PrjMap(P,PV)` to be the algebra expression `P` (and the subtree over which the projection is defined) with `F` applied to every variable of the algebra expression `P` over which `P` is evaluated.
This process is applied throughout the graph pattern of EXISTS
:
Definition: Variable RemappingFor any algebra expression `X`, define the Variable Remapping `PrjMap(X)`:
PrjMap(X) = replace all project operations project(P PV) with project(PrjMap(P,PV), PV) for each projection in X.This replacement is applied bottom-up when there are multiple project operations in the graph pattern of `EXISTS`.
Applying the renaming steps inside a sub-query does not change the solution mappings resulting from evaluating the sub-query. Remapping is only applied to variables not visible outside the sub-query. Renaming a variable in a SPARQL algebra expression causes the variable name used in bindings from evaluating the algebra expression to change. Since these are only variables that are not visible outside the sub-query, because they do not occur in the projection, the result of the sub-query is unchanged. SPARQL algebra expressions cannot access the name of a variable nor introduce a variable except by remapping.
SPARQL syntactic forms that attempt to bind a variable through the use of
AS
that might already be in a solution mapping are forbidden in
SPARQL; this is covered in the syntactic restrictions of
19.8 Grammar,
notes 12 and 13.
This proposal adds the restriction that any variables in a current
row, the set of variables
in-scope
of the expression containing EXISTS
, cannot be assigned with the extend
algebra function linked to the AS
syntax.
In addition, any use of VALUES
in the EXISTS
expression must not
use a variable in the current row.
The proposal is to retain the variables from the current row, not substitute them for RDF terms before evaluation, and also to restrict the binding of the solution to the RDF term of the current row. This occurs after renaming.
Binding for variables occurs in several places in SPARQL:
- Basic Graph Pattern Matching
- Property Path Patterns
- evaluation of algebra form
Graph(var,P)
involving a variable (from the syntaxGRAPH ?variable {…}
)
Note that other places where solution mappings add variables are in
the extend
function (connected to the AS
syntax)
and a multiset
from VALUES
syntax.
Limitations on Assignment
forbid these being variables of the current row.
Restricting the RDF Terms for a variable binding is done using inline data that is joined with the evaluation of the basic graph pattern, property path, or graph match.
Definition: Values InsertionFor solution mapping `μ`, define `Table(μ)` to be the multiset formed from `μ`.
Table(μ) = { μ } Card[μ] = 1Define the Values Insertion function `Replace(X, μ)` to replace each occurence `Y` of a Basic Graph Pattern, Property Path Expression, `Graph(Var, pattern)` in `X` with `join(Y, Table(μ))`.
The evaluation of EXISTS
is defined as:
Definition: Evaluation of ExistsLet `μ` be the current solution mapping for a filter, and `X` a graph pattern, define the Evaluation of Exists `exists(X)`
exists(X) = true if eval(D(G), Replace(PrjMap(X), μ) is a non-empty solution sequence. exists(X) = false otherwise
This section addresses each issue identified, given the proposal above.
This can be addressed by handling solution sequences as graph patterns where
needed, by adding
toMultiSet
as is done for SubSelect
in 18.2.2.6 Translate Graph Patterns
with a a correction to the text at the end of
the introductory paragraph of
Section 18.2.
query-errata-N:
"Section 18.2 Translation to the SPARQL Algebra" intro (end):
ToMultiSet can be used where a graph pattern is mentioned below because the
outcome of evaluating a graph pattern is a multiset.
Multisets of solution mappings are elements of the SPARQL algebra. Multisets
of solution mappings count as graph patterns.
Rather then replace a variable by its value in the current row, the new
mechanism makes the binding of variable to value available. The variable
remains in the graph pattern of EXISTS
and the evaluation.
By making the current row, which can include blank nodes, available, and not modifying the BGP by substitution, no blank nodes are introduced into the evaluation of the BGP. Instead, the possible solutions are restricted by the current row.
Issue 4 is addressed because variables are not removed from the domain of
MINUS
. This proposal does not preserve all uses of MINUS
expressions; the problem identified in issue 4 is considered to be a bug in the
SPARQL 1.1 specification.
Issue 5 is addressed by noting that variables inside sub-queries which are not projected can be renamed without affecting the sub-query results. Whether to preserve that invariant or allow the variables to be set by the current row is a choice point; this design preserves the independence of disconnected variables.
The proposal described in this document does not cover the use of variables from
the current row in a HAVING
clause.
Original content: https://afs.github.io/substitute.html
See also Correlation and Substitution in SPARQL.