GH-4920 SPARQLConnection.size() now uses count query #4972

JervenBolleman · 2024-05-05T20:28:25Z

GitHub issue resolved #4920

Briefly describe the changes proposed in this PR:

SPARQLConnection.size() method should not fetch every statement in the repository. Just send a count query instead.

PR Author Checklist (see the contributor guidelines for more details):

my pull request is self-contained
I've added tests for the changes I made
I've applied code formatting (you can use mvn process-resources to format from the command line)
I've squashed my commits where necessary
every commit message starts with the issue number (GH-xxxx) followed by a meaningful description of the change

…nt in the repository. Just send a count query instead. Signed-off-by: Jerven Bolleman <[email protected]>

JervenBolleman · 2024-05-05T20:43:13Z

@hmottestad did you see this license failure before?

hmottestad · 2024-05-06T14:04:19Z

core/repository/sparql/src/main/java/org/eclipse/rdf4j/repository/sparql/SPARQLConnection.java

+
+	String sizeAsTupleQuery(Resource... contexts) {
+		String query = COUNT_EVERYTHING;
+		if (contexts != null && isQuadMode()) {


I think contexts should never be null, but contexts[0] can probably be null if the user is asking for the default context.

RDF4J has the rdf4j:nil context and also a sesame one for addressing the default context. I believe it's a general issue with SPARQL that there is no way to address just the triples in the unnamed context. Jena also has their own: https://jena.apache.org/documentation/tdb/datasets.html#special-graph-names

Maybe we can test if either one of these returns results, and if not fallback to the old method?

It's a problem only when the default graph in SPARQL is the union of all graphs, but this is what I've found most sensible for all applications that I've used.

I will check that a null context[0] will work as well as the rdf4j:nil or semame:nil. For the cases where the remote uses a non union default graph the old case would also have returned 0.

hmottestad · 2024-05-06T14:05:24Z

core/repository/sparql/src/main/java/org/eclipse/rdf4j/repository/sparql/SPARQLConnection.java

+						+ "> { ?s ?p ?o}}";
+			} else if (contexts.length > 0) {
+				String graphs = Arrays.stream(contexts)
+						.filter(Resource::isIRI)


What happens with blank nodes in our current implementation? Are blank node identifiers at all stable in some way?

I SPARQL named graphs are not allowed to be blank nodes. So it could have worked if the stars aligned, but never reliably.

hmottestad · 2024-05-06T14:06:09Z

...epository/sparql/src/test/java/org/eclipse/rdf4j/repository/sparql/SPARQLConnectionTest.java

@@ -100,6 +114,26 @@ public void testAddSingleContextHandling() throws Exception {
 		assertThat(sparqlUpdate).containsPattern(expectedAddPattern).containsPattern(expectedRemovePattern);
 	}

+	@Test
+	public void testSizeQuery() throws Exception {


Do we have any end to end tests that actually spin up a SPARQL endpoint and runs operations against that?

Unfortunately not that I am aware off

I think in FedX federation part we have spin up a SPARQL embedded server for integration testing. Maybe you can borrow some ideas from there

https://github.com/eclipse-rdf4j/rdf4j/blob/main/tools/federation/src/test/java/org/eclipse/rdf4j/federated/server/SPARQLEmbeddedServer.java

hmottestad · 2024-05-07T08:08:00Z

@hmottestad did you see this license failure before?

I haven't seen that one before, but I have once experienced an old dependency suddenly failing the license check. I didn't figure out what was wrong last time, ended up just upgrading the dependency. Let me try to submit it to clearlydefined. Might take some time though :(

hmottestad · 2024-05-10T09:30:05Z

@hmottestad did you see this license failure before?

I haven't seen that one before, but I have once experienced an old dependency suddenly failing the license check. I didn't figure out what was wrong last time, ended up just upgrading the dependency. Let me try to submit it to clearlydefined. Might take some time though :(

CQ: https://gitlab.eclipse.org/eclipsefdn/emo-team/iplab/-/issues/14675

or RDF4J nill

hmottestad · 2024-05-29T14:59:48Z

core/repository/sparql/src/main/java/org/eclipse/rdf4j/repository/sparql/SPARQLConnection.java

+	 */
+	private static boolean isExposableGraphIri(Resource resource) {
+		// We use the instanceof test to avoid any issue with a null pointer.
+		return resource instanceof IRI && RDF4J.NIL != resource && SESAME.NIL != resource;


You'll have to use .equals(...), there is no guarantee that users will actually use our constants.

hmottestad · 2024-05-29T15:17:18Z

I'm still a bit unsure about this. We want to be able to correctly count the statements in the default/unnamed graph. I don't know how well the previous solution handled it, but I would assume that it would be able to check for if the context is null on each statement.

…mote default graph, or a dataset clearer

JervenBolleman · 2024-05-31T15:18:08Z

I'm still a bit unsure about this. We want to be able to correctly count the statements in the default/unnamed graph. I don't know how well the previous solution handled it, but I would assume that it would be able to check for if the context is null on each statement.

The behavior is the same between this code and previous regarding the default graph.
Before:

select * where {?s ?p ?o}

New

select (count(*) as ?c) where {?s ?p ?o}

The logic is a bit more robust in the case of using the internal default graph IRI's of RDF4j. Which should not be send over the wire.

GH-4920 SPARQLConnection.size() method should not fetch every stateme…

6bd8fff

…nt in the repository. Just send a count query instead. Signed-off-by: Jerven Bolleman <[email protected]>

JervenBolleman added the ⏩ performance label May 5, 2024

JervenBolleman added this to the 5.0.0 milestone May 5, 2024

JervenBolleman requested a review from hmottestad May 5, 2024 20:28

JervenBolleman self-assigned this May 5, 2024

JervenBolleman modified the milestones: 5.0.0, 4.3.12 May 5, 2024

hmottestad reviewed May 6, 2024

View reviewed changes

GH-4920 When sending the remote size query make sure we don't send null,

9650d55

or RDF4J nill

hmottestad reviewed May 29, 2024

View reviewed changes

GH-4920 Make the logic that distinguises between counting from the re…

a01fab9

…mote default graph, or a dataset clearer

hmottestad modified the milestones: 4.3.12, 4.3.13 Jun 4, 2024

hmottestad modified the milestones: 4.3.13, 5.0.2 Jul 24, 2024

hmottestad modified the milestones: 5.0.2, 5.0.3 Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-4920 SPARQLConnection.size() now uses count query #4972

GH-4920 SPARQLConnection.size() now uses count query #4972

JervenBolleman commented May 5, 2024

JervenBolleman commented May 5, 2024

hmottestad May 6, 2024

hmottestad May 6, 2024

hmottestad May 6, 2024

JervenBolleman May 7, 2024

hmottestad May 6, 2024

JervenBolleman May 6, 2024

hmottestad May 6, 2024

JervenBolleman May 6, 2024

aschwarte10 May 17, 2024

hmottestad commented May 7, 2024

hmottestad commented May 10, 2024

hmottestad May 29, 2024

JervenBolleman May 31, 2024

hmottestad commented May 29, 2024

JervenBolleman commented May 31, 2024

GH-4920 SPARQLConnection.size() now uses count query #4972

Are you sure you want to change the base?

GH-4920 SPARQLConnection.size() now uses count query #4972

Conversation

JervenBolleman commented May 5, 2024

JervenBolleman commented May 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hmottestad commented May 7, 2024

hmottestad commented May 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hmottestad commented May 29, 2024

JervenBolleman commented May 31, 2024