Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-4920 SPARQLConnection.size() now uses count query #4972

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

JervenBolleman
Copy link
Contributor

GitHub issue resolved #4920

Briefly describe the changes proposed in this PR:

SPARQLConnection.size() method should not fetch every statement in the repository. Just send a count query instead.


PR Author Checklist (see the contributor guidelines for more details):

  • my pull request is self-contained
  • I've added tests for the changes I made
  • I've applied code formatting (you can use mvn process-resources to format from the command line)
  • I've squashed my commits where necessary
  • every commit message starts with the issue number (GH-xxxx) followed by a meaningful description of the change

…nt in the repository. Just send a count query instead.

Signed-off-by: Jerven Bolleman <[email protected]>
@JervenBolleman JervenBolleman added this to the 5.0.0 milestone May 5, 2024
@JervenBolleman JervenBolleman self-assigned this May 5, 2024
@JervenBolleman JervenBolleman modified the milestones: 5.0.0, 4.3.12 May 5, 2024
@JervenBolleman
Copy link
Contributor Author

@hmottestad did you see this license failure before?


String sizeAsTupleQuery(Resource... contexts) {
String query = COUNT_EVERYTHING;
if (contexts != null && isQuadMode()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think contexts should never be null, but contexts[0] can probably be null if the user is asking for the default context.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RDF4J has the rdf4j:nil context and also a sesame one for addressing the default context. I believe it's a general issue with SPARQL that there is no way to address just the triples in the unnamed context. Jena also has their own: https://jena.apache.org/documentation/tdb/datasets.html#special-graph-names

Maybe we can test if either one of these returns results, and if not fallback to the old method?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a problem only when the default graph in SPARQL is the union of all graphs, but this is what I've found most sensible for all applications that I've used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check that a null context[0] will work as well as the rdf4j:nil or semame:nil. For the cases where the remote uses a non union default graph the old case would also have returned 0.

+ "> { ?s ?p ?o}}";
} else if (contexts.length > 0) {
String graphs = Arrays.stream(contexts)
.filter(Resource::isIRI)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens with blank nodes in our current implementation? Are blank node identifiers at all stable in some way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I SPARQL named graphs are not allowed to be blank nodes. So it could have worked if the stars aligned, but never reliably.

@@ -100,6 +114,26 @@ public void testAddSingleContextHandling() throws Exception {
assertThat(sparqlUpdate).containsPattern(expectedAddPattern).containsPattern(expectedRemovePattern);
}

@Test
public void testSizeQuery() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any end to end tests that actually spin up a SPARQL endpoint and runs operations against that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not that I am aware off

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in FedX federation part we have spin up a SPARQL embedded server for integration testing. Maybe you can borrow some ideas from there

https://github.com/eclipse-rdf4j/rdf4j/blob/main/tools/federation/src/test/java/org/eclipse/rdf4j/federated/server/SPARQLEmbeddedServer.java

@hmottestad
Copy link
Contributor

@hmottestad did you see this license failure before?

I haven't seen that one before, but I have once experienced an old dependency suddenly failing the license check. I didn't figure out what was wrong last time, ended up just upgrading the dependency. Let me try to submit it to clearlydefined. Might take some time though :(

@hmottestad
Copy link
Contributor

@hmottestad did you see this license failure before?

I haven't seen that one before, but I have once experienced an old dependency suddenly failing the license check. I didn't figure out what was wrong last time, ended up just upgrading the dependency. Let me try to submit it to clearlydefined. Might take some time though :(

CQ: https://gitlab.eclipse.org/eclipsefdn/emo-team/iplab/-/issues/14675

*/
private static boolean isExposableGraphIri(Resource resource) {
// We use the instanceof test to avoid any issue with a null pointer.
return resource instanceof IRI && RDF4J.NIL != resource && SESAME.NIL != resource;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll have to use .equals(...), there is no guarantee that users will actually use our constants.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@hmottestad
Copy link
Contributor

I'm still a bit unsure about this. We want to be able to correctly count the statements in the default/unnamed graph. I don't know how well the previous solution handled it, but I would assume that it would be able to check for if the context is null on each statement.

@JervenBolleman
Copy link
Contributor Author

I'm still a bit unsure about this. We want to be able to correctly count the statements in the default/unnamed graph. I don't know how well the previous solution handled it, but I would assume that it would be able to check for if the context is null on each statement.

The behavior is the same between this code and previous regarding the default graph.
Before:

select * where {?s ?p ?o}

New

select (count(*) as ?c) where {?s ?p ?o}

The logic is a bit more robust in the case of using the internal default graph IRI's of RDF4j. Which should not be send over the wire.

@hmottestad hmottestad modified the milestones: 4.3.12, 4.3.13 Jun 4, 2024
@hmottestad hmottestad modified the milestones: 4.3.13, 5.0.2 Jul 24, 2024
@hmottestad hmottestad modified the milestones: 5.0.2, 5.0.3 Aug 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SPARQLConnection size method counts client side instead of server side
3 participants