Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix sids for pids #563

Merged
merged 3 commits into from
Aug 28, 2023
Merged

fix sids for pids #563

merged 3 commits into from
Aug 28, 2023

Conversation

dpetran
Copy link
Contributor

@dpetran dpetran commented Aug 24, 2023

This is a general fix for sids used as pids.

Fixes #521

Problem
The problem here was that if a subject is transacted, and then in subsequent transactions that subject iri is used as a predicate iri, we don't recognize that that iri is a vocab iri - it doesn't have a subject id in the "pid range".

It's only a problem insofar as we end up displaying raw sids as key values in the result set of queries - the queries still work and the data is correct, it's just the way we display it that's the problem.

Approaches
So, I think there are two main approaches to solving this problem.

  1. Extend our :schema cache during a transaction
    We can trivially tell if we're going to run into this problem at flake creation time by checking if the pid is greater than the max-schema-sid number. We could then add special handling to the vocab-refresh pipeline to deal with this scenario.

  2. Look up iris during projection
    Just check the display of keys in the response set and look up the iris of any integer key we see coming through.

This PR takes approach 2, as that's the exact way VariableSelectors already work, and we already have an iri cache for each query.

Complications
While implementing I discovered that different query selectors were adding different structures to the iri cache. They all used pids as keys, but they had different structures for values (plain iri or predicate map or partial predicate map) which meant that a pid that was looked up in the VariableSelector could not be used in the SubgraphSelector, and vice-versa.

I fixed that by normalizing the cache structure to be a map of pid -> pred-spec, a map with some metadata about the predicate, including at least an :as key with the compacted iri as a value.

Future considerations
One additional thing I discovered after adding some logging to the cache misses is that the volatile we use as a cache is not thread safe, so there are cases where the same iri will be looked up multiple times in the course of projecting one result set. It may be useful in the future to benchmark and see whether using an atom for the cache would speed things up - trading potential contention on atom updates for potentially more cache hits.

Before this commit each selector used the iri-cache differently, and therefore gave
broken results if they were combined in a scenario that required iri lookups.

This normalizes the cache structure by making every cache update insert an entry in the
form of `(vswap! iri-cache assoc pid {:as <iri>})`, so that every cache read can use the
results regardless of where it was first inserted.

The other fix in this commit is adding a `dbproto/-iri` lookup to the
`response/wildcard-spec` function. Before it would return a raw pid in the response,
which is unhelpful for users.

fixes #521
Copy link
Contributor

@cap10morgan cap10morgan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚄

{:as pid})]
(vswap! cache assoc pid p-spec)
p-spec)))
(go-try
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this isn't in a high frequency code path it may not matter, but generally when we use a cache for speed, we want to check the cache before spawning core/async chans which is expensive at volume.

Here you spawn a chan, then conditionally check for an exception, then check for a cached value - where the cached value check needs not a chan, nor an exception check.

You'll note the other similar patterns here where cache checking is done before spawning a chan (see line 68 below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this is in a high frequency code path. I've refactored it to only spawn the thread as a last resort.

@dpetran dpetran requested a review from a team August 24, 2023 17:24
Copy link
Contributor

@cap10morgan cap10morgan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

@dpetran dpetran merged commit c3c4996 into main Aug 28, 2023
5 checks passed
@dpetran dpetran deleted the fix/sids-for-pids branch August 28, 2023 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

subjects used as predicates don't have iris in query results
3 participants