-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix sids for pids #563
fix sids for pids #563
Conversation
Before this commit each selector used the iri-cache differently, and therefore gave broken results if they were combined in a scenario that required iri lookups. This normalizes the cache structure by making every cache update insert an entry in the form of `(vswap! iri-cache assoc pid {:as <iri>})`, so that every cache read can use the results regardless of where it was first inserted. The other fix in this commit is adding a `dbproto/-iri` lookup to the `response/wildcard-spec` function. Before it would return a raw pid in the response, which is unhelpful for users. fixes #521
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚄
{:as pid})] | ||
(vswap! cache assoc pid p-spec) | ||
p-spec))) | ||
(go-try |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this isn't in a high frequency code path it may not matter, but generally when we use a cache for speed, we want to check the cache before spawning core/async chans which is expensive at volume.
Here you spawn a chan, then conditionally check for an exception, then check for a cached value - where the cached value check needs not a chan, nor an exception check.
You'll note the other similar patterns here where cache checking is done before spawning a chan (see line 68 below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, this is in a high frequency code path. I've refactored it to only spawn the thread as a last resort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻
This is a general fix for sids used as pids.
Fixes #521
Problem
The problem here was that if a subject is transacted, and then in subsequent transactions that subject iri is used as a predicate iri, we don't recognize that that iri is a vocab iri - it doesn't have a subject id in the "pid range".
It's only a problem insofar as we end up displaying raw sids as key values in the result set of queries - the queries still work and the data is correct, it's just the way we display it that's the problem.
Approaches
So, I think there are two main approaches to solving this problem.
Extend our :schema cache during a transaction
We can trivially tell if we're going to run into this problem at flake creation time by checking if the pid is greater than the max-schema-sid number. We could then add special handling to the vocab-refresh pipeline to deal with this scenario.
Look up iris during projection
Just check the display of keys in the response set and look up the iris of any integer key we see coming through.
This PR takes approach 2, as that's the exact way
VariableSelector
s already work, and we already have an iri cache for each query.Complications
While implementing I discovered that different query selectors were adding different structures to the iri cache. They all used pids as keys, but they had different structures for values (plain iri or predicate map or partial predicate map) which meant that a pid that was looked up in the VariableSelector could not be used in the SubgraphSelector, and vice-versa.
I fixed that by normalizing the cache structure to be a map of pid -> pred-spec, a map with some metadata about the predicate, including at least an
:as
key with the compacted iri as a value.Future considerations
One additional thing I discovered after adding some logging to the cache misses is that the
volatile
we use as a cache is not thread safe, so there are cases where the same iri will be looked up multiple times in the course of projecting one result set. It may be useful in the future to benchmark and see whether using an atom for the cache would speed things up - trading potential contention on atom updates for potentially more cache hits.