-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for returning commits from /history #379
Conversation
Co-authored-by: Marcela Poffald [email protected]
This allows us to properly retrieve commit flakes from the cache. Co-authored-by: Marcela Poffald [email protected]
Both the dbId subject flake and the ledger time had the @id of :f/time.
It turns out if you have a flake with a predicate of `const/$iri` (0), the datatype of the flake must be a `const/$xsd:string` (1). Otherwise the query select mechanism will try to resolve the flake object's @id, but that requires the flake object to be a subject id. These commit schema flakes had predicate: iri and datatype iri, which was causing queries that referenced them to blow up.
The problem was we were only using the `(next-sid)` result in the new issuer flake but not in the reference pointing to it. Now we use it in both flakes if we need both flakes. Co-authored-by: Marcela Poffald [email protected]
We were not creating an @id for the commit, which was inserting nils for the commit subject. Also refactored the creation of commit flakes to make each group of flakes more clear and only create new sids when we actually use them. Updated the s+p+o query test to not expect nil subjects and control for the source of non-determinism in a commit. Co-authored-by: Marcela Poffald [email protected]
Co-authored-by: Marcela Poffald [email protected]
Also refactored commit to generate flakes first and the merge them into novelty. Co-authored-by: Marcela Poffald [email protected]
Now we can rebuild the Commit document from a query. Co-authored-by: Marcela Poffald [email protected]
Changed the commit data to all share the same subject id: t. Added one missing vocabulary term for commitdata-address. Now s+p+o returns all flakes with no nils. Co-authored-by: Marcela Poffald [email protected]
# Conflicts: # dev/user.clj # src/fluree/db/api/query.cljc # src/fluree/db/json_ld/api.cljc # src/fluree/db/json_ld/vocab.cljc
The `vocab/update-with` function didn't do what I expected - it takes a db and returns a schema, and I was expecting it to return a db with the updated schema. Instead I swapped it with the `vocab/update-with*` which takes a schema and some vocab flakes and returns a new schema cache. Also added the `test-with-files` dep to the :dev deps so I can run the test from api-tests from the repl.
…e cleanup We don't need flakes of the type `[s const/$rdf:type o ...]`, now that we have a `.-dt` field on flakes
…nd asserts/retracts Co-authored-by: Daniel Petranek <[email protected]>
We don't need special iris for commit-related flakes, eg `iri-commitdata-v`. The property `v` is the same property, denoting a version, regardless of whether it appears in a commit document or elsewhere.
`asserts` should include just the actual data transacted by user
# Conflicts: # src/fluree/db/api/query.cljc # test/fluree/db/query/history_test.clj
These are so close to being the same thing that we thought we'd try them with a unified query syntax. Co-authored-by: Marcela Poffald <[email protected]>
Unified the query syntax for history and commit queries. Specifying a time range is now required, which should simplify the explanation of the query syntax. also: - removed some dead code - updated docstrings Co-authored-by: Marcela Poffald <[email protected]>
… result This updates `add-commit-details` to loop through the commit results and chunk together those results with continuous `t`s. We then retrieve commit details for the range of that chunk, rather than for each result. In cases where history results have such continuous results, we can reduce the number of times we make `tspo` range calls to retrieve commit data.
# Conflicts: # src/fluree/db/json_ld/commit.cljc # test/fluree/db/query/misc_queries_test.clj
We had to sneak the previous commit into the commit-flake-addition process in order to get the previous commit and commit data ids. We also had to update how we were constructing the results to not filter the iri flakes out of the commit flakes because we need ids in the final map. Co-authored-by: Marcela Poffald <[email protected]>
Co-authored-by: Marcela Poffald <[email protected]>
"main"] | ||
["fluree:commit:sha256:bs3bcu3rhzlidsdvg3jgvbplhzxxlevvhipxsnqvfk7em4juu2os" | ||
:f/context | ||
"fluree:memory:///contexts/b6dcf8968183239ecc7a664025f247de5b7859ac18cdeaace89aafc421eeddee"]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can see we get back a lot of fluree-generated commit data now if we pull the whole db after committing. This wasn't happening before because we had a caching bug that kept the commit flakes from being discoverable at all, even though they were there (see #333 (comment) for discussion).
prev-commit-flakes (into prev-commit-flakes) | ||
prev-db-flakes (into prev-db-flakes) | ||
issuer-flakes (into issuer-flakes) | ||
message-flakes (into message-flakes)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We used to also add flakes for commit tags here. We didn't include them in this PR because we weren't clear on the status of tag-related features, but if anyone thinks they should be here, we can add them back.
src/fluree/db/query/history.cljc
Outdated
|
||
Chunks together results with consecutive `t`s to reduce number of `time-range` index traversals | ||
needed for commit retrieval." | ||
[db context history-results] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add commit details to history results, we:
- do the regular history query, which tells us which
t
s are relevant. - we then go get commit details for those
t
s -- getting commits requires atspo
range scan
That tspo
range scan only works on ranges, so if we have sparse/non-consecutive t
s to consider, right now we make range scan call for each. If we need commits for t
s 1-3 and 5, we scan:
- from t=1 to t=3
- then again from t=5 to t=5
Ideally, there's some solution where we don't do multiple passes, or where this is less expensive than it is now. For now, this Works (TM).
I did add this little optimization to at least chunk together consecutive t
s to reduce the number of calls in some cases, but I still hold out hope to do something lower-level in the future.
This optimization could still be improved a bit though: the history-flakes->json-ld
fn returns all the results in one list on the output channel. I'd like to change it to put one result at a time, so that this fn can read off the channel rather than us fully realizing the list just to go through it again.
I'm going to work on that while this is under review, but I'd welcome any pointers from people who know more about core.async
than I do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed a commit that changeshistory-flakes->json-ld
to just put one result at a time on the channel: f6f4558
The rest of the above comment still applies
:t - a map with keys :from and :to, at least one is required if :t is provided." | ||
[query] | ||
(history-query-validator query)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the enormous diff here, I wanted to move the history-specific helpers and syntax stuff into its own ns.
@@ -379,15 +416,15 @@ | |||
new-commit*) | |||
[new-commit** jld-commit] (commit-data/commit-jsonld new-commit*) | |||
signed-commit (if did | |||
(cred/generate jld-commit private (:id did)) | |||
(<? (cred/generate jld-commit private (:id did))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fixed a bug where we we were writing out a stringified async channel instead of the actual signed commit, eg:
#object[clojure.core.async.impl.channels.ManyToManyChannel 0x66731af8 "clojure.core.async.impl.channels.ManyToManyChannel@66731af8"]%
This way we are not unnecessarily realizing the whole list of history results before going through them again to add commit-details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few things I think we should change before merging. There are a few places where we wrap things in go
blocks when it's not necessary. Every go block creates a channel, and every put and take off of a channel has a cost, so we should try to minimize the channels we create.
Secondly, there are a lot of places where we use <?
and go-try
where I don't think we should. There are two ways to use core.async channels. One is like an analogue of a promise, and another is like an async analogue of a seq
. The promise analogue returns a single result of an async process "later". That's what we should use <?
and go-try
for, because it makes sense to check the single contents of the channel for an exception.
The seq
analogue is different however. Higher level sequence processing falls down hard once some element of a sequence could be an exception. Higher order functions like async/map
, async/reduce
, async/transduce
, transducers attached to channels like what you get from partition-by
, map
, filter
, etc, and calls to the pipeline-*
functions don't mix when there could be an exception on a channel at any point. I think it's best to make that explicit by using go
and <!
in those cases, with the addition of an explicit error channel that we can use to "short-circuit" processing if there's any part of the asynchronous process generating the sequence that has an unrecoverable error.
Lastly, and this is probably too heavy of a lift to address in this pr, but I also think there are areas where we could improve performance here by changing the order of processing a bit. We are wasting a lot of time and space by collecting all the history flakes in a single vector in time-range
, only to immediately split that vector again multiple times (at least once for history-flakes->json-ld
, and then again for add-commit-details
). I think we can save on this memory thrashing by processing the flakes from the tspo index lazily as they are loaded by reimplementing the logic here to use more transducers and more calls to pipeline-async. That isn't the smallest lift, so I'm fine with merging this for now after the other cleanup with plans to refactor later.
src/fluree/db/query/history.cljc
Outdated
(async/pipeline-async 2 | ||
out-ch | ||
(fn [t-flakes ch] | ||
(-> (go-try |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using go-try
here implies that the code inside can throw exceptions, and that those exceptions would be caught and placed on the output channel. There's nothing to read that output channel created by go-try
to handle any errors that might be on it though (this is wrapped in a pipeline-async
call, which will only read the passed in ch
argument to the async fn. Any new channels returned by the async fn will be ignored). If any of the code inside does throw errors, those errors aren't being handled.
I think we should be explicit here about exceptions. If the code inside can't throw an exception, then we should replace go-try
with go
.
If it can throw exceptions, then we should replace go-try
with (go (try* ... (catch* ...
, and put any caught errors on the error channel so they can be handled downstream.
src/fluree/db/query/history.cljc
Outdated
[{:id :ex/foo :ex/x 1 :ex/y 2}...] | ||
" | ||
[db compact cache fuel error-ch t-flakes] | ||
(go-try |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we're wrapping this code in a go-block (which has a cost) for no reason. There are no asynchronous puts or takes in the wrapped code until the very end. Instead of wrapping this code in a go-block that doesn't do any async stuff before it terminates, you could rewrite this function like this.
(defn t-flakes->json-ld
[db compact cache fuel error-ch t-flakes]
(let [s-flake-ch (->> t-flakes
(group-by flake/s)
vals
async/to-chan!)
s-json-ch (async/chan)]
(async/pipeline-async 2
s-json-ch
(fn [assert-flakes ch]
(-> (s-flakes->json-ld db cache compact fuel error-ch assert-flakes)
(async/pipe ch)))
s-flake-ch)
(async/into [] s-json-ch)))
That saves a channel and doesn't change the behavior of this function from the perspective of the caller.
src/fluree/db/query/history.cljc
Outdated
" | ||
[db compact cache fuel error-ch t-flakes] | ||
(go-try | ||
(let [s-flake-partitions (fn [flakes] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this function is only called once. I think it'll be clearer if we apply the code inside to the t-flakes directly on line 127.
src/fluree/db/query/history.cljc
Outdated
|
||
s-ch (s-flake-partitions t-flakes) | ||
s-out-ch (async/chan) | ||
s-json-ch (async/into [] s-out-ch)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on how you want to use the output of this function, it's usually much more flexible and performant to save the step of fully realizing an async channel and collecting its contents into a vector to the very end of your code path right before you return data to the user. That way downstream code can filter
, limit
, or even ignore the stuff on the channel entirely without us paying the penalty of loading the whole thing if we didn't actually need it all. It's also much easier to reason about, apply transducers to, and include in a pipeline-*
chain a channel-of-x's instead of a channel-of-collections-of-x's
src/fluree/db/query/history.cljc
Outdated
(-> (go-try | ||
(let [{assert-flakes true | ||
retract-flakes false} (group-by flake/op t-flakes)] | ||
{(json-ld/compact const/iri-t compact) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we explicitly compute these keys (and possibly the takes off of these channels too) in the let
binding instead of inline while creating this output map? It was a bit difficult for me to tell what was going on here, and computing this stuff in the binding gives the opportunity of implicit documentation with the binding names.
src/fluree/db/query/history.cljc
Outdated
needed for commit retrieval." | ||
[db context history-results-chan] | ||
(go-try | ||
(when-let [first-result (<? history-results-chan)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are multiple items on the history result channel (implied by the loop
below), then there should never be an exception on it. If something needs to throw an exception at some point of the process of generating history results, then that exception should be handled in a separate error channel instead of mixed in with valid results. <?
should be used for channels that will only contain a single item, and that item might be an exception.
src/fluree/db/query/history.cljc
Outdated
[db context history-results-chan] | ||
(go-try | ||
(when-let [first-result (<? history-results-chan)] | ||
(let [t-key (json-ld/compact const/iri-t context)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're compacting this iri-t constant multiple times with the same context. I think that implies that this add-commit-details
function should come higher in the chain of computing history results so you can access the data you need before it's compacted.
src/fluree/db/query/history.cljc
Outdated
[{:id :ex/foo :f/assert [{},,,} :f/retract [{},,,]]}] | ||
" | ||
[db context flakes error-ch] | ||
(go-try |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you're wrapping a channel with a channel here, which will lead to very weird results. It doesn't look like there are any async puts or takes in here, so I think you can just remove this go block all together.
src/fluree/db/api/query.cljc
Outdated
;; filter flakes for history pattern | ||
(let [[pattern idx] (<? (history/history-pattern db context history)) | ||
history-error-ch (async/chan) | ||
flakes (<? (query-range/time-range db idx = pattern {:from-t from-t :to-t to-t})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're calling time-range
here directly, and then further processing the flakes it returns. The problem with this is time-range
collects its results into a single vector, and then you split that vector again and process it asynchronously for each t in the results. There's a relatively big cost in collecting all the results in memory only to split them up again for further processing.
Ideally we'd only collect results at the last moment before they are returned to the user. This would use both less time and less space. You can do this by implementing a function in terms of tree-chan that returns a channel that will contain a sequence of flake-chunks where there's one chunk for each t by passing the appropriate transducers to tree-chan
. This function would probably look a lot like time-range
without that collection step.
Then pass the channel returned by this new function with the logic in history-flakes->json-ld
without the code to partition by t to process each chunk of flakes with pipeline-async
to transform each flake chunk into a jsonld map.
Then, I would reimplement add-commit-details
as a transducer + an async function to process each jsonld map and add commit details if necessary.
src/fluree/db/query/history.cljc
Outdated
(go-try | ||
(when-let [first-result (<? history-results-chan)] | ||
(let [t-key (json-ld/compact const/iri-t context)] | ||
(loop [result first-result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this loop here does two things: 1: groups consecutive t's into chunks, and 2: processes each chunk asynchronously to add commit details. It's definitely more efficient space-wise, and probably more efficient time-wise to reimplement this function in terms of a transducer that does step 1 and a pipeline-async call that does step 2.
We removed some unnecessary go-blocks, added more explicit error handling, tried pushing the realization of the final results as high up the call stack as we could. Co-authored-by: Marcela Poffald <[email protected]>
We changed the default sort order of the history query to be in ascending order by t - earliest change first. This matches the default tspo sort order, so we no longer have to sort the commit flakes. History still needs to be sorted because sometimes it uses non-tspo indexes to satisfy the query. Moved the async into calls as high up the call stack as we could, and removed unnecessary go blocks. Co-authored-by: Marcela Poffald <[email protected]>
time-range returns a stream of flake slices, where each slice is the subset of flakes from an index leaf. Up to now we have been treating this as a promise and only taking the first result off the time-range chan. Now we treat it as a proper stream. Co-authored-by: Marcela Poffald <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work on the cleanup
src/fluree/db/api/query.cljc
Outdated
[fluree.db.query.range :as query-range] | ||
[fluree.db.dbproto :as dbproto] | ||
[fluree.db.flake :as flake] | ||
[fluree.db.util.core :as util #?(:clj :refer :cljs :refer-macros) [try* catch*]] | ||
[fluree.db.util.async :as async-util :refer [<? go-try]] | ||
[fluree.db.util.async :as async-util :refer [<? go-try into?]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like this into?
is used anywhere.
Related issues
Closes #333
Closes #359
Closes #342
If merged, fluree/http-api-gateway#7 will not be required.
Overview
This PR extends the
history
api with the ability to return commits for a given time range.This is done using flakes to represent commit data: we amended the flakes we were already inserting upon commit to include all the necessary parts of the commit. These are then retrieved via time-range calls and assembled into json-ld maps.
New Syntax
:commit-details
flag. Iftrue
, returns commit data, eg:-
{:history :ex/alice :commit-details true :t {:from 1 :to 5}}
: amend history results with commits for anyt
s reprsented in those results-
{:commit-details true :t {:from 1 :to 5}}
: just the commits for time ranget
t
options::latest
uses the db'st
:{:history :ex/alice :commit-details true :t {:from 1 :to :latest}}
:at
can be used instead offrom
andto
(sugar for equalfrom
andto
values):{:commit-details true :t {:at :latest}}
t
is now required for all history queries, with or without commit details. Previously, the caller could omitt
s to range over all of history. Now you would need to pass in{:from 1}
to do that.Example commit response:
A map keyed by
:f/commit
:This inlines the data we usually store in a separate file into the
data
field of the commit map (see #333 (comment) for relevant discussion).Implementation notes
Why add commit flakes instead of grabbing the commit from storage?
If we don’t want any commit-related flakes at all, then we should consider a future solution that allows us to retrieve commit data some other way (besides storage).
Future considerations: distinguishing fluree-generated data from user-relevant data
We had to do some workarounds to make sure the asserts/retracts only reflected the data asserted/retracted by the user, and omitted fluree-generated flakes that are just to make our implementation work. Because fluree-supplied flakes (such as commit flakes) and user-supplied changes live alongside each other in the db, we need to filter them out ourselves.
For now, we’ve just hard-coded (eg
commit-metadata-flake?
in thehistory
ns). This is similar to something we already do by hand when creating commit files (see the commit/generate-commit fn)However, if the distinction between internal metadata and "user" data continues to be important, we should consider a more principled way to sequester fluree-generated data.