Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: strings API fns #438

Closed
wants to merge 56 commits into from
Closed

Feature: strings API fns #438

wants to merge 56 commits into from

Conversation

cap10morgan
Copy link
Contributor

@cap10morgan cap10morgan commented Apr 4, 2023

Since our most common / important use case is JSON, let's make string keys the happy path in our API. Then we can convert from/to Clojure keywords in a CLJ SDK wrapper (that's not in this PR but would come later; either a separate PR or before this leaves "draft" status).

This allows us to always use just strings everywhere internally, including in the context (so no more :context-type nor :context-str, the one and only :context will always be strings). And all of our data-handling code can just assume strings and not have to branch on things like :context-type.

The performance penalty of deep conversion from keywords to strings that this requires would only be paid by people using keywords w/ the CLJ SDK wrapper. JSON users would never pay it b/c their payloads don't need conversion. And if someone using the CLJ SDK hits a performance wall, they can get around it by using strings instead of keywords in their data too (or using the JSON API if they prefer). I think all of that is better than maintaining two contexts and dealing with two data formats internally.

But I might be missing something important or otherwise wrong in my evaluation of the problem and this solution. Please let me know because this is, admittedly, a big change.

The other big reason this is draft status for now is that we either a) need to convert the EDN-using tests to using strings and related changes or b) wait for the CLJ SDK to be done and point them at that for now. a) seems like the better long-term place to end up, but b) might be fine for now, depending on the amount of work left in the CLJ SDK wrapper.

There is a companion PR for http-api-gateway that I need to put up soon as well. I'll update this comment w/ a link when I do. UPDATE: Here it is

cap10morgan and others added 8 commits March 10, 2023 12:23
Contains some fixes & new features we need (e.g. ::m/default support in :map schemas)
...to better optimize for JSON use cases.

Only the f.d.json-ld.api-test and f.d.query.fql-parse-test test namespaces have been updated, so the others fail as of this commit.
...discovered while getting http-api-gateway ready to work with this.
Not sure what I was thinking there...
...for HTTP API validation & documentation
@cap10morgan
Copy link
Contributor Author

Now that #437 has landed I'm going to merge in main and resolve the conflicts in here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all the policy stuff it looks like we're using compact iris defined by some internal context. I think it would be nice to either have a formalized constant that defines what the internal context is, or work with the expanded form and def some constants for all the expanded iris we care about.

This isn't a blocker for this PR, I just want to make a note.

Copy link
Contributor

@zonotope zonotope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still reviewing, but I'll leave the comments here that I have so far.

src/fluree/db/db/json_ld.cljc Outdated Show resolved Hide resolved
(log/trace "retrieve-context - default: " default-context "supplied:" supplied-context "context-type: " context-type)
(or (get-in @context-cache [context-type supplied-context])
[default-context context-cache supplied-context]
(log/debug "retrieve-context - default: " default-context "supplied:" supplied-context)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the proliferation of DEBUG level logs makes debugging harder, not easier. I would rather our default log level for stuff like this be TRACE instead of DEBUG, since our default log level in dev is DEBUG. That way we can opt in to the namespaces we care about getting logs for, instead of opting out of all the ones we don't care about to avoid drowning in a sea of irrelevant messages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we should formalize and document that. Because I never use TRACE and then miss some messages when others do. I would rather change the default log level and then just opt into debug in certain namespaces instead of having a mishmash of debug and trace logs everywhere that are effectively at the same level of granularity. Unless of course we define a meaningful difference between the two that we all agree to use. Otherwise trace seems superfluous.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem I have with it is that debug is enabled by default in development, so the repl can sometimes become unusable when you have a running system that you are exploring. Also, I view trace logs as literally tracing through the execution of the code, and that's more in line with what's happening here (admittedly subjective).

I would just rather not have all these messages printed to the repl by default. That makes it more difficult to actually identify the things you care about when in development.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think INFO should be the default, I for one am often drowning in the DEBUG stuff but I've been too lazy to adjust. Some sort of normalization of log levels should be done, though, I think that would make logging more useful for everyone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree about not cluttering the REPL. But debug being enabled is just our dev logging config, isn't it? We can change that whenever we want. It's probably the dev/logback.xml file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for me these specific logs are too granular, more trace-like. I would be inclined to change my logging level to omit these, but then I would lose others that I consider more valuable.

Fwiw for me debug logs are more "top-level," like at the entry or exit points of namespaces or significant code paths. I use them to help me figure out where to start looking if something has gone wrong.

These logs are ones I'd want enabled if, for example, I knew I needed to be looking in this ns and following its operations closely. To me that's more trace-level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that all seems reasonable. I just would like to have a shared, documented definition to follow. Otherwise it's easy to get lost in "what level do I need where?"

For this PR I'm happy to remove any debug logs that seem to create more noise than they're worth.

Copy link
Contributor

@bplatz bplatz Apr 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEBUG is really a customer-driven requirement first. While developing you are probably doing 1 operation a minute, but customers will easily have 1,000,000x more volume.

If there is an issue we cannot immediately resolve, our next step is to ask them to set logging to DEBUG. This needs to be done in a way they don't immediately run out of disk space, or blow up their logging software costs. Importantly it needs to be useful but also manageable for us to analyze. I think ideally a target of ~2x the data volume of INFO.

TRACE is last resort, and likely a system can only run in TRACE for a very short amount of time.

Importantly all logging should take care not to leak sensitive information (customer data). This is one of the top ways data breaches happen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I said "default log level", which I now get is confusing. I did not mean the default log level as set in logback.xml. I meant the log level we should usually pick when adding a new log statement in our code.

I think Marcela and Brian said better what I was trying to get at. I view the TRACE level as a granular, function by function, tracing through each step of code execution. That's useful sometimes, but when it's on all the time, it becomes useless. I view DEBUG as high level view when specific components are engaged (like starting to search for a query with a specific where clause), which is almost always useful in development, and also useful to figure out what components are breaking in production for a customer. I vew INFO as essential messages reporting out progress or normal operations of the program that would be useful in production and for a customer to monitor. WARN is for something the customer should watch out for, and ERROR is for when something failed.

The log levels for the program are set in the logback.xml. We can edit that file in development when we want to change the log level for the whole program or a specific namespace when we want more or less messages. The level is debug in development and info in production (as I think it should be).

These function by function logs printing out the input and output of each function should be set to trace, in my opinion. Then, for the few cases in development or wherever that we actually want such granular logs when we're trying to pinpoint an issue in a specific function or verify that it's doing the right thing, we can change the logback.xml to set the log level to trace for just the namespace we care about at that time. Using the same level for everything removes that flexibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all seems fine but we should specify and document it so we can all be on the same page about what to use when in the future.

(-> (dbproto/-rootdb db)
(query-range/index-range :spot = [subject-id const/$rdf:type])
<?))))
(map flake/o
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this pr, but we should eventually use transducers and/or calls to the lower-level fluree.db.index/tree-chan function to transform index scans as they are retrieved instead of after the fact because that's more efficient.

src/fluree/db/db/json_ld.cljc Outdated Show resolved Hide resolved
@@ -22,7 +22,7 @@
"returns true if json-ld value is a list object."
[v]
(and (map? v)
(= :list (-> v first key))))
(contains? v :list)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yay for builtin fns!


#?(:clj (set! *warn-on-reflection* true))

(def retract-key-re (re-pattern "^.+[/:#]retract$"))

(def registry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this registry used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know, I think it isn't actually in this PR. It was only being used by the CLJ SDK stuff. If that doesn't end up in here I can remove this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bet you're using this in the gateway, fwiw.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, good catch @dpetran. I bet I am. :)

src/fluree/db/query/exec.cljc Outdated Show resolved Hide resolved
(if-let [code (some-> q :having parse-code)]
(assoc q :having (eval/compile code))
(if-let [code (some-> q (get "having") parse-code)]
(assoc q "having" (eval/compile code))
q))

(defn parse-analytical-query*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's stopping us from leaving/converting these query keys to keywords here since we're already doing multiple conversions anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what multiple conversions you're referring to, but this is basically the same answer as the one for "select-one".

Copy link
Contributor

@dpetran dpetran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥇 LGTM, though I think we should wait to merge until others have their review in.

@@ -209,6 +209,32 @@
(assoc acc k v)))
{} m))

(defn assoc-from-str-opts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something about this fn gives me pause, though I'm struggling to articulate what exactly. In part, it may be the two usage types (collection of keys vs single-element map), which don't feel intuitively related to me. I also struggled for awhile to understand its purpose just from its call sites (though maybe that's just me 😅 ).

Is the alternative to this fn just manually mapping to the appropriate keywords? When would I reach for this fn in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, me too!

There are two ways I look at this fn:

  1. This is a temporary stop-gap to limit the delta of this change as much as possible and will be removed when we settle where we want keywords and where we want strings and make the entire codebase conform to that.
  2. It is good to explicitly grab only the keys we want to forward on to our internal code for user-supplied maps like this. Some of them have internal-intended keys that they also look for but we wouldn't really want users to set directly. So I would want to retain that aspect in a simpler & more performant manner.

(swap! cache assoc equals-rule last-result)
last-result))))))
(log/debug "resolve-equals-rule policy:" policy)
(log/debug "resolve-equals-rule path-pids:" path-pids)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, the debug logs throughout policy code are what initially made me feel a bit uneasy. I certainly added log statements in a lot of the same places when I was actively working on this part of the code, but I'm not sure we should leave them in for general use. I do think it's good to know if policy code is working at all (eg, is there now an active policy), but maybe not all these intermediate steps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why I think they should be set to TRACE. Sometimes we care about this in development; most of the time we don't. When we do care, we can change the logback.xml to TRACE just for this namespace.

@@ -339,7 +328,6 @@
:novelty novelty
:policy root-policy-map
:default-context default-context
:context-type context-type*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the idea of not having context-type anymore!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this code eliminate our ability for developers to use a keyword-based context? Or is that feature now handled in an alternative way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be handled in an alternative way (that isn't in here yet).

Copy link
Contributor

@bplatz bplatz Apr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because :context-type is now an option that is only used by CLJ devs using the db library after #437, by not having it stored on the db any longer you'd have to supply the option with every API call I'd think?

This change would seem to necessitate this as the new CLJ developer experience:

(def ledger @(fluree/load conn "my/db"))
(def mydb @(fluree/db ledger))

(def mydb2 @(fluree/stage mydb [{...txn...)] {:context-type :keyword}))
(def mydb3 @(fluree/stage mydb2 [{...txn...)] {:context-type :keyword}))

(def results1 @(fluree/query mydb3 {:select ... :where ...) {:context-type :keyword}))

Or, I guess an alternative is we'd have to create a custom CLJ API that automatically merges this option in with all relevant API calls, but then that would also mean we have to maintain two APIs going forward instead of one. v1/v2 ended up going this route with a substantial amount of pain (they still don't all work the same and remain a primary origin of bugs and documentation maintenance problems), so I was optimistic we could avoid that with v3 but maybe not.

Or there may be another alternative I'm not thinking of.

Regardless, it seems likely that this will have an impact on the developer experience, so knowing if that is the case and if so, how, is an important consideration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this. We can't evaluate this pull request until we understand how it impacts the Clojure development experience and making sure there are no regressions. I don't see anything related to that here or in any other pull request.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should all now be addressed (see further commits and comments on the PR below), but let me know if not.

(flake/o f))]
(recur r (conj acc res)))
(if (and (= 1 (count acc))
(not (#{"list" "set" :list :set} (-> context (get p-iri) (get :container)))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of the code is still unfamiliar to me generally. Why would we see the compacted "list" or "set" down here, and still possibly also the keywords?

But we only have to look for the keyword :container?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, we likely don't need all of those anymore. Good catch!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's because our json-ld/expand function keywordizes "@container" and "@list" and removes the "@" prefix from "@set". This code is just meant to work with that output.

@cap10morgan
Copy link
Contributor Author

OK the initial version of the CLJ SDK is now pushed here primarily in 3e11f79.

It supports connect, create, load, stage, commit!, and query w/ keyword compact IRIs & contexts both in requests and responses.

You do not need to specify a :context-type anywhere because they always get converted to their equivalent string contexts before being handed to the API. Anytime you want to use a JSON-LD @Keyword as a Clojure keyword in your data, you have to put :foo "@foo" in your context (except for :context itself which can't be aliased and needs special treatment anyway; :context will always get converted to "@context").

See the new fluree.sdk.clojure-test test namespace for examples of using the Clojure SDK. It's intended to be a drop-in replacement for fluree.db.json-ld.api in Clojure code that enables keywords along the lines of our existing support.

As this is a work in progress, there are some missing features and quick-and-dirty implementations here and there. I have called out most of them with ;; TODO: ... comments. But this should show enough to prove the concept and demonstrate the general approach of the implementation.

@cap10morgan
Copy link
Contributor Author

Multi-queries are now supported in the CLJ SDK too in b19e945

@cap10morgan
Copy link
Contributor Author

And history queries work in the CLJ SDK now too (from caffde5)

(api/query db)
deref
(log/debug->>val "pre-encoded query results:")
encode-results))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach does not work. It doesn't contemplate string values that might look like context prefixes (but aren't... and aren't even IRIs). In addition, this adds an immensely expensive processing step try to reproduce results that work well today without the expense needed at all (unpacking every query result, running expensive regexes on every key and every value recursively).

For example, transacting this data:

{:id     :ex/broken
 :ex/str "ex: this won't work"}

then with this query:

{:select {?s [:*]}
 :where  [[?s :id :ex/broken]]}

Produces this invalid EDN data structure (which surprisingly outputs in the repl):
[{:id :ex/broken, :ex/str :ex/ this won't work}]

While I didn't spend the time creating additional examples, I can forsee other issues, like IRIs with multiple colons that might have some intersection with context mappings. Context handling is complex, simple data x-forms are not going to be able to handle it. We should let the json-ld library which has the logic handle expansion and contraction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. This is one of the work-in-progress aspects of this. The solution I think is similar to what I did for query results here:

(defn keyword-compact-iri-transformer
[context]
;; TODO: Replace this with a :map-with schema once this lands upstream:
;; https://github.com/metosin/malli/issues/881
(let [kw-context (util/keywordize-keys context)]
(mt/transformer
{:encoders
{:string (fn [s]
(if-let [expanded (json-ld/expand-iri s context)]
(json-ld/compact expanded kw-context)
s))}})))

...that uses the context to expand the string compact IRI to a full IRI (if valid) and then re-compact it to a keyword IRI if it succeeded.

However, this seems to be broken in our json-ld/expand-iri fn too:

(def context (json-ld/parse-context {"ex" "http://example.org/ns/"}))
=> #'user/context
context
=> {:type-key "@type", "ex" {:id "http://example.org/ns/"}}
(json-ld/expand-iri "ex: this won't work" context)
=> "http://example.org/ns/ this won't work"

Unless it's supposed to catch the non-IRI-ness of "ex: this won't work" somewhere else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is recursively running regexes against the query results. I also don't think it's necessarily all that expensive, based on what it's doing. But I can try to run some performance comparisons between this and what we have on main if you want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a string value, so compaction would never be attempted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...that uses the context to expand the string compact IRI to a full IRI (if valid) and then re-compact it to a keyword IRI if it succeeded.

These are extraordinarily expensive operations. It is done once currently and works without issue. What problem does doing this multiple times solve?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is recursively running regexes against the query results.

I’m just looking at the code. You are using Malli to recursively traverse the entire query results data structure, then you are expanding the already compacted results (regexes, amongst some other expensive logic) then re-compacting (even more expensive). Regardless the effect is we are taking results that didn’t have to do any of that extra work and were correct to now doing extra work and are incorrect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What problem does doing this multiple times solve?

The intent is to have one internal format regardless of the external one used in requests and responses, and to optimize for JSON. We had kept running into issues related to strings-vs-keywords / JSON-vs-EDN so this was an attempt to eliminate that category of bugs. It does perform some more transformations on EDN as a consequence, though.

My goal here was to find a safe & reliable (first) and reasonably performant (later) way to convert string compact IRIs into keyword compact IRIs. But it seems that I assumed too much about the validity checking that json-ld/expand-iri does.

So yeah, we would have to change how this works in that regard since it isn't yet safe & reliable. It doesn't appear that the full expansion logic would be too difficult to implement: https://www.w3.org/TR/json-ld-api/#value-expansion

On the performance question, one option I considered is to have an option to return keywords or not when using the CLJ SDK, and if you enable it, you're accepting the perf hit to do that transformation. If you want to maximize performance, you leave it turned off and get strings back.

But I know we're out of time to work on this. This was meant as a proof of concept for how we might do this in a way that optimized for JSON but still allowed ergonomic / idiomatic Clojure usage. Since it appears there is more work to do w/ no more time to do it, I suppose we should close this unmerged and move on for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't think it's necessarily all that expensive, based on what it's doing.

Just quickly tested this and here is the benchmark for using the schema.org context and returning query results of movies.

This is just the added time to do the extra steps that are referenced as not being expensive:

  • 100 movies: adds 8 seconds
  • 1000 movies: adds 1 minute, 20 seconds

Needless to say, we should be testing a million results but you can extrapolate that would be about 2 hours for results to come back.

We should be returning 100 results in the hundreds of milliseconds, so this is extraordinarily expensive as I keep repeating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants