Making annotation collection practical #236
Replies: 8 comments 21 replies
-
I don't collect annotations unless required to (the user requested it, or there are unevaluated* keywords in a subschema, which I determine during the initial traversal of the schema (when looking for $id, $anchor, $schema etc). I only save this information for the entire schema, not for any individual subschema, so it's an all-or-nothing thing, but for schema users who are most concerned with validation results, annotations are not wanted, so it works out well enough. I like the idea of an allow-list in general, but indeed caution is needed with keywords that generate annotations mostly for internal validation use (e.g. properties for unevaluatedProperties), vs. the ones that are intended for humans (or code generation or documentation or other tooling) like title, description etc. |
Beta Was this translation helpful? Give feedback.
-
Expanding on this (because it's not really discussed above), I think that "global" or "per-evaluation" is an implementation detail. I think we define a given allow list to be per-eval, and if the implementation wants to also have a global setting, that's their choice. I don't even think we need to mention a global list. |
Beta Was this translation helpful? Give feedback.
-
I want to add a more concrete use case. Consider the FHIR JSON Schema, which is a 69,000 (yes, 69K)-line file using a ~150-branch So let's pretend it's an This is what I think we need to address. The costs incurred throughout evaluation by the presence of a single But I'm particularly concerned about the case where the needed annotations take up very little space, but the complete set of annotations might actually cause the program to run out of memory. Not to mention the cost of copying out even the basic output and searching it. |
Beta Was this translation helpful? Give feedback.
-
I'm going to try to re-focus this discussion by summarizing the idea at a higher, less implementation-oriented way. I included some implementation ideas because in the past I've been told that I'm too abstract, but perhaps this is a time for a more abstract discussion. Note that all of this is about annotation collection, not output. Presumably the output would just end up with less output units, but I don't want to constrain how the output formats might want to handle this. Current situation:
I think annotation collection is generally seen as an all-or-nothing thing, even if there's a configuration option that allows the application to choose "all" vs "nothing" for each evaluation. In a large (>10K lines, which has been seen in the wild) schema with lots of keywords that produce and/or depend on annotations, the cost of collecting all of them when only a few will be used is high. Most problematically, in many implementations this cost is incurred whether the annotations are used or not. Proposal
It does not matter to me why the application or the keywords need the annotations (e.g. applications wanting |
Beta Was this translation helpful? Give feedback.
-
In trying to implement this, I found that the biggest challenges were:
APIThe implementation of this feature must lean either to a keep list or an ignore list. I chose to have an ignore list, meaning that annotation collection occurs by default. To support both persectives, I ended up creating the following in my config options:
To support only a single annotation (e.g. options.IgnoreAllAnnotations();
options.CollectAnnotationsFrom<TitleKeyword>(); I think this is reasonable, but it could be just as reasonable to default to no annotations being collected. Annotation purposeAbove I used the word "reporting" for handling annotations. This is an important distinction from "collecting" them. If the user configures to ignore the NOTE On the topic of memory consumption, driving by annotations actually saves memory overall because I don't have to track which properties have been covered while separately managing annotations. The annotation itself tracks which properties have been covered. For the below I've ignored the // schema
{
"$id": "https://test.com/schema",
"title": "a title",
"type": "object",
"properties": {
"foo": true
},
"additionalProperties": false
}
// instance
{
"foo": 1
}
// output (Hierarchical)
{
"valid": true,
"evaluationPath": "",
"schemaLocation": "https://test.com/schema#",
"instanceLocation": "",
"annotations": {
"title": "a title",
"additionalProperties": []
},
"nested": [
{
"valid": true,
"evaluationPath": "/properties/foo",
"schemaLocation": "https://test.com/schema#/properties/foo",
"instanceLocation": "/foo"
}
]
} |
Beta Was this translation helpful? Give feedback.
-
I've been looking at the problem of annotation dependencies - how to allow some keyword's annotations to be switched off but only when those annotations are not needed by some other keyword in the schema. I'd like to propose a thought experiment. Suppose that a developer extends the Now the spec already describes a dependency relationship between "then": {
"$dynamicRef": "#meta",
"$dependencies": ["if"]
}, Doing this across the board for all keywords gives us a dependency graph that implementations can use to evaluate keywords in the correct order. (Currently, implementations must code this dependency graph by hand, in one way or another.) Given the explicit dependency graph, as an extension developer we could justifiably believe that whenever we evaluated But the current spec only requires that "unevaluatedItems": {
"$dynamicRef": "#meta",
"$dependencies": ["prefixItems", "items", "contains", "if", "then", "else", "allOf", "anyOf", "oneOf", "not"],
"$annotationDependencies": ["prefixItems", "items", "unevaluatedItems", "contains"]
}, The existence of So the way I see it, the apparent view of the spec - which is to say that not all keywords are annotation keywords - leads to a more complicated vocabulary meta-schema and less flexibility in terms of extension development. To the contrary, the idea that all keywords should be treated as annotation producing keywords leads to a more concise vocabulary meta-schema (only |
Beta Was this translation helpful? Give feedback.
-
Note that @karenetheridge has done some great work showing the performance benefits of tracking whether property/item coverage annotations are actually needed, and only collecting and maintaining the information if they are. |
Beta Was this translation helpful? Give feedback.
-
I've put some ideas in json-schema-org/json-schema-spec#1385. Among some other things, this adds requirements for annotation collection configuration. By default, all annotations are collected, but implementations are encouraged to include configuration to disable certain annotations as users desire. |
Beta Was this translation helpful? Give feedback.
-
[EDIT: You probably want to read the short version and only come back to this if you want background and/or sample implementation ideas. The implementation specifics in this comment are to help understand the idea, and do not necessarily need to be part of the proposal as accepted, if it is accepted.]
One of the biggest barriers to widespread use of annotations is that collecting and storing them potentially takes up a lot of space, which also impacts evaluation time. This cost, which is exacerbated by the use of annotations to implement
unevaluatedProperties
,unevaluatedItems
, and other keyword interactions, is present whether the annotations are used or not.In general, costly features should only incur most of their cost when they are used.
uniqueItems
is a good example of this. It is potentially extremely costly if the array is large, and the values are complex data structures. However, none of that cost is incurred unless you useuniqueItems
, and the cost is much less if you use it on, for example, an array of integers. Particularly if that array is not large.Changing static keyword dependencies (which existed before 2019-09) to some other mechanism as proposed in #204 would only slightly reduce this cost, as most of the annotations used for those interactions are also needed for dynamic dependencies (
unevaluated*
). Using lots of array item and object property applicators with large instances is now costly for anyone implementing these keywords along the lines suggested in the spec, even if theunevaluated*
keywords are never used. Some implementations may optimize this, but the spec does not really facilitate that.Annotation allow-lists
This could be solved by allowing (or requiring) the set of annotations of interest to be configured prior to evaluation, or passed as a parameter to the evaluation process. It should be easy to allow all or forbid all (regardless of which is the default - I would lean towards a default of forbid all).
There may be a use case for configuring a set of annotations to forbid instead of one to allow, although I can't think of one offhand. I know which annotations I'll need, making an allow-list easy, but I might not know all of the annotations present in the schema's vocabularies, making a forbid-list challenging to get right.
In addition to the global or per-evaluation lists, implementations would need to allow keywords to modify that list for the current dynamic scope and its subscopes. This would ensure that keywords like
unevaluated*
only incur costs when they are in use. To keep costs minimal, this feature should support allow-lists that are limited to a particular instance location or locations.Implementing "unevaluatedProperties" using annotation allow-lists
unevaluatedProperties
currently requires that it runs afterproperties
,patternProperties
,additionalProperties
, and all in-place applicators. This approach would require running additional code prior to those dependencies in order to ensure that the annotations are collected. The dependencies are the same, but now code must run both before and after, instead of just before.The before-code would put
properties
,patternProperties
,additionalProperties
, andunevaluatedProperties
on the annotation allow-list, restricted to the current instance location and dynamic scope, if they are not already allowed.In addition to what it does now, the after-code would check the dynamic scope of those keywords in the allow-list, and for any keyword where it matches the current dynamic scope, remove that annotation from the allow-list and delete the actual annotations after they have been used.
This probably requires some slight tweaking for annotations that can be used by multiple keywords: for example,
contains
in the next draft will be used by bothunevaluatedProperties
andunevaluatedItems
, so the after-code that checks and removes annotations probably needs to run at the end of the schema object's evaluation, rather than after individual keywords.This approach adds a bit of complexity to schema object processing, and a small performance cost to
unevaluated*
and similar keywords, but (aside from any cost inherent in the complexity of managing the allow-list), eliminates the cost of such keywords when they are not being used. It also eliminates the voluminous but often useless presence of internal communication annotations from the output. Of course, the application can include those annotations in the global or per-evaluation allow-list if they are deemed useful for debugging or other purposes. Annotations on those allow-lists would never be removed or forbidden during an evaluation.Impact on static keyword dependencies
Static keyword dependencies could be managed this way as well, which would require distinguishing between annotations allowed for a single dynamic scope vs allowed for all dynamic sub-scopes. These annotations would then be enabled, collected, used, and dropped within a single schema object evaluation.
However, as discussion #204 notes, such keywords can be implemented in many different ways, some of which are substantially faster and/or less complex. Doing so becomes even more appealing if the annotation in question are otherwise not needed.
Considerations for "if", "then", and "else"
In #204, we've been classifying these keywords as having a dynamic interaction as the outcome cannot be predicted until runtime. Which would suggest that we'd need to continue using allow-lists and annotations, and perhaps support flagging that this one is only for the current dynamic scope (although it seems unlikely to be a problem if it stays on for sub-scopes, as holding onto
if
boolean annotations longer than necessary is low-cost).However, while the outcome cannot be predicted, the interaction is straightforward and currently implemented as a special case anyway. While I would like for the
if
/then
/else
interaction to fit neatly into a general model (so that it sets a clear and good precedent for 3rd-party keyword design), if we take this allow-list approach it will probably be a good idea to re-think exactly what that general model should be.Beta Was this translation helpful? Give feedback.
All reactions