Making annotation collection practical #236

handrews · 2022-09-11T20:32:42Z

handrews
Sep 11, 2022

[EDIT: You probably want to read the short version and only come back to this if you want background and/or sample implementation ideas. The implementation specifics in this comment are to help understand the idea, and do not necessarily need to be part of the proposal as accepted, if it is accepted.]

One of the biggest barriers to widespread use of annotations is that collecting and storing them potentially takes up a lot of space, which also impacts evaluation time. This cost, which is exacerbated by the use of annotations to implement unevaluatedProperties, unevaluatedItems, and other keyword interactions, is present whether the annotations are used or not.

In general, costly features should only incur most of their cost when they are used. uniqueItems is a good example of this. It is potentially extremely costly if the array is large, and the values are complex data structures. However, none of that cost is incurred unless you use uniqueItems, and the cost is much less if you use it on, for example, an array of integers. Particularly if that array is not large.

Changing static keyword dependencies (which existed before 2019-09) to some other mechanism as proposed in #204 would only slightly reduce this cost, as most of the annotations used for those interactions are also needed for dynamic dependencies (unevaluated*). Using lots of array item and object property applicators with large instances is now costly for anyone implementing these keywords along the lines suggested in the spec, even if the unevaluated* keywords are never used. Some implementations may optimize this, but the spec does not really facilitate that.

Annotation allow-lists

This could be solved by allowing (or requiring) the set of annotations of interest to be configured prior to evaluation, or passed as a parameter to the evaluation process. It should be easy to allow all or forbid all (regardless of which is the default - I would lean towards a default of forbid all).

There may be a use case for configuring a set of annotations to forbid instead of one to allow, although I can't think of one offhand. I know which annotations I'll need, making an allow-list easy, but I might not know all of the annotations present in the schema's vocabularies, making a forbid-list challenging to get right.

In addition to the global or per-evaluation lists, implementations would need to allow keywords to modify that list for the current dynamic scope and its subscopes. This would ensure that keywords like unevaluated* only incur costs when they are in use. To keep costs minimal, this feature should support allow-lists that are limited to a particular instance location or locations.

Implementing "unevaluatedProperties" using annotation allow-lists

unevaluatedProperties currently requires that it runs after properties, patternProperties, additionalProperties, and all in-place applicators. This approach would require running additional code prior to those dependencies in order to ensure that the annotations are collected. The dependencies are the same, but now code must run both before and after, instead of just before.

The before-code would put properties, patternProperties, additionalProperties, and unevaluatedProperties on the annotation allow-list, restricted to the current instance location and dynamic scope, if they are not already allowed.

In addition to what it does now, the after-code would check the dynamic scope of those keywords in the allow-list, and for any keyword where it matches the current dynamic scope, remove that annotation from the allow-list and delete the actual annotations after they have been used.

This probably requires some slight tweaking for annotations that can be used by multiple keywords: for example, contains in the next draft will be used by both unevaluatedProperties and unevaluatedItems, so the after-code that checks and removes annotations probably needs to run at the end of the schema object's evaluation, rather than after individual keywords.

This approach adds a bit of complexity to schema object processing, and a small performance cost to unevaluated* and similar keywords, but (aside from any cost inherent in the complexity of managing the allow-list), eliminates the cost of such keywords when they are not being used. It also eliminates the voluminous but often useless presence of internal communication annotations from the output. Of course, the application can include those annotations in the global or per-evaluation allow-list if they are deemed useful for debugging or other purposes. Annotations on those allow-lists would never be removed or forbidden during an evaluation.

Impact on static keyword dependencies

Static keyword dependencies could be managed this way as well, which would require distinguishing between annotations allowed for a single dynamic scope vs allowed for all dynamic sub-scopes. These annotations would then be enabled, collected, used, and dropped within a single schema object evaluation.

However, as discussion #204 notes, such keywords can be implemented in many different ways, some of which are substantially faster and/or less complex. Doing so becomes even more appealing if the annotation in question are otherwise not needed.

Considerations for "if", "then", and "else"

In #204, we've been classifying these keywords as having a dynamic interaction as the outcome cannot be predicted until runtime. Which would suggest that we'd need to continue using allow-lists and annotations, and perhaps support flagging that this one is only for the current dynamic scope (although it seems unlikely to be a problem if it stays on for sub-scopes, as holding onto if boolean annotations longer than necessary is low-cost).

However, while the outcome cannot be predicted, the interaction is straightforward and currently implemented as a special case anyway. While I would like for the if/then/else interaction to fit neatly into a general model (so that it sets a clear and good precedent for 3rd-party keyword design), if we take this allow-list approach it will probably be a good idea to re-think exactly what that general model should be.

karenetheridge · 2022-09-11T20:55:21Z

karenetheridge
Sep 11, 2022
Maintainer

I don't collect annotations unless required to (the user requested it, or there are unevaluated* keywords in a subschema, which I determine during the initial traversal of the schema (when looking for $id, $anchor, $schema etc).

I only save this information for the entire schema, not for any individual subschema, so it's an all-or-nothing thing, but for schema users who are most concerned with validation results, annotations are not wanted, so it works out well enough.

I like the idea of an allow-list in general, but indeed caution is needed with keywords that generate annotations mostly for internal validation use (e.g. properties for unevaluatedProperties), vs. the ones that are intended for humans (or code generation or documentation or other tooling) like title, description etc.

2 replies

handrews Sep 11, 2022
Author

I only save this information for the entire schema, not for any individual subschema, so it's an all-or-nothing thing

Right, I forgot to acknowledge that the spec already allows for all-or-nothing. Good point on detecting unevaluated* on schema load - if we don't go with allow-lists, it might be good to suggest that somewhere. The question would be making it general. What if a vocabulary included, idk, unevaluatedPatternProperties which worked more-or-less like unevaluatedProperties but only for properties with names matching regexes? Perhaps you already do this, but your vocabulary plugin system would need a way to inform the schema load process that it has similar requirements. But yeah, thinking about how much of this could be done at schema load time, even if the granularity is more coarse, is a great idea as that might simplify some things.

gregsdennis Sep 11, 2022
Maintainer

Maybe it makes sense to have a further "property" of annotations that specifies whether it is needed for internal purposes. That way, "internal" annotations can continue to be collected (but not reported), and the allow list only affects annotations that are reported.

Implementations can optimize as they see fit to ensure annotations that aren't on the allow list or aren't internally required aren't generated. This includes pre-scanning for internals that may be necessary (e.g. to support unevaluated*).

gregsdennis · 2022-09-11T21:45:51Z

gregsdennis
Sep 11, 2022
Maintainer

In addition to the global or per-evaluation lists

Expanding on this (because it's not really discussed above), I think that "global" or "per-evaluation" is an implementation detail. I think we define a given allow list to be per-eval, and if the implementation wants to also have a global setting, that's their choice. I don't even think we need to mention a global list.

1 reply

handrews Sep 11, 2022
Author

Good point- I did not mean to imply that the global list would be a requirement, just a possibility since many implementations support both an overall config and validation/evaluation parameters. But really it only matters in the context of one evaluation at a time.

handrews · 2022-09-12T02:02:53Z

handrews
Sep 12, 2022
Author

I want to add a more concrete use case. Consider the FHIR JSON Schema, which is a 69,000 (yes, 69K)-line file using a ~150-branch oneOf. Since it's a oneOf, annotations will be discarded from all but one branch. It's draft-07 so there's no unevaluated*. I have no idea what annotations might be useful, but really I'm just using it to point out that there are really huge schemas out there.

So let's pretend it's an anyOf instead, draft 2020-12 (but still no unevaluated*), and uses a single custom annotation vocabulary fairly sparingly throughout. The memory consumption necessary to get that single, infrequently-used annotation out through the current mechanism would be ridiculous.

This is what I think we need to address. The costs incurred throughout evaluation by the presence of a single unevaluated* deep in one subschema are also relevant.

But I'm particularly concerned about the case where the needed annotations take up very little space, but the complete set of annotations might actually cause the program to run out of memory. Not to mention the cost of copying out even the basic output and searching it.

4 replies

handrews Sep 22, 2022
Author

@karenetheridge @gregsdennis any thoughts on this example which focuses on annotations intended for external consumption?

Does this seem like a compelling reason to make the sort of change I'm proposing? I feel like the discussion got a bit hung up on internal use, which is the less important side of this to me. I really want to make sure that if I have tens of thousands of lines-long schema with title and description used everywhere, but just want to get the ten-ish occurrences of annotation foo that I can do that with minimal additional memory consumption. Which means asking for foo and not title or description.

gregsdennis Sep 22, 2022
Maintainer

I'm not sure we need to make a distinction between annotations for external or internal consumption. I think they're both valuable for an external consumer in different scenarios. Annotations are annotations.

I don't have a problem with an implementation providing functionality to omit output (or collection) of annotations it doesn't need for internal processing, but aligned with my opinion in Jason's discussion, I don't know that the specification needs to make this a requirement. It seems like a feature that the more useful implementations would have.

gregsdennis Sep 22, 2022
Maintainer

I've been trying to think of this from the point of view of not having "internal" annotations at all. If we do that, then we would need to define additionalProperties, unevaluatedProperties, etc., to operate without them. My implementation depends on them, but it also reports them in the output.

handrews Sep 23, 2022
Author

I've been trying to think of this from the point of view of not having "internal" annotations at all.

Whether to have internal annotations or whether to replace them with some other thing is really not the point of this at all. Please take that to #204.

I would really like for there to be discussion of the collection allow list idea here. That is the point of this discussion. The only way in which internal usage is relevant to this discussion is that the allow list mechanism needs to support keywords enabling the annotations that they need, if any, even when the application did not request them. But that's a minor detail.

I don't know that the specification needs to make this a requirement. It seems like a feature that the more useful implementations would have.

Because we need to make this feature efficient and plausible to implement in order to increase its adoption and use. I don't think we can even consider mandating annotation collection without an approach like this. Too many implementations are declining to add support for it.

Also, while I got into some implementation details, the important thing is the capabilities here. I don't care if the mechanism is exactly what I laid out. But we do need clear guidance on implementing this efficiently and aligning it with actual usage. Which, in my experience, almost never involves using every single annotation that can be collected.

handrews · 2022-09-23T01:57:50Z

handrews
Sep 23, 2022
Author

I'm going to try to re-focus this discussion by summarizing the idea at a higher, less implementation-oriented way. I included some implementation ideas because in the past I've been told that I'm too abstract, but perhaps this is a time for a more abstract discussion. Note that all of this is about annotation collection, not output. Presumably the output would just end up with less output units, but I don't want to constrain how the output formats might want to handle this.

Current situation:

Implementations MAY opt out of supporting annotation collection
Performance costs are mentioned, but there is no guidance beyond "MAY opt out" for mitigation
No guidance is provided on configuring annotation collection if it is supported
I'm not aware of implementations that allow more complex behavior than turning it on or off as an entire feature

I think annotation collection is generally seen as an all-or-nothing thing, even if there's a configuration option that allows the application to choose "all" vs "nothing" for each evaluation.

In a large (>10K lines, which has been seen in the wild) schema with lots of keywords that produce and/or depend on annotations, the cost of collecting all of them when only a few will be used is high. Most problematically, in many implementations this cost is incurred whether the annotations are used or not.

Proposal

Implementations MAY opt out of supporting annotation collection [Note: I'd like to change this, but it's not part of this proposal]
Implementations that support annotation collection:
- MUST support applications identifying which annotation keywords are to be collected
- MUST support keywords indicating their annotation consumer needs
- MUST collect only the annotations requested by the application or required by keywords in use
- SHOULD support collecting annotations only needed by keywords in use (and not requested by the application) from the scopes from which they are needed. [I can see a legitimate performance vs code complexity here, so this probably isn't a MUST, and could even perhaps be a MAY]
- Keyword annotation consumption requirements, if supported, MUST be set based on the specific implementation of the keyword in question, as keywords can (including in 2019-09 and 2020-12) be implemented either with or without annotations and still conform to the specification
- SHOULD support enabling all annotations as a convenience
- MAY support a deny list for collecting all annotations except the ones listed [I'm not sure this even needs to be here, but it feels like it's worth mentioning the possibility]

It does not matter to me why the application or the keywords need the annotations (e.g. applications wanting properties annotations for debugging purposes vs wanting title annotations for use in the application UI). If the use of annotations for keyword dependencies is changed or removed, the requirements here will change or be dropped accordingly, so that's not a concern for evaluating the proposal.

12 replies

marksparkza Oct 1, 2022

Hi, @handrews, thanks for the invitation to comment.

I'm not 100% sure whether I'm understanding all the proposed clauses correctly, but rather than nitpick at the wording I've decided that the best way to understand what you're proposing - and to be able to offer meaningful feedback - is to have a crack at implementing the proposal and see what comes crawling out of the woodwork!

I'll keep you posted...

handrews Oct 1, 2022
Author

@marksparkza thanks for putting in the effort! I'll be very interested to see what you come up with. This was more of a conceptual proposal to get things started rather than an attempt to write an implementable spec, so if it doesn't seem clear enough, that's why. But if there's enough here for you to work with it's probably more interesting to see how you fill in the gaps than it would be for me to dictate the rest of it at this stage.

handrews Oct 1, 2022
Author

@jdanyow thanks! My apologies for the delay in responding- yes this seems very relevant to your situation. Was the memory constraint a major factor in deciding not to support annotation collection, or was that more of a time limitation or skepticism of the feature's usefulness? Would this approach make you more likely to support annotation collection in the cfworker implementation?

marksparkza Oct 3, 2022

@handrews, to add a bit of context:

An issue was raised against my implementation which to my mind resembles your proposal. In that instance I opted for output filtering but the end result was somewhat unsatisfactory - a rather obscure capability to filter annotations only for the basic output format. I avoided either controlling the collection of annotations, or filtering detailed / verbose output so as not to get into the hairy business of what to do about annotations used internally by applicator keywords.

handrews Oct 3, 2022
Author

@marksparkza it's always great to see that similar requirements and ideas have popped up independently, that's usually a sign that there's something worth pursuing.

gregsdennis · 2022-11-09T21:34:17Z

gregsdennis
Nov 9, 2022
Maintainer

In trying to implement this, I found that the biggest challenges were:

API: What functions do I present to the user to provide the best experience for:
- specifying only one or a few annotations to keep
- specifying only one or a few annotations to ignore
Annotation purpose: How do I differentiate between annotations which are required for other keywords to process correctly (my implementation is annotation-driven) and annotations which are solely for application consumption?
- properties produces annotations that are required for processing additionalProperties and unevaluatedProperties
- title produces annotations upon which no other keywords depend

API

The implementation of this feature must lean either to a keep list or an ignore list. I chose to have an ignore list, meaning that annotation collection occurs by default. To support both persectives, I ended up creating the following in my config options:

IgnoredAnnotations - a read-only list of annotations currently configured to not be produced
IgnoreAllAnnotations() - configures so that no annotations are reported
IgnoreAnnotationsFrom<T>() - configures so that annotations from a specific keyword (T) are not reported
ClearIgnoredAnnotations() - configure to report all annotations (I considered "CollectAllAnnotations" but I wanted to be clear that that was the default behavior)
CollectAnnotationsFrom<T>() - configures to report annotations from a specific keyword (T)

To support only a single annotation (e.g. title), one would need to ignore all annotations then collect only from title.

options.IgnoreAllAnnotations();
options.CollectAnnotationsFrom<TitleKeyword>();

I think this is reasonable, but it could be just as reasonable to default to no annotations being collected.

Annotation purpose

Above I used the word "reporting" for handling annotations. This is an important distinction from "collecting" them.

If the user configures to ignore the properties annotation, I still need to collect it, but I don't want to include it in the output. Granted, this doesn't address the memory consumption focus of this discussion, but because my implementation is annotation-driven, it's necessary.

NOTE On the topic of memory consumption, driving by annotations actually saves memory overall because I don't have to track which properties have been covered while separately managing annotations. The annotation itself tracks which properties have been covered.

For the below I've ignored the properties attribute but not additionalProperties. Note that additionalProperties still processes correctly because there was an annotation collected from properties but it's not in the output.

// schema
{
  "$id": "https://test.com/schema",
  "title": "a title",
  "type": "object",
  "properties": {
    "foo": true
  },
  "additionalProperties": false
}

// instance
{
  "foo": 1
}

// output (Hierarchical)
{
  "valid": true,
  "evaluationPath": "",
  "schemaLocation": "https://test.com/schema#",
  "instanceLocation": "",
  "annotations": {
    "title": "a title",
    "additionalProperties": []
  },
  "nested": [
    {
      "valid": true,
      "evaluationPath": "/properties/foo",
      "schemaLocation": "https://test.com/schema#/properties/foo",
      "instanceLocation": "/foo"
    }
  ]
}

0 replies

marksparkza · 2022-11-29T06:18:13Z

marksparkza
Nov 29, 2022

I've been looking at the problem of annotation dependencies - how to allow some keyword's annotations to be switched off but only when those annotations are not needed by some other keyword in the schema.

I'd like to propose a thought experiment.

Suppose that a developer extends the if keyword such that it produces an annotation consisting of the execution time of its subschema. Suppose further that another developer finds this extension useful and decides to extend then and else in the following way: If the execution time of if exceeds some limit, then then - if present - sets the validity of if to false and does not evaluate the instance in question, while else - if present - (also) sets the validity of if to false and proceeds to evaluate the instance.

Now the spec already describes a dependency relationship between if and then and between if and else. It's possible to represent the if-then dependency in a vocabulary meta-schema:

        "then": {
            "$dynamicRef": "#meta",
            "$dependencies": ["if"]
        },

Doing this across the board for all keywords gives us a dependency graph that implementations can use to evaluate keywords in the correct order. (Currently, implementations must code this dependency graph by hand, in one way or another.)

Given the explicit dependency graph, as an extension developer we could justifiably believe that whenever we evaluated then, we'd have access to the annotation of if. For an annotation filtering implementation, this implies an annotation consumption dependency: if annotations cannot be switched off if an adjacent then is present.

But the current spec only requires that then be evaluated after if, not that it be able to read any hypothetical annotation produced by if. To distinguish between these two classes of dependency, in vocabulary meta-schema language, we'd have to contrive distinct meta-keywords. For example, unevaluatedItems might look like this:

        "unevaluatedItems": {
            "$dynamicRef": "#meta",
            "$dependencies": ["prefixItems", "items", "contains", "if", "then", "else", "allOf", "anyOf", "oneOf", "not"],
            "$annotationDependencies": ["prefixItems", "items", "unevaluatedItems", "contains"]
        },

The existence of $annotationDependencies and the absence thereof from the then keyword definition schema tells us that the if-then evaluation order dependency cannot be mistaken for an annotation consumption dependency. This would allow if annotations to be switched off unconditionally, regardless of any hypothetical annotation consumption by an adjacent then - thus breaking our crafty if-then extension logic.

So the way I see it, the apparent view of the spec - which is to say that not all keywords are annotation keywords - leads to a more complicated vocabulary meta-schema and less flexibility in terms of extension development.

To the contrary, the idea that all keywords should be treated as annotation producing keywords leads to a more concise vocabulary meta-schema (only $dependencies is needed), solves the problem of determining annotation consumption dependencies with the more general keyword dependency graph, and allows more flexibility in extension development.

2 replies

handrews Nov 29, 2022
Author

Very interesting! I will have to digest this more before responding to the details of your ideas here. Some meta-schema ideas along these lines (although not anywhere near as thoroughly developed) have popped up from time to time. For now let me fill you in on some past ideas on encoding behaviors in machine-readable form.

I have been leaning more towards specifying keyword behaviors and dependencies in a separate vocabulary file (identified by the URI used in $vocabulary – this is specifically why that URI is not associated with anything yet, it's being reserved for future use). There are a couple of reasons for this:

keyword behaviors are defined by vocabularies (including redefining or extending standard keywords - a tricky case, but worth keeping in mind as you demonstrated)
vocabularies are expected to be used across multiple, perhaps even many, meta-schemas
it should be easy to write a custom meta-schema, basically only requiring the skills necessary to write a schema
we can reasonably expect deeper knowledge of people writing vocabulary files

So I would not want to rely on someone who is writing a custom meta-schema to correctly indicate the dependencies every time they tweak a keyword's syntax (maybe they want to disallow boolean schemas, or require that "type" is always an array for consistency).

Also, meta-schema keywords are schema keywords, so introducing more has a different impact on the perceived stability of JSON Schema than if we introduce an experimental vocabulary file format and keep messing with it. We'd have more freedom in a vocabulary file.

We've talked about being able to bundle such a vocabulary file into a meta-schema (and meta-schemas into schemas, which you can theoretically do right now although it relies on implementations doing things in a certain order that is not actually specified in the spec). There's a valid concern around file proliferation, so I just wanted to mention that we're aware of that.

All of that said, I suspect what you're saying here would work equally well in a vocabulary file. I have only skimmed it and it's getting late-ish (and I woke up early today), so I'll dig more into the specifics of your proposal tomorrow.

gregsdennis Nov 29, 2022
Maintainer

This is somewhat related to what I tried to do in this PR. You may want to read through that to get some additional context on what Henry said above.

handrews · 2023-01-04T15:39:39Z

handrews
Jan 4, 2023
Author

Note that @karenetheridge has done some great work showing the performance benefits of tracking whether property/item coverage annotations are actually needed, and only collecting and maintaining the information if they are.

0 replies

gregsdennis · 2023-03-03T19:48:59Z

gregsdennis
Mar 3, 2023
Maintainer

I've put some ideas in json-schema-org/json-schema-spec#1385. Among some other things, this adds requirements for annotation collection configuration.

By default, all annotations are collected, but implementations are encouraged to include configuration to disable certain annotations as users desire.

0 replies

JSON Schema

Making annotation collection practical #236

handrews Sep 11, 2022

Annotation allow-lists

Implementing "unevaluatedProperties" using annotation allow-lists

Impact on static keyword dependencies

Considerations for "if", "then", and "else"

Replies: 8 comments · 21 replies

karenetheridge Sep 11, 2022 Maintainer

handrews Sep 11, 2022 Author

gregsdennis Sep 11, 2022 Maintainer

gregsdennis Sep 11, 2022 Maintainer

handrews Sep 11, 2022 Author

handrews Sep 12, 2022 Author

handrews Sep 22, 2022 Author

gregsdennis Sep 22, 2022 Maintainer

gregsdennis Sep 22, 2022 Maintainer

handrews Sep 23, 2022 Author

handrews Sep 23, 2022 Author

Current situation:

Proposal

marksparkza Oct 1, 2022

handrews Oct 1, 2022 Author

handrews Oct 1, 2022 Author

marksparkza Oct 3, 2022

handrews Oct 3, 2022 Author

gregsdennis Nov 9, 2022 Maintainer

API

Annotation purpose

marksparkza Nov 29, 2022

handrews Nov 29, 2022 Author

gregsdennis Nov 29, 2022 Maintainer

handrews Jan 4, 2023 Author

gregsdennis Mar 3, 2023 Maintainer

handrews
Sep 11, 2022

Replies: 8 comments 21 replies

karenetheridge
Sep 11, 2022
Maintainer

handrews Sep 11, 2022
Author

gregsdennis Sep 11, 2022
Maintainer

gregsdennis
Sep 11, 2022
Maintainer

handrews Sep 11, 2022
Author

handrews
Sep 12, 2022
Author

handrews Sep 22, 2022
Author

gregsdennis Sep 22, 2022
Maintainer

gregsdennis Sep 22, 2022
Maintainer

handrews Sep 23, 2022
Author

handrews
Sep 23, 2022
Author

handrews Oct 1, 2022
Author

handrews Oct 1, 2022
Author

handrews Oct 3, 2022
Author

gregsdennis
Nov 9, 2022
Maintainer

marksparkza
Nov 29, 2022

handrews Nov 29, 2022
Author

gregsdennis Nov 29, 2022
Maintainer

handrews
Jan 4, 2023
Author

gregsdennis
Mar 3, 2023
Maintainer