Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vocabularies and "format" #563

Closed
handrews opened this issue Mar 8, 2018 · 15 comments
Closed

Vocabularies and "format" #563

handrews opened this issue Mar 8, 2018 · 15 comments

Comments

@handrews
Copy link
Contributor

handrews commented Mar 8, 2018

We have numerous requests for additional formats, or ideas that could be implemented with formats (see #152, #312, json-schema-org/json-schema-vocabularies#45, json-schema-org/json-schema-vocabularies#49, #542).

There have also been a number of discussions around the implementation requirements, particularly

they SHOULD implement validation for [all] attributes defined below

which becomes more burdensome as we add more formats. One idea has been to say that you only need to implement certain subsets (e.g. you could implement the date and time formats but skip the URI/IRI formats, but you shouldn't just implement uri-reference and not uri). But there's no good way to convey support levels. Which brings us to:

Save for agreement between parties, schema authors SHALL NOT expect a peer implementation to support this keyword and/or custom format attributes.

which makes interoperability very challenging. format is really only reliably useful as a semantic annotation, not as validation.

We need a better story here, if for no other reason than to figure out how to manage the endless stream of requests for standardized formats. Currently, that's the only way to achieve any level of interoperability, so there is a high motivation to push for inclusion.


If we go with vocabulary support along the general lines of #561, I feel like this should help manage the variations. A vocabulary could include specific format values (and likewise for contentType and contentEncoding, I suppose).

I've split this out from #561 because it's not clear how it would work. Using a meta-schema, as #561 proposes, doesn't work well for format values because enum is difficult to combine. The typical allOf used to compose vocabularies produces the intersection of enums rather than the union. But anyOf, which would work, would often produce unexpected behaviors otherwise.

Some other mechanism may be required. Therefore this is filed in the future milestone as a question. If we come up with a broadly supported proposal quickly, it can be added to draft-08, but otherwise it's fine as a follow-on in a later draft.

@ralfhandl
Copy link

Formats usually serve the dual purpose of a programming-language-independent code-generation hint for the recipient and a production/validation instruction for the sender or an intermediate validator, so pulling format out of validation seems to be the right direction.

Allowing vocabularies to define specific format values would certainly reduce the pressure for adding new formats to a central list.

It would also remove the single point of reference for looking up existing formats and their meaning.

Unless that vocabulary mechanism is accompanied by a central "registry" or "repository" for format definitions.

The "endless stream of requests for standardized formats" just shows how important interoperability is.

@handrews
Copy link
Contributor Author

handrews commented Mar 9, 2018

The "endless stream of requests for standardized formats" just shows how important interoperability is.

Agreed. I think for a central list, we should look to the IANA registry model. For now, we'll keep a small standard set in one of the major specification drafts.

But if vocabularies are identified by URIs, and we define some clear way for a vocabulary to define format values (and similar extensible value sets), then we can probably use vocabulary URIs with fragments to completely identify a format term.

@ralfhandl
Copy link

Sounds like a plan: 👍

@awwright
Copy link
Member

The link relations IANA registry has a nice model, either:

  • expert review for new entries; or
  • anyone define a format by a URI within a namespace they own

Either form can be used by anyone.

@reitzig
Copy link

reitzig commented Nov 12, 2018

If valid values of format are not standardized, why should I ever use it? I don't know which validator a consumer of the schema may use while writing the schema!

At the very least, the standard should specify that unimplemented formats should cause a validation error. Otherwise, consumers will get false positives, which contradicts the purpose of using schemas.

@handrews
Copy link
Contributor Author

@reitzig many formats are very expensive to validate, or even impossible to validate in a guaranteed way. email is notoriously difficult (google "regular expression for email addresses"), for example.

The point of this issue is to allow people to say "Please only attempt to process this schema if you understand formats X, Y, and Z." That's not possible now, but will be in the future. That will allow those who are happy with best-effort (some people don't expect validation and just want the format to be shown as documentation anyway) to continue to use things as-is, while those who want strict conformance and fail-fast can guarantee that.

@reitzig
Copy link

reitzig commented Nov 13, 2018

I can see that using format to hint to parsers/mappers what to do is useful in itself, even if the validation phase ignores it.

Still, inconsistent behaviour concerns me. I don't want tests to fail if runtime is fine, or the other way around, or some clients accept bad replies while others don't. In my mind, it therefore makes more sense to require that format not be checked during validation (except in a special mode marked as potentially incompatible), than to allow arbitrary choices in validators.

That's a very pessimistic perspective, of course. Systems that use only a single validation library are completely fine with an "intermediate" situation.

@handrews
Copy link
Contributor Author

With PR #671, we now have the concept of a formal vocabulary in the spec. That's a really complex PR, so here is the TL;DR (which is still pretty long- sorry):

  • $vocabulary is a keyword that is used in meta-schemas
    • Like all meta-schema keywords, it says something about the schemas described by the meta-schema
    • Specifically, it indicates what keywords are likely to be used in those schemas, and (indirectly) what source defines their semantics
  • $vocabulary is an object whose keys are URIs identifying vocabularies, and whose values are booleans
    • Vocabularies with true are required- if not recognized, an implementation MUST refuse to to process a schema described by this meta-schema
    • Vocabularies with false are optional- if not recognized, an implementation MAY process a schema described by this meta-schema anyway, and ignore any keywords it does not recognize (which is how unrecognized keywords are handled now)
  • The URIs used to identify vocabularies are currently forbidden to point to any sort of actual document
    • This is so that, based on feedback, we can come up with a useful vocabulary definition format in a future draft
    • For now, it just means that some specification somewhere else documents what is in the vocabulary in text.
      • For example the vocabulary URI https://json-schema.org/draft-08/vocabularies/applicators is defined in the JSON Core specification to mean that the schemas can use the keywords in the section of that specification titled "A Vocabulary for Applying Subschemas"
    • This is pretty much how meta-schemas work now anyway
      • Implementations just know that http://json-schema.org/draft-07/schema corresponds to everything in either the Core or Validation specification
      • There is no programmatic way to determine what that means, the implementation developer just read the spec
      • With $vocabulary, we're just making that more modular and introducing optional vs required vocabularies

There's more to the PR, but these are the key points for what I want to say about format and the content* keywords. Someone may also come up with a counter-proposal, but lets work with what we have for the moment.

What does this have to do with format?

My vague plans for world domination through modular vocabularies included being able to define values for keywords like format. JSON Schema itself would define some, but other vocabularies could also include format (with the same general keyword behavior) and just define additional values.

So for example, OpenAPI could define a vocabulary that just consisted of format and all of their defined formats that are not already in the JSON Schema Validation specification, and say that they are using that vocabulary as well as the standard ones. They would assign it a URI like

https://www.openapis.org/oas/3.0/json-schema-vocabularies/os-formats

And then in their specification document include that URI as the one to use with $vocabulary (assuming that OpenAPI starts supporting normal JSON Schema with meta-schemas- just roll with it, it's an example, and it's pretty likely to happen in OAS 3.1 anyway).

Requiring specific formats

The vocabulary based on the standard Validation spec will continue to document format validation support as optional. Even if the vocabulary is required, format validation is not guaranteed.

However, someone else could define a new vocabulary and document it to mean "like the standard format keyword but when the value is "ipv4" or "ipv6" then validation MUST occur. For really complicated formats, such a document should say what validation is sufficient (e.g. "email" is very hard to reliably validate, usually you validate it up to some point and then hope for the best).

You could even just say "like the standard format but with all values requiring validation".

Adopting this approach would let us:

  • use $vocabulary as proposed in the PR, with optional vs required use
  • define vocabularies that essentially just add the requirement to validate an existing format value
  • require different levels of validation where there is no obvious simple check
  • avoid designing a formal vocabulary document that somehow describes all of this (which would probably delay draft-08 by months)

It has several drawbacks:

  • there are a lot of formats, and a lot of potential combination of formats
  • this probably makes it relatively unlikely that lots of validators will recognize the various combinations documented as vocabularies- the vocabulary granularity was meant to produce a good balance between modularity vs a reasonable number of options for implementations to recognize
  • it's really hand-wavey, which bothers a lot of people

What do do in draft-08?

This is the best I've come up with. If folks like it, we'll put it in. If someone comes up with something better, we'll put that in.

Unless someone produces a vocabulary definition file format that wins instant universal acclaim, we will NOT be defining such a file format in draft-08. I've spent a lot of time thinking about it, and it's too hard to do within the next few weeks, and we're way past due for this draft. If we think we need that sort of solution, it will go in draft-09.

A corollary of this is that we cannot use URI fragments to indicate required support for things inside of the vocabulary (like specific keywords or specific keyword values), because that would constrain the format of the file, which is exactly what I don't want to do in this draft.

So... that's all I've got right now. Any ideas?

@gregsdennis
Copy link
Member

gregsdennis commented Nov 15, 2018

  • Vocabularies with true are required- if not recognized, an implementation MUST refuse to to process a schema described by this meta-schema
  • Vocabularies with false are optional- if not recognized, an implementation MAY process a schema described by this meta-schema anyway, and ignore any keywords it does not recognize (which is how unrecognized keywords are handled now)

Just to clarify, implementations should still ignore unrecognized keywords (i.e. keywords not defined by any of the vocabularies), correct?

@handrews
Copy link
Contributor Author

Just to clarify, implementations should still ignore unrecognized keywords (i.e. keywords not defined by any of the vocabularies), correct?

Correct- this is covered in the PR, at least I think it is. When in doubt, the PR takes precedence, I just know few people will slog through the PR (thanks again for doing that, btw!), so I summarized it a bit here.

@reitzig
Copy link

reitzig commented Nov 15, 2018

Do I understand correctly that for declaring that validation requires support for certain keywords for a certain (set of) schema(s), I have to write a new meta schema?

That strikes me as odd. The "have to" part, that is. It makes sense if I have a large number of similar schemas and the infrastructure to make a meta-schema available. However, in small use cases -- say I have a schema describing my log format which I use in automated tests -- it would be much more convenient to extend the vocabulary of a standard meta-schema in the schema.
Is that something that's on the table?

Regarding formats, a simple convention could be to have one (dummy) URL per format, for example

https://json-schema.org/draft-08/vocabularies/formats/date-time

Furthermore, I'm thinking about a vocabulary that triggers a strict mode where we require support for all used keywords (imho a very natural choice):

https://json-schema.org/draft-08/vocabularies/all-used

(Is there something like a strict mode in the spec?)

@gregsdennis
Copy link
Member

gregsdennis commented Nov 15, 2018

@reitzig unless you've declared new keywords not already defined by the draft schema, then, no, you don't need to write additional meta-schemas.

As @handrews said, if you can come up with a better, complete option that covers all of the same scenarios, please write it up (with examples). Right now, this is where we are. This issue has been open for 8 months now, and I'm sure he's been working on the idea longer than that.

Defining each format in a separate vocabulary seems inefficient. In his Requiring Specific Formats section, he states that vocabularies could redefine certain formats to make validation required. I'm sure we can figure out a way to make them validation forbidden to support the case where a supplied format is not wanted.

@handrews
Copy link
Contributor Author

@reitzig Meta-schemas are how implementations are told what behavior to apply, so yes, if you want to change the behavior, you have to change/write a new meta-schema. This is unlikely to be changed because:

  1. Most people don't need to do this. Whenever we figure out how we want to handle this, there will be a brief flurry of people writing new meta-schemas and then it will be pretty rare.
  2. We've made it a lot easier to write meta-schemas with $recursiveRef, so it's just not that big of a deal anymore.

In case it was not clear from my wording above: I'm not satisfied with any current solution to this problem, including the ones I outlined earlier. I think that the correct solution involves having a real vocabulary definition format, which is not going to make the cut for draft-08. I posted this to make sure no one had a better idea before going ahead with punting this to draft-09 (technically it was never in the draft-08 milestone, but I had hoped to pull it in).

a simple convention could be to have one (dummy) URL per format

The solution I proposed would allow that.

triggers a strict mode where we require support for all used keywords

That's too magical of a behavior for something that is really a very narrow problem. Most keywords are either supported or not, format and a very small number of others are exceptions.

@handrews
Copy link
Contributor Author

As we've worked with the concept of vocabularies and further tried to figure out what to do with format, it's become clear that the keyword should really be replaced by multiple vocabularies focused on specific areas.

For example, a date-time vocabulary could provide several keywords replacing the date-time, date, time, duration, etc. formats with more clarity and flexibility than format offers. This vocabulary would presumably be quite easy to implement (lots of reliable, optimized libraries understand date and time issues), so would probably become widely supported. Unlike with format where supporting easy formats gets bogged down in supporting hard formats.

We're not going to drop format unless and until such vocabularies emerge and prove popular. But we're not going to spend time trying to accommodate keywords with extensible value sets either. There's a huge pain and don't offer anything you can't do with separate keyword vocabularies.

Therefore, I'm closing this as irrelevant due to a change in direction. For the current status of format, see the draft 2019-09 release notes.

@sandrina-p
Copy link

This vocabulary would presumably be quite easy to implement (lots of reliable, optimized libraries understand date and time issues),

@handrews is there any library that supports date validation that you would recommend?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

6 participants