Grammar and taxonomy #171

awwright · 2022-05-03T05:31:04Z

awwright
May 3, 2022
Maintainer

Somewhere in a Slack discussion last week, I posited a field of study that unifies grammar and taxonomy. Such a study would be of interest to us, because JSON Schema works in both these fields. (In https://github.com/json-schema-org/json-schema-spec/wiki/Profiles-and-Sets I talk about "profiles" and "sets" which is related, if not the same, and should also be looked at.)

JSON Schema very closely resembles a way to define a context-free grammar within the set of valid JSON documents; where adding keywords reduces the set of valid documents (i.e. it applies the intersection). Most keywords in JSON Schema can be described as a finite state machine of bytes. (A DTD is the XML equivalent of a JSON Schema; it describes a context-free grammar of XML elements, in turn reducible to a context-free grammar of bytes!)

JSON Schema is also frequently used to describe taxonomies, where instances of one schema are automatically known to be instances of another schema, and where members of a more specific subclass describe more properties. However, this seems to conflict with the basic principle of JSON Schema that keywords reduce the number of valid documents, instead of increasing them.

While I’m not aware of a field that’s actually the unification of grammar and taxonomy, these concepts are both very closely related to Category Theory. Broadly applicable in a wide variety of scientific fields, it has especially influenced functional programming (monads, etc), relational databases, and it has some application to automata theory. (These are fields I feel we should know well.)

One of my key findings is: whenever we parse objects for data, when those objects can be instances of multiple classes, we must permit data specific to any subclasses to exist somewhere, but remain ignored as far as the superclass is concerned. In C++, this data is found appended to the instance (where memory is allocated but no address offset from the superclass is pointing). The analogous mechanism in JSON (and similar key-value object languages like PHP) is the subclass must pick a key that is unused & ignored in the superclass.

In other words, grammar & taxonomy are not contradictory at all. A superclass of a subclass must, by necessity, be a superset (not a subset like I've previously hypothesized).

An obvious objection: Doesn't this conflict with unevaluatedProperties? What if an instance uses a property that’s meaningful to a subclass? Counterexample: What if a subclass uses a property that’s already defined (& defined differently) in another subclass? If one of these is a problem, both of these must be a problem. There is some amount of “open world assumption” that must take place—when we pass around an instance as a member of the superclass, that it might be a member of additional subclasses, but that we don’t know specifically which ones. And it’s not obvious to me how “additionalProperties” nor “unevaluatedProperties” approaches a solution to this.

If the schema authoring is “closed world” (i.e. we know the complete description of all classes/schemas at all times), then maybe a limitation on additional properties makes sense, and so would some authoring conveniences. But this API wouldn’t be extensible!

Some areas for additional study:

Re-evaluate use cases for unevaluatedProperties: false: Are there any situations where this is actually a good practice? Could this be useful for forward and/or backward compatibility reasons? (for example, a client may want an API call to fail if it uses a feature that's not understood by the server, or vice-versa. But this couldn't work by itself both ways, for example in a peer-to-peer API where either side may be newer or older.)
How do we evaluate compatibility in APIs and between parties in an application architecture? How can JSON Schema be used to distinguish must-understand protocol elements from may-understand protocol elements?
In a client-server architecture, when is it appropriate for the server to shrink the acceptable value space? (Hypothesis: a server can shrink the value space of an undefined element, in the process of defining it, because well-behaved clients will not be using it; but it cannot shrink the value space of a defined element, this would break existing clients.)
What's the relationship between must/may-understand and closed/open-world assumption? An open set cannot support must-understand elements, because any elements that are must-understand might not be listed. (For most protocols this is not a problem because we can define a portion of the protocol that is a closed set of the must-understand elements of the protocol.)
What does this mean for recursively defined elements? E.g. the meta-schema? It seems likely that any schema, not using "unevaluatedProperties", can be adequately described with $ref only.

The application to the specification text right now is limited; mostly I'd like to survey: who is familiar with these concepts?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON Schema

Grammar and taxonomy #171

{{title}}

Replies: 0 comments

Select a reply

JSON Schema

Grammar and taxonomy #171

awwright May 3, 2022 Maintainer

Replies: 0 comments

awwright
May 3, 2022
Maintainer