Grammar and taxonomy #171
awwright
started this conversation in
Specification
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Somewhere in a Slack discussion last week, I posited a field of study that unifies grammar and taxonomy. Such a study would be of interest to us, because JSON Schema works in both these fields. (In https://github.com/json-schema-org/json-schema-spec/wiki/Profiles-and-Sets I talk about "profiles" and "sets" which is related, if not the same, and should also be looked at.)
JSON Schema very closely resembles a way to define a context-free grammar within the set of valid JSON documents; where adding keywords reduces the set of valid documents (i.e. it applies the intersection). Most keywords in JSON Schema can be described as a finite state machine of bytes. (A DTD is the XML equivalent of a JSON Schema; it describes a context-free grammar of XML elements, in turn reducible to a context-free grammar of bytes!)
JSON Schema is also frequently used to describe taxonomies, where instances of one schema are automatically known to be instances of another schema, and where members of a more specific subclass describe more properties. However, this seems to conflict with the basic principle of JSON Schema that keywords reduce the number of valid documents, instead of increasing them.
While I’m not aware of a field that’s actually the unification of grammar and taxonomy, these concepts are both very closely related to Category Theory. Broadly applicable in a wide variety of scientific fields, it has especially influenced functional programming (monads, etc), relational databases, and it has some application to automata theory. (These are fields I feel we should know well.)
One of my key findings is: whenever we parse objects for data, when those objects can be instances of multiple classes, we must permit data specific to any subclasses to exist somewhere, but remain ignored as far as the superclass is concerned. In C++, this data is found appended to the instance (where memory is allocated but no address offset from the superclass is pointing). The analogous mechanism in JSON (and similar key-value object languages like PHP) is the subclass must pick a key that is unused & ignored in the superclass.
In other words, grammar & taxonomy are not contradictory at all. A superclass of a subclass must, by necessity, be a superset (not a subset like I've previously hypothesized).
An obvious objection: Doesn't this conflict with unevaluatedProperties? What if an instance uses a property that’s meaningful to a subclass? Counterexample: What if a subclass uses a property that’s already defined (& defined differently) in another subclass? If one of these is a problem, both of these must be a problem. There is some amount of “open world assumption” that must take place—when we pass around an instance as a member of the superclass, that it might be a member of additional subclasses, but that we don’t know specifically which ones. And it’s not obvious to me how “additionalProperties” nor “unevaluatedProperties” approaches a solution to this.
If the schema authoring is “closed world” (i.e. we know the complete description of all classes/schemas at all times), then maybe a limitation on additional properties makes sense, and so would some authoring conveniences. But this API wouldn’t be extensible!
Some areas for additional study:
unevaluatedProperties: false
: Are there any situations where this is actually a good practice? Could this be useful for forward and/or backward compatibility reasons? (for example, a client may want an API call to fail if it uses a feature that's not understood by the server, or vice-versa. But this couldn't work by itself both ways, for example in a peer-to-peer API where either side may be newer or older.)The application to the specification text right now is limited; mostly I'd like to survey: who is familiar with these concepts?
Beta Was this translation helpful? Give feedback.
All reactions