-
Notifications
You must be signed in to change notification settings - Fork 793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable uri-reference format check in jsonsschema #2771
Conversation
8f35a96
to
4b6a2d5
Compare
To spot similar cases in the future, we could potentially extend the test pipeline to install the [format] extra dependencies for |
…d the wrong jsonschema draft
4b6a2d5
to
95cf206
Compare
Thanks @binste, can you check what happens if you parse the example of vega/vega-lite#5838 (comment) with jsonschema 4.13.0?
From what I understand it's not the grammar that is wrong, but the parsers that cannot handle special characters. Could you verify this? |
Running that example with jsonschema 4.13 gives the following error ValueError Traceback (most recent call last)
File ~/.local/lib/python3.11/site-packages/jsonschema/_format.py:135, in FormatChecker.check(self, instance, format)
134 try:
--> 135 result = func(instance)
136 except raises as e:
File ~/.local/lib/python3.11/site-packages/jsonschema/_format.py:360, in is_uri_reference(instance)
359 return True
--> 360 return rfc3987.parse(instance, rule="URI_reference")
File ~/.local/lib/python3.11/site-packages/rfc3987.py:462, in parse(string, rule)
461 if not m:
--> 462 raise ValueError('%r is not a valid %r.' % (string, rule))
463 if REGEX:
ValueError: '#/definitions/Foo<hello>' is not a valid 'URI_reference'.
The above exception was the direct cause of the following exception:
SchemaError Traceback (most recent call last)
Cell In[1], line 15
3 schema = {
4 "definitions": {
5 "Foo<hello>": {"type": "string"}
...
Failed validating 'format' in metaschema['allOf'][1]['properties']['properties']['additionalProperties']['$dynamicRef']['allOf'][0]['properties']['$ref']:
{'format': 'uri-reference', 'type': 'string'}
On schema['properties']['x']['$ref']:
'#/definitions/Foo<hello>' The "URI_reference" format check which fails here is the one that would be disabled with this PR. In my limited understanding, URIs (as they are used as values for "$ref") should be percent-encoded according to RFC3986, e.g. "<" becomes "%3C". |
Hi! (Sorry to hear something broke here). Just to clarify after seeing this PR cross-linked:
This isn't the case! Though I assume that's just imprecision -- but the format validation default (in 4.17.0 I presume you mean) is the same as before, it's only that validating schemas now catches cases where schemas are invalid purely from things represented by format validation in the metaschema. I.e. previously, some invalid schemas were let through, now they're not, as you're noticing (and this indeed will hopefully get even more strict in the future, as there are plenty of schemas entirely valid under the metaschema but which are still invalid according to the specification). Validating instances (i.e. data) is exactly the same, and changing that definitely wouldn't happen without deprecation for backwards compatibility purposes (heck it wouldn't happen at all, since it's not compliant behavior under the spec, but yeah). I haven't looked in depth at how you use |
Thanks @Julian for your detailed comment. Appreciated! |
Pleasure of course, and if indeed it'd help to have a closer look at any point at the changes feel free to ping me, I love altair :) |
Thanks @Julian for your input and in general for your work on python-jsonschema! Indeed, that was just an imprecision on my side, I was refering to the validation of the schema. Actually, I'd really appreciate it if you could take a look at the proposed solution, especially as I'm not too familiar with your library so there might be a more future-proof way to do this. To make it easier I stripped out the relevant parts into the code examples below so no need to go through the PR. For the previously mentioned reasons the schema below has a wrong uri-reference due to "<" and ">": import jsonschema
schema = {
"definitions": {
"Foo<hello>": {"type": "string"}
},
"properties": {
"x": {"$ref": "#/definitions/Foo<hello>"},
},
}
spec = {"x": "x value"}
jsonschema.validate(spec, schema, cls=jsonschema.Draft7Validator) And so it gives the following error: ...
ValueError: '#/definitions/Foo<hello>' is not a valid 'URI_reference'.
... Btw, I think it's great that this is introduced in The proposed solution is to temporarily disable the uri-reference format check: import jsonschema
schema = {
"definitions": {
"Foo<hello>": {"type": "string"}
},
"properties": {
"x": {"$ref": "#/definitions/Foo<hello>"},
},
}
spec = {"x": "x value"}
validator_cls = jsonschema.Draft7Validator
removed_format_checkers = []
try:
# In older versions of jsonschema this attribute did not yet exist
# and we do not need to disable any format checkers
if hasattr(validator_cls, "FORMAT_CHECKER"):
for format_name in ["uri-reference"]:
try:
checker = validator_cls.FORMAT_CHECKER.checkers.pop(format_name)
removed_format_checkers.append((format_name, checker))
except KeyError:
# Format checks are only set by jsonschema if it can import
# the relevant dependencies
continue
jsonschema.validate(spec, schema, cls=validator_cls)
finally:
# Restore the original set of checkers as the jsonschema package
# might also be used by other packages
for format_name, checker in removed_format_checkers:
validator_cls.FORMAT_CHECKER.checkers[format_name] = checker Does this make sense to you or do you think there is an advantage in using the validator class directly so we can pass the format checks to import copy
import jsonschema
from jsonschema import exceptions
schema = {
"definitions": {
"Foo<hello>": {"type": "string"}
},
"properties": {
"x": {"$ref": "#/definitions/Foo<hello>"},
},
}
spec = {"x": "x value"}
validator_cls = jsonschema.Draft7Validator
format_checker = copy.deepcopy(validator_cls.FORMAT_CHECKER)
format_checker.checkers.pop("uri-reference", None)
validator_cls.check_schema(schema, format_checker=format_checker)
validator = validator_cls(schema)
error = exceptions.best_match(validator.iter_errors(spec))
if error is not None:
raise error |
If you want to treat the schema as valid (essentially get the same behavior you had before), the simplest is likely just: validator = jsonschema.Draft7Validator(schema) which doesn't validate your schema, it assumes you're saying it's valid. (either with or without format checking enabled for validating instances, if you want it, use and then using But then you don't need to muck with changing anything and essentially should be able to go on as-is until the schema is fixed. If you want to guard against other unexpected bugs in the schema though then you may want to call errors = jsonschema.Draft7Validator(
jsonschema.Draft7Validator.META_SCHEMA,
format_checker=jsonschema.Draft7Validator.FORMAT_CHECKER,
).iter_errors(schema)
ones_we_care_about = [error for error in errors if error.validator_value != "uri-reference"]
if ones_we_carout:
the_schema_is_even_more_invalid() But if the schema is static (i.e. if it's like a vega-lite schema they never change) there's no need to call Does that help? (And thanks for the kind words!) |
31afb91
to
9c3b964
Compare
Thank you! This was very helpful, especially the differentiation between I pushed a simplified version of the fix which uses the validator class instead of import altair as alt
import pandas as pd
source = pd.DataFrame({
'a': ['A', 'B'],
'b': [28, 55]
})
chart = alt.Chart(source).mark_bar().encode(
x=alt.X("a", unknown=2),
y='b'
)
chart Raises:
|
Since this this issue was about the vega-lite-schema I thought this issue would be moved eventually to the Vega-lite repo, and maybe it still will, but after this fruitful discussion this PR becomes a nice improvement to the core of Altair. I see that this PR will use the Draft 7 version of the JSON schema specification ( The schema is included within Altair so it make sense to validate the vega-lite schema during development. Then we can catch these issues before it is shipped. Maybe we can set up a test for this within Github Action. While typing I remember the comment earlier on in this thread from @binste:
Once Altair is released I also see no need to validate the vega-lite-schema itself every time when using Altair. Sure it would be necessary to test the generated JSON by Altair against the schema, but that are two different things indeed. |
Sorry, I'm being a bit lazy by not digging into this myself and providing breadcrumbs instead, but hopefully they're helpful -- as I say I love altair so if this needs more attention on my part I'm happy to help, but seems y'all are getting there. Some more breadcrumbs then :D --
Here really "ideally" 2 things should happen:
EDIT: Being slightly less lazy, indeed vega already specifies
Yes! This is the thing I usually recommend -- if you have a static schema which you think is supposed to be valid, validate it in CI, then you'll know if something's wrong, but otherwise the library proceeds as normal. If you want such a thing you can run in GitHub actions, check-jsonschema is a lovely thing (by @sirosen) which even has |
I haven't read the full thread here, but if you wanted a minimal pre-commit config to get started, I believe that this would do the trick # .pre-commit-config.yaml , in the repo root
# using check-metaschema:
# https://check-jsonschema.readthedocs.io/en/latest/precommit_usage.html#check-metaschema
repos:
- repo: https://github.com/python-jsonschema/check-jsonschema
rev: 0.19.2
hooks:
- id: check-metaschema
files: ^altair/vega/v5/schema/vega-schema.json$ Or, if you didn't want to get up and running with pipx install check-jsonschema # or 'pip install' in CI
check-jsonschema --check-metaschema altair/vega/v5/schema/vega-schema.json And I'm always happy to help folks out on the |
Great inputs, thank you all! @mattijn Resolved the conflicts. Ready to be merged from my side. |
All checks passed. Merging. Thanks again everyone! |
…which did not yet make format checker accessible on validator class
…which did not yet make format checker accessible on validator class
* Use property to dynamically determine jsonschema validator * Fix regression introduced in #2771 for older jsonschema versions which did not yet make format checker accessible on validator class * Add test
* DOC: remove unused section * Disable uri-reference format check in jsonsschema (#2771) * Disable uri-reference format check. Consistently use same validator across codebase * Remove validation in SchemaInfo as not used anywhere and it referenced the wrong jsonschema draft * Add compatibility for older jsonschema versions * Improve comments * Simplify validate_jsonschema * Replace `iteritems` with `items` due to pandas deprecation (#2683) * Add changelog entry. * Bump version. * Run black and flake8. * Pin selenium in CI. Co-authored-by: Jake VanderPlas <[email protected]> Co-authored-by: Stefan Binder <[email protected]> Co-authored-by: Joel Ostblom <[email protected]>
* DOC: remove unused section * Disable uri-reference format check in jsonsschema (vega#2771) * Disable uri-reference format check. Consistently use same validator across codebase * Remove validation in SchemaInfo as not used anywhere and it referenced the wrong jsonschema draft * Add compatibility for older jsonschema versions * Improve comments * Simplify validate_jsonschema * Replace `iteritems` with `items` due to pandas deprecation (vega#2683) * Add changelog entry. * Bump version. * Run black and flake8. * Pin selenium in CI. Co-authored-by: Jake VanderPlas <[email protected]> Co-authored-by: Stefan Binder <[email protected]> Co-authored-by: Joel Ostblom <[email protected]>
* include an altairfuturewarning * deprecate vega 5 wrappers and render function * deprecate vegalite 3 wrappers and render function * use AltairDeprecationWarning * fix typo in v5 warning * remove mentioning alternative for vega wrappers * Backport bug fixes for a 4.2.1 release (#2827) * DOC: remove unused section * Disable uri-reference format check in jsonsschema (#2771) * Disable uri-reference format check. Consistently use same validator across codebase * Remove validation in SchemaInfo as not used anywhere and it referenced the wrong jsonschema draft * Add compatibility for older jsonschema versions * Improve comments * Simplify validate_jsonschema * Replace `iteritems` with `items` due to pandas deprecation (#2683) * Add changelog entry. * Bump version. * Run black and flake8. * Pin selenium in CI. Co-authored-by: Jake VanderPlas <[email protected]> Co-authored-by: Stefan Binder <[email protected]> Co-authored-by: Joel Ostblom <[email protected]> * include note in releases change log Co-authored-by: Jan Tilly <[email protected]> Co-authored-by: Jake VanderPlas <[email protected]> Co-authored-by: Stefan Binder <[email protected]> Co-authored-by: Joel Ostblom <[email protected]>
Fix for #2705 and #2767
Problem
Since 4.13.0, jsonschema does format validation by default. If it is installed with the
format
extra dependencies, this leads to errors for some of the Vega-Lite$ref
values such as#/definitions/ValueDefWithCondition<MarkPropFieldOrDatumDef,(Gradient|string|null)>
as they are not proper URIs as characters such as<
are not encoded. At one point, Vega-Lite did do URL encoding, see vega/vega-lite#5838, but this was reverted in vega/vega-lite#5869.Reproduce
Install
jsonschema[format]>=4.13
and runProposed solution
As the URIs are already not encoded in the Vega-Lite schema, we do not want to change the URIs. Therefore, we need to disable the format check introduced in jsonschema. This PR introduces a new function
validate_jsonschema
which serves as a drop-in replacement for all calls tojsonschema.validate
. It does two things:uri-reference
check before callingjsonschema.validate
Draft7Validator
) is used everywhere in the Altair codebase. This was not the case before as some calls tojsonschema.validate
did not pass a validator class which leadsjsonschema
to fall back onDraft202012Validator
.Tested with
jsonschema
versions 3, 4.16.0, and 4.17.3. Don't expect issues for versions in between as theFormatChecker
class already had thechecker
attribute as a dictionary in version 3 https://github.com/python-jsonschema/jsonschema/blob/v3.0.0/jsonschema/_format.py#L34Not directly related to the issue above but I also cleaned up an old reference to draft 4 along with some unused functionality, see 4b6a2d5