Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataclasses-jsonschema types #1589

Merged

Conversation

beckjake
Copy link
Contributor

@beckjake beckjake commented Jul 5, 2019

Convert dbt to use dataclasses + hologram for many/most internal types

Something to think about in this PR:

dbt.main.handle_and_check no longer returns the exact same signature. In particular, it no longer returns a dict-like object as its first result (the second part of the result is still a bool). You can see some of the kinds of changes in tests that inspected results. What else will be impacted by this change, and is it ok?

Things to do in future PRs because this is already far too long:

  • make the deps process a bit more sane
  • I had to make tests their own node type. We should do the same for seeds and hooks, at a minimum, but probably just for everything.
  • In general, I would like to clean up the type hierarchy. The typing.Union behavior really threw me for a loop here and results in some pretty unsatisfying serialization/de-serialization behavior!
  • (hologram?) Add support back in for things that "should" be lists actually being tuples (see the bigquery snapshots test I had to change)
  • I hate the stupid IntermediateSnapshotNode thing I had to do, but that is an issue buried deep inside the fundamentals of how we do parsing, so I really am loath to fix it here
  • Snapshots and their configs require a kind of gross hack to get proper error messages out of the validator. It would be nice to at least put this gross hack into hologram itself.

The typing.Union issue is that this assert passes:

@dataclass
class Foo(JsonSchemaMixin):
  a: int

@dataclass
class Bar(Foo):
  b: int

FoosAndBars = typing.Union[Foo, Bar]
assert FoosAndBars is Foo

This behavior makes complete and total sense for the original subclass-aware behavior of typing.Union! But it messes pretty badly with hologram, since now this won't decode:

@dataclass
class Contains(JsonSchemaMixin):
  inner: typing.Union[Foo, Bar]

Contains.from_dict({'inner': {'a': 1, 'b': 2}})  # raises ValidationError

This comes up a lot with the relationship between ParsedNode and CompiledNode. I've got a workaround in contracts/graph/compiled.py, but it kind of sucks.

@beckjake beckjake force-pushed the feature/dataclasses-jsonschema-types branch 2 times, most recently from 8678189 to 3dc1aff Compare July 9, 2019 16:50
Copy link
Member

@cmcarthur cmcarthur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@beckjake what's the state of this rn? I looked through a bunch of the changes which look fine individually. i agree with your point that this is already huge and we should try to push further work into future PRs

Replaceable,
metaclass=abc.ABCMeta
):
_ALIASES: ClassVar[Dict[str, str]] = field(default={}, init=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't you want field(default_factory=dict, ...) here

Copy link
Contributor Author

@beckjake beckjake Jul 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope! This is a ClassVar, as in a static class-level attribute, so default is appropriate. It's not called on instantiation by the dataclass layer (hence init=False). Though I do think I could maybe do this instead?:

_ALIASES: ClassVar[Dict[str, str]] = {}

It seems like dataclasses correctly figure out when they're class vars that they don't get an __init__ entry

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I was pretty wrong about doing the above instead! When that happens, everything looks great until you call dataclasses.replace(credentials, ...) - then python throws a TypeError complaining that it got an unexpected keyword argument _ALIASES. That seems like a bug to me.

@beckjake
Copy link
Contributor Author

@cmcarthur I'm going to mark it as "ready for review", but I'm still writing a bunch of unit tests around the various fiddly bits of going to/from dicts that I'll add (and probably make minor changes accordingly).

@beckjake beckjake marked this pull request as ready for review July 10, 2019 13:13
@beckjake beckjake force-pushed the feature/dataclasses-jsonschema-types branch 3 times, most recently from ea2ab72 to 07b63cc Compare July 10, 2019 23:35
Copy link
Contributor

@drewbanin drewbanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a massive undertaking - nice work on replacing like 1.5k LOC with 1.5k lines of tests!

I took an initial look at this today and:

  1. asked a couple of questions about things i didn't understand on my first pass
  2. responded to all (?) of your comments - thanks for calling certain things out

You said:

dbt.main.handle_and_check no longer returns the exact same signature. In particular, it no longer returns a dict-like object as its first result (the second part of the result is still a bool). You can see some of the kinds of changes in tests that inspected results. What else will be impacted by this change, and is it ok?

I know some users are running dbt via this API, so we should be really careful to document this change when this code goes live. Additionally, I think some of our own internal tooling relies on this return value, so we should be sure to update that as well. I can follow up on that off-line.

core/dbt/adapters/base/connections.py Outdated Show resolved Hide resolved
core/dbt/context/common.py Show resolved Hide resolved
core/dbt/context/parser.py Outdated Show resolved Hide resolved
core/dbt/contracts/graph/parsed.py Outdated Show resolved Hide resolved
core/dbt/contracts/graph/parsed.py Show resolved Hide resolved
@beckjake
Copy link
Contributor Author

I know some users are running dbt via this API, so we should be really careful to document this change when this code goes live. Additionally, I think some of our own internal tooling relies on this return value, so we should be sure to update that as well. I can follow up on that off-line.

Agreed on documenting! I feel ok about it since we are pretty clear (IMO) that we don't support that interface, and we usually direct people to the mostly-unchanged run_results.json.

I checked the internal tooling I know about and it only cares about the success parameter.

@cmcarthur
Copy link
Member

@beckjake the shape of this looks good although i would not be surprised if we catch a few more bugs when we test this code out. this is approved from my end.

with that, i am still curious:

  • have you been able to generate a json schema document for all of the subclasses of jsonschemamixin?

  • have you tested serializing / deserializing an entire manifest?

@beckjake beckjake force-pushed the feature/dataclasses-jsonschema-types branch 2 times, most recently from 32c94f9 to b7783bd Compare July 12, 2019 13:20
@beckjake
Copy link
Contributor Author

@beckjake the shape of this looks good although i would not be surprised if we catch a few more bugs when we test this code out. this is approved from my end.

with that, i am still curious:

* have you been able to generate a json schema document for all of the subclasses of jsonschemamixin?

Yes! Well, I've called from_dict on all of them one way or another, and that involves calling their .json_schema() classmethods. I've manually inspected quite a number of them and they all look really reasonable.

* have you tested serializing / deserializing an entire manifest?

No... serializing works great of course, we do that in the code, deserialization has some type confusion issues that we will work out in #1602. A modified form that only accepts CompiledNode in the manifest and coerces ParsedNodes accordingly might work fine, but I'd rather just do it correctly.

@drewbanin
Copy link
Contributor

@beckjake I gave this a spin locally and found that some of the schema.yml specs in our internal-analytics project had invalid/additional fields in them! While that's really cool and convenient, the warning provided by dbt was pretty noisy:

2019-07-12 15:28:46,156 (MainThread): Compilation warning: Invalid test config given in 
models/schema.yml @ sources: {'name': 'dbt_dbanin', 'freshness': {'error_after': {'count': 48, 'period':
 'hour'}}, 'tables': [{'name': 'a', 'sql_table_name': 'raw.snowplow.event'}]} - Additional properties 
are not allowed ('sql_table_name' was unexpected)

Failed validating 'additionalProperties' in schema['properties']['tables']['items']:
    {'additionalProperties': False,
     'description': 'UnparsedSourceTableDefinition(name: str, description: '
                    "str = '', tests: Union[List[Union[Dict[str, Any], "
                    'str]], NoneType] = None, columns: '
                    'Union[List[dbt.contracts.graph.unparsed.NamedTested], '
                    'NoneType] = <factory>, loaded_at_field: Union[str, '
                    'NoneType] = None, identifier: Union[str, NoneType] = '
                    'None, quoting: dbt.contracts.graph.unparsed.Quoting = '
                    '<factory>, freshness: '
                    'dbt.contracts.graph.unparsed.FreshnessThreshold = '
                    '<factory>)',
     'properties': {'columns': {'default': [],
                                'oneOf': [{'items': {'$ref': '#/definitions/NamedTested'},
                                           'type': 'array'},
                                          {'type': 'null'}]},
                    'description': {'default': '', 'type': 'string'},
                    'freshness': {'$ref': '#/definitions/FreshnessThreshold',
                                  'default': {'error_after': None,
                                              'warn_after': None}},
                    'identifier': {'oneOf': [{'type': 'string'},
                                             {'type': 'null'}]},
                    'loaded_at_field': {'oneOf': [{'type': 'string'},
                                                  {'type': 'null'}]},
                    'name': {'type': 'string'},
                    'quoting': {'$ref': '#/definitions/Quoting',
                                'default': {'database': None,
                                            'identifier': None,
                                            'schema': None}},
                    'tests': {'oneOf': [{'items': {'oneOf': [{'type': 'object'},
                                                             {'type': 'string'}]},
                                         'type': 'array'},
                                        {'type': 'null'}]}},
     'required': ['name'],
     'type': 'object'}

On instance['tables'][0]:
    {'name': 'a', 'sql_table_name': 'raw.snowplow.event'}

Is there anyway to quiet this output down? The information here is correct (and appropriate for the debug logs) but definitely way to much to show in the stdout IMO!

@beckjake
Copy link
Contributor Author

@drewbanin I made changes to both hologram and this branch that make the message much quieter outside of debug level (where it is perhaps still too noisy, but stack traces are nice to have)

Here's what a warning looks like now:

Compilation warning: Invalid test config given in models/schema.yml @ models: {'name': 'x', 'columns': [{'name': 'a', 'tests': ['unique', 'not_null']}, {'name': 'b', 'tests': [['not_null']]}]} - at path []: ['not_null'] is not of type 'object'

The at path [] is a bit of cruft in this instance, but otherwise when you have {{ config(enabled='hello') }} the error message doesn't even tell you the type error is about the "enabled" key, which seems pretty important!

@drewbanin
Copy link
Contributor

@beckjake that message looks a lot better! I think there might be a quirk around the exception handling though -- when i run dbt with an invalid source (containing an additional property) I see an exception raised by validator_error_message:

  File "/Users/drew/fishtown/dbt/core/dbt/parser/schemas.py", line 536, in <genexpr>
    self.parse_source_entry(source, path, package_name, root_dir)
  File "/Users/drew/fishtown/dbt/core/dbt/parser/schemas.py", line 242, in _filter_validate
    msg = validator_error_message(exc)
  File "/Users/drew/fishtown/dbt/core/dbt/exceptions.py", line 16, in validator_error_message
    path = "[%s]" % "][".join(map(repr, exc.relative_path))
AttributeError: 'ValidationError' object has no attribute 'relative_path'

My schema.yml looks like:

version: 2

sources:
  - name: dbt_dbanin

    table:  # Should be `tables`
      - name: a

I see the same AttributeError for a lot of different types of .yml spec errors. Any idea what's up with that?

@drewbanin
Copy link
Contributor

Ignore this ^ - I was using an outdated version of hologram

@drewbanin
Copy link
Contributor

Can you just update the changelog to note that we removed the graph from the context? This LGTM - ship it!

Most of the things that previously used manually created jsonschemas
Split tests into their own node type
Change tests to reflect that tables require a freshness block
add a lot more debug-logging on exceptions
Make things that get passed to Var() tell it about their vars
finally make .empty a property
documentation resource type is now a property, not serialized
added a Mergeable helper mixin to perform simple merges
Convert some oneOf checks into if-else chains to get better errors
Add more tests
Use "Any" as value in type defs
 - accept the warning from hologram for now, PR out to suppress it
set default values for enabled/materialized
Clean up the Parsed/Compiled type hierarchy
Allow generic snapshot definitions
remove the "graph" entry in the context
 - This improves performance on large projects significantly
Update changelog to reflect removing graph
@beckjake beckjake force-pushed the feature/dataclasses-jsonschema-types branch from 22a480b to 49f7cf8 Compare July 16, 2019 16:00
@beckjake
Copy link
Contributor Author

That rebase was kind of a monster, but when tests pass I'll merge!

@beckjake beckjake merged commit 72afd76 into dev/louisa-may-alcott Jul 16, 2019
@beckjake beckjake deleted the feature/dataclasses-jsonschema-types branch July 16, 2019 17:06
@beckjake beckjake changed the title dataclasses-jsonschema types [WIP] dataclasses-jsonschema types Jul 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants