Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement partial parsing #1600

Closed
drewbanin opened this issue Jul 11, 2019 · 0 comments
Closed

Implement partial parsing #1600

drewbanin opened this issue Jul 11, 2019 · 0 comments

Comments

@drewbanin
Copy link
Contributor

Feature

Feature description

dbt should implement "partial parsing." If a node is present in the target/manifest.json file, then dbt should compare the hash of the node from the manifest to the hash of the file on disk. If the hashes match, then dbt should populate a ParsedNode (or equivalent) object from the already-parsed artifact. This will bypass the process of parsing nodes from the filesystem on every run.

Assumptions this is based on:

  • parsing models/macros/etc is very slow compared to hashing the raw contents of a file
  • dbt can readily and accurately deserialize objects from the manifest
  • this deserialized node should exactly match the version of the node dbt would build if it parsed the file from disk directly
  • dbt can identify when file diffs are "nonlocal":
    • changes to dbt_project.yml, profiles.yml, and macros can have non-local effects on the nature of other parsed nodes. To what extent do we need to account for this in our approach?
    • changes to the --target, --profile, and --vars, and ENV vars can all conceivably change the nature of any parsed node in the project

Nonlocal node diffs

dbt records the following pieces of information during parsing:

  1. ref() calls
  2. source() calls
  3. doc() calls
  4. config() calls

The big thing to be aware of here is the config() calls. It's less common for users to change the shape of their graph (ie. select from different nodes) in response to externally provided vars. We can conceivably detect and fail when this happens - it's an acceptable constraint for the dbt graph to be "static" in nature IMO.

Configs are more problematic: it's pretty common (and frequently desirable) to switch model materializations, run certain hooks, and otherwise supply differing config values to nodes in response to externally supplied variables (or, the result of a call to some macro). Is it possible to delay config rendering until runtime? One challenge is that enabled and materialization configs (if ephemeral) affect the compiled nature of other nodes - are there any other such configs?

Order of operations:

  1. Let's try to MVP this to determine what the speedup of implementing partial parsing would look like. @beckjake I believe you already did some work on this front, but I can't seem to find it. Do you remember where that is?
  2. Clearly define the expected rules around partial parsing. Which types of file changes (or environmental factors) necessitate a full reparsing of the project?
  3. Actual implementation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants