Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Execute-only invocations #3360

Closed
jtcohen6 opened this issue May 16, 2021 · 6 comments
Closed

Execute-only invocations #3360

jtcohen6 opened this issue May 16, 2021 · 6 comments
Labels
enhancement New feature or request performance

Comments

@jtcohen6
Copy link
Contributor

jtcohen6 commented May 16, 2021

We're making a number of improvements to partial parsing (#3217). One of the goals we're targeting is very quick project mise-en-place (<5s) if no files have changed, regardless of project size.

A natural extension of this functionality is removing the need for a file system altogether, and supporting "execute-only" invocations which can take, as their only input, the partial-parse save-state / internal manifest (i.e. partial_parse.msgpack).

We're thinking that this functionality would:

  • Speed up deployments that require invoking dbt many times, without any changes to files
  • Support remote interactions in development

The scope of this issue is to determine whether our partial-parsing logic could reasonably support this workflow. If there are changes we need to make, we should consider make them ahead of v1.0:

Questions

  • If no file system is present, including no dbt_project.yml, how do users define profile_name + target? Would these need to be passed as flags / env vars?
  • Could the input to "execution-only" invocations be even more concise than the partial-parse save state? The files object wouldn't be needed, since we're not comparing against a file system
  • Would it be possible to separate entirely the parsing of a project from the adapter/target-specific details of its execution? This would be tricky for adapter-specific configs and target Jinja variables
  • Could this work with stateful dbt features, e.g. state:modified and --defer, by passing partial_parse.msgpack (current state) alongside manifest.json (previous state)?
  • Database state is a crucial input that exists outside of dbt. dbt handles this currently by running metadata queries to populate an adapter cache at the start of each invocation. If we're considering the use case of many "execute-only" invocations run in serial, should we think about "persisting" the adapter cache across invocations? Could this be persisted in memory (RPC server), or read from an artifact (catalog.json)? This is likely out of scope for the current issue, but I definitely want to think more about it
@jtcohen6 jtcohen6 added enhancement New feature or request performance 1.0.0 Issues related to the 1.0.0 release of dbt labels May 16, 2021
@gshank
Copy link
Contributor

gshank commented Jun 4, 2021

In theory we might be able to load a manifest.json file to use as a manifest, since most of the issues with doing that have been hammered out with the msgpack serialization project. It would be interesting to see what the difference is between a manifest state file with and without the files dictionary. The files dictionary drives partial parsing, so a manifest state file without it could only do load-and-go.

The partial_parse file and manifest.json are not necessarily current and previous state. When parsing starts they're both previous state. When parsing ends, they're both current state.

One thing we might want to do is remove all the references to absolute file paths in the various nodes and instead store that info only in the project file, as a step in the direction of reducing our ties to the file system. The 'file_id' that was introduced in the partial parsing work was a first step.

@leahwicz
Copy link
Contributor

leahwicz commented Jun 4, 2021

Goal: an even faster partial parsing (no files changed so skip file system) -> don't even bother reading the files (this is not a 1.0 issue)

If we don't have a clear picture of what client/server will look like in the future, we failed this ticket (this is a 1.0 issue)
-> Let's focus on this for now and make it a spike. Need further discussion and details here

Open Questions:

  • Options: still has to be a profile in places OR nothing in the execution place -> even in scope?

@leahwicz
Copy link
Contributor

leahwicz commented Jun 7, 2021

Created issue for the spike: #3437

@jtcohen6 I'm removing the 1.0 label and adding it to the spike instead

@leahwicz leahwicz removed the 1.0.0 Issues related to the 1.0.0 release of dbt label Jun 7, 2021
@leahwicz
Copy link
Contributor

leahwicz commented Jul 8, 2021

  • We would need to cache in 3-4 places for this and won't be easy
  • If we pre-cache for adapters, caching and handling impact env/etc. have on execution -> becomes much more complex
  • Bare min version (main manifest creation parsing) wouldn't be bad (would be like a weekish)
  • If we did everything in this ticket it would be a lot
  • Some of this would be covered in client/server
  • This should be split up into more tickets

@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Jul 9, 2021

Some of this would be covered in client/server

For right now, I'm only interested in the pieces for this that are prerequisites to client/server. It's likely that we'll want to further delineate parsing and execution, but our current plan for client/server does not require the complete separation that this ticket originally envisioned.

@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Sep 1, 2021

I'm glad we had the conversation above; our thinking here has developed significantly since. I'm going to close this issue for the time being, but this isn't the last of "msgpack-only execution."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance
Projects
None yet
Development

No branches or pull requests

3 participants