Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpointing in dbt #3891

Closed
fclesio opened this issue Sep 16, 2021 · 8 comments · Fixed by #4017
Closed

Checkpointing in dbt #3891

fclesio opened this issue Sep 16, 2021 · 8 comments · Fixed by #4017
Assignees
Labels
enhancement New feature or request state Stateful selection (state:modified, defer)

Comments

@fclesio
Copy link

fclesio commented Sep 16, 2021

Describe the feature

A checkpointing mechanism that can be used to start from a failure point when running a dbt run statement.

Describe alternatives you've considered

As I'm running dbt run via cli, I'm using some shell/python scripts to create a checkpoint file (a txt file with the name of last executed model) that if the execution fails, I can re-run dbt run that it will start from the last execution model.

Additional context

Using the dbt run command if any problem occurs, we need or (a) re-start the execution manually from the model that failed or (b) re-run the full execution again.

Who will this benefit?

  • Teams that have a larger quantity of models (in my case I have more than 308) and suffers from the network and/or database issues that can break an execution
  • Teams that want to reduce their recovery time if some outage happens during the dbt execution that instead to re-run dbt run again or dbt run model_1, dbt run model_n manually; start from the last point before the interruption

Are you interested in contributing this feature?

Yes.

@fclesio fclesio added enhancement New feature or request triage labels Sep 16, 2021
@jtcohen6
Copy link
Contributor

@fclesio Agree big-time!

The way I've been thinking about this: dbt should be able to read the artifact from a prior run (namely run_results.json), and select all nodes with a certain status. That could look like dbt run --select result:error result:skipped result:failed, or it could be simple as --select result:error+: since skipped nodes are always the downstream children of errored/failed nodes, and a failing (as opposed to errored-out) test likely requires additional work that a simple retry won't solve.

In order to make this work, we'd need to teach dbt about remote results, in the same way we taught dbt about remote manifests for the sake of state: comparison. There's thinking about this over in #2465, though that's now a fairly old issue. I think I'm going to close that issue, and open narrower issues targeting the outstanding use cases: selecting based on (a) past node results and (b) source freshness comparisons.

Does that artifact-based approach make sense to you? Is there anything you're thinking of that I'm missing?

@jtcohen6 jtcohen6 removed the triage label Sep 19, 2021
@fclesio
Copy link
Author

fclesio commented Sep 21, 2021

@jtcohen6 thanks for the prompt response.

The main issue that I see with the dbt run --select result:error result:skipped result:failed statement is that it assumes that the retry will be placed manually, and for the perspective of E(T)L/EL(T) perspective the checkpoint will be used in some situation of failure with some kind of automatic retry.

For instance: Let's say after execute dbt run for my 308 models my job fails during the model materialization in the 97. At the time of an automatic retry, the command should be dbt run because the job itself does not know what failed or not.

From the API perspective, I see something like dbt run --checkpoint-name run_results.json --checkpoint-utilization:true because in this way we can assure that:

  1. Who needs the checkpoint will explicitly put that in the statement;

2.1) When the checkpoint is not chosen, dbt does not need to care about the remote states;

2.2) Who chose the checkpoint will explicitly know that it will have some toll in terms of execution time, due to the remote states;

  1. For job scheduling a single statement (i.e, dbt run --checkpoint-name run_results.json --checkpoint-utilization:true ) will solve all the retries, instead to run a command that assumes that something already failed (like dbt run --select result:error result:skipped result:failed). In other words, instead to have a command that says "If something fails run me in those explicit terms!" we can have a statement that says "store all states along the time if something fails and you decide to retry automatically I know from which point I need to start over again..."

I do not know if it's clear or not, but I'm happy to clarify more if needed.

@sungchun12
Copy link
Contributor

I prefer the dbt run --select result:error+ semantics because it's consistent with how dbt users leverage similar syntax for Slim CI. To change the syntax using checkpoints will require dbt users to adopt another mental model for what's being run vs. not.

I can also imagine a combination like this: dbt run --select result:error+ state:modified+ that will provide a more coherent mental model for more nuanced runs.

@fclesio
Copy link
Author

fclesio commented Sep 26, 2021

Sorry for the stubbornness sungchun12, but considering a scenario of an automatic retry to avoid dbt run everything from scratch, it should be implicitly having an option establishing the checkpoint?

The idea is for every execution I should run dbt run --select result:error+ regardless of the state, i.e. if failed previously or not, right? And in case of some failure that an automatic retry needs to take place, nothing should be changed. Is that correct from my side?

@jtcohen6
Copy link
Contributor

@fclesio Thanks for the thoughtful responses! I think there's an important distinction here:

  • Automatic retry of a failed query: When a dbt- plugin receives an error from the database that it believes to be transient, and conducive to successful retry, there should be a configuration option such that it retries the specific query. This exists already in dbt-bigquery + dbt-spark, and there's a proposal to add it on Postgres/Redshift as well: Retry on transient errors #3303. The real limitation here is the level of clarity and documentation from the database and its python client.
  • Manual retry of a failed job: A pipeline has failed for any of several reasons. A model has a syntax error, a test found null records where we didn't expect, a source dropped a column without warning, a permission was accidentally revoked, etc. In this case, automatically re-running all the same queries would achieve little.

I took the thrust of this issue to mean the latter more than the former. If a specific model fails because its query timed out, or ran into intermittent network and/or database issues, the right answer feels like automatic retry—based on logic coded into the database adapter that checks for the timeout/transient error code. If that's more the thing you're after, I think #3303 is the right place to continue that conversation.

I'd be happy to open a separate issue for "smart manual retry," which is the result:error+ syntax described above. I do think orchestration processes that wrap dbt Core could offer UX that makes the restart process pretty seamless: a click of a button, "Retry this job," that picks up from the failed run step with result:error+. Or better yet, to Sung's point, if you've needed to push code changes to account for an unexpected model/test/etc failure: result:error+ state:modified+.

@matt-winkler @sungchun12 To that end: I'd be thrilled to work with both of you on making this happen. The state: selection set a lot of the necessary groundwork here; I was able to get an experimental version of dbt run -s result:<status> --state <state-path> up and running with only a handful of code: experiment/previous-result-selection.

One question: Should result:<status> be the only result-based selection criteria, or should we anticipate others and name it result.status:<status> instead? I'm actually leaning toward the former. A year ago, I'd thought about timing (result.timing:>600), but I don't think that's really useful here—it's valuable metadata, just not as proactive selection criteria for a new run. Same goes for specific failure counts for tests (result.failures:>500); I think users should just use severity + thresholds (warn_if + error_if) to summarize up those details into specific result statuses (result:warn, result:error). Happy to hear disagreement, though!

@sungchun12
Copy link
Contributor

@fclesio I don't see your response as stubborn at all! I welcome it and you raise a valid point with retrying dbt run regardless of state-very much aligned with @jtcohen6's mention of #3303 for transient errors that benefit from automatic retries. Based on the latest comment, this can be a both/and solution, not either/or.

I expect the best of both worlds, even with implicit logic for transient errors. Given this paradigm already exists for specific database adapters, we shouldn't need an explicit flag like: dbt --retry-on-transient-errors run or dbt run --checkpoint-name.

For result-based selection criteria, I recommend limiting it to (result:warn, result:error). I have not heard of users in conversations or across the public slack showing strong desire for rerunning based on timing. Today, users can do that through tags if needed.

@jtcohen6 I would love to help out and branch off your experiment branch. @fclesio @matt-winkler, I would love to work with you too to make this a reality. @fclesio, I understand if you want to focus on the transient error checkpoint logic for your specific database instead. What database are you using today for dbt?

@fclesio
Copy link
Author

fclesio commented Sep 29, 2021

What database are you using today for dbt?

@sungchun12 I'm using Redshift.

@sungchun12
Copy link
Contributor

@fclesio, cool! when we have a working pull request, we'd love for you to test drive it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request state Stateful selection (state:modified, defer)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants