[help] What happens with errors? #1310

hadley · 2024-07-29T21:16:11Z

hadley
Jul 29, 2024

Help

I understand and agree to https://books.ropensci.org/targets/help.html.

Description

I've read https://books.ropensci.org/targets/debugging.html, but I'm looking for a higher-level description of how errors affect the computation of the reactive graph.

wlandau · 2024-07-30T16:59:26Z

wlandau
Jul 30, 2024
Maintainer

When a target throws an error, the default behavior is to stop the whole pipeline immediately. Other choices are available, either globally through the error argument of tar_option_set(), or on a target-by-target basis through the same argument in tar_target(). If error = "continue", then the rest of the pipeline keeps going in spite of the error, and downstream targets use whatever value was last returned from that target in a previous run. error = "null" is the same as error = "continue", but the target that errored returns a value of NULL to make sure downstream targets have at least something they can use. error = "abridge" keeps existing targets running but does not dispatch any new ones.

0 replies

hadley · 2024-07-30T18:59:42Z

hadley
Jul 30, 2024
Author

Can you help me understand why the default behaviour is to stop running the whole pipeline rather than just the steps that are downstream of the error?

5 replies

wlandau Jul 31, 2024
Maintainer

That's an interesting idea I had not considered before. For certain types of pipelines, it would allow tar_make() to accomplish more work per run. Advanced users might appreciate having more choices like this one for the error option.

But in practice, most errors originate from upstream user-defined functions shared by many targets, and a pipeline could run away with errors on hundreds of embarrassingly parallel targets it should not be attempting. As an alternative, maybe it would be better to stop running (or avoid starting) anything downstream of any function that an errored target depends on. But this seems complicated and counterintuitive for new users to understand.

targets is already a big and daunting tool to learn, and I would prefer to avoid defaults that surprise people at first glance. The current behavior is at least easy to understand without checking the dependency graph.

hadley Jul 31, 2024
Author

My intuition would be that it's easier to stick with a model that folks probably already understand, i.e. shiny. To me, introducing a new model of reactivity feels like adding additional complexity, but maybe you don't see targets as a reactive execution framework?

wlandau Jul 31, 2024
Maintainer

Although I admire Shiny's reactivity model, the intended user experience of targets is a bit different. It focuses on project-oriented analysis workflows that would otherwise take the form of one or more R scripts or Quarto reports. The goal is to look and feel like you're writing a function-driven scripted analysis, but add Make-like automation and reproducibility. I have found this approach natural for my dissertation project in grad school and for Bayesian modeling / clinical trial simulation pipelines at work.

If I write a script for a statistical analysis, I usually expect it to stop as soon as it hits an error. I think I took this for granted as part of that intended look and feel.

But internally, targets is reactive and event-driven, both for serial computing and for distributed computing over a local network. While the local process of the pipeline is waiting for tasks to complete, the main local process sleeps and listens to an NNG condition variable provided by mirai/nanonext. When a target completes, it signals that condition variable over the network and wakes up the main process so the pipeline can dispatch any downstream targets that are now possible to start.

hadley Jul 31, 2024
Author

I understand that the fundamental feel is different but you don't think folks are going to see the similarities between shiny (user touches UI and wants minimal recomputation to update visible outputs) and targets (user touches code/data and wants minimal recomputation to update artefacts)?

wlandau Jul 31, 2024
Maintainer

Yes, I have heard folks like Carson make that exact comparison before. And I agree, there is value in adding a new error option (error = "reactive"?) which aligns with that expectation.

But at this point, I am not sure what the default should be. Another big difference between Shiny and targets is the computational burden of a task in a typical use case. Not all targets pipelines are computationally demanding, but targets is designed for this sort of work, and it is full of use cases with heavy-duty modeling (Bayesian, geospatial, etc.) on high-performance computing systems. If, as often happens, there is a bug in a function (especially one from a package) that affects multiple targets, it is helpful to stop the whole pipeline to avoid consuming resources that will result in failure anyway. (I admit, the default error = "stop" is pretty blunt for this.) More often than not, I think Shiny operates in scenarios where it is less expensive to continue along the DAG.

hadley · 2024-07-31T20:26:13Z

hadley
Jul 31, 2024
Author

Just to make sure we're talking about the same thing, I've drawn a little diagram:

(Apologies for not matching the conventions of targets, which I'm not that familiar with, but hopefully you get the gist).

If the red circle is a failure, I'm suggesting that the other branches in A would be cancelled, and B certainly wouldn't be run, but C and D would still run. I think what you're saying is that once the red circle errors, everything stops? Is that correct?

What happens if A contains 100 targets, and the first 99 succeed and only the 100th fails? What happens to the results of the 99 jobs that ran successfully?

What happens if it fails on the 10th job, and there are 5 other jobs running in parallel at the same time. Are they all automatically stopped? What happens to their results?

6 replies

wlandau Aug 1, 2024
Maintainer

I think what you're saying is that once the red circle errors, everything stops? Is that correct?

Yes, that's how the default error = "stop" works.

What happens if A contains 100 targets, and the first 99 succeed and only the 100th fails? What happens to the results of the 99 jobs that ran successfully?

Each of the 99 successes are already stored, either locally (e.g. in _targets/objects/) or on the cloud. Each of those branches can be retrieved with tar_read(), e.g. tar_read(A, branches = seq_len(99)). A second tar_make() will skip the 99 successes but probably fail on the 100th branch, and any change to the code would invalidate all branches. To more easily work with all 100, a tar_make() with tar_option_set(error = "null") will store NULL for the 100th branch and should allow a straightforward tar_read(A) to read everything.

What happens if it fails on the 10th job, and there are 5 other jobs running in parallel at the same time. Are they all automatically stopped? What happens to their results?

For error = "stop", the whole pipeline stops, which does try to terminate any parallel workers and stop the 5 jobs currently running. Error info is saved for that 10th failed job, but nothing is recorded for the other 5.

wlandau Aug 1, 2024
Maintainer

As a first step, I just implemented a new cancel() method in crew: https://wlandau.github.io/crew/reference/crew_class_controller.html#method-crew_class_controller-cancel

shikokuchuo Aug 1, 2024

A good first step - but just as a reminder, actually stopping individual parallel tasks will require the implementation of shikokuchuo/mirai#112. This is something that is missing from R in general, not just mirai. Having said that, for crew on a controller level, it should be easy for you to either reset all daemons, or else call saisei(force = TRUE) on the 'busy' ones. This would actually terminate them.

wlandau Aug 1, 2024
Maintainer

A good first step - but just as a reminder, actually stopping individual parallel tasks will require the implementation of shikokuchuo/mirai#112. This is something that is missing from R in general, not just mirai.

Ah, so stop_mirai() just prevents dispatch (if possible) and ignores any results?

Having said that, for crew on a controller level, it should be easy for you to either reset all daemons, or else call saisei(force = TRUE) on the 'busy' ones. This would actually terminate them.

The tricky part is that at the level of crew, I do not know which tasks were sent to which workers. And in a persistent worker scenario, there is friction in taking down and restarting a whole worker.

shikokuchuo Aug 2, 2024

stop_mirai() stops everything related to that task in the current process - but in the connected remote process, the main R thread can be in the middle of executing user code, and hence that continues.

This got me thinking in any case and I figured that it's all possible, especially with the ability now to work at a lower level. So you'll be glad to know I've taken this on shikokuchuo/mirai#112 (comment) so eventually you'll have this ability to stop particular tasks. I don't have a timeline for this yet though.

MilesMcBain · 2024-08-04T04:08:50Z

MilesMcBain
Aug 4, 2024

Hello Friends,

As a long time {drake}/{targets} user I just want to throw my emphatic support behind the default error behaviour of stopping all processing.

I think it's the most ergonomic option for a pipeline in development. The main reason why this is true is that when an error happens in development there is no guarantee that only downstream targets of the error will be affected by a fix.

The root of the error could originate anywhere upstream due to faulty data or assumptions, and the resolution could affect any number of targets anywhere in the plan, since the plan code and or data flows may need to be refactored in the fix.

If the default was changed to this "reactive" mode (in discussion) I suspect most developers would just mash Ctrl-C as soon an an error is encountered anyway for these reasons.

My intuition would be that it's easier to stick with a model that folks probably already understand, i.e. shiny. To me, introducing a new model of reactivity feels like adding additional complexity, but maybe you don't see targets as a reactive execution framework?

It's a strong assumption that people coming to {targets} would already be familiar with Shiny, or the detail of its reactive underpinnings.

But here's my take on why the two contexts are fundamentally different:

In Shiny:

When you have a failure it makes sense to keep the reactive graph alive and process it as far as you can, because this will likely result in an error message propagating somewhere onto the GUI where a developer can see it. This gives the developer insight into what part of the application the error may lie in. This knowledge is important because recreating the state of a Shiny reactive graph is hard, so knowing approximately where to place browser() calls to initiate live debugging of the app speeds things up.
Computing the reactive graph around the error is almost certainly cheap because you're a GUI App, and an expensive graph won't make for a usable GUI.

In Targets:

Interactively recreating the state of the graph for the node that had the error is trivial. You read the dependencies out of the cache. The 'workspace' concept makes this incredibly ergonomic. It's one of the reasons targets is so powerful, and accelerates development so much.
Computing the plan graph around the error may be cheap, but it probably isn't. Targets is a tool people turn to for managing large workloads.
As we've said there are no guarantees that targets that aren't direct ancestors or descendants of the node that had the error won't be affected by a potential fix. For example if an error arises due to a faulty data input, targets in a parallel processing stream that descends from the data will be invalidated when we modify the data.

It's nice to have the option to keep processing, and I think the current "null" option already gets you quite close to the "reactive" mode, at least in terms of maximising valid work done.

This is just my personal experience, but in all my years of R in prod with these frameworks I have turned on the option to keep processing just once.

I think part of this is the way I design prod pipelines:

There are 'anticipated failures' that have handling baked in. I handle the recovery of these errors such that the targets framework never sees or cares about the fact they occur.
'Unanticipated failures' are considered catastrophic and probably won't lead to valid outputs. There might be some savings made in carrying on, but in my personal experience they haven't been significant enough to motivate me to utilise that option. It's been more important to know about the failure as soon as possible.

Great discussion!

7 replies

hadley Aug 5, 2024
Author

In the example in the targets book, it would be super cool if instead of:

#> Error:
#> ! Error running targets::tar_make()
#>   Target errors: targets::tar_meta(fields = error, complete_only = TRUE)
#>   Tips: https://books.ropensci.org/targets/debugging.html
#>   Last error: missing values in object

You got:

#> Error:
#> ! Error running targets::tar_make()
#>   Target errors: targets::tar_meta(fields = error, complete_only = TRUE)
#>   Reproduce cached state with targets::tar_workspace(analysis_02de2921)

Where you could click on the code to immediately run it in your current environment.

wlandau Aug 5, 2024
Maintainer

Can you expand on this because that feels super worrying to me — if your pipeline is not capturing the dependencies between the steps how can you trust it? But I can't believe that you'd create pipelines that with implicit dependencies across the graph, so I must be misunderstanding something.

When it comes to capturing dependencies, targets restricts itself to all the immutable dependencies that a typical user controls: the R command and return value of each step in the pipeline, any files/directories declared with format = "file", in-memory global objects, and in-memory functions. For the latter two, targets limits itself to scanning tar_option_get("envir") in the isolated callr process that sources _targets.R. I would say this includes any potentially faulty upstream data.

But a different example for @MilesMcBain's more general point: installed packages are not tracked by default. It would be an unmanageable Pandora's box: enormous cyclic dependency graphs, S3 methods that static code analysis can't detect, compiled code, constantly changing unexported functions, and files in inst/. Tracking packaged namespaces is an opt-in feature powered by tar_option_set(imports = c("...")). renv is almost always better for package reproducibility, and it works super well with targets.

So targets does not track everything. (But it tries to be as clear as possible where it draws the line, via tar_visnetwork() etc.)

wlandau Aug 5, 2024
Maintainer

targets restricts itself to all the immutable dependencies that a typical user controls

I say "immutable" because some users want to read from a database table and then write to the same table further downstream. Even when those tables are different, we haven't found a convenient one-size-fits-all solution for remote databases in general. Workarounds range from hashing the database table in tar_change() to downloading from the database on a schedule with the help of tar_age() + cron. (c.f. https://docs.ropensci.org/tarchetypes/reference/index.html#targets-with-custom-invalidation-rules).

wlandau Aug 5, 2024
Maintainer

This seems like a cool feature. I wonder why workspace_on_error = TRUE is not the default?

Thanks! Workspaces have been around long enough that perhaps now is is the time to make workspace_on_error = TRUE the default (and add the messages you suggested).

hadley Aug 5, 2024
Author

I don't expect targets to recursively hash every function definition; but I don't think that's what Miles was implying.

MilesMcBain · 2024-08-05T22:52:55Z

MilesMcBain
Aug 5, 2024

Can you expand on this because that feels super worrying to me — if your pipeline is not capturing the dependencies between the steps how can you trust it? But I can't believe that you'd create pipelines that with implicit dependencies across the graph, so I must be misunderstanding something.

Happy to!

An important thing to keep in mind is that I am talking about a default suitable for pipelines in development. So immature, not yet established in prod etc.

For these pipelines the graph structure is not set, and will like be revised many times. An error can easily be a thing that leads to a change in graph structure. For example by carving off a target representing a special case off from the main ‘trunk’.

A specific and common example of where an error invalidates a chain of targets that are not descendent from it is where the error manifests in one target, but the root cause is invalid data which is handled further up the chain, e.g. in a ‘filter on read’ type approach.

Another example is one Will alluded to where the error is cause by tracked function (possibly in a package) that is a dependency of multiple targets in the plan, some of which did not error.

So in the development context where reproducing error condition state is so fast and easy, as we have said, it feels better to just to deal with problems as they arise rather than waiting for expensive targets to complete that have some chance of being rendered invalid anyway.

For more mature pipelines I think the ‘keep processing’ approach is much more viable.

1 reply

hadley Aug 5, 2024
Author

Ah, the pipelines in development vs pipelines in production distinction makes a lot of sense and is what I was missing. Given that framing, I agree that eagerly failing all jobs makes the most sense during development.

wlandau · 2024-09-10T11:50:30Z

wlandau
Sep 10, 2024
Maintainer

#1332 adds error = "trim", which includes most of the proposal from #1310 (comment).

1 reply

hadley Sep 10, 2024
Author

Nice, thanks!

[help] What happens with errors? #1310

hadley Jul 29, 2024

Help

Description

Replies: 6 comments · 20 replies

wlandau Jul 30, 2024 Maintainer

hadley Jul 30, 2024 Author

wlandau Jul 31, 2024 Maintainer

hadley Jul 31, 2024 Author

wlandau Jul 31, 2024 Maintainer

hadley Jul 31, 2024 Author

wlandau Jul 31, 2024 Maintainer

hadley Jul 31, 2024 Author

wlandau Aug 1, 2024 Maintainer

wlandau Aug 1, 2024 Maintainer

shikokuchuo Aug 1, 2024

wlandau Aug 1, 2024 Maintainer

shikokuchuo Aug 2, 2024

MilesMcBain Aug 4, 2024

hadley Aug 5, 2024 Author

wlandau Aug 5, 2024 Maintainer

wlandau Aug 5, 2024 Maintainer

wlandau Aug 5, 2024 Maintainer

hadley Aug 5, 2024 Author

MilesMcBain Aug 5, 2024

hadley Aug 5, 2024 Author

wlandau Sep 10, 2024 Maintainer

hadley Sep 10, 2024 Author

hadley
Jul 29, 2024

Replies: 6 comments 20 replies

wlandau
Jul 30, 2024
Maintainer

hadley
Jul 30, 2024
Author

wlandau Jul 31, 2024
Maintainer

hadley Jul 31, 2024
Author

wlandau Jul 31, 2024
Maintainer

hadley Jul 31, 2024
Author

wlandau Jul 31, 2024
Maintainer

hadley
Jul 31, 2024
Author

wlandau Aug 1, 2024
Maintainer

wlandau Aug 1, 2024
Maintainer

wlandau Aug 1, 2024
Maintainer

MilesMcBain
Aug 4, 2024

hadley Aug 5, 2024
Author

wlandau Aug 5, 2024
Maintainer

wlandau Aug 5, 2024
Maintainer

wlandau Aug 5, 2024
Maintainer

hadley Aug 5, 2024
Author

MilesMcBain
Aug 5, 2024

hadley Aug 5, 2024
Author

wlandau
Sep 10, 2024
Maintainer

hadley Sep 10, 2024
Author