Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow actor exceptions to propagate #4232

Merged
merged 17 commits into from
Jun 14, 2021
Merged

Conversation

martindurant
Copy link
Member

@martindurant martindurant commented Nov 10, 2020

Follows on from #4225

Closes dask/dask#7626

@martindurant
Copy link
Member Author

This seems to take more code than I would have thought, to wrap and unwrap the results dict in various places.

@martindurant martindurant marked this pull request as ready for review November 10, 2020 18:11
@@ -785,6 +785,7 @@ async def get_metrics(self):
executing={
key: now - self.tasks[key].start_time
for key in self.active_threads.values()
if key in self.tasks
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is a little worrysome. Without it, you get KeyError here during heartbeat after a failed actor action, but I don't really know why. Should the "active thread" be cleaned up somewhere?

@martindurant
Copy link
Member Author

CI fails on apparently unrelated timeout in test_broken_worker_during_computation on py36 only

@jakirkham jakirkham requested a review from jrbourbeau May 15, 2021 22:44
@martindurant
Copy link
Member Author

If this still passes, I would really like to see it merged! I may offer a "please object" timeframe.

@jakirkham
Copy link
Member

@jrbourbeau would you be able to take a look or do you know who would be able to?

@martindurant
Copy link
Member Author

OK, all passed except some unrelated and presumably flaky steal test in one run.
(Note: all the post warnings are still in the very long build log)

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @martindurant!

distributed/actor.py Outdated Show resolved Hide resolved
distributed/actor.py Outdated Show resolved Hide resolved
distributed/actor.py Outdated Show resolved Hide resolved
def prop(self):
raise MyException

with cluster(nworkers=2) as (cl, w):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to use cluster here instead of @gen_cluster like most of the other test in this module?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was to test the sync API, which is more typical for actors. I added an async version immediately below (could remove this one, if you like).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was to test the sync API, which is more typical for actors. I added an async version immediately below (could remove this one, if you like).

To be clear here, the sync api is more common for everything. We prefer the async tests because they are faster/easier on CI and allow for greater debuggability. Adding sync tests too is fine if we want to be extra careful, but in general we prefer async tests. In general if async tests I usually have confidence that sync works just as well, unless I'm explicitly writing code to handle synchronization.

What's here is great though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's here is great though.

I don't mind keeping it or not. Writing the sync version first probably reflects how I initially tested by hand. It was some time ago, so I can't remember if there was any other reason, given that the async version is effectively identical and works just fine.

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @martindurant! Just a few small comments on test_actor.py, otherwise I think this is good to merge

Comment on lines 11 to 17
from distributed.utils_test import ( # noqa: F401
async_wait_for,
cluster,
cluster_fixture,
gen_cluster,
loop,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following #4888, all pytest fixtures in distributed.utils_test are globally available. So there's no need for the cluster_fixture or loop imports (and hopefully we can drop # noqa: F401 too).

distributed/tests/test_actor.py Outdated Show resolved Hide resolved
distributed/tests/test_actor.py Outdated Show resolved Hide resolved
@martindurant
Copy link
Member Author

Do we have a tracking issue for ongoing CI failures? The two here are not actor related, as far as I can tell.

@mrocklin
Copy link
Member

mrocklin commented Jun 14, 2021 via email

@jrbourbeau
Copy link
Member

There are a variety of issues, generally one per failure

Yeah, we keep track of these issues with a flaky label

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @martindurant! Will merge once CI finishes

@jrbourbeau jrbourbeau merged commit 05c5621 into dask:main Jun 14, 2021
@martindurant martindurant deleted the actor_fail branch June 14, 2021 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Actor w/ async method does not propagate exceptions, and hangs forever
4 participants