-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed stuck BatchedSend comm #4128
Conversation
As reported in pangeo-data/pangeo#788, users were seeing tasks just not completing. After some debugging, I discovered that the scheduler had assigned the "stuck" tasks to a worker, but the worker never received the message. A bit more digging showed that 1. The message was stuck in the worker BatchedSend comm's buffer 2. The BatchedSend.waker event was clear (awaiting it would wait) 3. The BatchedSend.next_deadline was set I couldn't determine *why*, but this state is consistent with us "missing" a deadline, i.e. the `BatchedSend.next_deadline` is set, but the `_background_send` is already `awaiting` the event. So I'm very shaky on the cause, but I'm hoping that this fixes the issue. Doing some more manual testing.
Oh, I should have looked at the logs sooner. From the scheduler pod
which does sound familiar (#2519, #1704, #2506). This is with tornado 6.0.4 So this fix is unlikely to address the original issue (I think). But it may still be worthwhile? I'm not sure. We instead need a way to recover from an exception. I'm able to |
John's been looked into this a bit a #4080 (comment). |
OK, so things failed in the batched_send, so the next send didn't trigger? We should be robust to this, which I take it this PR helps with, but I honestly wouldn't expect things to finish if the error in #4080 persists . We're not robust to dropped messages. One thing to check here is if an older version of distributed has this problem, ideally something just before all of the serialization / bytes / memoryviews changes that went in a few months ago. Git bisect might help highlight an issue here. |
(if it's easy to reproduce that is) |
I think (but I'm pretty hazy here) the rough sequence was while not self.please_stop:
try:
nbytes = yield self.comm.write(
payload, serializers=self.serializers, on_error="raise" # this raised a BufferError
)
....
except CommClosedError as e:
logger.info("Batched Comm Closed: %s", e)
break
except Exception:
logger.exception("Error in batched write")
break # <------------ so we break out of the for loop
# and set self.stopped
self.stopped.set()
I'm less sure that it actually helps with things. I'll need to dig more. But what would help is retrying.
The error is transient I think, since when I manually wake up the Comm things finish fine. I think that tornado's internal buffer is drained and we're OK to go again? I'd consider this a WIP for now. I haven't been able to make a small reproducer yet but will give it some more time on Monday. |
I gave it a try with the reproducer reported in #4080 : b = bag.from_sequence([1]*100_000, partition_size=1)
bb = b.map_partitions(lambda *_, **__: [b'1'*2**20]*5).persist()
bc = bb.repartition(1000).persist() Unfortunately, this does not fix the issue. Workers still use to loose connection, as reported in #4080 . |
Thanks @michaelnarodovitch . Would you be willing to use |
For me, there was no time, where this used to work. Our production code is still working around with aggressive retries. distributed.comm.timeout=80s Ran the reproducer with dask from |
Folks were digging around in the serialization code a bit this summer. If
you're willing, I would encourage you to go back about 6-12 months and see
if things worked then.
…On Sat, Sep 26, 2020 at 10:02 AM michaelnarodovitch < ***@***.***> wrote:
For me, there was no time, where this used to work. Our production code is
still working around with aggressive retries.
distributed.comm.timeout=80s
distributed.scheduler.allowed-failures=999
Ran the reproducer with dask from pip install git+
***@***.*** Tried with
some other versions back then, as reported in #4080
<#4080> .
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4128 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTGFD3H2ZLL62DP6XM3SHYNDNANCNFSM4RZVNVDA>
.
|
Same problem with dask/distributed 2.7.0 and dask/distributed 2.0.0. |
OK. Thanks for checking
…On Sun, Sep 27, 2020 at 1:55 AM michaelnarodovitch ***@***.***> wrote:
Folks were digging around in the serialization code a bit this summer. If
you're willing, I would encourage you to go back about 6-12 months and see
if things worked then.
Same problem with dask/distributed 2.7.0 and dask/distributed 2.0.0.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4128 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTFSH62D655YGBUVHT3SH34XLANCNFSM4RZVNVDA>
.
|
@mrocklin, at a high-level, what do you expect to happen when a For other exceptions like this Edit: Hmm, this is maybe complicated by the fact that we don't actually await things from |
Yes, that would make sense to me
Ah indeed. Well maybe we can still retry there, and it fails we close the underlying Comm, so that the next time the BatchedSend tries to write it fails and restarts the worker? |
I'm going to close this for now in favor of #4135. This PR really only fixed things for the case where we missed a deadline (failed to await |
As reported in pangeo-data/pangeo#788, users
were seeing tasks just not completing. After some debugging, I
discovered that the scheduler had assigned the "stuck" tasks to a worker,
but the worker never received the message. A bit more digging showed
that
I couldn't determine why, but this state is consistent with us
"missing" a deadline, i.e. the
BatchedSend.next_deadline
is set, butthe
_background_send
is alreadyawaiting
the event (maybe with no timeout?).So I'm very shaky on the cause, but I'm hoping that this fixes the issue. Doing
some more manual testing.
@mrocklin I'm extra confused here, since this code hasn't been touched in a while. I'd have thought we would have seen more reports of this, but I don't recall any.