-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Computation deadlocks after inter worker communication error #4133
Comments
Good, timing, @quasiben and I are looking into this right now! xref #4128 and linked pangeo issues. One thing I'm trying to disambiguate: connection errors like those seen in #4130 and the distributed/distributed/batched.py Lines 73 to 109 in 34fb932
BatchedSend._background_send is not robust to failures (and I think swallows exceptions). Right now I'm going through tornado issues / IOStream to construct a reproducer, and then I'll work on making BatchedSend more robust.
|
Does this occur with a large number of workers ? One thing which would be super helpful (though fairly challenging) is getting a reproducer for this (these) issues. |
I suspected the BatchedSend myself but I was working under the impression that everything in
No, last time I observed this we had 10 workers. However, the scheduler and one of the workers had an uptime of about a month (we always have one worker ready to compute and are scaling up as necessary. Scheduler and worker0 are always-on). Not sure if this is connected but often worth mentioning
Yes, I know but so far I don't have anything to show. As I said, we've been facing this a few times already and I didn't even know where to start looking because most of the times we didn't even have an error log. I also don't believe #4130 is the only way to trigger this scenario, just the latest instance I could observe. |
That sounds right to me, but I"ll confirm. Right now my guess is that we have two distinct issues (the |
Note that I'm just reading code and tried to follow your analysis, and I think I may have found (a part of) the answers to this question, in particular why the "finally" block you quoted is never invoked: I think l. 546 in core.py (as you cited it) runs just fine without an exception, so no exception handling mechanism is triggered and this "finally" is not invoked. (This is possible, e.g. if the serialized exception exceeds 2048 bytes which would not re-trigger this same BufferError; it is also possible that this BufferError is not hit again here for other reasons, such that a write from another thread finished in the meantime). So I'm pretty confident this is the problem: This BufferError exception leaves the connection in a broken internal state, but it is still re-used. I think the fix is relatively straight-forward: be much more aggressive to close / cleanup connections with failures: in case of an internal error, the worker should NOT attempt to reply this error in-band via the (likely already broken) connection. Instead, it should treat the connection as broken beyond repair and just log the error and close the connection, relying on re-tries from a level above this code. |
This improves the robustness in case of internal errors in handlers that leave `comm` in an invalid state, see also dask#4133.
If it was running just fine, it would trigger an exception on the listener side, wouldn't it? distributed/distributed/core.py Lines 679 to 684 in 7d769b8
If the write would somehow block / not terminate that would explain the issue. Do you have a suggestion to why we do not see any exception logs (see line 534 core.py)?
Not really important here but why would the size of the exception impact this? I was still under the assumption that this buffer error was not, yet, well understood |
I would not expect this: The
My guess is this read blocks on the other end somewhere around here: distributed/distributed/comm/tcp.py Lines 193 to 198 in 7d769b8
I thought this was the traceback you posted?
tornado behaves differently depending on the size passed to write. The code path that potentially triggers the BufferError is only hit for writes smaller than 2048 bytes (while exceptions sent by distributed can, by default, be up to 10kb): |
Yes, of course. I've seen this issue a few times not and there was not always an exception log. For this instance that would fit, though |
I think we have at least a partial understanding of what is going on here, enough for a reasonable fix:
The fix is to simply close the connection on such low-level connection errors, rather than attempting to send an error message over this connection. This is what #4239 does. Note that there is still the possibility for similar error conditions not addressed by #4239: if the handler starts to send some data and then raises an exception (from the handler code, not from One relatively straight-forward option is to always close the connection after sending an "uncaught-error" message. This would prevent the other end from waiting indefinitely. Depending on whether some data has already been written to the stream or not, the error message might be garbage or not, but at least the client would not wait indefinitely. |
I'm curious, what happens in this situation. Do operations like sending data between worker retry today on a |
It depends: some operations are re-tried via distributed/distributed/worker.py Line 3224 in 7d769b8
These calls to |
Pinging @KrishanBhasin as this sounds very similar to some issues he has been seeing. |
Thanks @gforsyth. I was indeed seeing something that sounds similar to this in #3878, but have not encountered it for some time. |
Following the above reference also leads to #3761 which mentions deadlocks. Might be the same thing but can't confirm. |
Close comm.stream on low-level errors, such as BufferErrors. In such cases, we do not really know what was written to / read from the underlying socket, so the only safe way forward is to immediately close the connection. See also #4133
Before I start, sincere apologies for not having a reproducible example but the issue is a bit weird and hard to reproduce. I mostly querying here to see if anybody has an idea what might be going on.
What happened:
We have seen multiple times now that some individual computations are stuck in the
processing
state and are not being processed. A closer investigation revealed in all occurrences that the worker the processing task is on, let's call it WorkerA, is in the process of fetching the tasks dependencies from WorkerB. While fetching these, an exception seems to be triggered on WorkerB and for some reason this exception is lost. In particular, WorkerA never gets any signal that something was wrong and it waits indefinitely for the dependency to arrive while WorkerB already forgot about the error. This scenario ultimately leads to a deadlock where the single dependency fetch is blocking the cluster. It looks like the cluster never actually self heals. The last time I could observe this issue, the computation was blocked for about 18hrs. After manually closing all connections, the workers retried and the graph finished successfully.I've seen this deadlock happen multiple times already but only during the last instance, I could glimps an error log connecting to issue #4130 where the tornado stream seems to cause this error. Irregardless of what caused the error, I believe there is an issue on distributed side since the error was not properly propagated, the connections where not closed, etc. suggesting to WorkerA that everything was fine.
Traceback
The above traceback shows that the error appeared while WorkerB was handling the request, executing the
get_data
handler, ultimately raising the exceptiondistributed/distributed/core.py
Lines 523 to 535 in 34fb932
In this code section we can see that the exception is caught and logged. Following the code, an exception error result is generated which should be sent back to WorkerA causing the dependency fetch to fail, causing a retry
distributed/distributed/core.py
Lines 544 to 554 in 34fb932
Even if this fails (debug logs not enabled), the connection should be closed/aborted eventually and removed from the servers tracking
distributed/distributed/core.py
Lines 561 to 569 in 34fb932
which never happens. I checked on the still running workers and the
self._comm
attributed was still populated with the seemingly broken connection.Why was this finally block never executed? Why was the reply never submitted received by WorkerA?
Just looking at the code, I would assume this to be already implemented robustly but it seems I'm missing something. Anybody has a clue about what might be going on?
Environment
The text was updated successfully, but these errors were encountered: