-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trino worker parallelism becomes 0 #15055
Comments
cc @arhimondr |
i wasn't involved in trinodb/trino-hadoop-apache#39. |
Thanks @arhimondr @vinay-kl have we verified if the change works? not sure yet if that PR will necessarily solve this particular issue - given the locks here are on |
cc @jitheshtr |
@phd3 as long as the current thread which is inside the
|
Actually yeah, it looks like there could be more to it. Let me reopen it for now |
yes, for the other change the deadlock is happening because of two different locks. If only the method holding one lock is hanging - not sure if that fix helps. |
For the lock contention due to slow |
@vinay-kl - Would you be able to try out the |
@jitheshtr as i'm on leave for few weeks, i would not be able to test the same. sorry about this. |
sample reproduction by mocking a hanging |
@vinay-kl From your heapdump - there're ~400 instances of |
From cluster stability perspective - we need to remove this contention. Either by somehow timing the call out soon enough to get out of the lock - OR avoid having to lock in the creation process (which seems unnecessary but happens currently). |
There're two issues here:
|
@phd3 @jitheshtr what can be done here to navigate out of this situation until the fix has been implemented?, does killing the blocked query (after identifying the one from thread dump) help here? Once Azure storage (in our case) starts serving requests properly, then the cluster comes back to stable state. But this could take it's own time. So was wondering what could help mitigate in the meanwhile. |
@jitheshtr The issue is still persistent in We observed the in the newer versions the parallelism is getting recovered after 5-7 minutes, and this is better in the latest version whereas in
Also attached screenshot indicates the comparison between normal vs hampered dumps. |
@vinay-kl I don't see |
Yeah - I didn't see any threads where splits are waiting to create/get FileSystems either - so I suspect this is an independent issue. Is there any indication that the coordinator may be CPU/network saturated? Do you see scan tasks on workers in "RUNNING" state - but just waiting on new splits? Would also be useful to get a profile for coordinator/workers to get more insight into what may be happening. https://www.baeldung.com/java-async-profiler |
@sopel39 yes indeed the
@phd3 we don't see any saturation on Co-ordinator, the CPU is well under 15-20 % usage, and no indications at network level. There's not just enough splits for workers to process even though there are many queries in running and queue state. |
@vinay-kl I meant if you have JSON from query execution? |
@sopel39 my bad, i'll get few JSONs and attach the same. |
Hello Team,
We are running into
WORKER PARALLELISM becoming 0
and it stays for30-40 minutes
even though there are enough queries in the queue and in running state.Attaching the thread dump and images of the same
threadDump_16Nov2.out.zip
When we looked at the thread dump, we found out that one of the thread which has entered the
synchronized
block which is interacting withhadoop API
in our caseazure storage
is stuck under retries due to some underlying reason unknown. And all the other threads are waiting upon this Thread to release the lock to enter the blockTrinoFileSystemCache.java
is the class andgetInternal
is the method.This PR seems to have solved the issue
https://github.com/trinodb/trino-hadoop-apache/pull/39
, this was already merged but i couldn't see the change even in the latest branch.The text was updated successfully, but these errors were encountered: