-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash of all workflows (DuplicateSubscriberIdentifier) #3973
Comments
Resubmitting the jobs helped but from time to time I encounter this problem again. It is not really reproducible which makes it hard to fix and it only occurs sometimes…. ☺
Best regards,
Benedikt
|
Ok. Maybe you can report a bit of the way you have set up your installation? RabbitMQ version, if it's on localhost, version of AiiDA (it's still 1.1.1?) and of kiwipy (just to double check) etc. Also, do you know if |
Dear Giovanni,
Here are some more details
RabbitMQ: [{rabbit,"RabbitMQ","3.6.10"},
Kiwipy: '0.5.3'
Aiida still 1.1.1 (will probably update in a bit to 1.2.1)
Localhost
Python libs were installed using anaconda.
The number seems to be the PK from the parent workflow.
Hope that helps a bit
Best regards
Benedikt
|
Sorry you're having these issues Benedikt, and thanks for reporting. I'm wracking my brain trying to think how this could happen but I can't think of a plausible explanation for now so let me just clarify a little on what the various systems are doing which will put constraints on possible the 'explanation space' and hopefully we can solve it together.
As I think this through one possibilitiy is that a worker that was already working on @BeZie , would you be able to find the RabbitMQ logs for around the time of that exception (have a look in |
@muhrin The only thing which I find in the log around that time is closing AMQP connection <0.4871.3> (127.0.0.1:47882 -> 127.0.0.1:5672, vhost: '/', user: 'guest'): |
Dear all, |
Hi Benedikt. I had a chat with Sebastiaan on Friday to try and brainstorm a possible mechanism for this failure but we couldn't think of anything other than what I'd already suggested. So, let's gather a few more details and dig deeper:
Let's see if we get any clues from all that. |
Thanks already for the help :) |
My pleasure :) Wow, so looking at these logs it looks like the exact scenario I described is playing out, e.g.:
Here, AiiDA has missed (presumably) two heartbeats (the default) and RabbitMQ has assumed that client to be dead. Then, 23s later, the automatic reconnection mechanism kicks in and AiiDA reconnects:
This all causes some hubbub in the daemon logs, but tellingly soon after AiiDA gets delivered the same task it was working on before it got disconnected and that causes this exception
I'm surprised because the library that communicates with RabbitMQ uses a separate thread, partly so that it can keep responding to heartbeats while other things go on. Now, because this is python, we don't have true multithreading [1] and it's possible that your workload has blocked the interpreter for over two minutes (i.e. 2 heartbeats). So, can I ask, what kind of workload is in your workchains? Are you doing any heavy in-Python computation, heavy |
There are some "heavy" in-python computations in the workchain as calcfunctions (taking around 15 minutes each and they are quite memory hungry). These computations (to my knowledge) do not use multithreading. Maybe numpy does it under the hood. Best, |
To the best of my knowledge @muhrin Would running out of memory give a different result? What happens if the daemon process gets I've crashed my workchains by running out of memory (funnily, from a different process) before, but didn't look into the exact traceback because it was clear this was the root cause. If the missed heartbeat is the root cause, is there any other way to guard against a second worker trying to pick up the same process, or for the second worker to fail gracefully such that the process is not destroyed? |
Thanks @greschd . I'm really quite ignorant to the conditions under which python will or won't release the GIL, although keeping it for over 2 minutes, indeed, seems a little extreme. Regarding processes being killed (as a consequence of OOM, -9, or anything else), this is actually fine. In the sense that the next worker to pick up that task will continue it from the last checkpoint. This failure mode is a little different. It seems to happen when a worker misses the heartbeats and is presumed dead, then reconnects, and gets back the same job is is already working on (which RabbitMQ had no way to know as the worker wasn't responding). Now, one might say that worker should just know this and carry on with what it was doing, but the reality is that another worker could have been delivered that job and then they would both be working on it, so this isn't a robust solution. Ultimately, there is no magic solution here. There needs to be some locking mechanism that prevents two workers from working on the same thing and in our case that is the fact that a worker holds an RMQ task (and indeed this is exactly the kind of use case RMQ was written for). In a perfect world I would fix this by:
So @BeZie , let's confirm that what I think is happening is true. Can you set this the aiida-core/aiida/manage/external/rmq.py Line 34 in 6aa6994
and then report back if the error is cleared up? (you'll need to restart your daemon for this to take effect) A consequence of setting this is that if a worker terminates in an unclean way your tasks will be frozen for twice that timeout before they can be executed by a worker again. |
Just for the sake of pedantry, that of course wouldn't be sufficient to know the worker can always respond, because there's also OS level scheduling. Now I don't know how overloaded your system needs to be for a thread to not be scheduled at all for 2 minutes, but I guess it's possible. |
Just want to double check. The line comment kind of indicates that 600ms is already max. Should I somehow adjust the rabbitmq settings accordingly? BTW: thanks for the quick response times and with helping me out here. |
Ah, yes, it helps if I read the useful comments that we put into the code :) So, you have to make the AiiDA change and up the heartbeat in rabbitmq.conf, I think in Ubuntu this can be done by:
Then you should be good to up it in AiiDA. |
Going to test the new settings over the weekend and will let you know if I get a random crash. Some side comments.
Thanks again and best regards |
Unfortunately, the same error has occurred again. I had set the heartbeat time to about 10 minutes. As a next try, I will transfer the file repository to a local disk and see if this happens again. :) [This will take a couple of days as I need to order a new HDD first... if you have any suggestions until then, please let me know] Best regards, |
Update: Short update on my side. After moving everything to a local folder, I did not had any problems yet. Not sure if I should close the issue because it, in principle, still exists. Thanks for helping and guiding me to a workaround! Best regards, |
@greschd yep I moved the file repository from a network mounted drive to a local one. |
What is the configuration for RabbitMQ? Is that running on the localhost where AiiDA is running? There is no official support to configure these settings, but maybe you have manually changed the settings to connect to another server. I can't really see why the location of the file repository should influence RabbitMQ. However, we might have been distracted by a red herring. If the file repository was temporarily unavailable due to network issues making the mount unreachable, then this would of course fail any repository operation. There are no built in error recovery methods for this in AiiDA, because so far we assume it is local. If a calculation was writing to the repository, it would except, which would bubble up and also fail all calling workflows. @BeZie is the exception you pasted in the OP the only exception you say in any of the reports of the failed processes? |
@sphuber The RabbitMQ is running on the same location (localhost) as AiiDA. No changes to the default rabbitmq settings has been made except for the heartbeat (see above). |
Well, again, I'm tempted to think that heartbeats could be a problem here. We know for sure that there is a problem of the connection being dropped due to heartbeat misses and then re-established causing the duplicate key. Now, it could be that the large file transfers to a network drive were blocking for more than 10 minutes (although this does seem extreme to me). @BeZie , could you confirm the output from |
@BeZie do you have any kinds of logs from your NFS that could indicate if the error coincided with it being unavailable? Is the NFS being unavailable / overloaded something you have observed at all? In my experience, an overloaded NFS can just cause the operation to hang indefinitely, or until traffic to the NFS clears up. Maybe the RabbitMQ heartbeat being missed was indeed a secondary effect - it's still sort of weird this would occur though, because I think the read/write to NFS should release the GIL, and thus the "heartbeat responder" thread should still be active. |
Also saw this error (e.g., Running A relevant part of the rabbitmq log seems to be
The output of
|
The error has appeared again for me, this time due to a out of memory problem.
I am now moving away from memory intensive calcfunctions to some calcjobs to prevent this from happening again. But catching this behavior somehow would be nice anyway :) . Best regards, |
'Kill process ... or sacrifice child'....that sounds ominously biblical!
What kind of unholy calculations are you running Benedikt??
Thanks for the report @kjappelbaum! Indeed it's very likely that the sleep
is the culprit. Of course, this should work and highlights that even if we
figure out what to do about GIL blocking we would still have an issue.
So to update you: I'm looking into ways to detect (in our communications
library) that the connection has dropped and at that point (or when a
reconnection occurs) clear the current set of subscribers. This would be
the right^TM way to do this...although I'm not sure it would help with the
child sacrifice...
I'm at a virtual conference this week but I'll aim to update you next.
Sorry for all the inconvenience!
Best,
-Martin
Le mar. 9 juin 2020 à 08:33, BeZie <[email protected]> a écrit :
… The error has appeared again for me, this time due to a out of memory
problem.
Killed process 8524 (verdi) ... kernel
Out of memory: Kill process 8524 (verdi) score 143 or sacrifice child
I am now moving away from memory intensive calcfunctions to some calcjobs
to prevent this from happening again. But catching this behavior somehow
would be nice anyway :) .
Best regards,
Benedikt
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3973 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAMNV27YZVMJY66PIMWSOPTRVXJS7ANCNFSM4MRYG7MQ>
.
--
Martin Uhrin
Tel: +44 785 424 6463
Skype: martin.uhrin85
|
I'm having a similar issue to @kjappelbaum I believe (also running on macOS). I've received the error Using aiida-core=1.3.0, rabbitmq=3.8.3, the error occurred while running a WorkChain from The relevant part of the rabbitmq log I think is:
|
Hi @kavanase , sorry to hear you're experiencing this. Indeed, if a machine sleeps for longer than twice the heartbeat interval then the RabbitMQ server (which can be on the same machine) will think that it has died. I'm not sure it's particularly easy to prevent this, however I think the solution I was exploring in my last post could help i.e. the client running the workchain could (somehow) detect when it has lost the connection and accordingly clear all knowledge of subscribers (which is what is causing the exception you're seeing). I'm on holiday at the moment but should be able to pick this up in august. I'll post back when I've had a chance to play around. In the meantime, if you experience this issue for any reason other than the machine sleeping (or loosing network connection, etc). Then please do post again. |
Hi,
My issue is happening during the parsing of output files. I just did the timing tests and realized that the |
A few days ago I had the same problem. It was reproducible in a sense that I have tried resubmitting the workchain and it always failed for this reason. The laptop was also performing another heavy python task and was overheated. My setup was: AiiDA v1.4.0, RabbitMQ 3.7.8, macOS. I used a single daemon worker - restarting it did not help.
The problem appeared after I added a new user to RabbitMQ by:
and set up a new AiiDA profile:
Today, after the laptop was switched off for the weekend, I do not encounter the problem - the workchain works well. |
Hi All, just to say that I have also experienced this today:
It was when I had 900 processes running with 10 workers, mostly VASP CalcJobs and WorkChains. I recall doing something very memory intensive at that time (a mongodbrestore operation....) and then notice some high usage of the python processes as well (the problem escalating?). I have been running around 600 processes previously without any problem. Maybe I am pushing it too hard..... I am running AiiDA 1.3.0 on WSL Ubuntu 18.04, the RabbitMQ service is installed on the windows side. |
Thanks @zhubonan. So if I understand correctly your situation was that your AiiDA workflows where doing a fairly large amount of in-Python processing and then suddenly the exception occurred? There was no sleep of the system involved? Would you mind having a look at the RabbitMQ logs around the time (or just before) the exception? I'm wondering if there's a line that says something like |
@Tseplyaev , thanks a lot for the report! Can I just check: this new RMQ user you added. Did you do this first and then start running your workchains which would consistently crash? You were also getting this |
Here are the two files around the time of the issue first occur . I think I am having the different issue here the problem may not be on the AiiDA side at all. The RMQ log suggests there were multiple server restarts (unexpectedly, I did not manually restart the server. The first sign of crash: 2020/10/08 19:08:16 - seems to conincident with a RMQ server restart as indicatd by the "Log file opened" line. The "closing AMQP connection" line happened after the problem started to escalate. But I can see similar lines in the log earlier on, but no problem noticed for the normal operation of AiiDA. |
One thing that seems weird to me in all this, is that everyone is reporting duplicate broadcast subscribers: Looking at I would also note that actually in aiida-core the broadcast subscriber is currently redundant, because in |
Last weeks I have been experiencing this problem again. Now I use another computer (not the one I used when I reported this problem earlier). I think I was able to quasi-reproduce it. Not really, but very close. The workchain I use consists of two step: the first one is several CalcJobs calculated on the localhost and the second one is a singe CalcJob performed on the remote machine. I use a quite dumb script to submit the workchain: before each submission it checks the total number of CalcJobs running and if this number is larger than X it waits for Y seconds, if not - it submits a new workchain. Initially I used X=120 and Y=60 -> some of the submitted workchains crashed due to the problem in discussion. Then I changed the parameters: X=50 and Y=240. Heavy initial localhost calculations did not overlap between different workchains, daemons were not overloaded and I did not experience the |
I think you explained the asymmetry in seeing the exception coming from the broadcast and not the rpc subscribers. The former is handled by |
I closed this through #5715 because it may solve at least part of these cases. Since these reports are very old, it is difficult to now to what extent the fix will work. It is very likely that the bug is still present, but will occur just less often. If someone comes across this bug in |
Dear Aiida Users,
I got the following error message that lead to the crash of all running jobs/workflows/calculations during the weekend.
kiwipy.communications.DuplicateSubscriberIdentifier: Broadcast identifier
I am usually running with several daemon threads. Any idea how I could avoid such problems in the future?
(Running aiida 1.1.1 at the moment)
The text was updated successfully, but these errors were encountered: