Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash of all workflows (DuplicateSubscriberIdentifier) #3973

Closed
BeZie opened this issue Apr 27, 2020 · 41 comments · Fixed by #5715
Closed

Crash of all workflows (DuplicateSubscriberIdentifier) #3973

BeZie opened this issue Apr 27, 2020 · 41 comments · Fixed by #5715

Comments

@BeZie
Copy link

BeZie commented Apr 27, 2020

Dear Aiida Users,

I got the following error message that lead to the crash of all running jobs/workflows/calculations during the weekend.

kiwipy.communications.DuplicateSubscriberIdentifier: Broadcast identifier

I am usually running with several daemon threads. Any idea how I could avoid such problems in the future?

(Running aiida 1.1.1 at the moment)

  File "/path/to/python/lib/python3.7/site-packages/aiida/manage/external/rmq.py", line 187, in _continue
    result = yield super()._continue(communicator, pid, nowait, tag)
  File "/path/to/python/lib/python3.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/path/to/python/lib/python3.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/path/to/python/lib/python3.7/site-packages/tornado/gen.py", line 307, in wrapper
    yielded = next(result)
  File "/path/to/python/lib/python3.7/site-packages/plumpy/process_comms.py", line 547, in _continue
    proc = saved_state.unbundle(self._load_context)
  File "/path/to/python/lib/python3.7/site-packages/plumpy/persistence.py", line 51, in unbundle
    return Savable.load(self, load_context)
  File "/path/to/python/lib/python3.7/site-packages/plumpy/persistence.py", line 447, in load
    return load_cls.recreate_from(saved_state, load_context)
  File "/path/to/python/lib/python3.7/site-packages/plumpy/processes.py", line 236, in recreate_from
    base.call_with_super_check(process.init)
  File "/path/to/python/lib/python3.7/site-packages/plumpy/base/utils.py", line 29, in call_with_super_check
    fn(*args, **kwargs)
  File "/path/to/python/lib/python3.7/site-packages/aiida/engine/processes/process.py", line 126, in init
    super().init()
  File "/path/to/python/lib/python3.7/site-packages/plumpy/base/utils.py", line 16, in new_fn
    fn(self, *args, **kwargs)
  File "/path/to/python/lib/python3.7/site-packages/plumpy/processes.py", line 295, in init
    self.broadcast_receive, identifier=str(self.pid))
  File "/path/to/python/lib/python3.7/site-packages/kiwipy/rmq/communicator.py", line 592, in add_broadcast_subscriber
    return self._run_task(coro)
  File "/path/to/python/lib/python3.7/site-packages/kiwipy/rmq/communicator.py", line 677, in _run_task
    return self.tornado_to_kiwi_future(self._create_task(coro)).result(timeout=self.TASK_TIMEOUT)
  File "/path/to/python/lib/python3.7/concurrent/futures/_base.py", line 435, in result
    return self.__get_result()
  File "/path/to/python/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/path/to/python/lib/python3.7/site-packages/kiwipy/rmq/communicator.py", line 646, in done
    result = done_future.result()
  File "/path/to/python/lib/python3.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/path/to/python/lib/python3.7/site-packages/kiwipy/futures.py", line 54, in capture_exceptions
    yield
  File "/path/to/python/lib/python3.7/site-packages/kiwipy/rmq/utils.py", line 146, in run_task
    future.set_result((yield coro()))
  File "/path/to/python/lib/python3.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/path/to/python/lib/python3.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/path/to/python/lib/python3.7/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/path/to/python/lib/python3.7/site-packages/kiwipy/rmq/communicator.py", line 393, in add_broadcast_subscriber
    identifier = yield self._message_subscriber.add_broadcast_subscriber(subscriber, identifier)
  File "/path/to/python/lib/python3.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/path/to/python/lib/python3.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/path/to/python/lib/python3.7/site-packages/tornado/gen.py", line 307, in wrapper
    yielded = next(result)
  File "/path/to/python/lib/python3.7/site-packages/kiwipy/rmq/communicator.py", line 146, in add_broadcast_subscriber
    raise kiwipy.DuplicateSubscriberIdentifier("Broadcast identifier '{}'".format(identifier))
  kiwipy.communications.DuplicateSubscriberIdentifier: Broadcast identifier '157299'

@giovannipizzi
Copy link
Member

Hi, I never saw this. Did you manage to recover from this? Maybe this is something more related to kiwipy? Pinging @sphuber and @muhrin

@BeZie
Copy link
Author

BeZie commented May 11, 2020 via email

@giovannipizzi
Copy link
Member

Ok. Maybe you can report a bit of the way you have set up your installation? RabbitMQ version, if it's on localhost, version of AiiDA (it's still 1.1.1?) and of kiwipy (just to double check) etc.

Also, do you know if 157299 in the error message is a PK, a process ID, ...?

@BeZie
Copy link
Author

BeZie commented May 11, 2020 via email

@muhrin
Copy link
Contributor

muhrin commented May 11, 2020

Sorry you're having these issues Benedikt, and thanks for reporting. I'm wracking my brain trying to think how this could happen but I can't think of a plausible explanation for now so let me just clarify a little on what the various systems are doing which will put constraints on possible the 'explanation space' and hopefully we can solve it together.

  • Your trace shows that a process (pk=157299) is being recreated from a bundle (as happens, e.g., when the daemon is restarted) and is trying to register itself with the communicator to listen for broadcasts (usually used to respond to a child termination event).
  • The communicator's dictionary already contains an entry with the same pk
  • Each daemon worker has exactly one communicator (shared between all processes occupying that worker's slots).
  • AiiDA's db uniqueness constraints should gauarantee the uniqueness of pks whilst RabbitMQ should guarantee that exactly one consumer ever has a continue task at a time

As I think this through one possibilitiy is that a worker that was already working on 157299 lost it's RabbitMQ connection, missed two heartbeats (causing RabbitMQ to requeue the message), reconnected, and then got back the same continue message. I don't think that kiwipy recognises this scenario and clears the subscribers lists (and indeed AiiDA will probably have continued working on that process if it could).

@BeZie , would you be able to find the RabbitMQ logs for around the time of that exception (have a look in /var/log/rabbitmq if on *nix). We might be able to see if there was a disconnection.

@BeZie
Copy link
Author

BeZie commented May 11, 2020

@muhrin The only thing which I find in the log around that time is

closing AMQP connection <0.4871.3> (127.0.0.1:47882 -> 127.0.0.1:5672, vhost: '/', user: 'guest'):
client unexpectedly closed TCP connection

@BeZie
Copy link
Author

BeZie commented May 18, 2020

Dear all,
also happened to me after updating to current aiida version 1.2.1.

@muhrin
Copy link
Contributor

muhrin commented May 18, 2020

Hi Benedikt. I had a chat with Sebastiaan on Friday to try and brainstorm a possible mechanism for this failure but we couldn't think of anything other than what I'd already suggested. So, let's gather a few more details and dig deeper:

  1. I presume you're running this all on one computer (your AiiDA install, RabbitMQ and the SQL database)?
  2. Are you shutting down/restarting or incrementing or decrementing your number of workers? (or letting your computer go to sleep)
  3. Could you attach your RabbitMQ log for a relevant period? I'm curious if you see any correlations between the exception occurring and the AMQP connection closures
  4. Could you attach your AiiDA daemon log (~/.aiida/daemon/log)?

Let's see if we get any clues from all that.

@BeZie
Copy link
Author

BeZie commented May 19, 2020

  1. Yes, only the file repository is located on a network folder that is mounted with sshfs
  2. The crash has occurred during the weekend where the server was running, thus no shutting down, restart, incrementing, or decrementing of workers during this time. (16 May)
  3. and 4) see attachment. I had to blank some parts which are not relevant for debugging.
    rabbit.log
    aiida_filtered.log

Thanks already for the help :)

@muhrin
Copy link
Contributor

muhrin commented May 19, 2020

My pleasure :)

Wow, so looking at these logs it looks like the exact scenario I described is playing out, e.g.:

=ERROR REPORT==== 16-May-2020::11:29:09 ===
closing AMQP connection <0.861.0> (127.0.0.1:58862 -> 127.0.0.1:5672):
missed heartbeats from client, timeout: 60s

Here, AiiDA has missed (presumably) two heartbeats (the default) and RabbitMQ has assumed that client to be dead. Then, 23s later, the automatic reconnection mechanism kicks in and AiiDA reconnects:

=INFO REPORT==== 16-May-2020::11:29:32 ===
accepting AMQP connection <0.16649.3> (127.0.0.1:43536 -> 127.0.0.1:5672)

This all causes some hubbub in the daemon logs, but tellingly soon after AiiDA gets delivered the same task it was working on before it got disconnected and that causes this exception

05/16/2020 11:40:44 AM <15123> aiida.orm.nodes.process.workflow.workchain.WorkChainNode: [ERROR] failed to recreate the process instance in order to continue it.

I'm surprised because the library that communicates with RabbitMQ uses a separate thread, partly so that it can keep responding to heartbeats while other things go on. Now, because this is python, we don't have true multithreading [1] and it's possible that your workload has blocked the interpreter for over two minutes (i.e. 2 heartbeats).

So, can I ask, what kind of workload is in your workchains? Are you doing any heavy in-Python computation, heavy numpy tasks? Or, perhaps, calling a subprocess and blocking waiting on that? Basically anything you think could block a thread for a long time?

[1] https://realpython.com/python-gil/

@BeZie
Copy link
Author

BeZie commented May 19, 2020

There are some "heavy" in-python computations in the workchain as calcfunctions (taking around 15 minutes each and they are quite memory hungry). These computations (to my knowledge) do not use multithreading. Maybe numpy does it under the hood.

Best,
Benedikt

@greschd
Copy link
Member

greschd commented May 19, 2020

To the best of my knowledge numpy releases the GIL when doing heavy computations - but of course it can't be ruled out that this is what's causing the issue.

@muhrin Would running out of memory give a different result? What happens if the daemon process gets kill -9ed, without being able to persist its processes?

I've crashed my workchains by running out of memory (funnily, from a different process) before, but didn't look into the exact traceback because it was clear this was the root cause.

If the missed heartbeat is the root cause, is there any other way to guard against a second worker trying to pick up the same process, or for the second worker to fail gracefully such that the process is not destroyed?

@muhrin
Copy link
Contributor

muhrin commented May 19, 2020

Thanks @greschd . I'm really quite ignorant to the conditions under which python will or won't release the GIL, although keeping it for over 2 minutes, indeed, seems a little extreme.

Regarding processes being killed (as a consequence of OOM, -9, or anything else), this is actually fine. In the sense that the next worker to pick up that task will continue it from the last checkpoint.

This failure mode is a little different. It seems to happen when a worker misses the heartbeats and is presumed dead, then reconnects, and gets back the same job is is already working on (which RabbitMQ had no way to know as the worker wasn't responding). Now, one might say that worker should just know this and carry on with what it was doing, but the reality is that another worker could have been delivered that job and then they would both be working on it, so this isn't a robust solution.

Ultimately, there is no magic solution here. There needs to be some locking mechanism that prevents two workers from working on the same thing and in our case that is the fact that a worker holds an RMQ task (and indeed this is exactly the kind of use case RMQ was written for). In a perfect world I would fix this by:

  1. Having no GIL so I know the worker is always able to respond which would leave the primary reasons for missing hearbeats being things like the computer crashing, or network cable being pulled, etc
  2. Having the worker recognise that it has lost its RabbitMQ connection and terminate that task (as it has no right to run it anymore).

So @BeZie , let's confirm that what I think is happening is true. Can you set this the _RMQ_HEARTBEAT_TIMEOUT to some time (in milliseconds) that is at least half of the maximum time you would ever expect a workflow step to take:

_RMQ_HEARTBEAT_TIMEOUT = 600 # Maximum that can be set by client, with default RabbitMQ server configuration

and then report back if the error is cleared up? (you'll need to restart your daemon for this to take effect)

A consequence of setting this is that if a worker terminates in an unclean way your tasks will be frozen for twice that timeout before they can be executed by a worker again.

@greschd
Copy link
Member

greschd commented May 19, 2020

Having no GIL so I know the worker is always able to respond which would leave the primary reasons for missing hearbeats being things like the computer crashing, or network cable being pulled, etc

Just for the sake of pedantry, that of course wouldn't be sufficient to know the worker can always respond, because there's also OS level scheduling. Now I don't know how overloaded your system needs to be for a thread to not be scheduled at all for 2 minutes, but I guess it's possible.

@BeZie
Copy link
Author

BeZie commented May 19, 2020

Just want to double check. The line comment kind of indicates that 600ms is already max. Should I somehow adjust the rabbitmq settings accordingly?

BTW: thanks for the quick response times and with helping me out here.

@muhrin
Copy link
Contributor

muhrin commented May 19, 2020

Ah, yes, it helps if I read the useful comments that we put into the code :)

So, you have to make the AiiDA change and up the heartbeat in rabbitmq.conf, I think in Ubuntu this can be done by:

  1. Setting HEARTBEAT=[time in seconds] in /etc/rabbitmq/rabbitmq-env.conf [1].
  2. Restart the service, sudo systemctl restart rabbitmq-server.service, and,
  3. Check if it's taken effect: sudo rabbitmqctl environment | grep heartbeat should so {heartbeat,x}

Then you should be good to up it in AiiDA.

[1] https://www.rabbitmq.com/configure.html

@BeZie
Copy link
Author

BeZie commented May 20, 2020

Going to test the new settings over the weekend and will let you know if I get a random crash.

Some side comments.

  1. The most recent version of rabbitmq in Ubuntu 18.04 LTS seems to use the old config syntax. I have used the official rabbitmq repositories and updated my rabbitmq version.

  2. The max. default time for heartbeat in rabbitmq is 60s but the AiiDA comment indicates that 0.6s is the maximum. This is a little bit confusing.

Thanks again and best regards
Benedikt

@BeZie
Copy link
Author

BeZie commented May 25, 2020

Unfortunately, the same error has occurred again.

I had set the heartbeat time to about 10 minutes. As a next try, I will transfer the file repository to a local disk and see if this happens again. :) [This will take a couple of days as I need to order a new HDD first... if you have any suggestions until then, please let me know]

Best regards,
Benedikt

@BeZie
Copy link
Author

BeZie commented Jun 2, 2020

Update:

Short update on my side. After moving everything to a local folder, I did not had any problems yet. Not sure if I should close the issue because it, in principle, still exists.

Thanks for helping and guiding me to a workaround!

Best regards,
Benedikt

@greschd
Copy link
Member

greschd commented Jun 2, 2020

Just to clarify: You moved from a network mounted to a local drive, and this resolved the issue, correct?

@muhrin @sphuber can you think of any way that some lock file / bundle file being inaccessible could lead to this issue?

@BeZie
Copy link
Author

BeZie commented Jun 2, 2020

@greschd yep I moved the file repository from a network mounted drive to a local one.

@sphuber
Copy link
Contributor

sphuber commented Jun 2, 2020

What is the configuration for RabbitMQ? Is that running on the localhost where AiiDA is running? There is no official support to configure these settings, but maybe you have manually changed the settings to connect to another server.

I can't really see why the location of the file repository should influence RabbitMQ. However, we might have been distracted by a red herring. If the file repository was temporarily unavailable due to network issues making the mount unreachable, then this would of course fail any repository operation. There are no built in error recovery methods for this in AiiDA, because so far we assume it is local. If a calculation was writing to the repository, it would except, which would bubble up and also fail all calling workflows. @BeZie is the exception you pasted in the OP the only exception you say in any of the reports of the failed processes?

@BeZie
Copy link
Author

BeZie commented Jun 2, 2020

@sphuber The RabbitMQ is running on the same location (localhost) as AiiDA. No changes to the default rabbitmq settings has been made except for the heartbeat (see above).
And yes, the only exception is the one that is pasted in the OP and the logs.

@muhrin
Copy link
Contributor

muhrin commented Jun 2, 2020

Well, again, I'm tempted to think that heartbeats could be a problem here. We know for sure that there is a problem of the connection being dropped due to heartbeat misses and then re-established causing the duplicate key. Now, it could be that the large file transfers to a network drive were blocking for more than 10 minutes (although this does seem extreme to me).

@BeZie , could you confirm the output from sudo rabbitmqctl environment | grep heartbeat (or equivalent)? There does seem to be a units mismatch somewhere in AiiDA/kiwi/RMQ to be aware of.

@greschd
Copy link
Member

greschd commented Jun 2, 2020

@BeZie do you have any kinds of logs from your NFS that could indicate if the error coincided with it being unavailable?

Is the NFS being unavailable / overloaded something you have observed at all?

In my experience, an overloaded NFS can just cause the operation to hang indefinitely, or until traffic to the NFS clears up. Maybe the RabbitMQ heartbeat being missed was indeed a secondary effect - it's still sort of weird this would occur though, because I think the read/write to NFS should release the GIL, and thus the "heartbeat responder" thread should still be active.

@BeZie
Copy link
Author

BeZie commented Jun 3, 2020

@muhrin the output is
{heartbeat_interval,100},
{heartbeat,300},
@greschd that may some time. Not sure if I can get them (need to consult with our IT staff)

Best regards

@kjappelbaum
Copy link
Contributor

Also saw this error (e.g., kiwipy.communications.DuplicateSubscriberIdentifier: Broadcast identifier '11545') recently, running everything as well on localhost, without any heavy computations in Python but the computer went to sleep in between.

Running aiida-core=1.1.1, rabbitmq=3.8.3 on Mac, error occurred when running some workchain with aiida-cp2k=1.0.0b6.

A relevant part of the rabbitmq log seems to be

2020-06-06 15:14:34.839 [warning] <0.15388.2> closing AMQP connection <0.15388.2> (127.0.0.1:49859 -> 127.0.0.1:5672, vhost: '/', user: 'guest'):
client unexpectedly closed TCP connection
2020-06-06 16:21:52.279 [error] <0.280.0> ** Generic server aten_detector terminating
** Last message in was poll
** When Server state == {state,#Ref<0.938071647.959184898.184568>,1000,0.99,#{},#{}}
** Reason for termination ==
** {{timeout,{gen_server,call,[aten_sink,get_failure_probabilities]}},[{gen_server,call,2,[{file,"gen_server.erl"},{line,215}]},{aten_detector,handle_info,2,[{file,"src/aten_detector.erl"},{line,97}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,637}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,711}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}
2020-06-06 16:21:52.289 [error] <0.280.0> CRASH REPORT Process aten_detector with 0 neighbours exited with reason: {timeout,{gen_server,call,[aten_sink,get_failure_probabilities]}} in gen_server:call/2 line 215
2020-06-06 16:21:52.290 [error] <0.277.0> Supervisor aten_sup had child aten_detector started with aten_detector:start_link() at <0.280.0> exit with reason {timeout,{gen_server,call,[aten_sink,get_failure_probabilities]}} in context child_terminated

The output of rabbitmqctl environment | grep heartbeat

 {heartbeat_interval,100},
 {heartbeat,60},

@BeZie
Copy link
Author

BeZie commented Jun 9, 2020

The error has appeared again for me, this time due to a out of memory problem.

Killed process 8524 (verdi) ... kernel
Out of memory: Kill process 8524 (verdi) score 143 or sacrifice child

I am now moving away from memory intensive calcfunctions to some calcjobs to prevent this from happening again. But catching this behavior somehow would be nice anyway :) .

Best regards,
Benedikt

@muhrin
Copy link
Contributor

muhrin commented Jun 9, 2020 via email

@kavanase
Copy link

kavanase commented Jul 24, 2020

I'm having a similar issue to @kjappelbaum I believe (also running on macOS).

I've received the error kiwipy.communications.DuplicateSubscriberIdentifier: Broadcast identifier '{PID}' a couple of times now. Everything running on localhost, without any heavy Python in the WorkChain. I think, similar to @kjappelbaum, the problem is that my (macOS) laptop on which AiiDA is running went to sleep while the CalcJob within the WorkChain was running on the remote computer.

Using aiida-core=1.3.0, rabbitmq=3.8.3, the error occurred while running a WorkChain from aiida-vasp.

The relevant part of the rabbitmq log I think is:

2020-07-23 22:52:36.326 [warning] <0.6382.4> closing AMQP connection <0.6382.4> (127.0.0.1:49507 -> 127.0.0.1:5672, vhost: '/', user: 'guest'):
client unexpectedly closed TCP connection
2020-07-23 22:52:37.280 [info] <0.2616.5> accepting AMQP connection <0.2616.5> (127.0.0.1:50406 -> 127.0.0.1:5672)
2020-07-23 22:52:37.332 [info] <0.2616.5> connection <0.2616.5> (127.0.0.1:50406 -> 127.0.0.1:5672): user 'guest' authenticated and granted access to vhost '/'
2020-07-24 02:46:25.635 [warning] <0.2616.5> closing AMQP connection <0.2616.5> (127.0.0.1:50406 -> 127.0.0.1:5672, vhost: '/', user: 'guest'):
client unexpectedly closed TCP connection
2020-07-24 02:46:25.729 [warning] <0.10644.4> closing AMQP connection <0.10644.4> (127.0.0.1:50617 -> 127.0.0.1:5672, vhost: '/', user: 'guest'):
client unexpectedly closed TCP connection
2020-07-24 02:46:26.577 [info] <0.6899.5> accepting AMQP connection <0.6899.5> (127.0.0.1:53530 -> 127.0.0.1:5672)
2020-07-24 02:46:26.765 [info] <0.6902.5> accepting AMQP connection <0.6902.5> (127.0.0.1:53533 -> 127.0.0.1:5672)
2020-07-24 02:46:26.767 [info] <0.6899.5> connection <0.6899.5> (127.0.0.1:53530 -> 127.0.0.1:5672): user 'guest' authenticated and granted access to vhost '/'
2020-07-24 02:46:26.778 [info] <0.6902.5> connection <0.6902.5> (127.0.0.1:53533 -> 127.0.0.1:5672): user 'guest' authenticated and granted access to vhost '/'```

Same output of `rabbitmqctl environment | grep heartbeat`:

 {heartbeat_interval,100},
 {heartbeat,60},

@muhrin
Copy link
Contributor

muhrin commented Jul 26, 2020

Hi @kavanase , sorry to hear you're experiencing this. Indeed, if a machine sleeps for longer than twice the heartbeat interval then the RabbitMQ server (which can be on the same machine) will think that it has died. I'm not sure it's particularly easy to prevent this, however I think the solution I was exploring in my last post could help i.e. the client running the workchain could (somehow) detect when it has lost the connection and accordingly clear all knowledge of subscribers (which is what is causing the exception you're seeing).

I'm on holiday at the moment but should be able to pick this up in august. I'll post back when I've had a chance to play around.

In the meantime, if you experience this issue for any reason other than the machine sleeping (or loosing network connection, etc). Then please do post again.

@ezpzbz
Copy link
Member

ezpzbz commented Aug 30, 2020

Hi,
I am facing similar issue and upon discussing with @sphuber and @giovannipizzi , I tried to root the issue. I've increased the log level on both AiiDA and RabbitMQ to DEBUG and waited to see the issue. The selected parts of logs for a CalcJob that the issue occured on, are attached.
My system setup is:

Ubuntu: 18.04 LTS
aiida-core: 1.3.0
Rabbitmq: 3.8.7

and it does not sleep.

My issue is happening during the parsing of output files. I just did the timing tests and realized that the OUTCAR takes ~680 seconds to be parsed by pymatgen parser and it can explain the whole reason behind happening issue. rabbitmq.log shows that at the time of parsing the hearbeat is missed for the first time on 20:47:18 and then on 20:49:18, and again on 20:51:18. These are associated with the events logged for first worker in daemon_log3.log. Then, task has been given to the second worker (logged in daemon_log5.log) which had same issues. However, meanwhile, the first worker came back while had the results and registered/stored the result nodes, and once the second worker wanted to do the same, we have extra traceback errors that AiiDA correctly complained about it.
Currently, I'm going to fix the issue for my case by changing the parser to avoid unacceptable huge parsing time. However, it seems this issue needs to be addressed for other processes that can take long but cannot be shortned/avoided.
Cheers,

daemon_log3.log
daemon_log5.log
rabbitmq.log

@Tseplyaev
Copy link

A few days ago I had the same problem. It was reproducible in a sense that I have tried resubmitting the workchain and it always failed for this reason. The laptop was also performing another heavy python task and was overheated.

My setup was: AiiDA v1.4.0, RabbitMQ 3.7.8, macOS. I used a single daemon worker - restarting it did not help.
Restarting RabbitMQ did not help too.
RabbitMQ log corresponding to the time of failure (around 16:46):

2020-09-25 16:27:07.517 [info] <0.866.0> connection <0.866.0> (127.0.0.1:49787 -> 127.0.0.1:5672): user 'test' authenticated and granted access to vhost 'test'
2020-09-25 16:46:51.233 [info] <0.1029.0> accepting AMQP connection <0.1029.0> (127.0.0.1:50458 -> 127.0.0.1:5672)
2020-09-25 16:46:51.241 [info] <0.1029.0> connection <0.1029.0> (127.0.0.1:50458 -> 127.0.0.1:5672): user 'test' authenticated and granted access to vhost 'test'
2020-09-25 16:46:51.392 [info] <0.1029.0> closing AMQP connection <0.1029.0> (127.0.0.1:50458 -> 127.0.0.1:5672, vhost: 'test', user: 'test')
2020-09-28 11:47:28.472 [info] <0.7.0> Log file opened with Lager

The problem appeared after I added a new user to RabbitMQ by:

$ ./rabbitmqctl add_user test test
$ ./rabbitmqctl add_vhost test
$ ./rabbitmqctl set_permissions -p test test ".*" ".*" ".*"

and set up a new AiiDA profile:

----------------------  -----------------------------------------------------
aiidadb_backend         django
aiidadb_engine          postgresql_psycopg2
aiidadb_host            localhost
aiidadb_name            NAME
aiidadb_pass            PASS
aiidadb_port            5432
aiidadb_repository_uri  file:///Users/tsep/.aiida/repository/develop_profile
aiidadb_user            DB_USER
broker_host             127.0.0.1
broker_password         test
broker_port             5672
broker_protocol         amqp
broker_username         test
broker_virtual_host     test
default_user_email      [email protected]
options                 {}
profile_uuid            532b2c9acc6741699108fe3be50b68a2
----------------------  -----------------------------------------------------

Today, after the laptop was switched off for the weekend, I do not encounter the problem - the workchain works well.

@zhubonan
Copy link
Contributor

zhubonan commented Oct 8, 2020

Hi All, just to say that I have also experienced this today:

   raise self._exception
  File "/home/bonan/miniconda3/envs/aiida-1.0-py3/lib/python3.7/site-packages/kiwipy/rmq/communicator.py", line 646, in done
    result = done_future.result()
  File "/home/bonan/miniconda3/envs/aiida-1.0-py3/lib/python3.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/home/bonan/miniconda3/envs/aiida-1.0-py3/lib/python3.7/site-packages/kiwipy/futures.py", line 54, in capture_exceptions
    yield
  File "/home/bonan/miniconda3/envs/aiida-1.0-py3/lib/python3.7/site-packages/kiwipy/rmq/utils.py", line 146, in run_task
    future.set_result((yield coro()))
  File "/home/bonan/miniconda3/envs/aiida-1.0-py3/lib/python3.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/bonan/miniconda3/envs/aiida-1.0-py3/lib/python3.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/home/bonan/miniconda3/envs/aiida-1.0-py3/lib/python3.7/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/bonan/miniconda3/envs/aiida-1.0-py3/lib/python3.7/site-packages/kiwipy/rmq/communicator.py", line 393, in add_broadcast_subscriber
    identifier = yield self._message_subscriber.add_broadcast_subscriber(subscriber, identifier)
  File "/home/bonan/miniconda3/envs/aiida-1.0-py3/lib/python3.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/bonan/miniconda3/envs/aiida-1.0-py3/lib/python3.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/home/bonan/miniconda3/envs/aiida-1.0-py3/lib/python3.7/site-packages/tornado/gen.py", line 307, in wrapper
    yielded = next(result)
  File "/home/bonan/miniconda3/envs/aiida-1.0-py3/lib/python3.7/site-packages/kiwipy/rmq/communicator.py", line 146, in add_broadcast_subscriber
    raise kiwipy.DuplicateSubscriberIdentifier("Broadcast identifier '{}'".format(identifier))
kiwipy.communications.DuplicateSubscriberIdentifier: Broadcast identifier '312454'
10/08/2020 07:08:41 PM <1237> kiwipy.rmq.tasks: [ERROR] The task excepted

It was when I had 900 processes running with 10 workers, mostly VASP CalcJobs and WorkChains. I recall doing something very memory intensive at that time (a mongodbrestore operation....) and then notice some high usage of the python processes as well (the problem escalating?).

I have been running around 600 processes previously without any problem. Maybe I am pushing it too hard.....

I am running AiiDA 1.3.0 on WSL Ubuntu 18.04, the RabbitMQ service is installed on the windows side.

@muhrin
Copy link
Contributor

muhrin commented Oct 9, 2020

Thanks @zhubonan. So if I understand correctly your situation was that your AiiDA workflows where doing a fairly large amount of in-Python processing and then suddenly the exception occurred? There was no sleep of the system involved? Would you mind having a look at the RabbitMQ logs around the time (or just before) the exception? I'm wondering if there's a line that says something like closing AMQP connection around then.

@muhrin
Copy link
Contributor

muhrin commented Oct 9, 2020

@Tseplyaev , thanks a lot for the report! Can I just check: this new RMQ user you added. Did you do this first and then start running your workchains which would consistently crash? You were also getting this DuplicateSubscriberIdentifier exception in the AiiDA logs?

@zhubonan
Copy link
Contributor

zhubonan commented Oct 9, 2020

error.log
rmq.log

Here are the two files around the time of the issue first occur . I think I am having the different issue here the problem may not be on the AiiDA side at all. The RMQ log suggests there were multiple server restarts (unexpectedly, I did not manually restart the server. The first sign of crash: 2020/10/08 19:08:16 - seems to conincident with a RMQ server restart as indicatd by the "Log file opened" line.

The "closing AMQP connection" line happened after the problem started to escalate. But I can see similar lines in the log earlier on, but no problem noticed for the normal operation of AiiDA.

@chrisjsewell chrisjsewell changed the title Crash of all workflows Crash of all workflows (DuplicateSubscriberIdentifier) Feb 26, 2021
@chrisjsewell
Copy link
Member

chrisjsewell commented Feb 28, 2021

One thing that seems weird to me in all this, is that everyone is reporting duplicate broadcast subscribers: DuplicateSubscriberIdentifier: Broadcast identifier, yet in the code the RPC subscriber is added first (https://github.com/aiidateam/plumpy/blob/db0bf6033aa3a9b69e0e5b30206df05135538fd7/plumpy/processes.py#L287)
So why is no one seeing: DuplicateSubscriberIdentifier: RPC identifier?

Looking at kiwipy.RmqSubscriber (https://github.com/aiidateam/kiwipy/blob/5efa84c023e14f70f721118aea27cd0e417040be/kiwipy/rmq/communicator.py#L69),
it may be of note that in add_rpc_subscriber a duplicate is not detected by whether self._rpc_subscribers[identifier] already exists, but rather whether aiormq.Channel.consumer[identifier] already exists.

I would also note that actually in aiida-core the broadcast subscriber is currently redundant, because in aiida/cmdline/commands/cmd_process.py when verdi process pause/play --all is called, it queries the database to get all running/paused processes, then makes individual RPC calls, rather than sending a broadcast. So really we never want an issue with adding it to except the process (unless of course it is pointing to a deeper underlying problem)

@Tseplyaev
Copy link

Last weeks I have been experiencing this problem again. Now I use another computer (not the one I used when I reported this problem earlier). I think I was able to quasi-reproduce it. Not really, but very close.

The workchain I use consists of two step: the first one is several CalcJobs calculated on the localhost and the second one is a singe CalcJob performed on the remote machine.

I use a quite dumb script to submit the workchain: before each submission it checks the total number of CalcJobs running and if this number is larger than X it waits for Y seconds, if not - it submits a new workchain.

Initially I used X=120 and Y=60 -> some of the submitted workchains crashed due to the problem in discussion.
In this case calculations on the localhost overlapped with the submission of a new workchain i.e. heavy calculations from two different WorkChains were performed simultaneously. In addition, daemons were overloaded at some point of time due to the large number of processes running. I tried submitting like this setup three times - I thought that restarting the computer would help but it didn't.

Then I changed the parameters: X=50 and Y=240. Heavy initial localhost calculations did not overlap between different workchains, daemons were not overloaded and I did not experience the DuplicateSubscriberIdentifier problem anymore. It would be important to notice that I did not even restart the computer between failed and succeeded submissions.

@sphuber
Copy link
Contributor

sphuber commented Oct 27, 2022

One thing that seems weird to me in all this, is that everyone is reporting duplicate broadcast subscribers: DuplicateSubscriberIdentifier: Broadcast identifier, yet in the code the RPC subscriber is added first (https://github.com/aiidateam/plumpy/blob/db0bf6033aa3a9b69e0e5b30206df05135538fd7/plumpy/processes.py#L287) So why is no one seeing: DuplicateSubscriberIdentifier: RPC identifier?

Looking at kiwipy.RmqSubscriber (https://github.com/aiidateam/kiwipy/blob/5efa84c023e14f70f721118aea27cd0e417040be/kiwipy/rmq/communicator.py#L69), it may be of note that in add_rpc_subscriber a duplicate is not detected by whether self._rpc_subscribers[identifier] already exists, but rather whether aiormq.Channel.consumer[identifier] already exists.

I think you explained the asymmetry in seeing the exception coming from the broadcast and not the rpc subscribers. The former is handled by kiwipy, whereas the latter is handled by aiormq. It may very well be that there is a bug in kiwipy that doesn't properly cleanup broadcast subscribers once the connection was lost. Maybe aiormq does actually reset the list of subscribers once it loses connection, as in principle that means it should assume it no longer "owns" the tasks it was working on that are connected to those subscriptions and so it should relinquish it.

@sphuber
Copy link
Contributor

sphuber commented Oct 27, 2022

I closed this through #5715 because it may solve at least part of these cases. Since these reports are very old, it is difficult to now to what extent the fix will work. It is very likely that the bug is still present, but will occur just less often.

If someone comes across this bug in aiida-core>=2.1 please just open a new issue and reference this one as being related.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.