-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tests: Manually stop daemon after verdi devel revive
test
#5689
Tests: Manually stop daemon after verdi devel revive
test
#5689
Conversation
I think this should fix the problem, although the solution is not really ideal. It is a bit of a workaround since I still couldn't understand a 100% what is going on. This solution seems to work for now though and since it is really messing with all builds, we might want to consider merging this while we investigate further to find the real root cause. |
verdi devel revive
test
Warnings are raised when a profile is loaded that configures a RabbitMQ server with an unsupported version or if the installed `aiida-core` code is not a released version. These warnings are not relevant for testing and so they are suppressed by setting the relevant config options. The options are set on the automatically created config in the case of the temporary test profile, as well as the test profile that is created manually before for the tests run in the Github Actions workflow.
The `Computer` created by the `aiida_localhost` fixture configures the `core.direct` scheduler plugin, which does not support setting a maximum memory directive. Doing so leads to a warning being logged everytime a job is submitted to the computer.
If the `submit_and_wait` fixture times out waiting for the submitted process to reach the desired state, usually there is a problem with the daemon workers. To make debugging easier, the status of the daemon as well as the content of the daemon log file are included in the exception message.
There was a problem where the `verdi process pause` test in the `tests/cmdline/commands/test_process.py` would except because the timeout would be hit. The direct result was because the daemon worker could not load the node from the database, which in turns was because the session was in a pending rollback state. This was because a previous operation on the database excepted. This exception seemed to be due to the daemon trying to call `CalcJob.delete_state` or `Process.delete_checkpoint` in the `on_terminated` calls. For some reason, the update statement that would be executed for this, to remove the relevant attribute key, would match 0 rows. The suspicion is because the relevant node had already been removed from the database, probably because another test, ran between the two daemon tests, had cleaned the database and so the node no longer existed, but the process task somehow did. It is not quite clear exactly where the problem lies, but for now the temporary work-around is to manually stop the daemon in the first test, which apparently cleans the state such that the original exception is no longer hit and the daemon doesn't get stuck with an inconsistent session.
40dc604
to
ae90d17
Compare
@chrisjsewell I am merging this soon since it is blocking all other PRs. Let me know if you want to have a look still or I can go ahead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cheers
Fixes #5687
There was a problem where the
verdi process pause
test in thetests/cmdline/commands/test_process.py
would except because thetimeout would be hit. The direct result was because the daemon worker
could not load the node from the database, which in turns was because
the session was in a pending rollback state. This was because a previous
operation on the database excepted. This exception seemed to be due to
the daemon trying to call
CalcJob.delete_state
orProcess.delete_checkpoint
in theon_terminated
calls. For somereason, the update statement that would be executed for this, to remove
the relevant attribute key, would match 0 rows. The suspicion is because
the relevant node had already been removed from the database, probably
because another test, ran between the two daemon tests, had cleaned the
database and so the node no longer existed, but the process task somehow
did.
It is not quite clear exactly where the problem lies, but for now the
temporary work-around is to manually stop the daemon in the first test,
which apparently cleans the state such that the original exception is no
longer hit and the daemon doesn't get stuck with an inconsistent session.