Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core.direct scheduler: kill command doesn't stop the underlying jobs #6571

Open
t-reents opened this issue Sep 25, 2024 · 2 comments
Open

core.direct scheduler: kill command doesn't stop the underlying jobs #6571

t-reents opened this issue Sep 25, 2024 · 2 comments
Labels

Comments

@t-reents
Copy link

Describe the bug

I encountered the problem that my jobs are not properly killed when running on localhost using the core.direct scheduler. For example, when I kill a PwCalculation, the corresponding CalcJobNode disappears from the verdi process list and is marked as killed. However, the underlying pw.x jobs are still running according my CPU consumption (also visible if I run the top command).

Your environment

  • Operating system [e.g. Linux]: Linux
  • Python version [e.g. 3.7.1]: Python 3.10.12
  • aiida-core version [e.g. 1.2.1]: 2.5.1 and 2.6.2

Additional context

Initially, I thought that this is related to the verdi presto command, as I only observed this behavior with my presto profiles. However, after manually creating a new computer and specifying core.slurm as the scheduler, the problem disappeared. Therefore, it really seems to be related to the core.direct scheduler.

@agoscinski
Copy link
Contributor

agoscinski commented Sep 26, 2024

Can reproduce this with

from aiida import load_profile, engine, orm
load_profile()

builder = orm.load_code("bash@localhost").get_builder()
builder.x = orm.Int(2)
builder.y = orm.Int(3)
builder.metadata.options.sleep = 100000
engine.run(builder)

The process is called sleep and still persists after a kill. The python instance where the calcjob is started is killed, but I still see sleep in my process list

@agoscinski
Copy link
Contributor

agoscinski commented Sep 26, 2024

Unlike I initially thought kill just kills the process and not the children processes. So in the code above with this pstree

-+- 97560 alexgo bash _aiidasubmit.sh
 \-+- 97561 alexgo /opt/homebrew/bin/bash
   \--- 97562 alexgo sleep 100000

the sleep will be kept alive even if we kill the parents process. One solution would be to change the kill_job function to send a kill command for all descendant process. So here

def kill_job(self, jobid: str) -> bool:
"""Kill a remote job and parse the return value of the scheduler to check if the command succeeded.
..note::
On some schedulers, even if the command is accepted, it may take some seconds for the job to actually
disappear from the queue.
:param jobid: the job ID to be killed
:returns: True if everything seems ok, False otherwise.
"""
retval, stdout, stderr = self.transport.exec_command_wait(self._get_kill_command(jobid))
return self._parse_kill_output(retval, stdout, stderr)

we do

    def kill_job(self, jobid: str) -> bool:
        """Kill a remote job and parse the return value of the scheduler to check if the command succeeded.

        ..note::

            On some schedulers, even if the command is accepted, it may take some seconds for the job to actually
            disappear from the queue.

        :param jobid: the job ID to be killed
        :returns: True if everything seems ok, False otherwise.
        """
        import psutil
        process = psutil.Process(int(jobid))
        children = process.children(recursive=True)
        jobids = [str(child.pid) for child in children]
        jobids.append(jobid)
        retval, stdout, stderr = self.transport.exec_command_wait(self._get_kill_command(" ".join(jobids))
        return self._parse_kill_output(retval, stdout, stderr)

EDIT: this should be moved to the direct scheduler in the _get_kill_command function as it otherwise will interfere with slurm job ids

agoscinski added a commit to agoscinski/aiida-core that referenced this issue Sep 26, 2024
agoscinski added a commit to agoscinski/aiida-core that referenced this issue Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants