Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update to run_sorter_jobs() and slurm #3105

Open
wants to merge 30 commits into
base: main
Choose a base branch
from

Conversation

MarinManuel
Copy link

I am trying to run spikeinterface sorters on a HPC cluster and needed to pass specific parameters to slurm/sbatch.

I added the option to pass arbitrary arguments to sbatch in the engine_kwargs.

Let me know if that's causing any issues

@alejoe91
Copy link
Member

@MarinManuel thank you! This is great!

The SLURM deployment has been unexplored.on our side, so we are very glad that it works for you.

Would you mind sharing the script that you are using? I'd love to add an How To page on how to deploy multiple spike sorting jobs on SLURM

Copy link
Collaborator

@zm711 zm711 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alejoe91 / @samuelgarcia know the launcher stuff better, is popping the best strategy or could you just pull out the kwargs without changing the dict as you go?

Otherwise, I just added some fixes for our docstring style :)

src/spikeinterface/sorters/launcher.py Outdated Show resolved Hide resolved
src/spikeinterface/sorters/launcher.py Outdated Show resolved Hide resolved
src/spikeinterface/sorters/launcher.py Outdated Show resolved Hide resolved
src/spikeinterface/sorters/launcher.py Outdated Show resolved Hide resolved
@alejoe91 alejoe91 added the sorters Related to sorters module label Jul 1, 2024
Copy link
Collaborator

@JoeZiminski JoeZiminski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very cool! really opens up the SLURM options here. I have a couple of comments now, and will try it out on our HPC ASAP!

src/spikeinterface/sorters/launcher.py Show resolved Hide resolved
src/spikeinterface/sorters/launcher.py Outdated Show resolved Hide resolved
src/spikeinterface/sorters/launcher.py Outdated Show resolved Hide resolved
src/spikeinterface/sorters/launcher.py Outdated Show resolved Hide resolved
src/spikeinterface/sorters/launcher.py Outdated Show resolved Hide resolved
src/spikeinterface/sorters/launcher.py Outdated Show resolved Hide resolved
src/spikeinterface/sorters/launcher.py Outdated Show resolved Hide resolved
src/spikeinterface/sorters/launcher.py Show resolved Hide resolved
@JoeZiminski
Copy link
Collaborator

JoeZiminski commented Jul 29, 2024

Hey @MarinManuel sorry for the delay in testing this locally, I tried it on our cluster and seems to be working well! I had a couple of other comments below, it might also be worth adding a test for this. I have pasted one in the dropdown below, I don't think I can push to your fork, but feel free to add it to src/spikeinterface/sorters/tests/test_launcher.py underneath test_run_sorter_jobs_slurm.

If you think of any functionality this is missing, please let me know / feel free to expand the test. This will also require adding pytest-mock to the dev dependencies in pyproject.toml in the main repo folder. It would be cool to also test the generated .py files sometime, but that's outside the scope of this PR.

The test
def test_run_sorter_jobs_slurm_kwargs(mocker, tmp_path, job_list):
    """
    Mock `subprocess.run()` to check that engine_kwargs are
    propagated to the call as expected.
    """
    # First, mock `subprocess.run()`, set up a call to `run_sorter_jobs`
    # then check the mocked `subprocess.run()` was called with the
    # expected signature. Two jobs are passed in `jobs_list`, first
    # check the most recent call.
    mock_subprocess_run = mocker.patch(
        "spikeinterface.sorters.launcher.subprocess.run"
    )

    tmp_script_folder = tmp_path / "slurm_scripts"

    engine_kwargs = engine_kwargs=dict(
        tmp_script_folder=tmp_script_folder,
        sbatch_args={
            "cpus-per-task": 32,
            "mem": "32G",
            "gres": "gpu:1",
            "any_random_kwarg": 12322,
        }
    )
    run_sorter_jobs(
        job_list,
        engine="slurm",
        engine_kwargs=engine_kwargs,
    )

    script_0_path = f"{tmp_script_folder}/si_script_0.py"
    script_1_path = f"{tmp_script_folder}/si_script_1.py"

    expected_command = [
        "sbatch", "--cpus-per-task", "32", "--mem", "32G", "--gres", "gpu:1", "--any_random_kwarg", "12322", script_1_path
    ]
    mock_subprocess_run.assert_called_with(expected_command, capture_output=True, text=True)


    # Next, check the fisrt call (which sets up `si_script_0.py`)
    # also has the expected arguments.
    expected_command[9] = script_0_path
    assert mock_subprocess_run.call_args_list[0].args[0] == expected_command

    # Next, check that defaults are used properly when no kwargs are
    # passed. This will default to `_default_engine_kwargs` as
    # set in `launcher.py`
    run_sorter_jobs(
        job_list,
        engine="slurm",
        engine_kwargs={"tmp_script_folder": tmp_script_folder},
    )
    expected_command = [
        "sbatch", "--cpus-per-task", "1", "--mem", "1G", script_1_path
    ]
    mock_subprocess_run.assert_called_with(expected_command, capture_output=True, text=True)

    # Finally, check that the `tmp_script_folder` is generated on the
    # fly as expected. A random foldername is generated, just check that
    # the folder to which the scripts are saved is in the `tempfile` format.
    run_sorter_jobs(
        job_list,
        engine="slurm",
        engine_kwargs=None,  # TODO: test defaults
    )
    tmp_script_folder = "_".join(tempfile.mkdtemp(prefix="spikeinterface_slurm_").split("_")[:-1])
    assert tmp_script_folder in mock_subprocess_run.call_args_list[-1].args[0][5]

Some other small things:

  • I think that the other, currently existing test (test_run_sorter_jobs_slurm) might need to be updated to the new way of passing the sbatch_args.
  • I think it is worth adding an error if any other key-value pair exists in engine_kwargs except for tmp_script_folder or sbatch_args. The reason is this PR is backward-incompatible with previous versions but will not error, only silently fail to pass the kwargs onwards to slurm (e.g. if cpus_per_task is passed to engine_kwargs directly as it was previously, it will just be ignored). So good to raise an error here to let the user know about the new API.
  • Also, might as well also assert / raise an error if that engine_kwargs is only not None when engine="slurm"/

@MarinManuel
Copy link
Author

Hi @JoeZiminski
I tried to integrate your comments to my PR. Hopefully I did it right.
The only thing I did not do was

Also, might as well also assert / raise an error if that engine_kwargs is only not None when engine="slurm"/

I wasn't entirely sure what you meant, but I think sbatch would launch with default parameters if engine_kwargs=None, so that shouldn't be a problem?
Also, I didn't know how to write a test to check that the exception was raised correctly.

@JoeZiminski
Copy link
Collaborator

JoeZiminski commented Sep 10, 2024

Hey @MarinManuel, so sorry for the delay with this! I'm around again now, please feel free to ping me and we can merge this ASAP.

Just a couple of small things I think:

  • please see a couple of small comments on the tests
  • there are a few github suggestions (e.g. from Zach) about [edit: above]. These can be commited by selected 'commit suggestion', changing the commit message (e.g. just 'docstring fix' or something is okay) and then 'commit changes'. It's a cool github feature!
  • Any other comments that are outdated or have been addressed can be resolved.

Also, might as well also assert / raise an error if that engine_kwargs is only not None when engine="slurm"

Sorry this was the other way round, in case a user who is unfamiliar with SLURM e.g. using 'dask' say but passes the engine_kwargs under some false impression these will do something for all engines. Alternatively we could rename engine_kwargs to slurm_engine_kwargs as they only relate to slurm. Actually this is better, maybe they can be called slurm_engine_kwargs or just slurm_kwargs?

After this I can do one last pass and test locally again then I think it's good to go! Apologies the review took so long, this new feature is a super-useful addition.

@MarinManuel
Copy link
Author

Hi @JoeZiminski
Please let me know how that looks

@JoeZiminski
Copy link
Collaborator

JoeZiminski commented Sep 12, 2024

Hi @MarinManuel, I'm incredibly sorry I misunderstood from the docstring how engine kwargs was intended, I thought it was just for SLURM from the below:

old docstring
engine_kwargs : dict
        In the case of engine="slum", possible kwargs are:
        - tmp_script_folder: str, default None
            the folder in which the job scripts are created. Default: directory created by
            the `tempfile` library
        - sbatch_args: dict
          arguments to be passed to sbatch. They will be automatically prefixed with --.
          Arguments must be in the format slurm specify, see the
          [documentation for `sbatch`](https://slurm.schedmd.com/sbatch.html) for a list of possible arguments

It might be worth reverting that change such that the all kwargs come under 'engine kwargs' as before. The new docstring can look as below. I'm happy to do this if you give me access to your fork as I appreciate it will be annoying to undo something you just did due to my mistake!

New docstring
    engine_kwargs : dict
        Parameters to be passed to the underlying engine.
        Defaults are:
        * loop : None
        * joblib : n_jobs=1, backend="loky"
        * multiprocessing : max_workers=2, mp_context=None
        * dask : client=None
        * slurm : tmp_script_folder=None
        for the `slurm` entry:
            This dictionary contains arguments to be passed to sbatch.
            They will be automatically prefixed with --.
            Arguments must be in the format slurm specify, see the
            [documentation for `sbatch`](https://slurm.schedmd.com/sbatch.html) for a list of possible arguments

Some other very minor things that came up when manually testing, then I think good to go!

  • Need to add pytest-mock here and here in the pyproject.toml as it was a new dependency added in the test.
  • the test is failing on windows as some os commands used in the run_sorter_job assume we are on Linux. In the end we should skip only when not running on Linux. We need to add from platform import system at the top of test_launcher.py and @pytest.mark.skipif(system() != 'Linux', reason="Assumes we are on Linux to run SLURM").
  • I was manually testing and it was hard to figure out what was being passed to slurm by the CLI. I think we can add a quick print to before the subprocess call to make this clear, e.g. adding print(f"subprocess called with command {progr}\n") here.

Then all good from my end, sorry about this last-minute back and forth and thanks again for this super useful addition!

@JoeZiminski
Copy link
Collaborator

Hi @MarinManuel thanks for this, apologies please resend the invite I will fix the tests and conflicts, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sorters Related to sorters module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants