Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix import error when distributed.rmm.pool-size is setted #6482

Merged
merged 4 commits into from
Jun 6, 2022

Conversation

KoyamaSohei
Copy link
Contributor

Hi,
I try to use dask-cuda with UCX integration for our research.

On executing the following commands,

$ DASK_DISTRIBUTED__COMM__UCX__CREATE_CUDA_CONTEXT=True  DASK_DISTRIBUTED__RMM__POOL_SIZE=1GB dask-scheduler --interface ib0 --protocol="ucx"

An error has occurred on rmm.reinitialize().

This happened when

  • numba is installed and
  • rmm is not installed.

Because the previous process (set device_array) just checked whether rmm or numba is installed.

This seems to be solved just by import rmm.

stack trace

  File "/work/NBB/skoyama/miniconda3/envs/ucx/lib/python3.9/site-packages/distributed/cli/dask_scheduler.py", line 208, in main
    loop.run_sync(run)
  File "/work/NBB/skoyama/miniconda3/envs/ucx/lib/python3.9/site-packages/tornado/ioloop.py", line 530, in run_sync
    return future_cell[0].result()
  File "/work/NBB/skoyama/miniconda3/envs/ucx/lib/python3.9/site-packages/distributed/cli/dask_scheduler.py", line 204, in run
    await scheduler
  File "/work/NBB/skoyama/miniconda3/envs/ucx/lib/python3.9/site-packages/distributed/core.py", line 309, in _
    await self.start()
  File "/work/NBB/skoyama/miniconda3/envs/ucx/lib/python3.9/site-packages/distributed/scheduler.py", line 3391, in start
    await self.listen(
  File "/work/NBB/skoyama/miniconda3/envs/ucx/lib/python3.9/site-packages/distributed/core.py", line 455, in listen
    listener = await listen(
  File "/work/NBB/skoyama/miniconda3/envs/ucx/lib/python3.9/site-packages/distributed/comm/core.py", line 212, in _
    await self.start()
  File "/work/NBB/skoyama/miniconda3/envs/ucx/lib/python3.9/site-packages/distributed/comm/ucx.py", line 462, in start
    init_once()
  File "/work/NBB/skoyama/miniconda3/envs/ucx/lib/python3.9/site-packages/distributed/comm/ucx.py", line 153, in init_once
    rmm.reinitialize(
UnboundLocalError: local variable 'rmm' referenced before assignment

Regards,

@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

@KoyamaSohei
Copy link
Contributor Author

The linting is failed, so I try to fix.

@KoyamaSohei KoyamaSohei force-pushed the fix/set_distributed.rmm.pool-size branch from 61e0539 to 21e2033 Compare May 31, 2022 22:31
@github-actions
Copy link
Contributor

github-actions bot commented May 31, 2022

Unit Test Results

       15 files  ±0         15 suites  ±0   6h 27m 37s ⏱️ + 18m 15s
  2 819 tests ±0    2 738 ✔️ ±0    80 💤 ±0  1 ±0 
20 902 runs  ±0  19 904 ✔️  - 1  997 💤 +1  1 ±0 

For more details on these failures, see this check.

Results for commit 6575251. ± Comparison against base commit c2b28cf.

♻️ This comment has been updated with latest results.

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @KoyamaSohei! @pentschev @charlesbluca -- do either of you have a moment to look at this?

Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to handle this differently, as per my suggestions. When the user is requesting an initial pool size, one must assume that is intended, otherwise we just silently do nothing which can be difficult to debug later.

Comment on lines 133 to 139

pool_size_str = dask.config.get("distributed.rmm.pool-size")
if pool_size_str is not None:
pool_size = parse_bytes(pool_size_str)
rmm.reinitialize(
pool_allocator=True, managed_memory=False, initial_pool_size=pool_size
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pool_size_str = dask.config.get("distributed.rmm.pool-size")
if pool_size_str is not None:
pool_size = parse_bytes(pool_size_str)
rmm.reinitialize(
pool_allocator=True, managed_memory=False, initial_pool_size=pool_size
)

distributed/comm/ucx.py Show resolved Hide resolved
@KoyamaSohei
Copy link
Contributor Author

Thank you, @pentschev !!!
The suggestion is very appropriate.
Ignoring parameters implicitly is never good idea.

But calling import rmm twice caused a linting error (please refer my first commit).
I want to apply another solution.

@KoyamaSohei
Copy link
Contributor Author

I added that warning message on ignoring distributed.rmm.pool-size.

@pentschev
Copy link
Member

But calling import rmm twice caused a linting error (please refer my first commit).
I want to apply another solution.

Fair enough. GitHub won't allow me to suggest changes beyond the scope of the PR, but I would suggest a couple more things:

  1. Move pool_size_str = dask.config.get("distributed.rmm.pool-size") before the try/except code block, so you don't need to read it twice;
  2. Move the warning to just after except ImportError (i.e., before try... import numba.cuda) as the warning is entirely unrelated to Numba, but the way it is written makes it look like it is.

@KoyamaSohei KoyamaSohei force-pushed the fix/set_distributed.rmm.pool-size branch from c4dd066 to 6575251 Compare June 1, 2022 09:56
Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now, thanks @KoyamaSohei !

@KoyamaSohei
Copy link
Contributor Author

Thanks @pentschev for the review :)
It made the fix much better.

@quasiben
Copy link
Member

quasiben commented Jun 6, 2022

Thanks all merging in.

@quasiben quasiben merged commit 7d280fd into dask:main Jun 6, 2022
@KoyamaSohei KoyamaSohei deleted the fix/set_distributed.rmm.pool-size branch June 6, 2022 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants