Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot re-initialize CUDA in forked subprocess when loading model in seldon-core #9386

Closed
rlleshi opened this issue Nov 25, 2022 · 6 comments
Assignees
Labels
awaiting response bug Something isn't working community help wanted Extra attention is needed Stale

Comments

@rlleshi
Copy link

rlleshi commented Nov 25, 2022

I am using a seldon-core microservice to serve a fast-rcnn detection model. However, when passing the model to the desired Cuda device with torch model.to(device) (at init_detector) the following error is thrown:

Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method.

This is different from this existing where the problem was that the load() method of the microservice was not being used properly. Also, the specific suggestions at this torch do not work in this context.

I am attaching a minimal reproducible example in the below zip file.
cuda_error.zip

The contents are:

  • Detection.py - seldon python wrapper
  • Dockerfile - the dockerfile for the environment
  • download_model.sh - script to download the detection model
  • faster_rcnn_r50_caffe_fpn_mstrain_1x_coco-person.py - model config file

First, create the image: nvidia-docker build -f Dockerfile . -t seldontest
Then run the container: nvidia-docker run -p 5000:5000 -p 9000:9000 --name seldontest -it seldontest

I am not sure what the problem is because I've been able to deploy other models with gunicorn on a cuda device.

@ZwwWayne
Copy link
Collaborator

You can set multiprocessing to spawn msnually by

if mp.get_start_method(allow_none=True) is None:
        mp.set_start_method('spawn')

@BIGWangYuDong BIGWangYuDong added bug Something isn't working community help wanted Extra attention is needed awaiting response labels Nov 28, 2022
@rlleshi
Copy link
Author

rlleshi commented Nov 28, 2022

In that case AttributeError: Can't pickle local object 'main.<locals>.grpc_prediction_server' is thrown.

Full stack:

  File "/opt/conda/bin/seldon-core-microservice", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.8/site-packages/seldon_core/microservice.py", line 586, in main
    start_servers(
  File "/opt/conda/lib/python3.8/site-packages/seldon_core/microservice.py", line 85, in start_servers
    p2.start()
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/opt/conda/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/opt/conda/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/opt/conda/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/opt/conda/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/opt/conda/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)

@rlleshi
Copy link
Author

rlleshi commented Nov 28, 2022

This doesn't appear related to mmdetection though. It's just strange because it works with other torch models on cuda & gunicorn.

@rlleshi rlleshi changed the title Cannot re-initialize CUDA in forked subprocess when loading model in Gunicorn Cannot re-initialize CUDA in forked subprocess when loading model in seldon-core Nov 28, 2022
@rlleshi
Copy link
Author

rlleshi commented Nov 29, 2022

@ZwwWayne would you mind checking the minimal example once just to make sure it has nothing to do with mmcv checkpoint loading?

@github-actions
Copy link

github-actions bot commented Dec 7, 2022

This issue is marked as stale because it has been marked as invalid or awaiting response for 7 days without any further response. It will be closed in 5 days if the stale label is not removed or if there is no further response.

@github-actions github-actions bot added the Stale label Dec 7, 2022
@github-actions
Copy link

This issue is closed because it has been stale for 5 days. Please open a new issue if you have similar issues or you have any new updates now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting response bug Something isn't working community help wanted Extra attention is needed Stale
Projects
None yet
Development

No branches or pull requests

3 participants