Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mamba inference server init_process_group error for 1 gpu #10549

Closed
SkanderBS2024 opened this issue Sep 20, 2024 · 2 comments
Closed

Mamba inference server init_process_group error for 1 gpu #10549

SkanderBS2024 opened this issue Sep 20, 2024 · 2 comments
Assignees
Labels
bug Something isn't working stale

Comments

@SkanderBS2024
Copy link

Describe the bug

I've been encountring an error when running the megatron mamba inference server while requesting using the example code.
Megatron_mamba_eval

Steps/Code to reproduce bug
Execution code (executed at this level :

CUDA_VISIBLE_DEVICES="0" python megatron_mamba_eval.py \
            mamba_model_file=/workspace/nemo/work//Model/model.nemo \
            trainer.devices=1 \
            trainer.num_nodes=1 \
            tensor_model_parallel_size=1 \
            pipeline_model_parallel_size=1 \
            server=True \
	    chat=True \
            share=True

Error :

[NeMo I 2024-09-20 01:04:19 text_generation_server:65] request IP: 127.0.0.1
[NeMo I 2024-09-20 01:04:19 text_generation_server:66] {"sentences": ["hello"], "tokens_to_generate": 300, "temperature": 1.0, "add_BOS": true, "top_k": 0, "top_p": 0.9, "greedy": false, "all_probs": false, "repetition_penalty": 1.2, "min_tokens_to_generate": 2}
[NeMo W 2024-09-20 01:04:19 nemo_logging:349] /opt/NeMo/nemo/collections/nlp/modules/common/text_generation_server.py:61: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /opt/pytorch/pytorch/torch/csrc/tensor/python_tensor.cpp:79.)
      choice = torch.cuda.LongTensor([GENERATE_NUM])

Exception on /generate [PUT]
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/flask/app.py", line 880, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.10/dist-packages/flask/app.py", line 865, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
  File "/usr/local/lib/python3.10/dist-packages/flask_restful/__init__.py", line 489, in wrapper
    resp = resource(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/flask/views.py", line 110, in view
    return current_app.ensure_sync(self.dispatch_request)(**kwargs)  # type: ignore[no-any-return]
  File "/usr/local/lib/python3.10/dist-packages/flask_restful/__init__.py", line 604, in dispatch_request
    resp = meth(*args, **kwargs)
  File "/opt/NeMo/nemo/collections/nlp/modules/common/text_generation_server.py", line 185, in put
    MegatronGenerate.send_do_generate()  # Tell other ranks we're doing generate
  File "/opt/NeMo/nemo/collections/nlp/modules/common/text_generation_server.py", line 62, in send_do_generate
    torch.distributed.broadcast(choice, 0)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2044, in broadcast
    default_pg = _get_default_group()
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 995, in _get_default_group
    raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.
127.0.0.1 - - [20/Sep/2024 01:04:19] "PUT /generate HTTP/1.1" 500 -

server request code :

import json
import requests

batch_size = 1
port_num = 5555
headers = {"Content-Type": "application/json"}


def request_data(data):
    resp = requests.put('http://0.0.0.0:{}/generate'.format(port_num),
                        data=json.dumps(data),
                        headers=headers)
    sentences = resp.json()['sentences']
    return sentences


data = {
    "sentences": ["hello"] * batch_size,
    "tokens_to_generate": 300,
    "temperature": 1.0,
    "add_BOS": True,
    "top_k": 0,
    "top_p": 0.9,
    "greedy": False,
    "all_probs": False,
    "repetition_penalty": 1.2,
    "min_tokens_to_generate": 2,
}

sentences = request_data(data)

Expected behavior

Expected to return a response from the server (generated text).

Environment overview (please complete the following information)

  • Environment location: Docker image nemo 24.07
  • Method of NeMo install: install from source checkout on this version (tags/r2.0.0rc1)
  • If method of install is [Docker], provide docker pull & docker run commands used

Docker run command :

docker run --gpus all --shm-size=80g --net=host --ulimit memlock=-1 --rm -it \
    -v /ephemeral/:/workspace/megatron \
    -v /ephemeral/tmp:/tmp \
    nvcr.io/nvidia/nemo:24.07

Environment details

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

  • OS version : ubuntu 22.04
  • PyTorch version : version installed in the image 24.07
  • Python version : 3.10

Additional context

I have 2 A100gpus on the machine and tried to launch with torchrun and num_devices = 2 but i do get an error about TP*PP should match the World size and PP and TP for my model are set to 1 (so i cannot inference on multi-gpus in this case? ).
It's a 2B pure mamba2 ssm model in a .nemo checkpoint format.
When executing the script eveything is fine and it shows that the server is running, the errors appears only while requesting and it's related to torch.dist (not usefull for my use case using 1GPU).

@SkanderBS2024 SkanderBS2024 added the bug Something isn't working label Sep 20, 2024
Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Oct 26, 2024
Copy link
Contributor

github-actions bot commented Nov 2, 2024

This issue was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

2 participants