fail to run examples/offline.py , unable to download the model to reproduce #35

William12github · 2024-08-06T00:34:25Z

Hi ,
I am trying to reproduce the result, but it's unable to download the llama2-7b-hf model as below logs printed,

`root@d7b9ced7ced8:/workspace/DistServe# python3 examples/offline.py
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status
response.raise_for_status()
File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 1024, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://huggingface.co/meta-llama/Llama-2-7b-hf/resolve/main/config.json

The above exception was the direct cause of the following exception:
Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-7b-hf/resolve/main/config.json.
Your request to access model meta-llama/Llama-2-7b-hf has been rejected by the repo's authors.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/workspace/DistServe/examples/offline.py", line 32, in
model_config=ModelConfig(
File "/workspace/DistServe/distserve/config.py", line 177, in init
self.hf_config = self._get_hf_config()
File "/workspace/DistServe/distserve/config.py", line 192, in _get_hf_config
raise ValueError(
ValueError: Failed to load the model config, please check the model name or path: meta-llama/Llama-2-7b-hf`

, although I login into successfully huggingface-cli login, is there alternative way to acquire the model? thanks

Token has not been saved to git credential helper. Your token has been saved to /root/.cache/huggingface/token Login successful

The text was updated successfully, but these errors were encountered:

William12github · 2024-08-06T00:35:26Z

is there alternative way to provide the model except for downloading online directly

RobertLou · 2024-08-06T02:20:37Z

You can use this website to download model：https://modelscope.cn/my/overview

William12github · 2024-08-06T02:28:56Z

thanks , I have downloaded it manually through meata.com, and convert it into hf format using convert_llama_weights_to_hf.py, and got the files as following:

root@d7b9ced7ced8:/workspace/DistServe# ls -lt ../lama2-7b-hf/
total 13163352
-rw-r--r-- 1 root root 23950 Aug 6 02:00 model.safetensors.index.json
-rw-r--r-- 1 root root 3590488816 Aug 6 02:00 model-00003-of-00003.safetensors
-rw-r--r-- 1 root root 4947390880 Aug 6 02:00 model-00002-of-00003.safetensors
-rw-r--r-- 1 root root 4938985352 Aug 6 02:00 model-00001-of-00003.safetensors
-rw-r--r-- 1 root root 659 Aug 6 02:00 config.json
-rw-r--r-- 1 root root 111 Aug 6 02:00 generation_config.json
-rw-r--r-- 1 root root 1842622 Aug 6 01:59 tokenizer.json
-rw-r--r-- 1 root root 414 Aug 6 01:59 special_tokens_map.json
-rw-r--r-- 1 root root 499723 Aug 6 01:59 tokenizer.model
-rw-r--r-- 1 root root 960 Aug 6 01:59 tokenizer_config.json

but when I started it with :
$python distserve/api_server/distserve_api_server.py --model ../lama2-7b-hf/

it report another issue:

(ParaWorker pid=30741) INFO 02:18:28 runtime peak memory: 12.706 GB (ParaWorker pid=30741) INFO 02:18:28 total GPU memory: 44.521 GB (ParaWorker pid=30741) INFO 02:18:28 kv cache size for one token: 0.50000 MB (ParaWorker pid=30741) INFO 02:18:28 num_gpu_blocks: 3502 (ParaWorker pid=30741) INFO 02:18:28 num_cpu_blocks: 2048 (ParaWorker pid=30742) Gpt<T>::load() - ../lama2-7b-hf/decoder.embed_tokens.weight.pt not found

RobertLou · 2024-08-06T02:37:44Z

thanks , I have downloaded it manually through meata.com, and convert it into hf format using convert_llama_weights_to_hf.py, and got the files as following:

root@d7b9ced7ced8:/workspace/DistServe# ls -lt ../lama2-7b-hf/ total 13163352 -rw-r--r-- 1 root root 23950 Aug 6 02:00 model.safetensors.index.json -rw-r--r-- 1 root root 3590488816 Aug 6 02:00 model-00003-of-00003.safetensors -rw-r--r-- 1 root root 4947390880 Aug 6 02:00 model-00002-of-00003.safetensors -rw-r--r-- 1 root root 4938985352 Aug 6 02:00 model-00001-of-00003.safetensors -rw-r--r-- 1 root root 659 Aug 6 02:00 config.json -rw-r--r-- 1 root root 111 Aug 6 02:00 generation_config.json -rw-r--r-- 1 root root 1842622 Aug 6 01:59 tokenizer.json -rw-r--r-- 1 root root 414 Aug 6 01:59 special_tokens_map.json -rw-r--r-- 1 root root 499723 Aug 6 01:59 tokenizer.model -rw-r--r-- 1 root root 960 Aug 6 01:59 tokenizer_config.json

but when I started it with : $python distserve/api_server/distserve_api_server.py --model ../lama2-7b-hf/

it report another issue:

(ParaWorker pid=30741) INFO 02:18:28 runtime peak memory: 12.706 GB (ParaWorker pid=30741) INFO 02:18:28 total GPU memory: 44.521 GB (ParaWorker pid=30741) INFO 02:18:28 kv cache size for one token: 0.50000 MB (ParaWorker pid=30741) INFO 02:18:28 num_gpu_blocks: 3502 (ParaWorker pid=30741) INFO 02:18:28 num_cpu_blocks: 2048 (ParaWorker pid=30742) Gpt<T>::load() - ../lama2-7b-hf/decoder.embed_tokens.weight.pt not found

I encountered the same question before. I can tell you that is because distserver needs to convert the model into another form first. I changed the file distserve/downloader/downloader.py to solve this. You can replace the same part in this file. Here's my code:

        if is_local:
            if model_name_or_path[-1] == '/':
                allow_patterns = "*.bin"
                hf_files = os.path.join(model_name_or_path, allow_patterns)
                cache_dir = DISTSERVE_CACHE
                storage_folder = \
                    os.path.join(cache_dir, 
                                repo_folder_name(repo_id=model_name_or_path)) + '/'
                done_file = os.path.join(storage_folder, "done")
                if os.path.exists(done_file):
                    logger.info(f"Find cached model weights in {storage_folder}.")    
                    return storage_folder
                
                # download and convert model weights
                convert_weights(hf_files, storage_folder, dtype, model)
                file = open(done_file, 'w')
                file.close()
                return storage_folder
            else:
                return model_name_or_path + '/'

William12github · 2024-08-06T02:42:42Z

but I don't have the .bin file in the folder , only got these files:

root@d7b9ced7ced8:/workspace/DistServe# ls -lt ../lama2-7b-hf/
total 13163352
-rw-r--r-- 1 root root 23950 Aug 6 02:00 model.safetensors.index.json
-rw-r--r-- 1 root root 3590488816 Aug 6 02:00 model-00003-of-00003.safetensors
-rw-r--r-- 1 root root 4947390880 Aug 6 02:00 model-00002-of-00003.safetensors
-rw-r--r-- 1 root root 4938985352 Aug 6 02:00 model-00001-of-00003.safetensors
-rw-r--r-- 1 root root 659 Aug 6 02:00 config.json
-rw-r--r-- 1 root root 111 Aug 6 02:00 generation_config.json
-rw-r--r-- 1 root root 1842622 Aug 6 01:59 tokenizer.json
-rw-r--r-- 1 root root 414 Aug 6 01:59 special_tokens_map.json
-rw-r--r-- 1 root root 499723 Aug 6 01:59 tokenizer.model
-rw-r--r-- 1 root root 960 Aug 6 01:59 tokenizer_config.json

RobertLou · 2024-08-06T02:46:29Z

but I don't have the .bin file in the folder , only got these files:

root@d7b9ced7ced8:/workspace/DistServe# ls -lt ../lama2-7b-hf/
total 13163352
-rw-r--r-- 1 root root 23950 Aug 6 02:00 model.safetensors.index.json
-rw-r--r-- 1 root root 3590488816 Aug 6 02:00 model-00003-of-00003.safetensors
-rw-r--r-- 1 root root 4947390880 Aug 6 02:00 model-00002-of-00003.safetensors
-rw-r--r-- 1 root root 4938985352 Aug 6 02:00 model-00001-of-00003.safetensors
-rw-r--r-- 1 root root 659 Aug 6 02:00 config.json
-rw-r--r-- 1 root root 111 Aug 6 02:00 generation_config.json
-rw-r--r-- 1 root root 1842622 Aug 6 01:59 tokenizer.json
-rw-r--r-- 1 root root 414 Aug 6 01:59 special_tokens_map.json
-rw-r--r-- 1 root root 499723 Aug 6 01:59 tokenizer.model
-rw-r--r-- 1 root root 960 Aug 6 01:59 tokenizer_config.json

https://modelscope.cn/models/shakechen/Llama-2-7b-hf/files It has *.bin files, maybe download it again?

William12github · 2024-08-06T03:10:50Z

Hi Robert,
Appreciated for your great help.

Is it necessary to download all the files ? it will take too much time to download all.
and if I want to run examples/offline.py ,how to specify the local dir in the code?

RobertLou · 2024-08-06T03:18:33Z

Hi Robert, Appreciated for your great help.

Is it necessary to download all the files ? it will take too much time to download all. and if I want to run examples/offline.py ,how to specify the local dir in the code?

I'm not sure, but according to the code, *.bin is necessary. You can use --model to specify the local dir, like

python offfline.py --model ../Llama2-7b-hf/
By the way, if you have any question, checking the code is the fastest way. You can find '--model' args in offline.py which helps you to specify the local dir.

William12github · 2024-08-06T03:36:29Z

Hi Robert ,
Thank you for your enthusiastic help and good advice!

William12github · 2024-08-06T13:12:13Z

I am able to run the examples/offline.py code, got following result :

INFO 13:00:01 (decoding) CPU blocks: 0 / 128 (0.00%) used, (0 swapping in)
INFO 13:00:01 (decoding) GPU blocks: 3 / 941 (0.32%) used, (0 swapping out)
INFO 13:00:01 (decoding) 0 unaccepted, 0 waiting, 2 processing
INFO 13:00:02 (context) 0 waiting, 0 finished but unaccepted, 0 blocks occupied by on-the-fly requests
INFO 13:00:02 (decoding) CPU blocks: 0 / 128 (0.00%) used, (0 swapping in)
INFO 13:00:02 (decoding) GPU blocks: 2 / 941 (0.21%) used, (0 swapping out)
INFO 13:00:02 (decoding) 0 unaccepted, 0 waiting, 1 processing
INFO 13:00:03 (context) 0 waiting, 0 finished but unaccepted, 0 blocks occupied by on-the-fly requests
INFO 13:00:03 (decoding) CPU blocks: 0 / 128 (0.00%) used, (0 swapping in)
INFO 13:00:03 (decoding) GPU blocks: 3 / 941 (0.32%) used, (0 swapping out)
INFO 13:00:03 (decoding) 0 unaccepted, 0 waiting, 1 processing
Prompt: 'Life blooms like a flower. Far away or by the road. Waiting', Generated text: for the right time to blo om . Ћ
(10 tokens generated).
Prompt: 'A quick brown fox', Generated text: j umps over the lazy dog .
(8 tokens generated).
Prompt: 'Artificial intelligence is', Generated text: a hot topic in the te ch world . The term is thrown around a lot , but what does it really mean ?
(25 tokens generated).
Prompt: 'To be or not to be,', Generated text: that is the question .
(6 tokens generated).
Prompt: 'one two three four', Generated text: five six seven eight nine ten eleven eleven twelve th ir teen fifteen six teen sevent een eigh teen nin ete en twenty one twenty - one twenty - two twenty - three twenty - four twenty - five twenty - six twenty - se ven twenty - one twenty - two
(53 tokens generated).
(ParaWorker pid=5130) INFO 13:00:00 (worker context.#0) model /workspace/Llama-2-7b-hf/ loaded
(ParaWorker pid=5130) INFO 13:00:00 runtime peak memory: 12.497 GB
(ParaWorker pid=5130) INFO 13:00:00 total GPU memory: 22.059 GB
(ParaWorker pid=5130) INFO 13:00:00 kv cache size for one token: 0.50000 MB
(ParaWorker pid=5130) INFO 13:00:00 num_gpu_blocks: 941
(ParaWorker pid=5130) INFO 13:00:00 num_cpu_blocks: 128
root@5a65df8f9a43:/workspace/DistServe#

but I'm still confused :

how to know the performance of the testing?
if I want to comapre it with colocate solution (both prefill and decoding phase run in the single GPU) , how to launch the testing ?

Youhe-Jiang · 2024-08-12T15:45:34Z

I am able to run the examples/offline.py code, got following result :

INFO 13:00:01 (decoding) CPU blocks: 0 / 128 (0.00%) used, (0 swapping in)
INFO 13:00:01 (decoding) GPU blocks: 3 / 941 (0.32%) used, (0 swapping out)
INFO 13:00:01 (decoding) 0 unaccepted, 0 waiting, 2 processing
INFO 13:00:02 (context) 0 waiting, 0 finished but unaccepted, 0 blocks occupied by on-the-fly requests
INFO 13:00:02 (decoding) CPU blocks: 0 / 128 (0.00%) used, (0 swapping in)
INFO 13:00:02 (decoding) GPU blocks: 2 / 941 (0.21%) used, (0 swapping out)
INFO 13:00:02 (decoding) 0 unaccepted, 0 waiting, 1 processing
INFO 13:00:03 (context) 0 waiting, 0 finished but unaccepted, 0 blocks occupied by on-the-fly requests
INFO 13:00:03 (decoding) CPU blocks: 0 / 128 (0.00%) used, (0 swapping in)
INFO 13:00:03 (decoding) GPU blocks: 3 / 941 (0.32%) used, (0 swapping out)
INFO 13:00:03 (decoding) 0 unaccepted, 0 waiting, 1 processing
Prompt: 'Life blooms like a flower. Far away or by the road. Waiting', Generated text: for the right time to blo om . Ћ
(10 tokens generated).
Prompt: 'A quick brown fox', Generated text: j umps over the lazy dog .
(8 tokens generated).
Prompt: 'Artificial intelligence is', Generated text: a hot topic in the te ch world . The term is thrown around a lot , but what does it really mean ?
(25 tokens generated).
Prompt: 'To be or not to be,', Generated text: that is the question .
(6 tokens generated).
Prompt: 'one two three four', Generated text: five six seven eight nine ten eleven eleven twelve th ir teen fifteen six teen sevent een eigh teen nin ete en twenty one twenty - one twenty - two twenty - three twenty - four twenty - five twenty - six twenty - se ven twenty - one twenty - two
(53 tokens generated).
(ParaWorker pid=5130) INFO 13:00:00 (worker context.#0) model /workspace/Llama-2-7b-hf/ loaded
(ParaWorker pid=5130) INFO 13:00:00 runtime peak memory: 12.497 GB
(ParaWorker pid=5130) INFO 13:00:00 total GPU memory: 22.059 GB
(ParaWorker pid=5130) INFO 13:00:00 kv cache size for one token: 0.50000 MB
(ParaWorker pid=5130) INFO 13:00:00 num_gpu_blocks: 941
(ParaWorker pid=5130) INFO 13:00:00 num_cpu_blocks: 128
root@5a65df8f9a43:/workspace/DistServe#

but I'm still confused :

how to know the performance of the testing?

if I want to comapre it with colocate solution (both prefill and decoding phase run in the single GPU) , how to launch the testing ?

Hi mate, have you ever met this problem:

(ParaWorker pid=1955026) Error: Peer-to-peer access is unsupported on this platform.
(ParaWorker pid=1955026) In the current version of distserve, it is necessary to use a platform that supports GPU P2P access.
(ParaWorker pid=1955026) Exiting...

I checked the P2P access, it should be supported actually...

Thank you for any help!

William12github · 2024-08-13T01:54:36Z

you can use below command to check your system if it's support P2P:
`nvidia-smi topo -p2p wr
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 X OK OK OK OK OK OK OK
GPU1 OK X OK OK OK OK OK OK
GPU2 OK OK X OK OK OK OK OK
GPU3 OK OK OK X OK OK OK OK
GPU4 OK OK OK OK X OK OK OK
GPU5 OK OK OK OK OK X OK OK
GPU6 OK OK OK OK OK OK X OK
GPU7 OK OK OK OK OK OK OK X

Legend:

X = Self
OK = Status Ok
CNS = Chipset not supported
GNS = GPU not supported
TNS = Topology not supported
NS = Not supported
U = Unknown`

Youhe-Jiang · 2024-08-13T07:29:32Z

Thank you for response, I checked this, seems that the framework cannot run on 4090 machines, I ran it successfully on A100 machines.

…

On Tue, 13 Aug 2024 at 02:54, William12github ***@***.***> wrote: you can use below command to check the your if it's support P2P: `nvidia-smi topo -p2p wr GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU0 X OK OK OK OK OK OK OK GPU1 OK X OK OK OK OK OK OK GPU2 OK OK X OK OK OK OK OK GPU3 OK OK OK X OK OK OK OK GPU4 OK OK OK OK X OK OK OK GPU5 OK OK OK OK OK X OK OK GPU6 OK OK OK OK OK OK X OK GPU7 OK OK OK OK OK OK OK X Legend: X = Self OK = Status Ok CNS = Chipset not supported GNS = GPU not supported TNS = Topology not supported NS = Not supported U = Unknown` — Reply to this email directly, view it on GitHub <#35 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AUK4KHT3I7DBMZI3VJV62RTZRFRPFAVCNFSM6AAAAABMBIF2RSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBVGE4TKNBRGM> . You are receiving this because you commented.Message ID: ***@***.***>

Youhe-Jiang mentioned this issue Aug 12, 2024

fail to run examples/offline.py, load model weight error. #38

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fail to run examples/offline.py , unable to download the model to reproduce #35

fail to run examples/offline.py , unable to download the model to reproduce #35

William12github commented Aug 6, 2024

William12github commented Aug 6, 2024

RobertLou commented Aug 6, 2024

William12github commented Aug 6, 2024 •

edited

Loading

RobertLou commented Aug 6, 2024

William12github commented Aug 6, 2024

RobertLou commented Aug 6, 2024

William12github commented Aug 6, 2024

RobertLou commented Aug 6, 2024

William12github commented Aug 6, 2024

William12github commented Aug 6, 2024

Youhe-Jiang commented Aug 12, 2024

William12github commented Aug 13, 2024 •

edited

Loading

Youhe-Jiang commented Aug 13, 2024 via email

fail to run examples/offline.py , unable to download the model to reproduce #35

fail to run examples/offline.py , unable to download the model to reproduce #35

Comments

William12github commented Aug 6, 2024

William12github commented Aug 6, 2024

RobertLou commented Aug 6, 2024

William12github commented Aug 6, 2024 • edited Loading

RobertLou commented Aug 6, 2024

William12github commented Aug 6, 2024

RobertLou commented Aug 6, 2024

William12github commented Aug 6, 2024

RobertLou commented Aug 6, 2024

William12github commented Aug 6, 2024

William12github commented Aug 6, 2024

Youhe-Jiang commented Aug 12, 2024

William12github commented Aug 13, 2024 • edited Loading

Youhe-Jiang commented Aug 13, 2024 via email

William12github commented Aug 6, 2024 •

edited

Loading

William12github commented Aug 13, 2024 •

edited

Loading