Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

meta/llama-2-70b maximum input size (1024) differs from the LLaMA-2 maximum context size (4096 tokens) #264

Open
jdkanu opened this issue Mar 18, 2024 · 0 comments

Comments

@jdkanu
Copy link

jdkanu commented Mar 18, 2024

LLaMA-2 models have a maximum input size of 4096 tokens [original paper, meta llama github repo]. When prompting meta/llama-2-70b through replicate, however, the maximum size of the model is, strangely, 1024, which is causing an error that crashes my program.

[TensorRT-LLM][ERROR] Cannot process new request: Prompt length (1240) exceeds maximum input length (1024). (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/GptManager.cpp:246)
1       0x7f2d5e9b41cd /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x7f61cd) [0x7f2d5e9b41cd]
2       0x7f2d609dbe17 tensorrt_llm::batch_manager::GptManager::fetchNewRequests() + 423
3       0x7f2d609dda98 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 232
4       0x7f2e9b3f2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f2e9b3f2253]
5       0x7f2e9b181ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f2e9b181ac3]
6       0x7f2e9b212a04 clone + 68
[TensorRT-LLM][ERROR] Encountered error for requestId 117809844: Cannot process new request: Prompt length (1240) exceeds maximum input length (1024). (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/GptManager.cpp:246)
1       0x7f2d5e9b41cd /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x7f61cd) [0x7f2d5e9b41cd]
2       0x7f2d609dbe17 tensorrt_llm::batch_manager::GptManager::fetchNewRequests() + 423
3       0x7f2d609dda98 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 232
4       0x7f2e9b3f2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f2e9b3f2253]
5       0x7f2e9b181ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f2e9b181ac3]
6       0x7f2e9b212a04 clone + 68
[TensorRT-LLM][ERROR] Cannot process new request: Prompt length (1240) exceeds maximum input length (1024). (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/GptManager.cpp:246)
1       0x7f33469b41cd /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x7f61cd) [0x7f33469b41cd]
2       0x7f33489dbe17 tensorrt_llm::batch_manager::GptManager::fetchNewRequests() + 423
3       0x7f33489dda98 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 232
4       0x7f3472df2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f3472df2253]
5       0x7f3472b81ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f3472b81ac3]
6       0x7f3472c12a04 clone + 68
[TensorRT-LLM][ERROR] Cannot process new request: Prompt length (1240) exceeds maximum input length (1024). (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/GptManager.cpp:246)
1       0x7f86529b41cd /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x7f61cd) [0x7f86529b41cd]
2       0x7f86549dbe17 tensorrt_llm::batch_manager::GptManager::fetchNewRequests() + 423
3       0x7f86549dda98 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 232
4       0x7f8781df2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f8781df2253]
5       0x7f8781b81ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f8781b81ac3]
6       0x7f8781c12a04 clone + 68
[TensorRT-LLM][ERROR] Cannot process new request: Prompt length (1240) exceeds maximum input length (1024). (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/GptManager.cpp:246)
1       0x7f58929b41cd /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x7f61cd) [0x7f58929b41cd]
2       0x7f58949dbe17 tensorrt_llm::batch_manager::GptManager::fetchNewRequests() + 423
3       0x7f58949dda98 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 232
4       0x7f59bebf2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f59bebf2253]
5       0x7f59be981ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f59be981ac3]
6       0x7f59bea12a04 clone + 68
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/cog/server/worker.py", line 224, in _handle_predict_error
yield
File "/usr/local/lib/python3.10/dist-packages/cog/server/worker.py", line 253, in _predict_async
async for r in result:
File "/src/predict.py", line 180, in predict
output = event.json()["text_output"]
KeyError: 'text_output'

https://replicate.com/p/le6b6jtbp565nmjftq4lxsy44i

Even the smaller 7B models do not return this error when called with the same prompt (same input size).

It looks like the wrong model is being called for meta/llama-2-70b. LLaMA-2 should not complain about an input of just 1240 tokens. If so, then I and potentially many other customers calling meta/llama-2-70b are paying for calls to a different model! Not what we asked for, not what was advertised, and not returning output when it should! Please correct!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant