Why is my batch size maxing out at 20 when my estimated max is around 60? #1331
Replies: 2 comments 1 reply
-
I’m very interested in this. Of course if you take into account different attention mechanisms and other optimizations inference can be better than this calc but that shouldn’t be relevant here as the calculated max batch size as a base is much larger than it is able to achieve. Any ideas? @OlivierDehaene |
Beta Was this translation helpful? Give feedback.
-
This is related to my issue: #1831 For my POV, I also think there is something odd. Not only when you do the math it leads you to different numbers but aswell when you consider the inferred Some useful links I have used: |
Beta Was this translation helpful? Give feedback.
-
I am running
text-generation-benchmark
with the default batch sizes of [1, 2, 4, 8, 16, and 32] but getting CUDA OOM errors when I hit the batch size of 32. I refined it down hitting a max at batch size 20. This is confusing to me as I'm running HuggingFaceH4/zephyr-7b-beta on an A100 80GB PCIe. From my rough calculations this should be able to handle a max batch size of around 60. I am runningtext-generation-launcher
with the default parameters and runningtext-generation-benchmark
with--decode-length 2048
and--sequence-length 256
. Here is my calculation for reference:KV cache size = 2 * 2 * 32 * 4096 / 1000000000 = 0.000524
KV cache tokens = (80 - 14) * 0.000524 = 125885.01
Max batch size = 125885.01 / 2048 = 61.47
Am I doing the calculation wrong here or could there be something amiss with my setup?
Beta Was this translation helpful? Give feedback.
All reactions