Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

habana_main rebase #71

Merged
merged 537 commits into from
Jul 2, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
537 commits
Select commit Hold shift + click to select a range
80aa7e9
[Hardware][Intel] Optimize CPU backend and add more performance tips …
bigPYJ1151 Jun 13, 2024
a65634d
[Docs] Add 4th meetup slides (#5509)
WoosukKwon Jun 13, 2024
03dccc8
[Misc] Add vLLM version getter to utils (#5098)
DarkLight1337 Jun 13, 2024
3987347
[CI/Build] Simplify OpenAI server setup in tests (#5100)
DarkLight1337 Jun 13, 2024
0ce7b95
[Doc] Update LLaVA docs (#5437)
DarkLight1337 Jun 13, 2024
85657b5
[Kernel] Factor out epilogues from cutlass kernels (#5391)
tlrmchlsmth Jun 13, 2024
30299a4
[MISC] Remove FP8 warning (#5472)
comaniac Jun 13, 2024
a8fda4f
Seperate dev requirements into lint and test (#5474)
Yard1 Jun 13, 2024
6b0511a
Revert "[Core] Remove unnecessary copies in flash attn backend" (#5478)
Yard1 Jun 13, 2024
1696efe
[misc] fix format.sh (#5511)
youkaichao Jun 13, 2024
33e3b37
[CI/Build] Disable test_fp8.py (#5508)
tlrmchlsmth Jun 13, 2024
e38042d
[Kernel] Disable CUTLASS kernels for fp8 (#5505)
tlrmchlsmth Jun 13, 2024
50eed24
Add `cuda_device_count_stateless` (#5473)
Yard1 Jun 13, 2024
cd9c0d6
[Hardware][Intel] Support CPU inference with AVX2 ISA (#5452)
DamonFool Jun 13, 2024
55d6361
[Misc] Fix arg names in quantizer script (#5507)
AllenDou Jun 14, 2024
0f0d8bc
bump version to v0.5.0.post1 (#5522)
simon-mo Jun 14, 2024
319ad7f
[CI/Build][Misc] Add CI that benchmarks vllm performance on those PRs…
KuntaiDu Jun 14, 2024
d47af2b
[CI/Build] Disable LLaVA-NeXT CPU test (#5529)
DarkLight1337 Jun 14, 2024
703475f
[Kernel] Fix CUTLASS 3.x custom broadcast load epilogue (#5516)
tlrmchlsmth Jun 14, 2024
d74674b
[Misc] Fix arg names (#5524)
AllenDou Jun 14, 2024
1598568
[ Misc ] Rs/compressed tensors cleanup (#5432)
robertgshaw2-neuralmagic Jun 14, 2024
348616a
[Kernel] Suppress mma.sp warning on CUDA 12.5 and later (#5401)
tlrmchlsmth Jun 14, 2024
48f589e
[mis] fix flaky test of test_cuda_device_count_stateless (#5546)
youkaichao Jun 14, 2024
77490c6
[Core] Remove duplicate processing in async engine (#5525)
DarkLight1337 Jun 14, 2024
d1c3d7d
[misc][distributed] fix benign error in `is_in_the_same_node` (#5512)
youkaichao Jun 14, 2024
cdab68d
[Docs] Add ZhenFund as a Sponsor (#5548)
simon-mo Jun 14, 2024
6e2527a
[Doc] Update documentation on Tensorizer (#5471)
sangstar Jun 14, 2024
e2afb03
[Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models (#5460)
tdoublep Jun 14, 2024
28c145e
[Bugfix] Fix typo in Pallas backend (#5558)
WoosukKwon Jun 14, 2024
f5bb85b
[Core][Distributed] improve p2p cache generation (#5528)
youkaichao Jun 14, 2024
bd7efe9
Add ccache to amd (#5555)
simon-mo Jun 15, 2024
1b8a0d7
[Core][Bugfix]: fix prefix caching for blockv2 (#5364)
leiwen83 Jun 15, 2024
0e9164b
[mypy] Enable type checking for test directory (#5017)
DarkLight1337 Jun 15, 2024
81fbb36
[CI/Build] Test both text and token IDs in batched OpenAI Completions…
DarkLight1337 Jun 15, 2024
e691918
[misc] Do not allow to use lora with chunked prefill. (#5538)
rkooo567 Jun 15, 2024
d919ecc
add gptq_marlin test for bug report https://github.com/vllm-project/v…
alexm-neuralmagic Jun 15, 2024
1c0afa1
[BugFix] Don't start a Ray cluster when not using Ray (#5570)
njhill Jun 15, 2024
3ce2c05
[Fix] Correct OpenAI batch response format (#5554)
zifeitong Jun 15, 2024
f31c1f9
Add basic correctness 2 GPU tests to 4 GPU pipeline (#5518)
Yard1 Jun 16, 2024
4a67690
[CI][BugFix] Flip is_quant_method_supported condition (#5577)
mgoin Jun 16, 2024
f07d513
[build][misc] limit numpy version (#5582)
youkaichao Jun 16, 2024
845a3f2
[Doc] add debugging tips for crash and multi-node debugging (#5581)
youkaichao Jun 17, 2024
e2b85cf
Fix w8a8 benchmark and add Llama-3-8B (#5562)
comaniac Jun 17, 2024
9333fb8
[Model] Rename Phi3 rope scaling type (#5595)
garg-amit Jun 17, 2024
9e74d9d
Correct alignment in the seq_len diagram. (#5592)
CharlesRiggins Jun 17, 2024
890d8d9
[Kernel] `compressed-tensors` marlin 24 support (#5435)
dsikka Jun 17, 2024
1f12122
[Misc] use AutoTokenizer for benchmark serving when vLLM not installe…
zhyncs Jun 17, 2024
728c4c8
[Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (#3814)
jikunshang Jun 17, 2024
ab66536
[CI/BUILD] Support non-AVX512 vLLM building and testing (#5574)
DamonFool Jun 17, 2024
9e4e6fe
[CI] the readability of benchmarking and prepare for dashboard (#5571)
KuntaiDu Jun 17, 2024
1b44aaf
[bugfix][distributed] fix 16 gpus local rank arrangement (#5604)
youkaichao Jun 17, 2024
e441bad
[Optimization] use a pool to reuse LogicalTokenBlock.token_ids (#5584)
youkaichao Jun 17, 2024
a3e8a05
[Bugfix] Fix KV head calculation for MPT models when using GQA (#5142)
bfontain Jun 17, 2024
26e1188
[Fix] Use utf-8 encoding in entrypoints/openai/run_batch.py (#5606)
zifeitong Jun 17, 2024
fa9e385
[Speculative Decoding 1/2 ] Add typical acceptance sampling as one of…
sroy745 Jun 18, 2024
daef218
[Model] Initialize Phi-3-vision support (#4986)
Isotr0py Jun 18, 2024
5002175
[Kernel] Add punica dimensions for Granite 13b (#5559)
joerunde Jun 18, 2024
8eadcf0
[misc][typo] fix typo (#5620)
youkaichao Jun 18, 2024
32c86e4
[Misc] Fix typo (#5618)
DarkLight1337 Jun 18, 2024
114d727
[CI] Avoid naming different metrics with the same name in performance…
KuntaiDu Jun 18, 2024
db5ec52
[bugfix][distributed] improve p2p capability test (#5612)
youkaichao Jun 18, 2024
f0cc0e6
[Misc] Remove import from transformers logging (#5625)
CatherineSue Jun 18, 2024
4ad7b53
[CI/Build][Misc] Update Pytest Marker for VLMs (#5623)
ywang96 Jun 18, 2024
13db436
[ci] Deprecate original CI template (#5624)
khluu Jun 18, 2024
7879f24
[Misc] Add OpenTelemetry support (#4687)
ronensc Jun 18, 2024
95db455
[Misc] Add channel-wise quantization support for w8a8 dynamic per tok…
dsikka Jun 18, 2024
19091ef
[ci] Setup Release pipeline and build release wheels with cache (#5610)
khluu Jun 18, 2024
07feecd
[Model] LoRA support added for command-r (#5178)
sergey-tinkoff Jun 18, 2024
8a17338
[Bugfix] Fix for inconsistent behaviour related to sampling and repet…
tdoublep Jun 18, 2024
2bd231a
[Doc] Added cerebrium as Integration option (#5553)
milo157 Jun 18, 2024
b23ce92
[Bugfix] Fix CUDA version check for mma warning suppression (#5642)
tlrmchlsmth Jun 18, 2024
6820724
[Bugfix] Fix w8a8 benchmarks for int8 case (#5643)
tlrmchlsmth Jun 19, 2024
59a1eb5
[Bugfix] Fix Phi-3 Long RoPE scaling implementation (#5628)
ShukantPal Jun 19, 2024
e5150f2
[Bugfix] Added test for sampling repetition penalty bug. (#5659)
tdoublep Jun 19, 2024
f758aed
[Bugfix][CI/Build][AMD][ROCm]Fixed the cmake build bug which generate…
hongxiayang Jun 19, 2024
3eea748
[misc][distributed] use 127.0.0.1 for single-node (#5619)
youkaichao Jun 19, 2024
da971ec
[Model] Add FP8 kv cache for Qwen2 (#5656)
mgoin Jun 19, 2024
7d46c8d
[Bugfix] Fix sampling_params passed incorrectly in Phi3v example (#5684)
Isotr0py Jun 19, 2024
d871453
[Misc]Add param max-model-len in benchmark_latency.py (#5629)
DearPlanet Jun 19, 2024
e9c2732
[CI/Build] Add tqdm to dependencies (#5680)
DarkLight1337 Jun 19, 2024
3ee5c4b
[ci] Add A100 queue into AWS CI template (#5648)
khluu Jun 19, 2024
afed90a
[Frontend][Bugfix] Fix preemption_mode -> preemption-mode for CLI arg…
mgoin Jun 19, 2024
d571ca0
[ci][distributed] add tests for custom allreduce (#5689)
youkaichao Jun 19, 2024
7868750
[Bugfix] AsyncLLMEngine hangs with asyncio.run (#5654)
zifeitong Jun 19, 2024
e83db9e
[Doc] Update docker references (#5614)
rafvasq Jun 19, 2024
4a30d7e
[Misc] Add per channel support for static activation quantization; up…
dsikka Jun 19, 2024
949e49a
[ci] Limit num gpus if specified for A100 (#5694)
khluu Jun 19, 2024
3730a1c
[Misc] Improve conftest (#5681)
DarkLight1337 Jun 20, 2024
1b2eaac
[Bugfix][Doc] FIx Duplicate Explicit Target Name Errors (#5703)
ywang96 Jun 20, 2024
111af1f
[Kernel] Update Cutlass int8 kernel configs for SM90 (#5514)
varun-sundar-rabindranath Jun 20, 2024
ad137cd
[Model] Port over CLIPVisionModel for VLMs (#5591)
ywang96 Jun 20, 2024
a7dcc62
[Kernel] Update Cutlass int8 kernel configs for SM80 (#5275)
varun-sundar-rabindranath Jun 20, 2024
3f3b6b2
[Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS ke…
tlrmchlsmth Jun 20, 2024
8065a7e
[Frontend] Add FlexibleArgumentParser to support both underscore and …
mgoin Jun 20, 2024
6c5b7af
[distributed][misc] use fork by default for mp (#5669)
youkaichao Jun 21, 2024
b12518d
[Model] MLPSpeculator speculative decoding support (#4947)
JRosenkranz Jun 21, 2024
1f56742
[Kernel] Add punica dimension for Qwen2 LoRA (#5441)
jinzhen-lin Jun 21, 2024
c35e4a3
[BugFix] Fix test_phi3v.py (#5725)
CatherineSue Jun 21, 2024
67005a0
[Bugfix] Add fully sharded layer for QKVParallelLinearWithLora (#5665)
jeejeelee Jun 21, 2024
d9a252b
[Core][Distributed] add shm broadcast (#5399)
youkaichao Jun 21, 2024
bd620b0
[Kernel][CPU] Add Quick `gelu` to CPU (#5717)
ywang96 Jun 21, 2024
5b15bde
[Doc] Documentation on supported hardware for quantization methods (#…
mgoin Jun 21, 2024
f1e72cc
[BugFix] exclude version 1.15.0 for modelscope (#5668)
zhyncs Jun 21, 2024
7187507
[ci][test] fix ca test in main (#5746)
youkaichao Jun 21, 2024
f5dda63
[LoRA] Add support for pinning lora adapters in the LRU cache (#5603)
rohithkrn Jun 21, 2024
cf90ae0
[CI][Hardware][Intel GPU] add Intel GPU(XPU) ci pipeline (#5616)
jikunshang Jun 22, 2024
9c62db0
[Model] Support Qwen-VL and Qwen-VL-Chat models with text-only inputs…
DamonFool Jun 22, 2024
ff9ddbc
[Misc] Remove #4789 workaround left in vllm/entrypoints/openai/run_ba…
zifeitong Jun 22, 2024
0cbc1d2
[Bugfix] Fix pin_lora error in TPU executor (#5760)
WoosukKwon Jun 22, 2024
8c00f9c
[Docs][TPU] Add installation tip for TPU (#5761)
WoosukKwon Jun 22, 2024
832ea88
[core][distributed] improve shared memory broadcast (#5754)
youkaichao Jun 22, 2024
6c916ac
[BugFix] [Kernel] Add Cutlass2x fallback kernels (#5744)
varun-sundar-rabindranath Jun 23, 2024
5d4d905
[Distributed] Add send and recv helpers (#5719)
andoorve Jun 23, 2024
edd5fe5
[Bugfix] Add phi3v resize for dynamic shape and fix torchvision requi…
Isotr0py Jun 24, 2024
c246212
[doc][faq] add warning to download models for every nodes (#5783)
youkaichao Jun 24, 2024
a2899d5
Merge remote-tracking branch 'upstream/main' into HEAD
kzawora-intel Jun 24, 2024
fc6d4b4
post-rebase api adjustments
kzawora-intel Jun 24, 2024
126c607
Merge remote-tracking branch 'upstream/main' into private/kzawora/reb…
kzawora-intel Jun 24, 2024
e72dc6c
[Doc] Add "Suggest edit" button to doc pages (#5789)
mgoin Jun 24, 2024
1744cc9
[Doc] Add Phi-3-medium to list of supported models (#5788)
mgoin Jun 24, 2024
ba991d5
[Bugfix] Fix FlexibleArgumentParser replaces _ with - for actual args…
CatherineSue Jun 24, 2024
e9de9dd
[ci] Remove aws template (#5757)
khluu Jun 25, 2024
f23871e
[Doc] Add notice about breaking changes to VLMs (#5818)
DarkLight1337 Jun 25, 2024
2ce5d66
[Speculative Decoding] Support draft model on different tensor-paral…
wooyeonlee0 Jun 25, 2024
d12bff7
add pin_lora to habana components
kzawora-intel Jun 25, 2024
43ff60b
Merge remote-tracking branch 'upstream/main' into private/kzawora/re…
kzawora-intel Jun 25, 2024
efce3c4
add WA for model loader
kzawora-intel Jun 25, 2024
c1e7589
fix api mismatches with ray
kzawora-intel Jun 25, 2024
58bd037
tensor parallel fixes
kzawora-intel Jun 25, 2024
1d6409b
workers cpu alignment fix
kzawora-intel Jun 25, 2024
7b99314
[Misc] Remove useless code in cpu_worker (#5824)
DamonFool Jun 25, 2024
952b7c4
prefill/decode metadata fixes
kzawora-intel Jun 25, 2024
67882db
[Core] Add fault tolerance for `RayTokenizerGroupPool` (#5748)
Yard1 Jun 25, 2024
cf04c81
re-enable attn metadata trimming
kzawora-intel Jun 25, 2024
2b850fe
worker_use_ray fix
kzawora-intel Jun 25, 2024
c18ebfd
[doc][distributed] add both gloo and nccl tests (#5834)
youkaichao Jun 25, 2024
d9b34ba
[CI/Build] Add unit testing for FlexibleArgumentParser (#5798)
mgoin Jun 25, 2024
dd248f7
[Misc] Update `w4a16` `compressed-tensors` support to include `w8a16`…
dsikka Jun 25, 2024
bc34937
[Hardware][TPU] Refactor TPU backend (#5831)
WoosukKwon Jun 25, 2024
dd793d1
[Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improv…
mawong-amd Jun 25, 2024
f178e56
[Hardware][TPU] Raise errors for unsupported sampling params (#5850)
WoosukKwon Jun 25, 2024
c2a8ac7
[CI/Build] Add E2E tests for MLPSpeculator (#5791)
tdoublep Jun 26, 2024
8207972
[Bugfix] Fix assertion in NeuronExecutor (#5841)
aws-patlange Jun 26, 2024
dda4811
[Core] Refactor Worker and ModelRunner to consolidate control plane c…
stephanie-wang Jun 26, 2024
3aa7b6c
[Misc][Doc] Add Example of using OpenAI Server with VLM (#5832)
ywang96 Jun 26, 2024
515080a
[bugfix][distributed] fix shm broadcast when the queue size is full (…
youkaichao Jun 26, 2024
6806998
[Bugfix] Fix embedding to support 2D inputs (#5829)
WoosukKwon Jun 26, 2024
3439c5a
[Bugfix][TPU] Fix KV cache size calculation (#5860)
WoosukKwon Jun 26, 2024
6984c02
[CI/Build] Refactor image test assets (#5821)
DarkLight1337 Jun 26, 2024
5bfd1bb
[Kernel] Adding bias epilogue support for `cutlass_scaled_mm` (#5560)
ProExpertProg Jun 26, 2024
c54269d
[Frontend] Add tokenize/detokenize endpoints (#5054)
sasha0552 Jun 26, 2024
cbc53b6
[Hardware][TPU] Support parallel sampling & Swapping (#5855)
WoosukKwon Jun 26, 2024
f5c8628
[Bugfix][TPU] Fix CPU cache allocation (#5869)
WoosukKwon Jun 26, 2024
38a1674
Support CPU inference with VSX PowerPC ISA (#5652)
ChipKerchner Jun 26, 2024
294104c
[doc] update usage of env var to avoid conflict (#5873)
youkaichao Jun 26, 2024
b9e8425
[Misc] Add example for LLaVA-NeXT (#5879)
ywang96 Jun 27, 2024
2110557
[BugFix] Fix cuda graph for MLPSpeculator (#5875)
njhill Jun 27, 2024
6eabc6c
[Doc] Add note about context length in Phi-3-Vision example (#5887)
DarkLight1337 Jun 27, 2024
d12af20
[VLM][Bugfix] Make sure that `multi_modal_kwargs` is broadcasted prop…
xwjiang2010 Jun 27, 2024
96354d6
[Model] Add base class for LoRA-supported models (#5018)
DarkLight1337 Jun 27, 2024
2061f0b
[Bugfix] Fix img_sizes Parsing in Phi3-Vision (#5888)
ywang96 Jun 27, 2024
e36df83
Merge remote-tracking branch 'upstream/main' into private/kzawora/reb…
kzawora-intel Jun 27, 2024
e9d32d0
[CI/Build] [1/3] Reorganize entrypoints tests (#5526)
DarkLight1337 Jun 27, 2024
1fd06cc
add collective crash WA
kzawora-intel Jun 27, 2024
940f525
add comment to the weird mark_step
kzawora-intel Jun 27, 2024
98cf2ed
[Model][Bugfix] Implicit model flags and reenable Phi-3-Vision (#5896)
DarkLight1337 Jun 27, 2024
3fd02bd
[doc][misc] add note for Kubernetes users (#5916)
youkaichao Jun 27, 2024
691e29e
[BugFix] Fix `MLPSpeculator` handling of `num_speculative_tokens` (#5…
njhill Jun 27, 2024
365791f
[BugFix] Fix `min_tokens` behaviour for multiple eos tokens (#5849)
njhill Jun 27, 2024
736ed38
[CI/Build] Fix Args for `_get_logits_warper` in Sampler Test (#5922)
ywang96 Jun 27, 2024
79c92c7
[Model] Add Gemma 2 (#5908)
WoosukKwon Jun 27, 2024
64e8d2a
[core][misc] remove logical block (#5882)
youkaichao Jun 27, 2024
c3dde36
[Kernel][ROCm][AMD] fused_moe Triton configs v2 for mi300X (#5932)
divakar-amd Jun 27, 2024
f136da1
[Hardware][TPU] Optimize KV cache swapping (#5878)
WoosukKwon Jun 28, 2024
74d55c0
[VLM][BugFix] Make sure that `multi_modal_kwargs` can broadcast prope…
xwjiang2010 Jun 28, 2024
0d0e3a4
[Bugfix][Hardware][Intel CPU] Fix unpassed multi_modal_kwargs for CPU…
Isotr0py Jun 28, 2024
5cbe8d1
[Core] Registry for processing model inputs (#5214)
DarkLight1337 Jun 28, 2024
5932634
Unmark fused_moe config json file as executable (#5960)
tlrmchlsmth Jun 28, 2024
57f09a4
[Hardware][Intel] OpenVINO vLLM backend (#5379)
ilya-lavrenov Jun 28, 2024
ec1ad00
[Bugfix] Better error message for MLPSpeculator when `num_speculative…
tdoublep Jun 28, 2024
3b752a6
[CI/Build] [2/3] Reorganize entrypoints tests (#5904)
DarkLight1337 Jun 28, 2024
b90d8cd
[Distributed] Make it clear that % should not be in tensor dict keys.…
xwjiang2010 Jun 28, 2024
b2c6202
[Spec Decode] Introduce DraftModelRunner (#5799)
comaniac Jun 28, 2024
6a2d659
[Bugfix] Fix compute datatype for cutlass 3.x epilogues (#5931)
tlrmchlsmth Jun 28, 2024
b185230
[ Misc ] Remove `fp8_shard_indexer` from Col/Row Parallel Linear (Sim…
robertgshaw2-neuralmagic Jun 28, 2024
2cd402e
[ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP…
robertgshaw2-neuralmagic Jun 28, 2024
be0b3af
Support Deepseek-V2 (#4650)
zwd003 Jun 28, 2024
4bf35ed
[Bugfix] Only add `Attention.kv_scale` if kv cache quantization is en…
mgoin Jun 28, 2024
5d2a1a9
Unmark more files as executable (#5962)
tlrmchlsmth Jun 28, 2024
6a62cb8
[Bugfix] Fix Engine Failing After Invalid Request - AsyncEngineDeadEr…
robertgshaw2-neuralmagic Jun 28, 2024
7041de4
[Kernel] Flashinfer for prefill & decode, with Cudagraph support for …
LiuXiaoxuanPKU Jun 28, 2024
54814fd
[Bugfix][TPU] Fix TPU sampler output (#5978)
WoosukKwon Jun 29, 2024
7f83f40
[Bugfix][TPU] Fix pad slot id (#5977)
WoosukKwon Jun 29, 2024
c4bca74
[Bugfix] fix missing last itl in openai completions benchmark (#5926)
mcalman Jun 29, 2024
906a19c
[Misc] Extend vLLM Metrics logging API (#5925)
SolitaryThinker Jun 29, 2024
ba49944
[Kernel] Add punica dimensions for Granite 3b and 8b (#5930)
joerunde Jun 29, 2024
580353d
[Bugfix] Fix precisions in Gemma 1 (#5913)
WoosukKwon Jun 29, 2024
329df38
[Misc] Update Phi-3-Vision Example (#5981)
ywang96 Jun 29, 2024
51e971d
[Bugfix] Support `eos_token_id` from `config.json` (#5954)
DarkLight1337 Jun 29, 2024
7c01f70
[Core] Optimize `SequenceStatus.is_finished` by switching to IntEnum …
Yard1 Jun 29, 2024
f7dac83
[Kernel] Raise an exception in MoE kernel if the batch size is larger…
comaniac Jun 29, 2024
8dbfcd3
[ CI/Build ] Added E2E Test For Compressed Tensors (#5839)
robertgshaw2-neuralmagic Jun 29, 2024
99397da
[CI/Build] Add TP test for vision models (#5892)
DarkLight1337 Jun 29, 2024
75aa144
[ CI/Build ] LM Eval Harness Based CI Testing (#5838)
robertgshaw2-neuralmagic Jun 29, 2024
9def106
[Bugfix][CI/Build][Hardware][AMD] Install matching torchvision to fix…
mawong-amd Jun 29, 2024
bcc6a09
[CI/Build] Temporarily Remove Phi3-Vision from TP Test (#5989)
ywang96 Jun 30, 2024
cff6a1f
[CI/Build] Reuse code for checking output consistency (#5988)
DarkLight1337 Jun 30, 2024
9d47f64
[CI/Build] [3/3] Reorganize entrypoints tests (#5966)
DarkLight1337 Jun 30, 2024
2be6955
[ci][distributed] fix device count call
youkaichao Jun 30, 2024
c6c240a
[Frontend]: Support base64 embedding (#5935)
llmpros Jun 30, 2024
f5e73c9
[Lora] Use safetensor keys instead of adapter_config.json to find une…
rkooo567 Jun 30, 2024
deacb7e
[ CI ] Temporarily Disable Large LM-Eval Tests (#6005)
robertgshaw2-neuralmagic Jun 30, 2024
7836fdc
[Misc] Fix `get_min_capability` (#5971)
dsikka Jun 30, 2024
af9ad46
[ Misc ] Refactor w8a8 to use `process_weights_after_load` (Simplify …
robertgshaw2-neuralmagic Jun 30, 2024
614aa51
[misc][cuda] use nvml to avoid accidentally cuda initialization (#6007)
youkaichao Jul 1, 2024
80ca1e6
[Speculative Decoding 2/2 ] Integrate typical acceptance sampler into…
sroy745 Jul 1, 2024
7076c89
Merge remote-tracking branch 'upstream/main' into private/kzawora/reb…
kzawora-intel Jul 1, 2024
a3ac366
Revert test changes
kzawora-intel Jul 1, 2024
85af27e
cleanup
kzawora-intel Jul 1, 2024
f856a85
llm engine cleanup
kzawora-intel Jul 1, 2024
b1f8b71
utils.py cleanup
kzawora-intel Jul 1, 2024
fb74454
custom ops refactor
kzawora-intel Jul 1, 2024
0e63941
move xops to ops
kzawora-intel Jul 1, 2024
463a8e6
Merge remote-tracking branch 'origin/habana_main' into private/kzawor…
kzawora-intel Jul 1, 2024
0141d57
remove vllm/hpu/attn_bias.py
kzawora-intel Jul 1, 2024
52fa486
Merge remote-tracking branch 'origin/habana_main' into private/kzawor…
kzawora-intel Jul 1, 2024
a21fe62
whitespace fix
kzawora-intel Jul 1, 2024
aaf5446
revert accidental changes in rmsnorm
kzawora-intel Jul 1, 2024
1ec95c4
Fix hpugraph hashing
kzawora-intel Jul 1, 2024
2394c41
add trim_attn_metadata comment
kzawora-intel Jul 1, 2024
98fb698
fix prompt bucketing:
kzawora-intel Jul 1, 2024
d76084c
[ CI ] Re-enable Large Model LM Eval (#6031)
robertgshaw2-neuralmagic Jul 1, 2024
4050d64
[doc][misc] remove deprecated api server in doc (#6037)
youkaichao Jul 1, 2024
bb60326
[Misc] update benchmark backend for scalellm (#6018)
zhyncs Jul 1, 2024
8893130
[doc][misc] further lower visibility of simple api server (#6041)
youkaichao Jul 1, 2024
dec6fc6
[Bugfix] Use RayActorError for older versions of Ray in RayTokenizer…
Yard1 Jul 1, 2024
12a5995
[Bugfix] adding chunking mechanism to fused_moe to handle large input…
avshalomman Jul 1, 2024
83bdcb6
add FAQ doc under 'serving' (#5946)
llmpros Jul 1, 2024
8e0817c
[Bugfix][Doc] Fix Doc Formatting (#6048)
ywang96 Jul 1, 2024
c4059ea
[Bugfix] Add explicit `end_forward` calls to flashinfer (#6044)
Yard1 Jul 1, 2024
c87ebc3
[BugFix] Ensure worker model loop is always stopped at the right time…
njhill Jul 1, 2024
e373853
[Frontend] Relax api url assertion for openai benchmarking (#6046)
jamestwhedbee Jul 1, 2024
5460070
[Model] Changes to MLPSpeculator to support tie_weights and input_sca…
tdoublep Jul 1, 2024
3476ed0
[Core] Optimize block_manager_v2 vs block_manager_v1 (to make V2 defa…
alexm-neuralmagic Jul 2, 2024
2c37540
[Frontend] Add template related params to request (#5709)
danieljannai21 Jul 2, 2024
98d6682
[VLM] Remove `image_input_type` from VLM config (#5852)
xwjiang2010 Jul 2, 2024
c365082
Merge remote-tracking branch 'upstream/main' into private/kzawora/reb…
kzawora-intel Jul 2, 2024
31354e5
[Doc] Reinstate doc dependencies (#6061)
DarkLight1337 Jul 2, 2024
aee6daf
Merge remote-tracking branch 'upstream/main' into private/kzawora/reb…
kzawora-intel Jul 2, 2024
d99d986
guard model loader wa for hpu
kzawora-intel Jul 2, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
2 changes: 1 addition & 1 deletion .buildkite/check-wheel-size.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import os
import zipfile

MAX_SIZE_MB = 100
MAX_SIZE_MB = 200


def print_top_10_largest_files(zip_file):
Expand Down
4 changes: 0 additions & 4 deletions .buildkite/download-images.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,6 @@ set -o pipefail
# aws s3 sync s3://air-example-data-2/vllm_opensource_llava/ images/
mkdir -p images
cd images
wget https://air-example-data-2.s3.us-west-2.amazonaws.com/vllm_opensource_llava/stop_sign_pixel_values.pt
wget https://air-example-data-2.s3.us-west-2.amazonaws.com/vllm_opensource_llava/stop_sign_image_features.pt
wget https://air-example-data-2.s3.us-west-2.amazonaws.com/vllm_opensource_llava/cherry_blossom_pixel_values.pt
wget https://air-example-data-2.s3.us-west-2.amazonaws.com/vllm_opensource_llava/cherry_blossom_image_features.pt
wget https://air-example-data-2.s3.us-west-2.amazonaws.com/vllm_opensource_llava/stop_sign.jpg
wget https://air-example-data-2.s3.us-west-2.amazonaws.com/vllm_opensource_llava/cherry_blossom.jpg

Expand Down
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Meta-Llama-3-70B-Instruct.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m meta-llama/Meta-Llama-3-70B-Instruct -b 32 -l 250 -f 5
model_name: "meta-llama/Meta-Llama-3-70B-Instruct"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.892
- name: "exact_match,flexible-extract"
value: 0.892
limit: 250
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m neuralmagic/Meta-Llama-3-8B-Instruct-FP8 -b 32 -l 250 -f 5 -t 1
model_name: "neuralmagic/Meta-Llama-3-8B-Instruct-FP8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.756
- name: "exact_match,flexible-extract"
value: 0.752
limit: 250
num_fewshot: 5
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m meta-llama/Meta-Llama-3-8B-Instruct -b 32 -l 250 -f 5 -t 1
model_name: "meta-llama/Meta-Llama-3-8B-Instruct"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.756
- name: "exact_match,flexible-extract"
value: 0.752
limit: 250
num_fewshot: 5
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Mixtral-8x7B-Instruct-v0.1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m neuralmagic/Mixtral-8x7B-Instruct-v0.1 -b 32 -l 250 -f 5 -t 4
model_name: "mistralai/Mixtral-8x7B-Instruct-v0.1"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.616
- name: "exact_match,flexible-extract"
value: 0.632
limit: 250
num_fewshot: 5
2 changes: 2 additions & 0 deletions .buildkite/lm-eval-harness/configs/models-large.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Meta-Llama-3-70B-Instruct.yaml
Mixtral-8x7B-Instruct-v0.1.yaml
2 changes: 2 additions & 0 deletions .buildkite/lm-eval-harness/configs/models-small.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Meta-Llama-3-8B-Instruct.yaml
Meta-Llama-3-8B-Instruct-FP8.yaml
46 changes: 46 additions & 0 deletions .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
#!/bin/bash
# We can use this script to compute baseline accuracy on GSM for transformers.
#
# Make sure you have lm-eval-harness installed:
# pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@9516087b81a61d0e220b22cc1b75be76de23bc10

usage() {
echo``
echo "Runs lm eval harness on GSM8k using huggingface transformers."
echo "This pathway is intended to be used to create baselines for "
echo "our automated nm-test-accuracy workflow"
echo
echo "usage: ${0} <options>"
echo
echo " -m - huggingface stub or local directory of the model"
echo " -b - batch size to run the evaluation at"
echo " -l - limit number of samples to run"
echo " -f - number of fewshot samples to use"
echo
}

while getopts "m:b:l:f:" OPT; do
case ${OPT} in
m )
MODEL="$OPTARG"
;;
b )
BATCH_SIZE="$OPTARG"
;;
l )
LIMIT="$OPTARG"
;;
f )
FEWSHOT="$OPTARG"
;;
\? )
usage
exit 1
;;
esac
done

lm_eval --model hf \
--model_args pretrained=$MODEL,parallelize=True \
--tasks gsm8k --num_fewshot $FEWSHOT --limit $LIMIT \
--batch_size $BATCH_SIZE
51 changes: 51 additions & 0 deletions .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
#!/bin/bash
# We can use this script to compute baseline accuracy on GSM for vllm.
# We use this for fp8, which HF does not support.
#
# Make sure you have lm-eval-harness installed:
# pip install lm-eval==0.4.2

usage() {
echo``
echo "Runs lm eval harness on GSM8k using huggingface transformers."
echo "This pathway is intended to be used to create baselines for "
echo "our automated nm-test-accuracy workflow"
echo
echo "usage: ${0} <options>"
echo
echo " -m - huggingface stub or local directory of the model"
echo " -b - batch size to run the evaluation at"
echo " -l - limit number of samples to run"
echo " -f - number of fewshot samples to use"
echo " -t - tensor parallel size to run at"
echo
}

while getopts "m:b:l:f:t:" OPT; do
case ${OPT} in
m )
MODEL="$OPTARG"
;;
b )
BATCH_SIZE="$OPTARG"
;;
l )
LIMIT="$OPTARG"
;;
f )
FEWSHOT="$OPTARG"
;;
t )
TP_SIZE="$OPTARG"
;;
\? )
usage
exit 1
;;
esac
done

lm_eval --model vllm \
--model_args pretrained=$MODEL,tensor_parallel_size=$TP_SIZE \
--tasks gsm8k --num_fewshot $FEWSHOT --limit $LIMIT \
--batch_size $BATCH_SIZE
59 changes: 59 additions & 0 deletions .buildkite/lm-eval-harness/run-tests.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
#!/bin/bash

usage() {
echo``
echo "Runs lm eval harness on GSM8k using vllm and compares to "
echo "precomputed baseline (measured by HF transformers.)"
echo
echo "usage: ${0} <options>"
echo
echo " -c - path to the test data config (e.g. configs/small-models.txt)"
echo " -t - tensor parallel size"
echo
}

SUCCESS=0

while getopts "c:t:" OPT; do
case ${OPT} in
c )
CONFIG="$OPTARG"
;;
t )
TP_SIZE="$OPTARG"
;;
\? )
usage
exit 1
;;
esac
done

# Parse list of configs.
IFS=$'\n' read -d '' -r -a MODEL_CONFIGS < $CONFIG

for MODEL_CONFIG in "${MODEL_CONFIGS[@]}"
do
LOCAL_SUCCESS=0

echo "=== RUNNING MODEL: $MODEL_CONFIG WITH TP SIZE: $TP_SIZE==="

export LM_EVAL_TEST_DATA_FILE=$PWD/configs/${MODEL_CONFIG}
export LM_EVAL_TP_SIZE=$TP_SIZE
pytest -s test_lm_eval_correctness.py || LOCAL_SUCCESS=$?

if [[ $LOCAL_SUCCESS == 0 ]]; then
echo "=== PASSED MODEL: ${MODEL_CONFIG} ==="
else
echo "=== FAILED MODEL: ${MODEL_CONFIG} ==="
fi

SUCCESS=$((SUCCESS + LOCAL_SUCCESS))

done

if [ "${SUCCESS}" -eq "0" ]; then
exit 0
else
exit 1
fi
54 changes: 54 additions & 0 deletions .buildkite/lm-eval-harness/test_lm_eval_correctness.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
"""
LM eval harness on model to compare vs HF baseline computed offline.
Configs are found in configs/$MODEL.yaml

* export LM_EVAL_TEST_DATA_FILE=configs/Meta-Llama-3-70B-Instruct.yaml
* export LM_EVAL_TP_SIZE=4
* pytest -s test_lm_eval_correctness.py
"""

import os
from pathlib import Path

import lm_eval
import numpy
import yaml

RTOL = 0.02
TEST_DATA_FILE = os.environ.get(
"LM_EVAL_TEST_DATA_FILE",
".buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct.yaml")

TP_SIZE = os.environ.get("LM_EVAL_TP_SIZE", 1)


def launch_lm_eval(eval_config):
model_args = f"pretrained={eval_config['model_name']}," \
f"tensor_parallel_size={TP_SIZE}"

results = lm_eval.simple_evaluate(
model="vllm",
model_args=model_args,
tasks=[task["name"] for task in eval_config["tasks"]],
num_fewshot=eval_config["num_fewshot"],
limit=eval_config["limit"],
batch_size="auto")

return results


def test_lm_eval_correctness():
eval_config = yaml.safe_load(
Path(TEST_DATA_FILE).read_text(encoding="utf-8"))

# Launch eval requests.
results = launch_lm_eval(eval_config)

# Confirm scores match ground truth.
for task in eval_config["tasks"]:
for metric in task["metrics"]:
ground_truth = metric["value"]
measured_value = results["results"][task["name"]][metric["name"]]
print(f'{task["name"]} | {metric["name"]}: '
f'ground_truth={ground_truth} | measured={measured_value}')
assert numpy.isclose(ground_truth, measured_value, rtol=RTOL)
103 changes: 103 additions & 0 deletions .buildkite/nightly-benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# vLLM benchmark suite

## Introduction

This directory contains the performance benchmarking CI for vllm.
The goal is to help developers know the impact of their PRs on the performance of vllm.

This benchmark will be *triggered* upon:
- A PR being merged into vllm.
- Every commit for those PRs with `perf-benchmarks` label.

**Benchmarking Coverage**: latency, throughput and fix-qps serving on A100 (the support for more GPUs is comming later), with different models.

**Benchmarking Duration**: about 1hr.

**For benchmarking developers**: please try your best to constraint the duration of benchmarking to less than 1.5 hr so that it won't take forever to run.


## Configuring the workload

The benchmarking workload contains three parts:
- Latency tests in `latency-tests.json`.
- Throughput tests in `throughput-tests.json`.
- Serving tests in `serving-tests.json`.

See [descriptions.md](tests/descriptions.md) for detailed descriptions.

### Latency test

Here is an example of one test inside `latency-tests.json`:

```json
[
{
"test_name": "latency_llama8B_tp1",
"parameters": {
"model": "meta-llama/Meta-Llama-3-8B",
"tensor_parallel_size": 1,
"load_format": "dummy",
"num_iters_warmup": 5,
"num_iters": 15
}
},
]
```

In this example:
- The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`.
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-benchmarks-suite.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`

Note that the performance numbers are highly sensitive to the value of the parameters. Please make sure the parameters are set correctly.

WARNING: The benchmarking script will save json results by itself, so please do not configure `--output-json` parameter in the json file.


### Throughput test
The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `benchmark_throughput.py`.

The number of this test is also stable -- a slight change on the value of this number might vary the performance numbers by a lot.

### Serving test
We test the throughput by using `benchmark_serving.py` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example:

```
[
{
"test_name": "serving_llama8B_tp1_sharegpt",
"qps_list": [1, 4, 16, "inf"],
"server_parameters": {
"model": "meta-llama/Meta-Llama-3-8B",
"tensor_parallel_size": 1,
"swap_space": 16,
"disable_log_stats": "",
"disable_log_requests": "",
"load_format": "dummy"
},
"client_parameters": {
"model": "meta-llama/Meta-Llama-3-8B",
"backend": "vllm",
"dataset_name": "sharegpt",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 200
}
},
]
```

Inside this example:
- The `test_name` attribute is also a unique identifier for the test. It must start with `serving_`.
- The `server-parameters` includes the command line arguments for vLLM server.
- The `client-parameters` includes the command line arguments for `benchmark_serving.py`.
- The `qps_list` controls the list of qps for test. It will be used to configure the `--request-rate` parameter in `benchmark_serving.py`

The number of this test is less stable compared to the delay and latency benchmarks (due to randomized sharegpt dataset sampling inside `benchmark_serving.py`), but a large change on this number (e.g. 5% change) still vary the output greatly.

WARNING: The benchmarking script will save json results by itself, so please do not configure `--save-results` or other results-saving-related parameters in `serving-tests.json`.

## Visualizing the results
The `convert-results-json-to-markdown.py` helps you put the benchmarking results inside a markdown table, by formatting [descriptions.md](tests/descriptions.md) with real benchmarking results.
You can find the result presented as a table inside the `buildkite/performance-benchmark` job page.
If you do not see the table, please wait till the benchmark finish running.
The json version of the table (together with the json version of the benchmark) will be also attached to the markdown file.
The raw benchmarking results (in the format of json files) are in the `Artifacts` tab of the benchmarking.
Loading