Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] add disable_sliding_window parameter to vllm/lmi-dist engine args #2338

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ class LmiDistRbProperties(Properties):
enable_chunked_prefill: Optional[bool] = False
cpu_offload_gb_per_gpu: Optional[int] = 0
enable_prefix_caching: Optional[bool] = False
disable_sliding_window: Optional[bool] = False

@model_validator(mode='after')
def validate_mpi(self):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ class VllmRbProperties(Properties):
enable_chunked_prefill: Optional[bool] = False
cpu_offload_gb_per_gpu: Optional[int] = 0
enable_prefix_caching: Optional[bool] = False
disable_sliding_window: Optional[bool] = False

@field_validator('engine')
def validate_engine(cls, engine):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,7 @@ def __init__(self, model_id_or_path: str, properties: dict, **kwargs):
enable_chunked_prefill=self.lmi_dist_config.enable_chunked_prefill,
cpu_offload_gb=self.lmi_dist_config.cpu_offload_gb_per_gpu,
enable_prefix_caching=self.lmi_dist_config.enable_prefix_caching,
disable_sliding_window=self.lmi_dist_config.disable_sliding_window,
**engine_kwargs)

kwargs = {}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -263,7 +263,9 @@ def get_engine_args_from_config(config: VllmRbProperties) -> EngineArgs:
max_logprobs=config.max_logprobs,
enable_chunked_prefill=config.enable_chunked_prefill,
cpu_offload_gb=config.cpu_offload_gb_per_gpu,
enable_prefix_caching=config.enable_prefix_caching)
enable_prefix_caching=config.enable_prefix_caching,
disable_sliding_window=config.disable_sliding_window,
)


def get_multi_modal_data(request: Request) -> dict:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,7 @@ interface SecureModeAllowList {
"option.enable_chunked_prefill",
"option.cpu_offload_gb_per_gpu",
"option.enable_prefix_caching",
"option.disable_sliding_window",
"option.enable_streaming",
"option.tgi_compat");
}
7 changes: 4 additions & 3 deletions serving/docs/lmi/user_guides/lmi-dist_user_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,7 @@ Here are the advanced parameters that are available when using LMI-Dist.
| option.max_lora_rank | \>= 0.27.0 | Pass Through | This config determines the maximum rank allowed for a LoRA adapter. Set this value to maximum rank of your adapters. Setting a larger value will enable more adapters at a greater memory usage cost. | Default: `16` |
| option.lora_extra_vocab_size | \>= 0.27.0 | Pass Through | This config determines the maximum additional vocabulary that can be added through a LoRA adapter. | Default: `256` |
| option.max_cpu_loras | \>= 0.27.0 | Pass Through | This config determines the maximum number of LoRA adapters to cache in memory. All others will be evicted to disk. | Default: `None` |
| option.enable_chunked_prefill | \>= 0.29.0 | Pass Through | This config enables chunked prefill support. With chunked prefill, longer prompts will be chunked and batched with decode requests to reduce inter token latency. This option is EXPERIMENTAL and tested for llama and falcon models only. This does not work with LoRA and speculative decoding yet. | Default: `False` |
| option.cpu_offload_gb_per_gpu | \>= 0.29.0 | Pass Through | This config allows offloading model weights into CPU to enable large model running with limited GPU memory. | Default: `0` |
| option.enable_prefix_caching | \>= 0.29.0 | Pass Through | This config allows the engine to cache the context memory and reuse to speed up inference. | Default: `False` |
| option.enable_chunked_prefill | \>= 0.29.0 | Pass Through | This config enables chunked prefill support. With chunked prefill, longer prompts will be chunked and batched with decode requests to reduce inter token latency. This option is EXPERIMENTAL and tested for llama and falcon models only. This does not work with LoRA and speculative decoding yet. | Default: `False` |
hommayushi3 marked this conversation as resolved.
Show resolved Hide resolved
| option.cpu_offload_gb_per_gpu | \>= 0.29.0 | Pass Through | This config allows offloading model weights into CPU to enable large model running with limited GPU memory. | Default: `0` |
| option.enable_prefix_caching | \>= 0.29.0 | Pass Through | This config allows the engine to cache the context memory and reuse to speed up inference. | Default: `False` |
| option.disable_sliding_window | \>= 0.30.0 | Pass Through | This config disables sliding window, capping to sliding window size inference. | Default: `False` |
7 changes: 4 additions & 3 deletions serving/docs/lmi/user_guides/vllm_user_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,7 @@ In that situation, there is nothing LMI can do until the issue is fixed in the b
| option.max_lora_rank | \>= 0.27.0 | Pass Through | This config determines the maximum rank allowed for a LoRA adapter. Set this value to maximum rank of your adapters. Setting a larger value will enable more adapters at a greater memory usage cost. | Default: `16` |
| option.lora_extra_vocab_size | \>= 0.27.0 | Pass Through | This config determines the maximum additional vocabulary that can be added through a LoRA adapter. | Default: `256` |
| option.max_cpu_loras | \>= 0.27.0 | Pass Through | This config determines the maximum number of LoRA adapters to cache in memory. All others will be evicted to disk. | Default: `None` |
| option.enable_chunked_prefill | \>= 0.29.0 | Pass Through | This config enables chunked prefill support. With chunked prefill, longer prompts will be chunked and batched with decode requests to reduce inter token latency. This option is EXPERIMENTAL and tested for llama and falcon models only. This does not work with LoRA and speculative decoding yet. | Default: `False` |
| option.cpu_offload_gb_per_gpu | \>= 0.29.0 | Pass Through | This config allows offloading model weights into CPU to enable large model running with limited GPU memory. | Default: `0` |
| option.enable_prefix_caching | \>= 0.29.0 | Pass Through | This config allows the engine to cache the context memory and reuse to speed up inference. | Default: `False` |
| option.enable_chunked_prefill | \>= 0.29.0 | Pass Through | This config enables chunked prefill support. With chunked prefill, longer prompts will be chunked and batched with decode requests to reduce inter token latency. This option is EXPERIMENTAL and tested for llama and falcon models only. This does not work with LoRA and speculative decoding yet. | Default: `False` |
| option.cpu_offload_gb_per_gpu | \>= 0.29.0 | Pass Through | This config allows offloading model weights into CPU to enable large model running with limited GPU memory. | Default: `0` |
| option.enable_prefix_caching | \>= 0.29.0 | Pass Through | This config allows the engine to cache the context memory and reuse to speed up inference. | Default: `False` |
| option.disable_sliding_window | \>= 0.30.0 | Pass Through | This config disables sliding window, capping to sliding window size inference. | Default: `False` |
Loading