The load test is designed to simulate continuous production load and minimize effect of model generation behavior:
- variation in generation parameters
- continuous request stream with varying distribution and load levels
- force generation of exact number of output tokens (for most providers)
- specified load test duration
Supported providers and API flavors:
- OpenAI API compatible endpoints:
- Fireworks.ai public or private deployments
- VLLM
- Anyscale Endpoints
- OpenAI
- Text Generation Inference (TGI) / HuggingFace Endpoints
- Together.ai
- NVidia Triton server:
- Legacy HTTP endpoints (no streaming)
- LLM-focused endpoints (with or without streaming)
Captured metrics:
- Overall latency
- Number of generated tokens
- Sustained requests throughput (QPS)
- Time to first token (TTFT) for streaming
- Per token latency for streaming
Metrics summary can be exported to CSV. This way multiple configuration can be scripted over. CSV file can be imported to Google Sheets/Excel or Jupyter for further analysis.
See llm_bench
folder for detailed usage.