make some tests and choose an openai api compatible local llm server #7

furlat · 2024-02-12T01:49:56Z

https://github.com/ollama/ollama
https://github.com/abetlen/llama-cpp-python
https://github.com/vllm-project/vllm

furlat · 2024-02-20T10:45:07Z

Could be worth to try a modal-deployment of the lmm server with modal as well
https://modal.com/docs/examples/vllm_inference

furlat · 2024-02-23T18:22:06Z

https://github.com/sgl-project/sglang

furlat · 2024-03-01T20:02:38Z

interesting pr for vlmm with respect to speculative decoding vllm-project/vllm#2188 and fused moe kernels vllm-project/vllm#2913 vllm-project/vllm#2979

furlat · 2024-03-21T17:32:59Z

The neural network architecture used as the language model will be Mixtral. The server must meet the following requirements:

Structured extraction using Pydantic.
Efficient batching in order to repeat the same task on thousands of different documents.
Use of a common prefix system to reuse the KV cache regarding the system prompt and other shared prefixes.
Speculative decoding, such as prompt n-gram caching, which allows the model to make suggestions with the text already present within the input.
The ability to use quantized models, as we should be able to use Mixtral with a budget of 2 GPU 4090s per server, for a total of 48 GB of VRAM per server.
Use of FastAPI to serve the inference server to other machines within the VPN/LAN of the servers.

The goal and evaluation will be the speed in terms of reading and writing of the inference server. In particular, we are interested in knowing how much time it takes to read and write one million tokens with the same structured extraction task on about a thousand documents in parallel.

---> also on Modal

furlat added the enhancement New feature or request label Feb 12, 2024

furlat changed the title ~~make some tests an choose an openai api compatible local llm server~~ make some tests and choose an openai api compatible local llm server Feb 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make some tests and choose an openai api compatible local llm server #7

make some tests and choose an openai api compatible local llm server #7

furlat commented Feb 12, 2024

furlat commented Feb 20, 2024

furlat commented Feb 23, 2024

furlat commented Mar 1, 2024 •

edited

Loading

furlat commented Mar 21, 2024

make some tests and choose an openai api compatible local llm server #7

make some tests and choose an openai api compatible local llm server #7

Comments

furlat commented Feb 12, 2024

furlat commented Feb 20, 2024

furlat commented Feb 23, 2024

furlat commented Mar 1, 2024 • edited Loading

furlat commented Mar 21, 2024

furlat commented Mar 1, 2024 •

edited

Loading