This is the repository for the community edition of the TitanML Takeoff server. This is a server designed for optimized inference of large language models.
For usage information, tutorials, and usage examples, see the docs.
✔️ Easy deployment and streaming response
✔️ Optimized int8 quantization
✔️ Chat and playground-like interface
✔️ Support for encoder-decoder (T5 family) and decoder models
For the pro edition, including multi-gpu inference, int4 quantization, and more. contact us
To use the inference server, use the iris
launcher (you'll need to have docker installed, too). To install iris, run
pip install titan-iris
Then, to launch an inference server with a model, run
iris takeoff --model tiiuae/falcon-7b-instruct --device cpu --port 8000
You'll be prompted to login. To run with GPU access, add --device cuda
instead.
To experiment with the resulting server, navigate to http://localhost:8000/demos/playground, or http://localhost:8000/demos/chat. To see docs on how to query the model, navigate to http://localhost:8000/docs
We welcome community contributions! Please contact us on our discord if you have any questions!
To build the development environment, run the following commands
# Access
$ cd takeoff
# For dev, build the image first
$ docker build -t takeoff .
# Spin up the container
$ docker run -it -p 8000:80 --gpus all -v $HOME/.iris_cache/:/code/models/ --entrypoint /bin/bash takeoff
# set the models and device
export TAKEOFF_MODEL_NAME=t5-small
export TAKEOFF_DEVICE=cuda # or cpu
# This will run the CT2 convert and then spin up the fastAPI server
$ sh run.sh
# The server will initialize in the <http://localhost:8000>
You can then use iris takeoff --infer
to test the inference
For more details as to how to use the server, check out the docs