- Verify compatibility with Llama-2 34B once released
- Optimizations for ROCm
- Optimizations for RTX 20-series maybe
- Look into improving P40 performance
- More testing on Llama 2 models
- Flash Attention 2.0 (?)
- Find a way to eliminate
ExLlamaAttention.repeat_kv
(custom attention kernel?) - C++ implementations of sampler functions
- Optimized/batched beam search
- Allow stackable LoRAs
- Guidance or equivalent
- Comprehensive API server (more than
example_flask.py
- Controls to enable beam search
- Rewrite/refactor all the JavaScript and CSS
- Make it a little prettier
- Better error handling
- LoRA controls
- Multiple chat modes with prompt templates (instruct, etc.)
- Support for other quantization methods
- Support for other LLM architectures
- Allow for backpropagation
- LoRA training features
- Soft prompt training