Skip to content

Latest commit

 

History

History
46 lines (31 loc) · 1.04 KB

TODO.md

File metadata and controls

46 lines (31 loc) · 1.04 KB

Model compatibility

  • Verify compatibility with Llama-2 34B once released

GPU compatibility (etc.)

  • Optimizations for ROCm
  • Optimizations for RTX 20-series maybe
  • Look into improving P40 performance

Testing

  • More testing on Llama 2 models

Optimization

  • Flash Attention 2.0 (?)
  • Find a way to eliminate ExLlamaAttention.repeat_kv (custom attention kernel?)
  • C++ implementations of sampler functions

Generation

  • Optimized/batched beam search
  • Allow stackable LoRAs
  • Guidance or equivalent

Interface

  • Comprehensive API server (more than example_flask.py

Web UI

  • Controls to enable beam search
  • Rewrite/refactor all the JavaScript and CSS
  • Make it a little prettier
  • Better error handling
  • LoRA controls
  • Multiple chat modes with prompt templates (instruct, etc.)

??

  • Support for other quantization methods
  • Support for other LLM architectures
  • Allow for backpropagation
  • LoRA training features
  • Soft prompt training