Merge 0223 (#8)

* Remove hardcode flash-attn disable setting (lm-sys#2342) * Document turning off proxy_buffering when api is streaming (lm-sys#2337) * Simplify huggingface api example (lm-sys#2355) * Update sponsor logos (lm-sys#2367) * if LOGDIR is empty, then don't try output log to local file (lm-sys#2357) Signed-off-by: Lei Wen <[email protected]> Co-authored-by: Lei Wen <[email protected]> * add best_of and use_beam_search for completions interface (lm-sys#2348) Signed-off-by: Lei Wen <[email protected]> Co-authored-by: Lei Wen <[email protected]> * Extract upvote/downvote from log files (lm-sys#2369) * Revert "add best_of and use_beam_search for completions interface" (lm-sys#2370) * Improve doc (lm-sys#2371) * add best_of and use_beam_search for completions interface (lm-sys#2372) Signed-off-by: Lei Wen <[email protected]> Co-authored-by: Lei Wen <[email protected]> * update monkey patch for llama2 (lm-sys#2379) * Make E5 adapter more restrict to reduce mismatch (lm-sys#2381) * Update UI and sponsers (lm-sys#2387) * Use fsdp api for save save (lm-sys#2390) * Release v0.2.27 * Spicyboros + airoboros 2.2 template update. (lm-sys#2392) Co-authored-by: Jon Durbin <[email protected]> * bugfix of openai_api_server for fastchat.serve.vllm_worker (lm-sys#2398) Co-authored-by: wuyongyu <[email protected]> * Revert "bugfix of openai_api_server for fastchat.serve.vllm_worker" (lm-sys#2400) * Revert "add best_of and use_beam_search for completions interface" (lm-sys#2401) * Release a v0.2.28 with bug fixes and more test cases * Fix model_worker error (lm-sys#2404) * Added google/flan models and fixed AutoModelForSeq2SeqLM when loading T5 compression model (lm-sys#2402) * Rename twitter to X (lm-sys#2406) * Update huggingface_api.py (lm-sys#2409) * Add support for baichuan2 models (lm-sys#2408) * Fixed character overlap issue when api streaming output (lm-sys#2431) * Support custom conversation template in multi_model_worker (lm-sys#2434) * Add Ascend NPU support (lm-sys#2422) * Add raw conversation template (lm-sys#2417) (lm-sys#2418) * Improve docs & UI (lm-sys#2436) * Fix Salesforce xgen inference (lm-sys#2350) * Add support for Phind-CodeLlama models (lm-sys#2415) (lm-sys#2416) Co-authored-by: Lianmin Zheng <[email protected]> * Add falcon 180B chat conversation template (lm-sys#2384) * Improve docs (lm-sys#2438) * add dtype and seed (lm-sys#2430) * Data cleaning scripts for dataset release (lm-sys#2440) * merge google/flan based adapters: T5Adapter, CodeT5pAdapter, FlanAdapter (lm-sys#2411) * Fix docs * Update UI (lm-sys#2446) * Add Optional SSL Support to controller.py (lm-sys#2448) * Format & Improve docs * Release v0.2.29 (lm-sys#2450) * Show terms of use as an JS alert (lm-sys#2461) * vllm worker awq quantization update (lm-sys#2463) Co-authored-by: 董晓龙 <[email protected]> * Fix falcon chat template (lm-sys#2464) * Fix chunk handling when partial chunks are returned (lm-sys#2485) * Update openai_api_server.py to add an SSL option (lm-sys#2484) * Update vllm_worker.py (lm-sys#2482) * fix typo quantization (lm-sys#2469) * fix vllm quanziation args * Update README.md (lm-sys#2492) * Huggingface api worker (lm-sys#2456) * Update links to lmsys-chat-1m (lm-sys#2497) * Update train code to support the new tokenizer (lm-sys#2498) * Third Party UI Example (lm-sys#2499) * Add metharme (pygmalion) conversation template (lm-sys#2500) * Optimize for proper flash attn causal handling (lm-sys#2503) * Add Mistral AI instruction template (lm-sys#2483) * Update monitor & plots (lm-sys#2506) * Release v0.2.30 (lm-sys#2507) * Fix for single turn dataset (lm-sys#2509) * replace os.getenv with os.path.expanduser because the first one doesn… (lm-sys#2515) Co-authored-by: khalil <[email protected]> * Fix arena (lm-sys#2522) * Update Dockerfile (lm-sys#2524) * add Llama2ChangAdapter (lm-sys#2510) * Add ExllamaV2 Inference Framework Support. (lm-sys#2455) * Improve docs (lm-sys#2534) * Fix warnings for new gradio versions (lm-sys#2538) * revert the gradio change; now works for 3.40 * Improve chat templates (lm-sys#2539) * Add Zephyr 7B Alpha (lm-sys#2535) * Improve Support for Mistral-Instruct (lm-sys#2547) * correct max_tokens by context_length instead of raise exception (lm-sys#2544) * Revert "Improve Support for Mistral-Instruct" (lm-sys#2552) * Fix Mistral template (lm-sys#2529) * Add additional Informations from the vllm worker (lm-sys#2550) * Make FastChat work with LMSYS-Chat-1M Code (lm-sys#2551) * Create `tags` attribute to fix `MarkupError` in rich CLI (lm-sys#2553) * move BaseModelWorker outside serve.model_worker to make it independent (lm-sys#2531) * Misc style and bug fixes (lm-sys#2559) * Fix README.md (lm-sys#2561) * release v0.2.31 (lm-sys#2563) * resolves lm-sys#2542 modify dockerfile to upgrade cuda to 12.2.0 and pydantic 1.10.13 (lm-sys#2565) * Add airoboros_v3 chat template (llama-2 format) (lm-sys#2564) * Add Xwin-LM V0.1, V0.2 support (lm-sys#2566) * Fixed model_worker generate_gate may blocked main thread (lm-sys#2540) (lm-sys#2562) * feat: add claude-v2 (lm-sys#2571) * Update vigogne template (lm-sys#2580) * Fix issue lm-sys#2568: --device mps led to TypeError: forward() got an unexpected keyword argument 'padding_mask'. (lm-sys#2579) * Add Mistral-7B-OpenOrca conversation_temmplate (lm-sys#2585) * docs: bit misspell comments model adapter default template name conversation (lm-sys#2594) * Update Mistral template (lm-sys#2581) * Fix <s> in mistral template * Update README.md (vicuna-v1.3 -> vicuna-1.5) (lm-sys#2592) * Update README.md to highlight chatbot arena (lm-sys#2596) * Add Lemur model (lm-sys#2584) Co-authored-by: Roberto Ugolotti <[email protected]> * add trust_remote_code=True in BaseModelAdapter (lm-sys#2583) * Openai interface add use beam search and best of 2 (lm-sys#2442) Signed-off-by: Lei Wen <[email protected]> Co-authored-by: Lei Wen <[email protected]> * Update qwen and add pygmalion (lm-sys#2607) * feat: Support model AquilaChat2 (lm-sys#2616) * Added settings vllm (lm-sys#2599) Co-authored-by: bodza <[email protected]> Co-authored-by: bodza <[email protected]> * [Logprobs] Support logprobs=1 (lm-sys#2612) * release v0.2.32 * fix: Fix for OpenOrcaAdapter to return correct conversation template (lm-sys#2613) * Make fastchat.serve.model_worker to take debug argument (lm-sys#2628) Co-authored-by: hi-jin <[email protected]> * openchat 3.5 model support (lm-sys#2638) * xFastTransformer framework support (lm-sys#2615) * feat: support custom models vllm serving (lm-sys#2635) * kill only fastchat process (lm-sys#2641) * Update server_arch.png * Use conv.update_last_message api in mt-bench answer generation (lm-sys#2647) * Improve Azure OpenAI interface (lm-sys#2651) * Add required_temp support in jsonl format to support flexible temperature setting for gen_api_answer (lm-sys#2653) * Pin openai version < 1 (lm-sys#2658) * Remove exclude_unset parameter (lm-sys#2654) * Revert "Remove exclude_unset parameter" (lm-sys#2666) * added support for CodeGeex(2) (lm-sys#2645) * add chatglm3 conv template support in conversation.py (lm-sys#2622) * UI and model change (lm-sys#2672) Co-authored-by: Lianmin Zheng <[email protected]> * train_flant5: fix typo (lm-sys#2673) * Fix gpt template (lm-sys#2674) * Update README.md (lm-sys#2679) * feat: support template's stop_str as list (lm-sys#2678) * Update exllama_v2.md (lm-sys#2680) * save model under deepspeed (lm-sys#2689) * Adding SSL support for model workers and huggingface worker (lm-sys#2687) * Check the max_new_tokens <= 0 in openai api server (lm-sys#2688) * Add Microsoft/Orca-2-7b and update model support docs (lm-sys#2714) * fix tokenizer of chatglm2 (lm-sys#2711) * Template for using Deepseek code models (lm-sys#2705) * add support for Chinese-LLaMA-Alpaca (lm-sys#2700) * Make --load-8bit flag work with weights in safetensors format (lm-sys#2698) * Format code and minor bug fix (lm-sys#2716) * Bump version to v0.2.33 (lm-sys#2717) * fix tokenizer.pad_token attribute error (lm-sys#2710) * support stable-vicuna model (lm-sys#2696) * Exllama cache 8bit (lm-sys#2719) * Add Yi support (lm-sys#2723) * Add Hermes 2.5 [fixed] (lm-sys#2725) * Fix Hermes2Adapter (lm-sys#2727) * Fix YiAdapter (lm-sys#2730) * add trust_remote_code argument (lm-sys#2715) * Add revision arg to MT Bench answer generation (lm-sys#2728) * Fix MPS backend 'index out of range' error (lm-sys#2737) * add starling support (lm-sys#2738) * Add deepseek chat (lm-sys#2760) * a convenient script for spinning up the API with Model Workers (lm-sys#2790) * Prevent returning partial stop string in vllm worker (lm-sys#2780) * Update UI and new models (lm-sys#2762) * Support MetaMath (lm-sys#2748) * Use common logging code in the OpenAI API server (lm-sys#2758) Co-authored-by: Warren Francis <[email protected]> * Show how to turn on experiment tracking for fine-tuning (lm-sys#2742) Co-authored-by: Morgan McGuire <[email protected]> * Support xDAN-L1-Chat Model (lm-sys#2732) * Format code * Update the version to 0.2.34 (lm-sys#2793) * add dolphin (lm-sys#2794) * Fix tiny typo (lm-sys#2805) * Add instructions for evaluating on MT bench using vLLM (lm-sys#2770) * Update README.md * Add SOLAR-10.7b Instruct Model (lm-sys#2826) * Update README.md (lm-sys#2852) * fix: 'compeletion' typo (lm-sys#2847) * Add Tunnelmole as an open source alternative to ngrok and include usage instructions (lm-sys#2846) * update readme * update mt-bench readme * Add support for CatPPT (lm-sys#2840) * Add functionality to ping AI2 InferD endpoints for tulu 2 (lm-sys#2832) Co-authored-by: Sam Skjonsberg <[email protected]> * add download models from www.modelscope.cn (lm-sys#2830) Co-authored-by: mulin.lyh <[email protected]> * Fix conv_template of chinese alpaca 2 (lm-sys#2812) * add bagel model adapter (lm-sys#2814) * add root_path argument to gradio web server. (lm-sys#2807) Co-authored-by: bertls <[email protected]> * Import `accelerate` locally to avoid it as a strong dependency (lm-sys#2820) * Replace dict merge with unpacking for compatibility of 3.8 in vLLM worker (lm-sys#2824) Signed-off-by: rudeigerc <[email protected]> * Format code (lm-sys#2854) * Openai API migrate (lm-sys#2765) * fix openai api server docs * Add a16z as a sponser * Add new models (Perplexity, gemini) & Separate GPT versions (lm-sys#2856) Co-authored-by: Wei-Lin Chiang <[email protected]> * Clean error messages (lm-sys#2857) * Update docs (lm-sys#2858) * Modify doc description (lm-sys#2859) * Fix the problem of not using the decoding method corresponding to the base model in peft mode (lm-sys#2865) * update a new sota model on MT-Bench which touch an 8.8 scores. (lm-sys#2864) * NPU needs to be initialized when starting a new process (lm-sys#2843) * Fix the problem with "vllm + chatglm3" (lm-sys#2845) (lm-sys#2876) Co-authored-by: 姚峰 <[email protected]> * Update token spacing for mistral conversation.py (lm-sys#2872) * check if hm in models before deleting to avoid errors (lm-sys#2870) Co-authored-by: Your Name <[email protected]> * Add TinyLlama (lm-sys#2889) * Fix bug that model doesn't automatically switch peft adapter (lm-sys#2884) * Update web server commands (lm-sys#2869) * fix the tokenize process and prompt template of chatglm3 (lm-sys#2883) Co-authored-by: 章焕锭 <[email protected]> * Add `Notus` support (lm-sys#2813) Co-authored-by: alvarobartt <[email protected]> * feat: support anthropic api with api_dict (lm-sys#2879) * Update model_adapter.py (lm-sys#2895) * leaderboard code update (lm-sys#2867) * fix: change order of SEQUENCE_LENGTH_KEYS (lm-sys#2925) * fix baichuan:apply_prompt_template call args error (lm-sys#2921) Co-authored-by: Zheng Hao <[email protected]> * Fix a typo in openai_api_server.py (lm-sys#2905) * feat: use variables OPENAI_MODEL_LIST (lm-sys#2907) * Add TenyxChat-7B-v1 model (lm-sys#2901) Co-authored-by: sarath@L3 <[omitted]> * add support for iei yuan2.0 (https://huggingface.co/IEITYuan) (lm-sys#2919) * nous-hermes-2-mixtral-dpo (lm-sys#2922) * Bump the version to 0.2.35 (lm-sys#2927) * fix specify local path issue use model from www.modelscope.cn (lm-sys#2934) Co-authored-by: mulin.lyh <[email protected]> * support openai embedding for topic clustering (lm-sys#2729) * Remove duplicate API endpoint (lm-sys#2949) * Update Hermes Mixtral (lm-sys#2938) * Enablement of REST API Usage within Google Colab Free Tier (lm-sys#2940) * Create a new worker implementation for Apple MLX (lm-sys#2937) * feat: support Model Yuan2.0, a new generation Fundamental Large Language Model developed by IEIT System (lm-sys#2936) * Fix the pooling method of BGE embedding model (lm-sys#2926) * format code * SGLang Worker (lm-sys#2928) * Fix sglang worker (lm-sys#2953) * Update mlx_worker to be async (lm-sys#2958) * Integrate LightLLM into serve worker (lm-sys#2888) * Copy button (lm-sys#2963) * feat: train with template (lm-sys#2951) * fix content maybe a str (lm-sys#2968) * Adding download folder information in README (lm-sys#2972) * use cl100k_base as the default tiktoken encoding (lm-sys#2974) Signed-off-by: bjwswang <[email protected]> * Update README.md (lm-sys#2975) * Fix tokenizer for vllm worker (lm-sys#2984) * update yuan2.0 generation (lm-sys#2989) * fix: tokenization mismatch when training with different templates (lm-sys#2996) * fix: inconsistent tokenization by llama tokenizer (lm-sys#3006) * Fix type hint for play_a_match_single (lm-sys#3008) * code update (lm-sys#2997) * Update model_support.md (lm-sys#3016) * Update lightllm_integration.md (lm-sys#3014) * Upgrade gradio to 4.17 (lm-sys#3027) * Update MLX integration to use new generate_step function signature (lm-sys#3021) * Update readme (lm-sys#3028) * Update gradio version in `pyproject.toml` and fix a bug (lm-sys#3029) * Update gradio demo and API model providers (lm-sys#3030) * Gradio Web Server for Multimodal Models (lm-sys#2960) Co-authored-by: Lianmin Zheng <[email protected]> * Migrate the gradio server to openai v1 (lm-sys#3032) * Update version to 0.2.36 (lm-sys#3033) Co-authored-by: Wei-Lin Chiang <[email protected]> * Add llava 34b template (lm-sys#3034) * Update model support (lm-sys#3040) * Add psutil to pyproject.toml dependencies (lm-sys#3039) * Fix SGLang worker (lm-sys#3045) * Random VQA Sample button for VLM direct chat (lm-sys#3041) * Update arena.md to fix link (lm-sys#3051) * multi inference --------- Signed-off-by: Lei Wen <[email protected]> Signed-off-by: rudeigerc <[email protected]> Signed-off-by: bjwswang <[email protected]> Co-authored-by: Trangle <[email protected]> Co-authored-by: Nathan Stitt <[email protected]> Co-authored-by: Lianmin Zheng <[email protected]> Co-authored-by: leiwen83 <[email protected]> Co-authored-by: Lei Wen <[email protected]> Co-authored-by: Jon Durbin <[email protected]> Co-authored-by: Jon Durbin <[email protected]> Co-authored-by: Rayrtfr <[email protected]> Co-authored-by: wuyongyu <[email protected]> Co-authored-by: wangxiyuan <[email protected]> Co-authored-by: Jeff (Zhen) Wang <[email protected]> Co-authored-by: karshPrime <[email protected]> Co-authored-by: obitolyz <[email protected]> Co-authored-by: Shangwei Chen <[email protected]> Co-authored-by: HyungJin Ahn <[email protected]> Co-authored-by: zhangsibo1129 <[email protected]> Co-authored-by: Tobias Birchler <[email protected]> Co-authored-by: Jae-Won Chung <[email protected]> Co-authored-by: Mingdao Liu <[email protected]> Co-authored-by: Ying Sheng <[email protected]> Co-authored-by: Brandon Biggs <[email protected]> Co-authored-by: dongxiaolong <[email protected]> Co-authored-by: 董晓龙 <[email protected]> Co-authored-by: Siddartha Naidu <[email protected]> Co-authored-by: shuishu <[email protected]> Co-authored-by: Andrew Aikawa <[email protected]> Co-authored-by: Liangsheng Yin <[email protected]> Co-authored-by: enochlev <[email protected]> Co-authored-by: AlpinDale <[email protected]> Co-authored-by: Lé <[email protected]> Co-authored-by: Toshiki Kataoka <[email protected]> Co-authored-by: khalil <[email protected]> Co-authored-by: khalil <[email protected]> Co-authored-by: dubaoquan404 <[email protected]> Co-authored-by: Chang W. Lee <[email protected]> Co-authored-by: theScotchGame <[email protected]> Co-authored-by: lewtun <[email protected]> Co-authored-by: Stephen Horvath <[email protected]> Co-authored-by: liunux4odoo <[email protected]> Co-authored-by: Norman Mu <[email protected]> Co-authored-by: Sebastian Bodza <[email protected]> Co-authored-by: Tianle (Tim) Li <[email protected]> Co-authored-by: Wei-Lin Chiang <[email protected]> Co-authored-by: Alex <[email protected]> Co-authored-by: Jingcheng Hu <[email protected]> Co-authored-by: lvxuan <[email protected]> Co-authored-by: cOng <[email protected]> Co-authored-by: bofeng huang <[email protected]> Co-authored-by: Phil-U-U <[email protected]> Co-authored-by: Wayne Spangenberg <[email protected]> Co-authored-by: Guspan Tanadi <[email protected]> Co-authored-by: Rohan Gupta <[email protected]> Co-authored-by: ugolotti <[email protected]> Co-authored-by: Roberto Ugolotti <[email protected]> Co-authored-by: edisonwd <[email protected]> Co-authored-by: FangYin Cheng <[email protected]> Co-authored-by: bodza <[email protected]> Co-authored-by: bodza <[email protected]> Co-authored-by: Cody Yu <[email protected]> Co-authored-by: Srinath Janakiraman <[email protected]> Co-authored-by: Jaeheon Jeong <[email protected]> Co-authored-by: One <[email protected]> Co-authored-by: [email protected] <[email protected]> Co-authored-by: David <[email protected]> Co-authored-by: Witold Wasiczko <[email protected]> Co-authored-by: Peter Willemsen <[email protected]> Co-authored-by: ZeyuTeng96 <[email protected]> Co-authored-by: Forceless <[email protected]> Co-authored-by: Jeff <[email protected]> Co-authored-by: MrZhengXin <[email protected]> Co-authored-by: Long Nguyen <[email protected]> Co-authored-by: Elsa Granger <[email protected]> Co-authored-by: Christopher Chou <[email protected]> Co-authored-by: wangshuai09 <[email protected]> Co-authored-by: amaleshvemula <[email protected]> Co-authored-by: Zollty Tsou <[email protected]> Co-authored-by: xuguodong1999 <[email protected]> Co-authored-by: Michael J Kaye <[email protected]> Co-authored-by: 152334H <[email protected]> Co-authored-by: Jingsong-Yan <[email protected]> Co-authored-by: Siyuan (Ryans) Zhuang <[email protected]> Co-authored-by: Chris Kerwell Gresla <[email protected]> Co-authored-by: pandada8 <[email protected]> Co-authored-by: Isaac Ong <[email protected]> Co-authored-by: Warren Francis <[email protected]> Co-authored-by: Warren Francis <[email protected]> Co-authored-by: Morgan McGuire <[email protected]> Co-authored-by: Morgan McGuire <[email protected]> Co-authored-by: xDAN-AI <[email protected]> Co-authored-by: Ikko Eltociear Ashimine <[email protected]> Co-authored-by: Robbie <[email protected]> Co-authored-by: Rishiraj Acharya <[email protected]> Co-authored-by: Nathan Lambert <[email protected]> Co-authored-by: Sam Skjonsberg <[email protected]> Co-authored-by: liuyhwangyh <[email protected]> Co-authored-by: mulin.lyh <[email protected]> Co-authored-by: stephanbertl <[email protected]> Co-authored-by: bertls <[email protected]> Co-authored-by: Chirag Jain <[email protected]> Co-authored-by: Yuchen Cheng <[email protected]> Co-authored-by: Shuo Yang <[email protected]> Co-authored-by: Wei-Lin Chiang <[email protected]> Co-authored-by: JQ <[email protected]> Co-authored-by: yaofeng <[email protected]> Co-authored-by: 姚峰 <[email protected]> Co-authored-by: Michael <[email protected]> Co-authored-by: Josh NE <[email protected]> Co-authored-by: Your Name <[email protected]> Co-authored-by: WHDY <[email protected]> Co-authored-by: 章焕锭 <[email protected]> Co-authored-by: Gabriel Martín Blázquez <[email protected]> Co-authored-by: alvarobartt <[email protected]> Co-authored-by: Zheng Hao <[email protected]> Co-authored-by: Ren Xuancheng <[email protected]> Co-authored-by: Sarath Shekkizhar <[email protected]> Co-authored-by: wangpengfei1013 <[email protected]> Co-authored-by: Alexandre Strube <[email protected]> Co-authored-by: Teknium <[email protected]> Co-authored-by: Cristian Gutiérrez <[email protected]> Co-authored-by: ali asaria <[email protected]> Co-authored-by: wulixuan <[email protected]> Co-authored-by: staoxiao <[email protected]> Co-authored-by: Zaida Zhou <[email protected]> Co-authored-by: dheeraj-326 <[email protected]> Co-authored-by: bjwswang <[email protected]> Co-authored-by: Zhanghao Wu <[email protected]> Co-authored-by: Ted Li <[email protected]> Co-authored-by: Shukant Pal <[email protected]> Co-authored-by: Lisa Dunlap <[email protected]> Co-authored-by: Logan Kilpatrick <[email protected]>
shaleprotocol · Feb 24, 2024 · dac3317 · dac3317
1 parent 94421ea
commit dac3317
Show file tree

Hide file tree

Showing 60 changed files with 6,308 additions and 846 deletions.
diff --git a/README.md b/README.md
@@ -16,6 +16,10 @@ We are focused to support Llama2 at scale now. If you want any other models, ple
 
 ## Dev Log
 
+### 2024-02
+
+Sync upstream changes
+
 ### 2023-09
 
 Sync upstream changes

diff --git a/docs/arena.md b/docs/arena.md
@@ -5,10 +5,11 @@ We invite the entire community to join this benchmarking effort by contributing
 ## How to add a new model
 If you want to see a specific model in the arena, you can follow the methods below.
 
-- Method 1: Hosted by LMSYS.
-  1. Contribute the code to support this model in FastChat by submitting a pull request. See [instructions](model_support.md#how-to-support-a-new-model).
-  2. After the model is supported, we will try to schedule some compute resources to host the model in the arena. However, due to the limited resources we have, we may not be able to serve every model. We will select the models based on popularity, quality, diversity, and other factors.
+### Method 1: Hosted by 3rd party API providers or yourself
+If you have a model hosted by a 3rd party API provider or yourself, please give us the access to an API endpoint.
+  - We prefer OpenAI-compatible APIs, so we can reuse our [code](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/api_provider.py) for calling OpenAI models.
+  - If you have your own API protocol, please follow the [instructions](model_support.md) to add them. Contribute your code by sending a pull request.
 
-- Method 2: Hosted by 3rd party API providers or yourself.
-  1. If you have a model hosted by a 3rd party API provider or yourself, please give us an API endpoint. We prefer OpenAI-compatible APIs, so we can reuse our [code](https://github.com/lm-sys/FastChat/blob/33dca5cf12ee602455bfa9b5f4790a07829a2db7/fastchat/serve/gradio_web_server.py#L333-L358) for calling OpenAI models.
-  2. You can use FastChat's OpenAI API [server](openai_api.md) to serve your model with OpenAI-compatible APIs and provide us with the endpoint.
+### Method 2: Hosted by LMSYS
+1. Contribute the code to support this model in FastChat by submitting a pull request. See [instructions](model_support.md).
+2. After the model is supported, we will try to schedule some compute resources to host the model in the arena. However, due to the limited resources we have, we may not be able to serve every model. We will select the models based on popularity, quality, diversity, and other factors.
diff --git a/docs/commands/webserver.md b/docs/commands/webserver.md
@@ -24,10 +24,13 @@ python3 -m fastchat.serve.test_message --model vicuna-13b --controller http://lo
 
 cd fastchat_logs/server0
 
+python3 -m fastchat.serve.huggingface_api_worker --model-info-file ~/elo_results/register_hf_api_models.json
+
 export OPENAI_API_KEY=
 export ANTHROPIC_API_KEY=
+export GCP_PROJECT_ID=
 
-python3 -m fastchat.serve.gradio_web_server_multi --controller http://localhost:21001 --concurrency 10 --add-chatgpt --add-claude --add-palm --anony-only --elo ~/elo_results/elo_results.pkl --leaderboard-table-file ~/elo_results/leaderboard_table.csv --register ~/elo_results/register_oai_models.json --show-terms
+python3 -m fastchat.serve.gradio_web_server_multi --controller http://localhost:21001 --concurrency 50 --add-chatgpt --add-claude --add-palm --elo ~/elo_results/elo_results.pkl --leaderboard-table-file ~/elo_results/leaderboard_table.csv --register ~/elo_results/register_oai_models.json --show-terms
 
 python3 backup_logs.py
 ```

diff --git a/docs/lightllm_integration.md b/docs/lightllm_integration.md
@@ -0,0 +1,18 @@
+# LightLLM Integration
+You can use [LightLLM](https://github.com/ModelTC/lightllm) as an optimized worker implementation in FastChat.
+It offers advanced continuous batching and a much higher (~10x) throughput.
+See the supported models [here](https://github.com/ModelTC/lightllm?tab=readme-ov-file#supported-model-list).
+
+## Instructions
+1. Please refer to the [Get started](https://github.com/ModelTC/lightllm?tab=readme-ov-file#get-started) to install LightLLM. Or use [Pre-built image](https://github.com/ModelTC/lightllm?tab=readme-ov-file#container)
+
+2. When you launch a model worker, replace the normal worker (`fastchat.serve.model_worker`) with the LightLLM worker (`fastchat.serve.lightllm_worker`). All other commands such as controller, gradio web server, and OpenAI API server are kept the same. Refer to [--max_total_token_num](https://github.com/ModelTC/lightllm/blob/4a9824b6b248f4561584b8a48ae126a0c8f5b000/docs/ApiServerArgs.md?plain=1#L23) to understand how to calculate the `--max_total_token_num` argument.
+   ```
+   python3 -m fastchat.serve.lightllm_worker --model-path lmsys/vicuna-7b-v1.5 --tokenizer_mode "auto" --max_total_token_num 154000
+   ```
+
+   If you what to use quantized weight and kv cache for inference, try
+
+   ```
+   python3 -m fastchat.serve.lightllm_worker --model-path lmsys/vicuna-7b-v1.5 --tokenizer_mode "auto" --max_total_token_num 154000 --mode triton_int8weight triton_int8kv
+   ```
diff --git a/docs/mlx_integration.md b/docs/mlx_integration.md
@@ -0,0 +1,23 @@
+# Apple MLX Integration
+
+You can use [Apple MLX](https://github.com/ml-explore/mlx) as an optimized worker implementation in FastChat.
+
+It runs models efficiently on Apple Silicon
+
+See the supported models [here](https://github.com/ml-explore/mlx-examples/tree/main/llms#supported-models).
+
+Note that for Apple Silicon Macs with less memory, smaller models (or quantized models) are recommended.
+
+## Instructions
+
+1. Install MLX.
+
+   ```
+   pip install "mlx-lm>=0.0.6"
+   ```
+
+2. When you launch a model worker, replace the normal worker (`fastchat.serve.model_worker`) with the MLX worker (`fastchat.serve.mlx_worker`). Remember to launch a model worker after you have launched the controller ([instructions](../README.md))
+
+   ```
+   python3 -m fastchat.serve.mlx_worker --model-path TinyLlama/TinyLlama-1.1B-Chat-v1.0
+   ```
diff --git a/docs/model_support.md b/docs/model_support.md
@@ -1,15 +1,48 @@
 # Model Support
+This document describes how to support a new model in FastChat.
 
-## Supported models
+## Content
+- [Local Models](#local-models)
+- [API-Based Models](#api-based-models)
+
+## Local Models
+To support a new local model in FastChat, you need to correctly handle its prompt template and model loading.
+The goal is to make the following command run with the correct prompts.
+
+```
+python3 -m fastchat.serve.cli --model [YOUR_MODEL_PATH]
+```
+
+You can run this example command to learn the code logic.
+
+```
+python3 -m fastchat.serve.cli --model lmsys/vicuna-7b-v1.5
+```
+
+You can add `--debug` to see the actual prompt sent to the model.
+
+### Steps
+
+FastChat uses the `Conversation` class to handle prompt templates and `BaseModelAdapter` class to handle model loading.
+
+1. Implement a conversation template for the new model at [fastchat/conversation.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py). You can follow existing examples and use `register_conv_template` to add a new one. Please also add a link to the official reference code if possible.
+2. Implement a model adapter for the new model at [fastchat/model/model_adapter.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/model/model_adapter.py). You can follow existing examples and use `register_model_adapter` to add a new one.
+3. (Optional) add the model name to the "Supported models" [section](#supported-models) above and add more information in [fastchat/model/model_registry.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/model/model_registry.py).
+
+After these steps, the new model should be compatible with most FastChat features, such as CLI, web UI, model worker, and OpenAI-compatible API server. Please do some testing with these features as well.
+
+### Supported models
 
 - [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
   - example: `python3 -m fastchat.serve.cli --model-path meta-llama/Llama-2-7b-chat-hf`
 - Vicuna, Alpaca, LLaMA, Koala
   - example: `python3 -m fastchat.serve.cli --model-path lmsys/vicuna-7b-v1.5`
+- [allenai/tulu-2-dpo-7b](https://huggingface.co/allenai/tulu-2-dpo-7b)
 - [BAAI/AquilaChat-7B](https://huggingface.co/BAAI/AquilaChat-7B)
 - [BAAI/AquilaChat2-7B](https://huggingface.co/BAAI/AquilaChat2-7B)
 - [BAAI/AquilaChat2-34B](https://huggingface.co/BAAI/AquilaChat2-34B)
 - [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en#using-huggingface-transformers)
+- [argilla/notus-7b-v1](https://huggingface.co/argilla/notus-7b-v1)
 - [baichuan-inc/baichuan-7B](https://huggingface.co/baichuan-inc/baichuan-7B)
 - [BlinkDL/RWKV-4-Raven](https://huggingface.co/BlinkDL/rwkv-4-raven)
   - example: `python3 -m fastchat.serve.cli --model-path ~/model_weights/RWKV-4-Raven-7B-v11x-Eng99%-Other1%-20230429-ctx8192.pth`
@@ -18,13 +51,20 @@
 - [camel-ai/CAMEL-13B-Combined-Data](https://huggingface.co/camel-ai/CAMEL-13B-Combined-Data)
 - [codellama/CodeLlama-7b-Instruct-hf](https://huggingface.co/codellama/CodeLlama-7b-Instruct-hf)
 - [databricks/dolly-v2-12b](https://huggingface.co/databricks/dolly-v2-12b)
+- [deepseek-ai/deepseek-llm-67b-chat](https://huggingface.co/deepseek-ai/deepseek-llm-67b-chat)
+- [deepseek-ai/deepseek-coder-33b-instruct](https://huggingface.co/deepseek-ai/deepseek-coder-33b-instruct)
 - [FlagAlpha/Llama2-Chinese-13b-Chat](https://huggingface.co/FlagAlpha/Llama2-Chinese-13b-Chat)
 - [FreedomIntelligence/phoenix-inst-chat-7b](https://huggingface.co/FreedomIntelligence/phoenix-inst-chat-7b)
 - [FreedomIntelligence/ReaLM-7b-v1](https://huggingface.co/FreedomIntelligence/Realm-7b)
 - [h2oai/h2ogpt-gm-oasst1-en-2048-open-llama-7b](https://huggingface.co/h2oai/h2ogpt-gm-oasst1-en-2048-open-llama-7b)
+- [HuggingFaceH4/starchat-beta](https://huggingface.co/HuggingFaceH4/starchat-beta)
+- [HuggingFaceH4/zephyr-7b-alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha)
 - [internlm/internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b)
+- [IEITYuan/Yuan2-2B/51B/102B-hf](https://huggingface.co/IEITYuan)
 - [lcw99/polyglot-ko-12.8b-chang-instruct-chat](https://huggingface.co/lcw99/polyglot-ko-12.8b-chang-instruct-chat)
 - [lmsys/fastchat-t5-3b-v1.0](https://huggingface.co/lmsys/fastchat-t5)
+- [meta-math/MetaMath-7B-V1.0](https://huggingface.co/meta-math/MetaMath-7B-V1.0)
+- [Microsoft/Orca-2-7b](https://huggingface.co/microsoft/Orca-2-7b)
 - [mosaicml/mpt-7b-chat](https://huggingface.co/mosaicml/mpt-7b-chat)
   - example: `python3 -m fastchat.serve.cli --model-path mosaicml/mpt-7b-chat`
 - [Neutralzz/BiLLa-7B-SFT](https://huggingface.co/Neutralzz/BiLLa-7B-SFT)
@@ -34,56 +74,57 @@
 - [OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5](https://huggingface.co/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5)
 - [openchat/openchat_3.5](https://huggingface.co/openchat/openchat_3.5)
 - [Open-Orca/Mistral-7B-OpenOrca](https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca)
-- [VMware/open-llama-7b-v2-open-instruct](https://huggingface.co/VMware/open-llama-7b-v2-open-instruct)
+- [OpenLemur/lemur-70b-chat-v1](https://huggingface.co/OpenLemur/lemur-70b-chat-v1)
 - [Phind/Phind-CodeLlama-34B-v2](https://huggingface.co/Phind/Phind-CodeLlama-34B-v2)
 - [project-baize/baize-v2-7b](https://huggingface.co/project-baize/baize-v2-7b)
 - [Qwen/Qwen-7B-Chat](https://huggingface.co/Qwen/Qwen-7B-Chat)
+- [rishiraj/CatPPT](https://huggingface.co/rishiraj/CatPPT)
 - [Salesforce/codet5p-6b](https://huggingface.co/Salesforce/codet5p-6b)
 - [StabilityAI/stablelm-tuned-alpha-7b](https://huggingface.co/stabilityai/stablelm-tuned-alpha-7b)
+- [tenyx/TenyxChat-7B-v1](https://huggingface.co/tenyx/TenyxChat-7B-v1)
+- [TinyLlama/TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)
 - [THUDM/chatglm-6b](https://huggingface.co/THUDM/chatglm-6b)
 - [THUDM/chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b)
 - [tiiuae/falcon-40b](https://huggingface.co/tiiuae/falcon-40b)
 - [tiiuae/falcon-180B-chat](https://huggingface.co/tiiuae/falcon-180B-chat)
 - [timdettmers/guanaco-33b-merged](https://huggingface.co/timdettmers/guanaco-33b-merged)
 - [togethercomputer/RedPajama-INCITE-7B-Chat](https://huggingface.co/togethercomputer/RedPajama-INCITE-7B-Chat)
+- [VMware/open-llama-7b-v2-open-instruct](https://huggingface.co/VMware/open-llama-7b-v2-open-instruct)
 - [WizardLM/WizardLM-13B-V1.0](https://huggingface.co/WizardLM/WizardLM-13B-V1.0)
 - [WizardLM/WizardCoder-15B-V1.0](https://huggingface.co/WizardLM/WizardCoder-15B-V1.0)
-- [HuggingFaceH4/starchat-beta](https://huggingface.co/HuggingFaceH4/starchat-beta)
-- [HuggingFaceH4/zephyr-7b-alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha)
 - [Xwin-LM/Xwin-LM-7B-V0.1](https://huggingface.co/Xwin-LM/Xwin-LM-70B-V0.1)
-- [OpenLemur/lemur-70b-chat-v1](https://huggingface.co/OpenLemur/lemur-70b-chat-v1)
-- [allenai/tulu-2-dpo-7b](https://huggingface.co/allenai/tulu-2-dpo-7b)
-- [Microsoft/Orca-2-7b](https://huggingface.co/microsoft/Orca-2-7b)
 - Any [EleutherAI](https://huggingface.co/EleutherAI) pythia model such as [pythia-6.9b](https://huggingface.co/EleutherAI/pythia-6.9b)
 - Any [Peft](https://github.com/huggingface/peft) adapter trained on top of a
   model above.  To activate, must have `peft` in the model path.  Note: If
   loading multiple peft models, you can have them share the base model weights by
   setting the environment variable `PEFT_SHARE_BASE_WEIGHTS=true` in any model
   worker.
 
-## How to support a new model
 
-To support a new model in FastChat, you need to correctly handle its prompt template and model loading.
-The goal is to make the following command run with the correct prompts.
+## API-Based Models
+To support an API-based model, consider learning from the existing OpenAI example.
+If the model is compatible with OpenAI APIs, then a configuration file is all that's needed without any additional code.
+For custom protocols, implementation of a streaming generator in [fastchat/serve/api_provider.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/api_provider.py) is required, following the provided examples. Currently, FastChat is compatible with OpenAI, Anthropic, Google Vertex AI, Mistral, and Nvidia NGC.
 
+### Steps to Launch a WebUI with an API Model
+1. Specify the endpoint information in a JSON configuration file. For instance, create a file named `api_endpoints.json`:
+```json
+{
+  "gpt-3.5-turbo": {
+    "model_name": "gpt-3.5-turbo",
+    "api_type": "openai",
+    "api_base": "https://api.openai.com/v1",
+    "api_key": "sk-******",
+    "anony_only": false
+  }
+}
 ```
-python3 -m fastchat.serve.cli --model [YOUR_MODEL_PATH]
-```
-
-You can run this example command to learn the code logic.
+  - "api_type" can be one of the following: openai, anthropic, gemini, or mistral. For custom APIs, add a new type and implement it accordingly.
+  - "anony_only" indicates whether to display this model in anonymous mode only.
 
+2. Launch the Gradio web server with the argument `--register api_endpoints.json`:
 ```
-python3 -m fastchat.serve.cli --model lmsys/vicuna-7b-v1.5
+python3 -m fastchat.serve.gradio_web_server --controller "" --share --register api_endpoints.json
 ```
 
-You can add `--debug` to see the actual prompt sent to the model.
-
-### Steps
-
-FastChat uses the `Conversation` class to handle prompt templates and `BaseModelAdapter` class to handle model loading.
-
-1. Implement a conversation template for the new model at [fastchat/conversation.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py). You can follow existing examples and use `register_conv_template` to add a new one. Please also add a link to the official reference code if possible.
-2. Implement a model adapter for the new model at [fastchat/model/model_adapter.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/model/model_adapter.py). You can follow existing examples and use `register_model_adapter` to add a new one.
-3. (Optional) add the model name to the "Supported models" [section](#supported-models) above and add more information in [fastchat/model/model_registry.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/model/model_registry.py).
-
-After these steps, the new model should be compatible with most FastChat features, such as CLI, web UI, model worker, and OpenAI-compatible API server. Please do some testing with these features as well.
+Now, you can open a browser and interact with the model.
diff --git a/docs/openai_api.md b/docs/openai_api.md
@@ -8,6 +8,8 @@ The following OpenAI APIs are supported:
 - Completions. (Reference: https://platform.openai.com/docs/api-reference/completions)
 - Embeddings. (Reference: https://platform.openai.com/docs/api-reference/embeddings)
 
+The REST API can be seamlessly operated from Google Colab, as demonstrated in the [FastChat_API_GoogleColab.ipynb](https://github.com/lm-sys/FastChat/blob/main/playground/FastChat_API_GoogleColab.ipynb) notebook, available in our repository. This notebook provides a practical example of how to utilize the API effectively within the Google Colab environment.
+
 ## RESTful API Server
 First, launch the controller
 
@@ -32,29 +34,28 @@ Now, let us test the API server.
 ### OpenAI Official SDK
 The goal of `openai_api_server.py` is to implement a fully OpenAI-compatible API server, so the models can be used directly with [openai-python](https://github.com/openai/openai-python) library.
 
-First, install openai-python:
+First, install OpenAI python package >= 1.0:
 ```bash
 pip install --upgrade openai
 ```
 
-Then, interact with model vicuna:
+Then, interact with the Vicuna model:
 ```python
 import openai
-# to get proper authentication, make sure to use a valid key that's listed in
-# the --api-keys flag. if no flag value is provided, the `api_key` will be ignored.
+
 openai.api_key = "EMPTY"
-openai.api_base = "http://localhost:8000/v1"
+openai.base_url = "http://localhost:8000/v1/"
 
 model = "vicuna-7b-v1.5"
 prompt = "Once upon a time"
 
 # create a completion
-completion = openai.Completion.create(model=model, prompt=prompt, max_tokens=64)
+completion = openai.completions.create(model=model, prompt=prompt, max_tokens=64)
 # print the completion
 print(prompt + completion.choices[0].text)
 
 # create a chat completion
-completion = openai.ChatCompletion.create(
+completion = openai.chat.completions.create(
   model=model,
   messages=[{"role": "user", "content": "Hello! What is your name?"}]
 )

diff --git a/docs/third_party_ui.md b/docs/third_party_ui.md
@@ -0,0 +1,24 @@
+# Third Party UI
+If you want to host it on your own UI or third party UI, you can launch the [OpenAI compatible server](openai_api.md) and host with a tunnelling service such as Tunnelmole or ngrok, and then enter the credentials appropriately.
+
+You can find suitable UIs from third party repos:
+- [WongSaang's ChatGPT UI](https://github.com/WongSaang/chatgpt-ui)
+- [McKayWrigley's Chatbot UI](https://github.com/mckaywrigley/chatbot-ui)
+
+- Please note that some third-party providers only offer the standard `gpt-3.5-turbo`, `gpt-4`, etc., so you will have to add your own custom model inside the code. [Here is an example of how to create a UI with any custom model name](https://github.com/ztjhz/BetterChatGPT/pull/461).
+
+##### Using Tunnelmole
+Tunnelmole is an open source tunnelling tool. You can find its source code on [Github](https://github.com/robbie-cahill/tunnelmole-client). Here's how you can use Tunnelmole:
+1. Install Tunnelmole with `curl -O https://install.tunnelmole.com/9Wtxu/install && sudo bash install`. (On Windows, download [tmole.exe](https://tunnelmole.com/downloads/tmole.exe)). Head over to the [README](https://github.com/robbie-cahill/tunnelmole-client) for other methods such as `npm` or building from source.
+2. Run `tmole 7860` (replace `7860` with your listening port if it is different from 7860). The output will display two URLs: one HTTP and one HTTPS. It's best to use the HTTPS URL for better privacy and security.
+```
+➜  ~ tmole 7860
+http://bvdo5f-ip-49-183-170-144.tunnelmole.net is forwarding to localhost:7860
+https://bvdo5f-ip-49-183-170-144.tunnelmole.net is forwarding to localhost:7860
+```
+
+##### Using ngrok
+ngrok is a popular closed source tunnelling tool. First download and install it from [ngrok.com](https://ngrok.com/downloads). Here's how to use it to expose port 7860.
+```
+ngrok http 7860
+```