From ff9c757cef4eef82dd01b5c3fb018f37018001d8 Mon Sep 17 00:00:00 2001
From: Huanzhi Mao <huanzhimao@gmail.com>
Date: Sat, 21 Sep 2024 15:31:41 -0700
Subject: [PATCH 1/2] separate changelog

---
 .../CHANGELOG.md                              | 115 ++++++++++++++++++
 1 file changed, 115 insertions(+)
 create mode 100644 berkeley-function-call-leaderboard/CHANGELOG.md

diff --git a/berkeley-function-call-leaderboard/CHANGELOG.md b/berkeley-function-call-leaderboard/CHANGELOG.md
new file mode 100644
index 000000000..aaf2e3320
--- /dev/null
+++ b/berkeley-function-call-leaderboard/CHANGELOG.md
@@ -0,0 +1,115 @@
+# Changelog
+
+All notable changes to the Berkeley Function Calling Leaderboard will be documented in this file.
+
+- [Sept 19, 2024] [#644](https://github.com/ShishirPatil/gorilla/pull/644): BFCL V3 release:
+  - Introduce new multi-turn dataset and state-based evaluation metric
+  - Separate ast_checker and executable_checker for readability
+  - Several outdated or deprecated models will be excluded from the leaderboard and replaced with their updated successors to improve the leaderboard's overall maintainability.
+  - Switch to use vllm serve for OSS model inference
+- [Sept 13, 2024] [#638](https://github.com/ShishirPatil/gorilla/pull/638): Fix prompt formatting issue for `THUDM/glm-4-9b-chat`.
+- [Sept 12, 2024] [#635](https://github.com/ShishirPatil/gorilla/pull/635): Add new models `o1-preview-2024-09-12` and `o1-mini-2024-09-12` to the leaderboard.
+- [Sept 8, 2024] [#627](https://github.com/ShishirPatil/gorilla/pull/627) Add new model `MadeAgents/Hammer-7b` to the leaderboard.
+- [Sept 7, 2024] [#626](https://github.com/ShishirPatil/gorilla/pull/626): Fix prompt format for Llama models.
+- [Sept 4, 2024] [#623](https://github.com/ShishirPatil/gorilla/pull/623): Fix decoding issue in the `NvidiaHandler`; remove duplicate `ArcticHandler` class.
+- [August 29, 2024] [#616](https://github.com/ShishirPatil/gorilla/pull/6160): Add the following new models to the leaderboard:
+  - `Salesforce/xLAM-7b-r`
+  - `Salesforce/xLAM-8x7b-r`
+  - `Salesforce/xLAM-8x22b-r`
+- [August 28, 2024] [#565](https://github.com/ShishirPatil/gorilla/pull/565), [#612](https://github.com/ShishirPatil/gorilla/pull/612): Packagerize the BFCL pipeline for easier deployment and maintenance.
+- [August 27, 2024] [#608](https://github.com/ShishirPatil/gorilla/pull/608): Bug fix in the dataset and possible answers.
+  - simple: 16 affected
+  - multiple: 5 affected
+- [August 23, 2024] [#600](https://github.com/ShishirPatil/gorilla/pull/600): Bug fix in the dataset and possible answers.
+  - simple: 12 affected
+  - multiple: 3 affected
+  - parallel: 3 affected
+  - parallel multiple: 6 affected
+- [August 22, 2024] [#593](https://github.com/ShishirPatil/gorilla/pull/593):
+  - Move formatting instructions and function documentation to system prompt instead of user prompt in the message section. All prompting models are affected.
+  - Bug fix in the dataset and possible answers.
+    - irrelevance: 1 affected
+    - live_irrelevance: 1 affected
+    - live_simple: 1 affected
+    - live_parallel: 3 affected
+- [August 19, 2024] [#580](https://github.com/ShishirPatil/gorilla/pull/580): Introduce BFCL V2 Live dataset, featuring user-contributed live prompts and function docs. To read more about the composition and construction of this dataset, please refer to our [blog](https://gorilla.cs.berkeley.edu/blogs/12_bfcl_v2_live.html). All CLI commands have been updated to support the new dataset.
+- [August 8, 2024] [#574](https://github.com/ShishirPatil/gorilla/pull/574): Set temperature to 0.001 for all models for consistency and reproducibility.
+- [August 7, 2024] [#571](https://github.com/ShishirPatil/gorilla/pull/571): Support parallel inference for hosted models. User can specify the number of threads to use for parallel inference by setting the `--num-threads` flag. The default is 1, which means no parallel inference.
+- [August 6, 2024] [#569](https://github.com/ShishirPatil/gorilla/pull/569), [#570](https://github.com/ShishirPatil/gorilla/pull/570), [#573](https://github.com/ShishirPatil/gorilla/pull/573): Add the following new models to the leaderboard:
+  - `open-mistral-nemo-2407`
+  - `open-mistral-nemo-2407-FC`
+  - `open-mixtral-8x22b`
+  - `open-mixtral-8x22b-FC`
+  - `open-mixtral-8x7b`
+  - `gpt-4o-mini-2024-07-18`
+  - `gpt-4o-mini-2024-07-18-FC`
+  - `gpt-4o-2024-08-06`
+  - `gpt-4o-2024-08-06-FC`
+  - `meetkai/functionary-medium-v3.1-FC`
+  - `meetkai/functionary-small-v3.1-FC`
+  - `meetkai/functionary-small-v3.2-FC`
+- [August 5, 2024] [#568](https://github.com/ShishirPatil/gorilla/pull/568): Rephrase the question prompt for the `executable_parallel_function` category to remove potentially misleading information implying multi-turn function calls.
+- [August 4, 2024] [#557](https://github.com/ShishirPatil/gorilla/pull/557): Bug fix in the possible answers.
+  - simple: 7 affected
+  - multiple function: 3 affected
+  - parallel function: 5 affected
+  - parallel multiple function: 6 affected
+  - executable parallel function: 1 affected
+  - javascript: 3 affected
+- [July 26, 2024] [#549](https://github.com/ShishirPatil/gorilla/pull/549): Fix `js_type_converter.py` to properly handle JavaScript array value inside dictionary.
+- [July 25, 2024] [#532](https://github.com/ShishirPatil/gorilla/pull/532), [#543](https://github.com/ShishirPatil/gorilla/pull/543), [#556](https://github.com/ShishirPatil/gorilla/pull/556), [#542](https://github.com/ShishirPatil/gorilla/pull/542): Add the following new models to the leaderboard:
+  - `Salesforce/xLAM-7b-fc-r`
+  - `Salesforce/xLAM-1b-fc-r`
+  - `yi-large-fc`
+  - `NousResearch/Hermes-2-Pro-Llama-3-8B`
+  - `NousResearch/Hermes-2-Pro-Llama-3-70B`
+  - `NousResearch/Hermes-2-Theta-Llama-3-8B`
+  - `NousResearch/Hermes-2-Theta-Llama-3-70B`
+- [July 22, 2024] [#540](https://github.com/ShishirPatil/gorilla/pull/540): Chore: Improve handling of vLLM's cleanup phase error by combining all selected test categories into one single task to submit to the vLLM server.
+- [July 21, 2024] [#538](https://github.com/ShishirPatil/gorilla/pull/538), [#545](https://github.com/ShishirPatil/gorilla/pull/545): Fix `language_specific_pre_processing` and `convert_to_tool` function to properly handle pre-processing for prompts and function docs in Java and JavaScript test categories. All entries in these categories are affected.
+- [July 20, 2024] [#537](https://github.com/ShishirPatil/gorilla/pull/537): Update generation script for locally-hosted OSS model to use single-node multi-GPU inference method (tensor parallel). Ray is not used anymore.
+- [July 16, 2024] [#525](https://github.com/ShishirPatil/gorilla/pull/525), [#536](https://github.com/ShishirPatil/gorilla/pull/536): Add new model `ibm-granite/granite-20b-functioncalling` to the leaderboard.
+- [July 10, 2024] [#522](https://github.com/ShishirPatil/gorilla/pull/522): Bug fix in the evaluation dataset for Executable Parallel Multiple category. This includes updates to both prompts and function docs. 2 entries are affected.
+- [July 8, 2024] [#516](https://github.com/ShishirPatil/gorilla/pull/516): Fix double-casting issue in `model_handler` for Java and JavaScript test categories.
+- [July 7, 2024] [#504](https://github.com/ShishirPatil/gorilla/pull/504), [#505](https://github.com/ShishirPatil/gorilla/pull/505), [#506](https://github.com/ShishirPatil/gorilla/pull/506), [#508](https://github.com/ShishirPatil/gorilla/pull/508), [#512](https://github.com/ShishirPatil/gorilla/pull/512), [#517](https://github.com/ShishirPatil/gorilla/pull/517): Make BFCL user-friendly and easy to extend.
+- [July 6, 2024] [#423](https://github.com/ShishirPatil/gorilla/pull/423) and [#503](https://github.com/ShishirPatil/gorilla/pull/503): Bug fix in possible answers for the AST evaluation dataset (parallel category: 14 affected; parallel_multiple category: 25 affected).
+- [July 5, 2024] [#496](https://github.com/ShishirPatil/gorilla/pull/496): Updates to API status checks. Checking the health of executable APIs is now off by default. Further, even when triggered, un-healthy APIs will not terminate the evaluation process. Users can enable this feature by setting the `--api-sanity-check` flag or `-c` for short. The previous `--skip-api-sanity-check` or `-s` flag is now deprecated.
+- [July 3, 2024] [#489](https://github.com/ShishirPatil/gorilla/pull/489): Add new model `nvidia/nemotron-4-340b-instruct` to the leaderboard.
+- [July 2, 2024] [#474](https://github.com/ShishirPatil/gorilla/pull/474): Add new model `THUDM/glm-4-9b-chat` to the leaderboard.
+- [June 18, 2024] [#470](https://github.com/ShishirPatil/gorilla/pull/470): Add new model `firefunction-v2-FC` to the leaderboard.
+- [June 15, 2024] [#437](https://github.com/ShishirPatil/gorilla/pull/437): Fix prompting issues for `Nexusflow-Raven-v2 (FC)`.
+- [June 7, 2024] [#407](https://github.com/ShishirPatil/gorilla/pull/407), [#462](https://github.com/ShishirPatil/gorilla/pull/462): Update the AST evaluation logic to allow the use of `int` values for Python parameters expecting `float` values. This is to accommodate the Python auto-conversion feature from `int` to `float`.
+- [May 14, 2024] [#426](https://github.com/ShishirPatil/gorilla/pull/426):
+  - Add the following new models to the leaderboard:
+    - `gpt-4o-2024-05-13`
+    - `gpt-4o-2024-05-13-FC`
+    - `gemini-1.5-pro-preview-0514`
+    - `gemini-1.5-flash-preview-0514`
+  - Update price for the following models:
+    - All Gemini Series
+    - `Claude-2.1 (Prompt)` and `Claude-instant-1.2 (Prompt)`
+    - `Mistral-large` and `Mistral-Small`
+    - `GPT-3.5-Turbo-0125`
+- [May 8, 2024] [#406](https://github.com/ShishirPatil/gorilla/pull/406) and [#421](https://github.com/ShishirPatil/gorilla/pull/421): Update the `gemini_handler.py` to better handle parallel function calls for Gemini models.
+- [May 6, 2024] [#412](https://github.com/ShishirPatil/gorilla/pull/412): Bug fix in evaluation dataset for AST categories. This includes updates to both prompts and function docs.
+- [May 2, 2024] [#405](https://github.com/ShishirPatil/gorilla/pull/405): Bug fix in the possible answers for the AST Simple evaluation dataset. Prompt and function docs are not affected.
+- [April 28, 2024] [#397](https://github.com/ShishirPatil/gorilla/pull/397): Add new model `snowflake/arctic` to the leaderboard. Note that there are multiple ways to inference the model, and we choose to do it via Nvidia API catalog.
+- [April 27, 2024] [#390](https://github.com/ShishirPatil/gorilla/pull/390): Bug fix in cost and latency calculation for open-source models, which are now all calculated when serving the model with [vLLM](https://github.com/vllm-project/vllm) using 8 V100 GPUs for consistency. $$\text{Cost} = \text{Latency per 1000 function call} * (\text{8xV100 azure-pay-as-you-go-price per hour / 3600})$$
+- [April 25, 2024] [#386](https://github.com/ShishirPatil/gorilla/pull/386): Add 5 new models to the leaderboard: `meta-llama/Meta-Llama-3-8B-Instruct`, `meta-llama/Meta-Llama-3-70B-Instruct`, `gemini-1.5-pro-preview-0409`, `command-r-plus`, `command-r-plus-FC`.
+- [April 19, 2024] [#377](https://github.com/ShishirPatil/gorilla/pull/377):
+  - Bug fix for the evaluation dataset in the executable test categories. This includes updates to both prompts and function docs.
+  - The `evaluation_result` field has been removed to accommodate the variability in API execution results across different evaluation runs. Instead, a human-verified `ground_truth` is now included for the executable test categories. During each evaluation run, `evaluation_result` is generated anew using the `ground_truth`, and then compared against the model output.
+  - A stricter metric has been adopted when using the `structural_match` (aka. type match) evaluation criteria ---- For `list` results, the lengths are compared; for `dict` results, the keys are matched. This is to account for the fast-changing nature of some of the real-time API results while ensuring the evaluation remains meaningful.
+  - Added another evaluation criteria `real_time_match` for the executable category, which is a looser form of `exact_match` specifically for numerical execution results. The execution result must be within a certain percentage threshold (20%) from the expected result to accommodate the live updates of API responses. User can change this threshold value in `eval_checker_constant.py`.
+- [April 18, 2024] [#375](https://github.com/ShishirPatil/gorilla/pull/375): A more comprehensive API sanity check is included; the APIs that are invoked during the non-REST executable evaluation process will also be checked for their availability before running the evaluation. Also, add support for the shortcut `-s` for the `--skip-api-sanity-check` flag, based on the community feedback.
+- [April 16, 2024] [#366](https://github.com/ShishirPatil/gorilla/pull/366): Switch to use Anthropic's new Tool Use Beta `tools-2024-04-04` when generating Claude 3 FC series data. `gpt-4-turbo-2024-04-09` and `gpt-4-turbo-2024-04-09-FC` are also added to the leaderboard.
+- [April 11, 2024] [#347](https://github.com/ShishirPatil/gorilla/pull/347): Add the 95th percentile latency to the leaderboard statistics. This metric is useful for understanding the latency distribution of the models, especially the worst-case scenario.
+- [April 10, 2024] [#339](https://github.com/ShishirPatil/gorilla/pull/339): Introduce REST API sanity check for the REST executable test category. It ensures that all the API endpoints involved during the execution evaluation process are working properly. If any of them are not behaving as expected, the evaluation process will be stopped by default as the result will be inaccurate. Users can choose to bypass this check by setting the `--skip-api-sanity-check` flag or `-s` for short.
+- [April 9, 2024] [#338](https://github.com/ShishirPatil/gorilla/pull/338): Bug fix in the evaluation datasets (including both prompts and function docs). Bug fix for possible answers as well.
+- [April 8, 2024] [#330](https://github.com/ShishirPatil/gorilla/pull/330): Fixed an oversight that was introduced in [#299](https://github.com/ShishirPatil/gorilla/pull/299). For function-calling (FC) models that cannot take `float` type in input, when the parameter type is a `float`, the evaluation procedure will convert that type to `number` in the model input and mention in the parameter description that `This is a float type value.`. An additional field `format: float` will also be included in the model input to make it clear about the type. Updated the model handler for Claude, Mistral, and OSS to better parse the model output.
+- [April 8, 2024] [#327](https://github.com/ShishirPatil/gorilla/pull/327): Add new model `NousResearch/Hermes-2-Pro-Mistral-7B` to the leaderboard.
+- [April 3, 2024] [#309](https://github.com/ShishirPatil/gorilla/pull/309): Bug fix for evaluation dataset possible answers. Implement **string standardization** for the AST evaluation pipeline, i.e. removing white spaces and a subset of punctuations (`,./-_*^`) to make the AST evaluation more robust and accurate. Fixed AST evaluation issue for type `tuple`. Add 2 new models `meetkai/functionary-small-v2.4 (FC)`, `meetkai/functionary-medium-v2.4 (FC)` to the leaderboard.
+- [April 1, 2024] [#299](https://github.com/ShishirPatil/gorilla/pull/299): Leaderboard update with new models (`Claude-3-Haiku`, `Databrick-DBRX-Instruct`), more advanced AST evaluation procedure, and updated evaluation datasets. Cost and latency statistics during evaluation are also measured. We also released the manual that our evaluation procedure is based on, available [here](https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html#metrics).
+- [Mar 11, 2024] [#254](https://github.com/ShishirPatil/gorilla/pull/254): Leaderboard update with 3 new models: `Claude-3-Opus-20240229 (Prompt)`, `Claude-3-Sonnet-20240229 (Prompt)`, and `meetkai/functionary-medium-v2.2 (FC)`
+- [Mar 5, 2024] [#237](https://github.com/ShishirPatil/gorilla/pull/237) and [238](https://github.com/ShishirPatil/gorilla/pull/238): leaderboard update resulting from [#223](https://github.com/ShishirPatil/gorilla/pull/223); 3 new models: `mistral-large-2402`, `gemini-1.0-pro`, and `google/gemma-7b-it`.
+- [Feb 29, 2024] [#223](https://github.com/ShishirPatil/gorilla/pull/223): modifications to REST evaluation.

From 10a3edc71f4779e04c28b2e43b20e3ee38ad739f Mon Sep 17 00:00:00 2001
From: Huanzhi Mao <huanzhimao@gmail.com>
Date: Sat, 21 Sep 2024 15:33:47 -0700
Subject: [PATCH 2/2] update readme; remove outdated links

---
 berkeley-function-call-leaderboard/README.md | 265 ++++++-------------
 1 file changed, 74 insertions(+), 191 deletions(-)

diff --git a/berkeley-function-call-leaderboard/README.md b/berkeley-function-call-leaderboard/README.md
index 03df68138..7e7122190 100644
--- a/berkeley-function-call-leaderboard/README.md
+++ b/berkeley-function-call-leaderboard/README.md
@@ -1,18 +1,20 @@
 # Berkeley Function Calling Leaderboard (BFCL)
 
-💡 Read more in our [Gorilla OpenFunctions Leaderboard Blog](https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html)
+💡 Read more in our Gorilla OpenFunctions Leaderboard Blogs:
 
-🦍 Berkeley Function Calling Leaderboard live [Berkeley Function Calling Leaderboard](https://gorilla.cs.berkeley.edu/leaderboard.html#leaderboard)
+- [BFCL v1 Blog Post](https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html)
+- [BFCL v2 (live dataset) Blog Post](https://gorilla.cs.berkeley.edu/blogs/12_bfcl_v2_live.html)
+- [BFCL v3 (multi-turn) Blog Post](https://gorilla.cs.berkeley.edu/blogs/13_bfcl_v3_multi_turn.html)
 
-🦍 Berkeley Function Calling Leaderboard on Hugginface [Berkeley Function Calling Leaderboard Huggingface](https://huggingface.co/spaces/gorilla-llm/berkeley-function-calling-leaderboard)
+🦍 Berkeley Function Calling Leaderboard live [Berkeley Function Calling Leaderboard](https://gorilla.cs.berkeley.edu/leaderboard.html#leaderboard)
 
 ## Introduction
-We introduce the Berkeley Function Leaderboard (BFCL), the **first comprehensive and executable function call evaluation dedicated to assessing Large Language Models' (LLMs) ability to invoke functions**. Unlike previous function call evaluations, BFCL accounts for various forms of function calls, diverse function calling scenarios, and their executability. Additionally, we release Gorilla-Openfunctions-v2, the most advanced open-source model to date capable of handling multiple languages, parallel function calls, and multiple function calls simultaneously. A unique debugging feature of this model is its ability to output an "Error Message" when the provided function does not suit your task.
 
-Read more about the technical details and interesting insights in our [blog post](https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html)!
+We introduce the Berkeley Function Leaderboard (BFCL), the **first comprehensive and executable function call evaluation dedicated to assessing Large Language Models' (LLMs) ability to invoke functions**. Unlike previous function call evaluations, BFCL accounts for various forms of function calls, diverse function calling scenarios, and their executability.
 
-![image](./architecture_diagram.png)
+Read more about the technical details and interesting insights in our [BFCL v1 blog post](https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html), [BFCL v2 blog post](https://gorilla.cs.berkeley.edu/blogs/12_bfcl_v2_live.html), and [BFCL v3 blog post](https://gorilla.cs.berkeley.edu/blogs/13_bfcl_v3_multi_turn.html).
 
+![image](./architecture_diagram.png)
 
 ## Install Dependencies
 
@@ -36,9 +38,11 @@ pip install -e .
 ### Installing Extra Dependencies for Self-Hosted Open Source Models
 
 To do LLM generation on self-hosted open source models, you need to run the following command to install the extra dependencies.
+
 ```bash
 pip install -e .[oss_eval]
 ```
+
 Note that this requires GPU supported by vLLM and it can only be installed on Linux and Windows (not Mac).
 
 ### Setting up Environment Variables
@@ -50,19 +54,20 @@ cp .env.example .env
 ```
 
 ### API Keys for Execution Evaluation Data Post-processing (Can be Skipped: Necessary for Executable Test Categories)
+
 Add your keys into the `.env` file, so that the original placeholder values in questions, params, and answers will be reset.
 
 To run the executable test categories, there are 4 API keys to include:
 
 1. RAPID-API Key: https://rapidapi.com/hub
 
-    * Yahoo Finance: https://rapidapi.com/sparior/api/yahoo-finance15
-    * Real Time Amazon Data : https://rapidapi.com/letscrape-6bRBa3QguO5/api/real-time-amazon-data
-    * Urban Dictionary: https://rapidapi.com/community/api/urban-dictionary
-    * Covid 19: https://rapidapi.com/api-sports/api/covid-193
-    * Time zone by Location: https://rapidapi.com/BertoldVdb/api/timezone-by-location
+   - Yahoo Finance: https://rapidapi.com/sparior/api/yahoo-finance15
+   - Real Time Amazon Data : https://rapidapi.com/letscrape-6bRBa3QguO5/api/real-time-amazon-data
+   - Urban Dictionary: https://rapidapi.com/community/api/urban-dictionary
+   - Covid 19: https://rapidapi.com/api-sports/api/covid-193
+   - Time zone by Location: https://rapidapi.com/BertoldVdb/api/timezone-by-location
 
-    All the Rapid APIs we use have free tier usage. You need to **subscribe** to those API providers in order to have the executable test environment setup but it will be *free of charge*!
+   All the Rapid APIs we use have free tier usage. You need to **subscribe** to those API providers in order to have the executable test environment setup but it will be _free of charge_!
 
 2. Exchange Rate API:https://www.exchangerate-api.com
 3. OMDB API: http://www.omdbapi.com/apikey.aspx
@@ -75,7 +80,6 @@ After you have filled in the necessary values in the `.env` file, you can run th
 python apply_function_credential_config.py
 ```
 
-
 ## Evaluating different models on the BFCL
 
 Make sure the model API keys are included in your `.env` file. Running proprietary models like GPTs, Claude, Mistral-X will require them.
@@ -101,19 +105,21 @@ If decided to run OSS model, the generation script uses vllm and therefore requi
 
 ### Generating LLM Responses
 
-Use the following command for LLM inference of the evaluation dataset with specific models. 
+Use the following command for LLM inference of the evaluation dataset with specific models.
 
 ```bash
 python openfunctions_evaluation.py --model MODEL_NAME --test-category TEST_CATEGORY --num-threads 1
 ```
-You can optionally specify the number of threads to use for *parallel inference* by setting the `--num-threads` flag to speed up inference for **hosted models**, not applicable for OSS models.
+
+You can optionally specify the number of threads to use for _parallel inference_ by setting the `--num-threads` flag to speed up inference for **hosted models**, not applicable for OSS models.
 
 For available options for `MODEL_NAME` and `TEST_CATEGORY`, please refer to the [Models Available](#models-available) and [Available Test Category](#available-test-category) section below.
 
 If no `MODEL_NAME` is provided, the model `gorilla-openfunctions-v2` will be used by default. If no `TEST_CATEGORY` is provided, all test categories will be run by default.
 
 ### Models Available
-Below is *a table of models we support* to run our leaderboard evaluation against. If the models support function calling (FC), we will follow its function calling format provided by official documentation. Otherwise, we use a consistent system message to prompt the model to generate function calls in the right format.
+
+Below is _a table of models we support_ to run our leaderboard evaluation against. If the models support function calling (FC), we will follow its function calling format provided by official documentation. Otherwise, we use a consistent system message to prompt the model to generate function calls in the right format.
 |Model | Type |
 |---|---|
 |gorilla-openfunctions-v2 | Function Calling|
@@ -173,56 +179,52 @@ Below is *a table of models we support* to run our leaderboard evaluation agains
 
 Here {MODEL} 💻 means the model needs to be hosted locally and called by vllm, {MODEL} means the models that are called API calls. For models with a trailing `-FC`, it means that the model supports function-calling feature. You can check out the table summarizing feature supports among different models [here](https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html#prompt).
 
-
 For model names with {.}, it means that the model has multiple versions. For example, we provide evaluation on three versions of GPT-4: `gpt-4-0125-preview`, `gpt-4-1106-preview`, and `gpt-4-0613`.
 
-For Mistral large and small models, we provide evaluation on both of their `Any` and `Auto` settings. More information about this can be found [here](https://docs.mistral.ai/guides/function-calling/).
-
-
 For `Gemini-1.0-pro`, you need to fill in `model_handler/gemini_handler.py` with your GCP project ID that has access to Vertex AI endpoint.
-For `Databrick-DBRX-instruct`, you need to create a Databrick Azure workspace and setup an endpoint for inference. 
-
+For `Databrick-DBRX-instruct`, you need to create a Databrick Azure workspace and setup an endpoint for inference.
 
 ### Available Test Category
+
 In the following two sections, the optional `--test-category` parameter can be used to specify the category of tests to run. You can specify multiple categories separated by spaces. Available options include:
 
-* Available test groups:
-  * `all`: All test categories.
-    * This is the default option if no test category is provided.
-  * `multi_turn`: All multi-turn test categories.
-  * `single_turn`: All single-turn test categories.
-  * `live`: All user-contributed live test categories.
-  * `non_live`: All not-user-contributed test categories (the opposite of `live`).
-  * `ast`: Abstract Syntax Tree tests.
-  * `executable`: Executable code evaluation tests.
-  * `python`: Tests specific to Python code.
-  * `non_python`: Tests for code in languages other than Python, such as Java and JavaScript.
-  * `python_ast`: Python Abstract Syntax Tree tests.
-* Available individual test categories:
-  * `simple`: Simple function calls.
-  * `parallel`: Multiple function calls in parallel.
-  * `multiple`: Multiple function calls in sequence.
-  * `parallel_multiple`: Multiple function calls in parallel and in sequence.
-  * `java`: Java function calls.
-  * `javascript`: JavaScript function calls.
-  * `exec_simple`: Executable function calls.
-  * `exec_parallel`: Executable multiple function calls in parallel.
-  * `exec_multiple`: Executable multiple function calls in parallel.
-  * `exec_parallel_multiple`: Executable multiple function calls in parallel and in sequence.
-  * `rest`: REST API function calls.
-  * `irrelevance`: Function calls with irrelevant function documentation.
-  * `live_simple`: User-contributed simple function calls.
-  * `live_multiple`: User-contributed multiple function calls in sequence.
-  * `live_parallel`: User-contributed multiple function calls in parallel.
-  * `live_parallel_multiple`: User-contributed multiple function calls in parallel and in sequence.
-  * `live_irrelevance`: User-contributed function calls with irrelevant function documentation.
-  * `live_relevance`: User-contributed function calls with relevant function documentation.
-  * `multi_turn_base`: Base entries for multi-turn function calls.
-  * `multi_turn_miss_func`: Multi-turn function calls with missing function.
-  * `multi_turn_miss_param`: Multi-turn function calls with missing parameter.
-  * `multi_turn_long_context`: Multi-turn function calls with long context.
-  * `multi_turn_composite`: Multi-turn function calls with missing function, missing parameter, and long context.
-* If no test category is provided, the script will run all available test categories. (same as `all`)
+- Available test groups:
+  - `all`: All test categories.
+    - This is the default option if no test category is provided.
+  - `multi_turn`: All multi-turn test categories.
+  - `single_turn`: All single-turn test categories.
+  - `live`: All user-contributed live test categories.
+  - `non_live`: All not-user-contributed test categories (the opposite of `live`).
+  - `ast`: Abstract Syntax Tree tests.
+  - `executable`: Executable code evaluation tests.
+  - `python`: Tests specific to Python code.
+  - `non_python`: Tests for code in languages other than Python, such as Java and JavaScript.
+  - `python_ast`: Python Abstract Syntax Tree tests.
+- Available individual test categories:
+  - `simple`: Simple function calls.
+  - `parallel`: Multiple function calls in parallel.
+  - `multiple`: Multiple function calls in sequence.
+  - `parallel_multiple`: Multiple function calls in parallel and in sequence.
+  - `java`: Java function calls.
+  - `javascript`: JavaScript function calls.
+  - `exec_simple`: Executable function calls.
+  - `exec_parallel`: Executable multiple function calls in parallel.
+  - `exec_multiple`: Executable multiple function calls in parallel.
+  - `exec_parallel_multiple`: Executable multiple function calls in parallel and in sequence.
+  - `rest`: REST API function calls.
+  - `irrelevance`: Function calls with irrelevant function documentation.
+  - `live_simple`: User-contributed simple function calls.
+  - `live_multiple`: User-contributed multiple function calls in sequence.
+  - `live_parallel`: User-contributed multiple function calls in parallel.
+  - `live_parallel_multiple`: User-contributed multiple function calls in parallel and in sequence.
+  - `live_irrelevance`: User-contributed function calls with irrelevant function documentation.
+  - `live_relevance`: User-contributed function calls with relevant function documentation.
+  - `multi_turn_base`: Base entries for multi-turn function calls.
+  - `multi_turn_miss_func`: Multi-turn function calls with missing function.
+  - `multi_turn_miss_param`: Multi-turn function calls with missing parameter.
+  - `multi_turn_long_context`: Multi-turn function calls with long context.
+  - `multi_turn_composite`: Multi-turn function calls with missing function, missing parameter, and long context.
+- If no test category is provided, the script will run all available test categories. (same as `all`)
 
 > If you want to run the `all`, `non_live`, `executable` or `python` category, make sure to register your REST API keys in the `.env` file. This is because Gorilla Openfunctions Leaderboard wants to test model's generated output on real world API!
 
@@ -230,7 +232,6 @@ In the following two sections, the optional `--test-category` parameter can be u
 
 > By setting the `--api-sanity-check` flag, or `-c` for short, if the test categories include any executable categories (eg, the test name contains `exec`), the evaluation process will perform the REST API sanity check first to ensure that all the API endpoints involved during the execution evaluation process are working properly. If any of them are not behaving as expected, we will flag those in the console and continue execution.
 
-
 ## Evaluating the LLM generations
 
 ### Running the Checker
@@ -275,142 +276,24 @@ python eval_runner.py --model gorilla-openfunctions-v2 claude-3-5-sonnet-2024062
 
 Some companies have proposed some optimization strategies in their models' handler, which we (BFCL) think is unfair to other models, as those optimizations are not generalizable to all models. Therefore, we have disabled those optimizations during the evaluation process by default. You can enable those optimizations by setting the `USE_{COMPANY}_OPTIMIZATION` flag to `True` in the `.env` file.
 
-
-## Changelog
-
-* [Sept 19, 2024] [#644](https://github.com/ShishirPatil/gorilla/pull/644): BFCL V3 release:
-  * Introduce new multi-turn dataset and state-based evaluation metric
-  * Switch to use vllm serve for OSS model inference
-  * Separate ast_checker and executable_checker for readability
-  * Several outdated or deprecated models will be excluded from the leaderboard and replaced with their updated successors to improve the leaderboard's overall maintainability.
-* [Sept 13, 2024] [#638](https://github.com/ShishirPatil/gorilla/pull/638): Fix prompt formatting issue for `THUDM/glm-4-9b-chat`.
-* [Sept 12, 2024] [#635](https://github.com/ShishirPatil/gorilla/pull/635): Add new models `o1-preview-2024-09-12` and `o1-mini-2024-09-12` to the leaderboard.
-* [Sept 8, 2024] [#627](https://github.com/ShishirPatil/gorilla/pull/627) Add new model `MadeAgents/Hammer-7b` to the leaderboard.
-* [Sept 7, 2024] [#626](https://github.com/ShishirPatil/gorilla/pull/626): Fix prompt format for Llama models.
-* [Sept 4, 2024] [#623](https://github.com/ShishirPatil/gorilla/pull/623): Fix decoding issue in the `NvidiaHandler`; remove duplicate `ArcticHandler` class.
-* [August 29, 2024] [#616](https://github.com/ShishirPatil/gorilla/pull/6160): Add the following new models to the leaderboard:
-  * `Salesforce/xLAM-7b-r`
-  * `Salesforce/xLAM-8x7b-r`
-  * `Salesforce/xLAM-8x22b-r`
-* [August 28, 2024] [#565](https://github.com/ShishirPatil/gorilla/pull/565), [#612](https://github.com/ShishirPatil/gorilla/pull/612): Packagerize the BFCL pipeline for easier deployment and maintenance.
-* [August 27, 2024] [#608](https://github.com/ShishirPatil/gorilla/pull/608): Bug fix in the dataset and possible answers.
-  * simple: 16 affected
-  * multiple: 5 affected
-* [August 23, 2024] [#600](https://github.com/ShishirPatil/gorilla/pull/600): Bug fix in the dataset and possible answers.
-  * simple: 12 affected
-  * multiple: 3 affected
-  * parallel: 3 affected
-  * parallel multiple: 6 affected
-* [August 22, 2024] [#593](https://github.com/ShishirPatil/gorilla/pull/593): 
-  * Move formatting instructions and function documentation to system prompt instead of user prompt in the message section. All prompting models are affected.
-  * Bug fix in the dataset and possible answers.
-    * irrelevance: 1 affected
-    * live_irrelevance: 1 affected
-    * live_simple: 1 affected
-    * live_parallel: 3 affected
-* [August 19, 2024] [#580](https://github.com/ShishirPatil/gorilla/pull/580): Introduce BFCL V2 Live dataset, featuring user-contributed live prompts and function docs. To read more about the composition and construction of this dataset, please refer to our [blog](https://gorilla.cs.berkeley.edu/blogs/12_bfcl_v2_live.html). All CLI commands have been updated to support the new dataset.
-* [August 8, 2024] [#574](https://github.com/ShishirPatil/gorilla/pull/574): Set temperature to 0.001 for all models for consistency and reproducibility.
-* [August 7, 2024] [#571](https://github.com/ShishirPatil/gorilla/pull/571): Support parallel inference for hosted models. User can specify the number of threads to use for parallel inference by setting the `--num-threads` flag. The default is 1, which means no parallel inference.
-* [August 6, 2024] [#569](https://github.com/ShishirPatil/gorilla/pull/569), [#570](https://github.com/ShishirPatil/gorilla/pull/570), [#573](https://github.com/ShishirPatil/gorilla/pull/573): Add the following new models to the leaderboard:
-  * `open-mistral-nemo-2407`
-  * `open-mistral-nemo-2407-FC`
-  * `open-mixtral-8x22b`
-  * `open-mixtral-8x22b-FC`
-  * `open-mixtral-8x7b`
-  * `gpt-4o-mini-2024-07-18`
-  * `gpt-4o-mini-2024-07-18-FC`
-  * `gpt-4o-2024-08-06`
-  * `gpt-4o-2024-08-06-FC`
-  * `meetkai/functionary-medium-v3.1-FC`
-  * `meetkai/functionary-small-v3.1-FC`
-  * `meetkai/functionary-small-v3.2-FC`
-* [August 5, 2024] [#568](https://github.com/ShishirPatil/gorilla/pull/568): Rephrase the question prompt for the `executable_parallel_function` category to remove potentially misleading information implying multi-turn function calls.
-* [August 4, 2024] [#557](https://github.com/ShishirPatil/gorilla/pull/557): Bug fix in the possible answers.
-  * simple: 7 affected
-  * multiple function: 3 affected
-  * parallel function: 5 affected
-  * parallel multiple function: 6 affected
-  * executable parallel function: 1 affected
-  * javascript: 3 affected
-* [July 26, 2024] [#549](https://github.com/ShishirPatil/gorilla/pull/549): Fix js_type_converter.py to properly handle JavaScript array value inside dictionary.
-* [July 25, 2024] [#532](https://github.com/ShishirPatil/gorilla/pull/532), [#543](https://github.com/ShishirPatil/gorilla/pull/543), [#556](https://github.com/ShishirPatil/gorilla/pull/556), [#542](https://github.com/ShishirPatil/gorilla/pull/542): Add the following new models to the leaderboard:
-    - `Salesforce/xLAM-7b-fc-r`
-    - `Salesforce/xLAM-1b-fc-r`
-    - `yi-large-fc`
-    - `NousResearch/Hermes-2-Pro-Llama-3-8B`
-    - `NousResearch/Hermes-2-Pro-Llama-3-70B`
-    - `NousResearch/Hermes-2-Theta-Llama-3-8B`
-    - `NousResearch/Hermes-2-Theta-Llama-3-70B`
-* [July 22, 2024] [#540](https://github.com/ShishirPatil/gorilla/pull/540): Chore: Improve handling of vLLM's cleanup phase error by combining all selected test categories into one single task to submit to the vLLM server.
-* [July 21, 2024] [#538](https://github.com/ShishirPatil/gorilla/pull/538), [#545](https://github.com/ShishirPatil/gorilla/pull/545): Fix `language_specific_pre_processing` and `convert_to_tool` function to properly handle pre-processing for prompts and function docs in Java and JavaScript test categories. All entries in these categories are affected.
-* [July 20, 2024] [#537](https://github.com/ShishirPatil/gorilla/pull/537): Update generation script for locally-hosted OSS model to use single-node multi-GPU inference method (tensor parallel). Ray is not used anymore.
-* [July 16, 2024] [#525](https://github.com/ShishirPatil/gorilla/pull/525), [#536](https://github.com/ShishirPatil/gorilla/pull/536): Add new model `ibm-granite/granite-20b-functioncalling` to the leaderboard.
-* [July 10, 2024] [#522](https://github.com/ShishirPatil/gorilla/pull/522): Bug fix in the evaluation dataset for Executable Parallel Multiple category. This includes updates to both prompts and function docs. 2 entries are affected.
-* [July 8, 2024] [#516](https://github.com/ShishirPatil/gorilla/pull/516): Fix double-casting issue in `model_handler` for Java and JavaScript test categories.
-* [July 7, 2024] [#504](https://github.com/ShishirPatil/gorilla/pull/504), [#505](https://github.com/ShishirPatil/gorilla/pull/505), [#506](https://github.com/ShishirPatil/gorilla/pull/506), [#508](https://github.com/ShishirPatil/gorilla/pull/508), [#512](https://github.com/ShishirPatil/gorilla/pull/512), [#517](https://github.com/ShishirPatil/gorilla/pull/517): Make BFCL user-friendly and easy to extend.
-* [July 6, 2024] [#423](https://github.com/ShishirPatil/gorilla/pull/423) and [#503](https://github.com/ShishirPatil/gorilla/pull/503): Bug fix in possible answers for the AST evaluation dataset (parallel category: 14 affected; parallel_multiple category: 25 affected).
-* [July 5, 2024] [#496](https://github.com/ShishirPatil/gorilla/pull/496): Updates to API status checks. Checking the health of executable APIs is now off by default. Further, even when triggered, un-healthy APIs will not terminate the evaluation process. Users can enable this feature by setting the `--api-sanity-check` flag or `-c` for short. The previous `--skip-api-sanity-check` or `-s` flag is now deprecated.
-* [July 3, 2024] [#489](https://github.com/ShishirPatil/gorilla/pull/489): Add new model `nvidia/nemotron-4-340b-instruct` to the leaderboard.
-* [July 2, 2024] [#474](https://github.com/ShishirPatil/gorilla/pull/474): Add new model `THUDM/glm-4-9b-chat` to the leaderboard.
-* [June 18, 2024] [#470](https://github.com/ShishirPatil/gorilla/pull/470): Add new model `firefunction-v2-FC` to the leaderboard.
-* [June 15, 2024] [#437](https://github.com/ShishirPatil/gorilla/pull/437): Fix prompting issues for `Nexusflow-Raven-v2 (FC)`.
-* [June 7, 2024] [#407](https://github.com/ShishirPatil/gorilla/pull/407), [#462](https://github.com/ShishirPatil/gorilla/pull/462): Update the AST evaluation logic to allow the use of `int` values for Python parameters expecting `float` values. This is to accommodate the Python auto-conversion feature from `int` to `float`.
-* [May 14, 2024] [#426](https://github.com/ShishirPatil/gorilla/pull/426):
-    - Add the following new models to the leaderboard:
-        + `gpt-4o-2024-05-13`
-        + `gpt-4o-2024-05-13-FC`
-        + `gemini-1.5-pro-preview-0514`
-        + `gemini-1.5-flash-preview-0514`
-    - Update price for the following models:
-        + All Gemini Series
-        + `Claude-2.1 (Prompt)` and `Claude-instant-1.2 (Prompt)`
-        + `Mistral-large` and `Mistral-Small`
-        + `GPT-3.5-Turbo-0125`
-* [May 8, 2024] [#406](https://github.com/ShishirPatil/gorilla/pull/406) and [#421](https://github.com/ShishirPatil/gorilla/pull/421): Update the `gemini_handler.py` to better handle parallel function calls for Gemini models.
-* [May 6, 2024] [#412](https://github.com/ShishirPatil/gorilla/pull/412): Bug fix in evaluation dataset for AST categories. This includes updates to both prompts and function docs.
-* [May 2, 2024] [#405](https://github.com/ShishirPatil/gorilla/pull/405): Bug fix in the possible answers for the AST Simple evaluation dataset. Prompt and function docs are not affected.
-* [April 28, 2024] [#397](https://github.com/ShishirPatil/gorilla/pull/397): Add new model `snowflake/arctic` to the leaderboard. Note that there are multiple ways to inference the model, and we choose to do it via Nvidia API catalog.
-* [April 27, 2024] [#390](https://github.com/ShishirPatil/gorilla/pull/390): Bug fix in cost and latency calculation for open-source models, which are now all calculated when serving the model with [vLLM](https://github.com/vllm-project/vllm) using 8 V100 GPUs for consistency. $$\text{Cost} = \text{Latency per 1000 function call} * (\text{8xV100 azure-pay-as-you-go-price per hour / 3600})$$
-* [April 25, 2024] [#386](https://github.com/ShishirPatil/gorilla/pull/386): Add 5 new models to the leaderboard: `meta-llama/Meta-Llama-3-8B-Instruct`, `meta-llama/Meta-Llama-3-70B-Instruct`, `gemini-1.5-pro-preview-0409`, `command-r-plus`, `command-r-plus-FC`.
-* [April 19, 2024] [#377](https://github.com/ShishirPatil/gorilla/pull/377): 
-    - Bug fix for the evaluation dataset in the executable test categories. This includes updates to both prompts and function docs.
-    - The `evaluation_result` field has been removed to accommodate the variability in API execution results across different evaluation runs. Instead, a human-verified `ground_truth` is now included for the executable test categories. During each evaluation run, `evaluation_result` is generated anew using the `ground_truth`, and then compared against the model output. 
-    - A stricter metric has been adopted when using the `structural_match` (aka. type match) evaluation criteria ---- For `list` results, the lengths are compared; for `dict` results, the keys are matched. This is to account for the fast-changing nature of some of the real-time API results while ensuring the evaluation remains meaningful.
-    - Added another evaluation criteria `real_time_match` for the executable category, which is a looser form of `exact_match` specifically for numerical execution results. The execution result must be within a certain percentage threshold (20%) from the expected result to accommodate the live updates of API responses. User can change this threshold value in `eval_checker_constant.py`.
-* [April 18, 2024] [#375](https://github.com/ShishirPatil/gorilla/pull/375): A more comprehensive API sanity check is included; the APIs that are invoked during the non-REST executable evaluation process will also be checked for their availability before running the evaluation. Also, add support for the shortcut `-s` for the `--skip-api-sanity-check` flag, based on the community feedback.
-* [April 16, 2024] [#366](https://github.com/ShishirPatil/gorilla/pull/366): Switch to use Anthropic's new Tool Use Beta `tools-2024-04-04` when generating Claude 3 FC series data. `gpt-4-turbo-2024-04-09` and `gpt-4-turbo-2024-04-09-FC` are also added to the leaderboard.
-* [April 11, 2024] [#347](https://github.com/ShishirPatil/gorilla/pull/347): Add the 95th percentile latency to the leaderboard statistics. This metric is useful for understanding the latency distribution of the models, especially the worst-case scenario.
-* [April 10, 2024] [#339](https://github.com/ShishirPatil/gorilla/pull/339): Introduce REST API sanity check for the REST executable test category. It ensures that all the API endpoints involved during the execution evaluation process are working properly. If any of them are not behaving as expected, the evaluation process will be stopped by default as the result will be inaccurate. Users can choose to bypass this check by setting the `--skip-api-sanity-check` flag or `-s` for short.
-* [April 9, 2024] [#338](https://github.com/ShishirPatil/gorilla/pull/338): Bug fix in the evaluation datasets (including both prompts and function docs). Bug fix for possible answers as well.
-* [April 8, 2024] [#330](https://github.com/ShishirPatil/gorilla/pull/330): Fixed an oversight that was introduced in [#299](https://github.com/ShishirPatil/gorilla/pull/299). For function-calling (FC) models that cannot take `float` type in input, when the parameter type is a `float`, the evaluation procedure will convert that type to `number` in the model input and mention in the parameter description that `This is a float type value.`. An additional field `format: float` will also be included in the model input to make it clear about the type. Updated the model handler for Claude, Mistral, and OSS to better parse the model output.
-* [April 8, 2024] [#327](https://github.com/ShishirPatil/gorilla/pull/327): Add new model `NousResearch/Hermes-2-Pro-Mistral-7B` to the leaderboard.
-* [April 3, 2024] [#309](https://github.com/ShishirPatil/gorilla/pull/309): Bug fix for evaluation dataset possible answers. Implement **string standardization** for the AST evaluation pipeline, i.e. removing white spaces and a subset of punctuations (`,./-_*^`) to make the AST evaluation more robust and accurate. Fixed AST evaluation issue for type `tuple`. Add 2 new models `meetkai/functionary-small-v2.4 (FC)`, `meetkai/functionary-medium-v2.4 (FC)` to the leaderboard.
-* [April 1, 2024] [#299](https://github.com/ShishirPatil/gorilla/pull/299): Leaderboard update with new models (`Claude-3-Haiku`, `Databrick-DBRX-Instruct`), more advanced AST evaluation procedure, and updated evaluation datasets. Cost and latency statistics during evaluation are also measured. We also released the manual that our evaluation procedure is based on, available [here](https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html#metrics).
-* [Mar 11, 2024] [#254](https://github.com/ShishirPatil/gorilla/pull/254): Leaderboard update with 3 new models: `Claude-3-Opus-20240229 (Prompt)`, `Claude-3-Sonnet-20240229 (Prompt)`, and `meetkai/functionary-medium-v2.2 (FC)`
-* [Mar 5, 2024] [#237](https://github.com/ShishirPatil/gorilla/pull/237) and [238](https://github.com/ShishirPatil/gorilla/pull/238): leaderboard update resulting from [#223](https://github.com/ShishirPatil/gorilla/pull/223); 3 new models: `mistral-large-2402`, `gemini-1.0-pro`, and `google/gemma-7b-it`.
-* [Feb 29, 2024] [#223](https://github.com/ShishirPatil/gorilla/pull/223): modifications to REST evaluation. 
-
-
 ## Contributing
 
 We welcome additions to the Function Calling Leaderboard! To add a new model, here are a few things you need to do:
 
-1. Take a look at the `model_handler/handler.py`. This is the base handler object which all handlers are inherited from. Also, free feel to take a look at the existing model handers; very likely you can re-use some of the existing code if the new model outputs in a similar format.
-2. Create your handler and define the following functions 
-    1. `__init__`: on initialization, you need to create a `self.client` object if you have an existing endpoint(e.g. `self.client = OpenAI()`) or follow `model_handler/oss_handler.py` for starting a vLLM serving.
-    2. `inference`: inference function takes in prompt, functions, as well as optional programming language parameters. It will make call to the endpoint, compile result in the desired format, as well as logging the token number and latency
-    3. `decode_ast`: decode_ast will convert the response from raw output in the format of `[{func1:{param1:val1,...}},{func2:{param2:val2,...}}]` This format will be used to check for exact matching the parameters.
-    4. `decode_execute`: deocde_execute will convert the response from raw output in the format of `"[func1(param1=val1),func2(param2=val2)]"`
-3. Modify `model_handler/handler_map.py`. This mapping contains the key as the exact model name and value as the handler object of the specific model.
-4. Modify `eval_checker/eval_runner_helper.py`:
-    - Update the `MODEL_METADATA_MAPPING` with the model display name, URL, license and company information. The key should be the same as the one in `model_handler/handler_map.py`.
-    - If your model is price-based, you should update the `INPUT_PRICE_PER_MILLION_TOKEN` and `OUTPUT_PRICE_PER_MILLION_TOKEN`. - If your model doesn't have a cost, you should add it to the `NO_COST_MODELS` list.
-    - If your model is open-source and is hosted locally, the `OSS_LATENCY` list needs to be updated with the latency for the whole batch of data generation. This information will affect the cost calculation.
-5. Raise a [Pull Request](https://github.com/ShishirPatil/gorilla/pulls) with your new Model Handler. We will run the model handler if an endpoint is established. If self-hosting is required and the model size is large, we might not be able to accommodate model hosting therefore an OpenAI compatible endpoint for evaluation is desired. 
+1. Take a look at the `bfcl/model_handler/base_handler.py`. This is the base handler object which all handlers are inherited from. Also, free feel to take a look at the existing model handers; very likely you can re-use some of the existing code if the new model outputs in a similar format.
+2. Create your handler and define the following functions
+   1. `__init__`: on initialization, you need to create a `self.client` object if you have an existing endpoint(e.g. `self.client = OpenAI()`) or follow `model_handler/oss_handler.py` for starting a vLLM serving.
+   2. `inference`: inference function takes in prompt, functions, as well as optional programming language parameters. It will make call to the endpoint, compile result in the desired format, as well as logging the token number and latency
+   3. `decode_ast`: decode_ast will convert the response from raw output in the format of `[{func1:{param1:val1,...}},{func2:{param2:val2,...}}]` This format will be used to check for exact matching the parameters.
+   4. `decode_execute`: deocde_execute will convert the response from raw output in the format of `"[func1(param1=val1),func2(param2=val2)]"`
+3. Modify `bfcl/model_handler/handler_map.py`. This mapping contains the key as the exact model name and value as the handler object of the specific model.
+4. Modify `bfcl/val_checker/model_metadata.py`:
+   - Update the `MODEL_METADATA_MAPPING` with the model display name, URL, license and company information. The key should be the same as the one in `model_handler/handler_map.py`.
+   - If your model is price-based, you should update the `INPUT_PRICE_PER_MILLION_TOKEN` and `OUTPUT_PRICE_PER_MILLION_TOKEN`. - If your model doesn't have a cost, you should add it to the `NO_COST_MODELS` list.
+   - If your model is open-source and is hosted locally, the `OSS_LATENCY` list needs to be updated with the latency for the whole batch of data generation. This information will affect the cost calculation.
+5. Raise a [Pull Request](https://github.com/ShishirPatil/gorilla/pulls) with your new Model Handler. We will run the model handler if an endpoint is established. If self-hosting is required and the model size is large, we might not be able to accommodate model hosting therefore an OpenAI compatible endpoint for evaluation is desired.
 6. Feel Free to join [Gorilla Discord](https://discord.gg/grXXvj9Whz) `#leaderboard` and reach out to us for any questions or concerns about adding new models. We are happy to help you!
 
-
 All the leaderboard statistics, and data used to train the models are released under Apache 2.0.
-Gorilla is an open source effort from UC Berkeley and we welcome contributors. 
+Gorilla is an open source effort from UC Berkeley and we welcome contributors.
 Please email us your comments, criticisms, and questions. More information about the project can be found at [https://gorilla.cs.berkeley.edu/](https://gorilla.cs.berkeley.edu/)
-