Skip to content

Latest commit

 

History

History
65 lines (46 loc) · 2.85 KB

CHANGELOG.md

File metadata and controls

65 lines (46 loc) · 2.85 KB

Changelog

[Unreleased]

Added

Fixed

[0.4.0]

Added

  • Added benchmark for ChatGPT-4o-latest

Fixed

  • Fixed the bug that caused the include_string to fail when the code contains a lot of double quotes. It negatively impacted test cases like q_and_a_extractor and extract_julia_code. This potentially changes some scores, so versioning up.

[0.3.0]

Added

  • REMOVED: Comparison of several Qwen-1.5 models (removed due to poor scores caused by invalid GGUF)
  • Added benchmark evals for Google Gemini 1.0 Pro (latest version as of 17th Feb 2024)
  • Added benchmark evals for Claude 3 models and MistralAI "mistral-large"
  • Added benchmark for the latest OpenAI GPT-4 Turbo ("gpt-4-turbo-2024-04-09")
  • Added benchmarks for several open-weights models hosted on Fireworks.ai: Qwen-72b, Mixtral 8x22b (instruct preview), DBRX Instruct
  • Added benchmarks for Llama3 models and Mistral's own Mixtral-8x22b
  • Add 2 new test cases to the waitlist: find_mean, find_median (see code_generation_waitlist/).
  • Added a new model from OpenAI: gpt-4o-2024-05-13
  • Added benchmark for Mistral's latest model: mistral-large-2407

Fixed

  • Changed the wording from "open-source" models to "locally-hosted" models (a more appropriate description)

[0.2.0]

Added

  • Added new models (OpenAI "0125" versions, Codellama, and more)
  • Capability to evaluate code with AgentCodeFixer loop (set codefixing_num_rounds>0 )
  • Automatically set a different seed for commercial API providers (MistralAI, OpenAI) to avoid their caching mechanism
  • Re-scored all past submissions with the new methodology

Fixed

  • Improved code loading and debugging via Julia's code loading mechanism (include_string), which allows to better locate the lines that caused the errors (run evaluate(....; verbose=true) to see which lines caused the errors or return_debug=true to return the debug information as a secondary output).
  • Improved error capture and scoring (eg, imports of Base modules are now correctly recognized as "safe")
  • Improved detection of parse errors (ie, reduces score of submissions that "executed" only because I didn't detect the parsing error earlier)
  • Fixed mkdir bug in run_benchmark

Removed

  • @timeout macro has been upstreamed to PromptingTools

Case Studies

  • Quantization effects on Yi34b and Magicoder 7b
  • Effect of English vs Chinese on performance with Yi34b

[0.1.0]

Added

  • Documentation with detailed methodology, test case definitions, and results across various data cuts.
  • Added ~5 samples for each model/prompt/test case combination for more robust results.

Fixed

[pre-0.1.0]

Fixed

  • Improved code parsing to accommodate smaller models (eg, imprecise markdown code fences but having a valid code, valid function definition but an invalid follow-on explanation that breaks the execution) - Improved scores for all models.