Leaderboard

Model Name	Overall Score	Date Tested	nutrient_management	soil_and_water	pest_management	crop_management	V1_benchmark_questions	community_questions_FBN	License Type
o1-preview	93.18%	2024-09-27	94.12%	97.5%	91.94%	95.59%	90.48%	84.0%	Proprietary
claude-3.5-sonnet	89.41%	2024-08-24	87.06%	92.5%	90.32%	89.71%	89.52%	84.0%	Proprietary
gpt-4o	88.71%	2024-08-24	87.06%	88.75%	90.32%	88.24%	91.43%	80.0%	Proprietary
norm	88.0%	2024-08-24	82.35%	90.0%	88.71%	89.71%	89.52%	88.0%	Proprietary
claude-3-opus	87.53%	2024-08-24	88.24%	86.25%	85.48%	92.65%	86.67%	84.0%	Proprietary
centeotl-api	85.65%	2024-11-14	88.24%	85.0%	88.71%	80.88%	88.57%	72.0%	Proprietary
llama-3.1-sonar-huge-128k-online	85.41%	2024-08-24	83.53%	81.25%	82.26%	89.71%	89.52%	84.0%	Proprietary
gpt-4	85.0%	2024-06-14	83.53%	83.75%	83.87%	86.76%	86.67%		Proprietary
o1-mini	84.94%	2024-09-27	85.88%	85.0%	85.48%	83.82%	87.62%	72.0%	Proprietary
llama-3.1-405b-instruct	84.0%	2024-08-24	83.53%	81.25%	83.87%	88.24%	87.62%	68.0%	Open Source
gemini-pro-1.5	83.53%	2024-08-24	83.53%	83.75%	82.26%	86.76%	85.71%	68.0%	Proprietary
hermes-3-llama-3.1-405b	82.82%	2024-08-24	81.18%	82.5%	87.1%	85.29%	83.81%	68.0%	Open Source
qwen-2-72b-instruct	82.59%	2024-08-24	82.35%	82.5%	82.26%	85.29%	85.71%	64.0%	Open Source
centeotl-api-llama	81.88%	2024-11-17	85.88%	85.0%	82.26%	79.41%	83.81%	56.0%	Proprietary
llama-3-70b-instruct	81.5%	2024-06-14	78.82%	78.75%	82.26%	83.82%	83.81%		Open Source
gpt-4o-mini	80.47%	2024-08-24	77.65%	85.0%	75.81%	82.35%	81.9%	76.0%	Proprietary
llama-3.1-70b-instruct	80.23%	2024-08-24	75.29%	81.25%	82.26%	89.71%	80.95%	60.0%	Open Source
gemini-flash-1.5	79.0%	2024-06-14	74.12%	76.25%	83.87%	85.29%	78.1%		Proprietary
mistral-large	78.12%	2024-08-24	75.29%	77.5%	82.26%	76.47%	81.9%	68.0%	Open Source
claude-3-haiku	75.25%	2024-06-14	71.76%	73.75%	79.03%	72.06%	79.05%		Proprietary
phi-3-medium-128k-instruct	74.35%	2024-08-24	70.59%	77.5%	79.03%	75.0%	73.33%	68.0%	Open Source
nous-hermes-yi-34b	74.35%	2024-08-24	70.59%	76.25%	83.87%	72.06%	74.29%	64.0%	Open Source
grok-beta	71.29%	2024-10-21	72.94%	68.75%	67.74%	67.65%	76.19%	72.0%	Proprietary
yi-34b-chat	70.75%	2024-06-14	68.24%	68.75%	79.03%	70.59%	69.52%		Open Source
phi-3-mini-128k-instruct	67.5%	2024-06-14	60.0%	71.25%	67.74%	69.12%	69.52%		Open Source
gpt-3.5-turbo	64.94%	2024-08-24	62.35%	61.25%	70.97%	72.06%	70.48%	28.0%	Proprietary
llama-3-8b-instruct:nitro	63.5%	2024-06-14	54.12%	68.75%	61.29%	72.06%	62.86%		Open Source
hermes-2-pro-llama-3-8b	62.0%	2024-06-14	57.65%	57.5%	62.9%	66.18%	65.71%		Open Source
dhenu2-in-8b-preview	61.88%	2024-11-14	52.94%	67.5%	67.74%	61.76%	66.67%	40.0%	Open Source
llama-3.1-8b-instruct	59.53%	2024-08-24	51.76%	58.75%	61.29%	66.18%	59.05%	68.0%	Open Source
mistral-7b-instruct	51.53%	2024-08-24	41.18%	50.0%	62.9%	60.29%	53.33%	32.0%	Open Source
mistral-medium	29.18%	2024-08-24	30.59%	23.75%	20.97%	41.18%	34.29%	8.0%	Open Source
mixtral-8x7b-instruct	18.35%	2024-08-24	16.47%	13.75%	17.74%	14.71%	26.67%	16.0%	Open Source

What is this?

We are benchmarking the ability for different models to give correct answers to Agronomy questions. This is a simple, 98 multiple-choice question benchmark today, and I plan to make it more complete and challenging in the future.

Why?

When building new models for agriculture, it's important to know if your model is getting better or worse. This is a simple benchmark to help us determine if we are improving the agronomic ability of new models and by how much.

Roadmap

Make it harder! These are fairly basic questions. We should add short and long answer questions (to be evaluated against example correct answers)
Add questions for international regions
Add more models to the leaderboard

Updates

2024-08-16

Thank you to , who contributed community questions!
Benchmarks have been run against the new community questions on select models.
Nous Hermes 3 405b added & benchmarked.

2024-07-24

Added Meta Llama 3.1 models
Added OpenAI GPT4o-mini

2024-06-15

Added 295 more questions to the benchmark.
Added quesiton cateogires
Re-ran with all models
Added graphs as output for visual comparison.

2024-01-17

Updated benchmark questions to remove incorrectly formed questions (for eaxample, the most missed question across all models was "e. both symptoms occur across the field and stunted roots", which is clearly not a properly formed question).
Included chat prompt templates for models that require chat templates.
Re-ran benchmark against all models after fixes in place and updated leaderboard.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
benchmark_results		benchmark_results
.gitignore		.gitignore
README.md		README.md
main.py		main.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Leaderboard

What is this?

Why?

Roadmap

Updates

2024-08-16

2024-07-24

2024-06-15

2024-01-17

About

Releases

Packages

Languages

gbstox/agronomy_llm_benchmarking

Folders and files

Latest commit

History

Repository files navigation

Leaderboard

What is this?

Why?

Roadmap

Updates

2024-08-16

2024-07-24

2024-06-15

2024-01-17

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages