Model Name | Overall Score | Date Tested | nutrient_management | soil_and_water | pest_management | crop_management | V1_benchmark_questions | community_questions_FBN | License Type |
---|---|---|---|---|---|---|---|---|---|
o1-preview | 93.18% | 2024-09-27 | 94.12% | 97.5% | 91.94% | 95.59% | 90.48% | 84.0% | Proprietary |
claude-3.5-sonnet | 89.41% | 2024-08-24 | 87.06% | 92.5% | 90.32% | 89.71% | 89.52% | 84.0% | Proprietary |
gpt-4o | 88.71% | 2024-08-24 | 87.06% | 88.75% | 90.32% | 88.24% | 91.43% | 80.0% | Proprietary |
norm | 88.0% | 2024-08-24 | 82.35% | 90.0% | 88.71% | 89.71% | 89.52% | 88.0% | Proprietary |
claude-3-opus | 87.53% | 2024-08-24 | 88.24% | 86.25% | 85.48% | 92.65% | 86.67% | 84.0% | Proprietary |
centeotl-api | 85.65% | 2024-11-14 | 88.24% | 85.0% | 88.71% | 80.88% | 88.57% | 72.0% | Proprietary |
llama-3.1-sonar-huge-128k-online | 85.41% | 2024-08-24 | 83.53% | 81.25% | 82.26% | 89.71% | 89.52% | 84.0% | Proprietary |
gpt-4 | 85.0% | 2024-06-14 | 83.53% | 83.75% | 83.87% | 86.76% | 86.67% | Proprietary | |
o1-mini | 84.94% | 2024-09-27 | 85.88% | 85.0% | 85.48% | 83.82% | 87.62% | 72.0% | Proprietary |
llama-3.1-405b-instruct | 84.0% | 2024-08-24 | 83.53% | 81.25% | 83.87% | 88.24% | 87.62% | 68.0% | Open Source |
gemini-pro-1.5 | 83.53% | 2024-08-24 | 83.53% | 83.75% | 82.26% | 86.76% | 85.71% | 68.0% | Proprietary |
hermes-3-llama-3.1-405b | 82.82% | 2024-08-24 | 81.18% | 82.5% | 87.1% | 85.29% | 83.81% | 68.0% | Open Source |
qwen-2-72b-instruct | 82.59% | 2024-08-24 | 82.35% | 82.5% | 82.26% | 85.29% | 85.71% | 64.0% | Open Source |
centeotl-api-llama | 81.88% | 2024-11-17 | 85.88% | 85.0% | 82.26% | 79.41% | 83.81% | 56.0% | Proprietary |
llama-3-70b-instruct | 81.5% | 2024-06-14 | 78.82% | 78.75% | 82.26% | 83.82% | 83.81% | Open Source | |
gpt-4o-mini | 80.47% | 2024-08-24 | 77.65% | 85.0% | 75.81% | 82.35% | 81.9% | 76.0% | Proprietary |
llama-3.1-70b-instruct | 80.23% | 2024-08-24 | 75.29% | 81.25% | 82.26% | 89.71% | 80.95% | 60.0% | Open Source |
gemini-flash-1.5 | 79.0% | 2024-06-14 | 74.12% | 76.25% | 83.87% | 85.29% | 78.1% | Proprietary | |
mistral-large | 78.12% | 2024-08-24 | 75.29% | 77.5% | 82.26% | 76.47% | 81.9% | 68.0% | Open Source |
claude-3-haiku | 75.25% | 2024-06-14 | 71.76% | 73.75% | 79.03% | 72.06% | 79.05% | Proprietary | |
phi-3-medium-128k-instruct | 74.35% | 2024-08-24 | 70.59% | 77.5% | 79.03% | 75.0% | 73.33% | 68.0% | Open Source |
nous-hermes-yi-34b | 74.35% | 2024-08-24 | 70.59% | 76.25% | 83.87% | 72.06% | 74.29% | 64.0% | Open Source |
grok-beta | 71.29% | 2024-10-21 | 72.94% | 68.75% | 67.74% | 67.65% | 76.19% | 72.0% | Proprietary |
yi-34b-chat | 70.75% | 2024-06-14 | 68.24% | 68.75% | 79.03% | 70.59% | 69.52% | Open Source | |
phi-3-mini-128k-instruct | 67.5% | 2024-06-14 | 60.0% | 71.25% | 67.74% | 69.12% | 69.52% | Open Source | |
gpt-3.5-turbo | 64.94% | 2024-08-24 | 62.35% | 61.25% | 70.97% | 72.06% | 70.48% | 28.0% | Proprietary |
llama-3-8b-instruct:nitro | 63.5% | 2024-06-14 | 54.12% | 68.75% | 61.29% | 72.06% | 62.86% | Open Source | |
hermes-2-pro-llama-3-8b | 62.0% | 2024-06-14 | 57.65% | 57.5% | 62.9% | 66.18% | 65.71% | Open Source | |
dhenu2-in-8b-preview | 61.88% | 2024-11-14 | 52.94% | 67.5% | 67.74% | 61.76% | 66.67% | 40.0% | Open Source |
llama-3.1-8b-instruct | 59.53% | 2024-08-24 | 51.76% | 58.75% | 61.29% | 66.18% | 59.05% | 68.0% | Open Source |
mistral-7b-instruct | 51.53% | 2024-08-24 | 41.18% | 50.0% | 62.9% | 60.29% | 53.33% | 32.0% | Open Source |
mistral-medium | 29.18% | 2024-08-24 | 30.59% | 23.75% | 20.97% | 41.18% | 34.29% | 8.0% | Open Source |
mixtral-8x7b-instruct | 18.35% | 2024-08-24 | 16.47% | 13.75% | 17.74% | 14.71% | 26.67% | 16.0% | Open Source |
We are benchmarking the ability for different models to give correct answers to Agronomy questions. This is a simple, 98 multiple-choice question benchmark today, and I plan to make it more complete and challenging in the future.
When building new models for agriculture, it's important to know if your model is getting better or worse. This is a simple benchmark to help us determine if we are improving the agronomic ability of new models and by how much.
- Make it harder! These are fairly basic questions. We should add short and long answer questions (to be evaluated against example correct answers)
- Add questions for international regions
- Add more models to the leaderboard
- Thank you to , who contributed community questions!
- Benchmarks have been run against the new community questions on select models.
- Nous Hermes 3 405b added & benchmarked.
- Added Meta Llama 3.1 models
- Added OpenAI GPT4o-mini
- Added 295 more questions to the benchmark.
- Added quesiton cateogires
- Re-ran with all models
- Added graphs as output for visual comparison.
-
Updated benchmark questions to remove incorrectly formed questions (for eaxample, the most missed question across all models was "e. both symptoms occur across the field and stunted roots", which is clearly not a properly formed question).
-
Included chat prompt templates for models that require chat templates.
-
Re-ran benchmark against all models after fixes in place and updated leaderboard.