Benchmark · 7,432 Species · 4 Domains

AgriTaxon

Can Large Multimodal Models Identify What They See in Agriculture?

Under Review

1,971 Crops 178 Livestock 3,485 Pests 1,798 Weeds

Bridging the Seeing-Without-Naming Gap

In agriculture, correctly naming a species from its image—what we call open-ended taxonomic naming—is the starting point for intelligent agriculture, from choosing a pesticide to enforcing quarantine. Yet no existing benchmark directly tests whether Large Multimodal Models (LMMs) can perform this task. Existing agricultural datasets each cover a single domain (8–102 classes), lack links to standard databases, and evaluate only closed-set classification—leaving untested whether models can actually name what they see.

AgriTaxon fills this gap by unifying four agricultural domains—Crops, Livestock, Pests, and Weeds—into a single benchmark with 7,432 species, each linked to authoritative FAO and EPPO databases via Wikidata. We evaluate 14 LMMs under both multiple-choice (with semantically hard negatives) and open-ended naming, and introduce an LLM-as-a-Judge scoring protocol that achieves 98% agreement with domain experts.

7,432
Species
4
Domains
14
LMMs Evaluated

Leaderboard

Accuracy (%) on AgriTaxon for 14 LMMs, ranked by open-ended Acc within each block. Open-ended reports both EM (exact match) and Acc (with LLM-as-a-Judge alias validation). Bold = best per column.

Model Release Hard Multi-Choice Open-Ended
Crop Live. Pest Weed Mean Crop Live. Pest Weed EM Acc
Proprietary Models
gemini-3-pro-preview 2025.11 9.0 83.795.572.578.182.5 43.649.430.053.044.051.2
doubao-seed-2-0-pro 2026.02 6.0 81.388.870.976.679.4 45.334.839.656.644.148.6
gemini-3-flash-preview 2025.12 8.7 84.995.172.779.183.0 22.437.622.950.433.448.1
doubao-seed-2-0-lite 2026.02 5.6 78.187.670.174.777.6 37.333.730.548.837.644.0
gpt-5 2025.08 8.4 79.192.968.873.378.6 30.634.317.236.229.637.6
gpt-5-mini 2025.08 7.7 71.687.562.764.671.6 28.426.49.823.622.027.4
claude-haiku-4-5 2025.10 4.7 58.174.552.956.260.4 12.519.74.08.611.214.7
Open-Source Models
kimi-k2.5 2026.01 2.1 75.086.065.868.073.7 29.831.020.039.230.038.0
glm-4.6v 2025.12 6.2 66.777.057.958.665.1 33.220.710.426.622.730.1
qwen3-vl-235b-a22b 2025.09 2.2 66.481.563.262.568.4 23.523.913.525.821.727.5
qwen3.5-397b-a17b 2026.02 9.2 70.585.465.865.871.9 25.322.813.824.821.726.8
qwen3-vl-30b-a3b 2025.10 2.2 59.270.155.954.659.9 19.924.25.314.616.022.5
glm-4.6v-flashx 2025.12 5.8 59.371.355.455.160.3 21.620.75.712.715.219.6
qwen3.5-35b-a3b 2026.02 4.2 67.180.961.461.767.8 11.317.43.912.611.317.4

Hard = AgriTaxon-Hard subset (≤2 models correct, 1,052 samples). EM = exact match after normalization. Acc = LLM-as-a-Judge alias validation (98% expert agreement).

πŸ“¬ Submit to Leaderboard — If you would like your model or method to appear on this leaderboard, please contact us at zengx@nercita.org.cn with your evaluation results.

Key Findings

Explore the Embedding Space

7,437 species embedded by Qwen3-Embedding-0.6B and projected via t-SNE. Scroll to zoom, drag to pan, hover for details, click to open Wikipedia.

Open Full-Screen Browser β†’

Error Type Examples

Representative examples from the human-annotated 75 Acc-error cases (Gemini 3 Pro Preview, open-ended evaluation).

Taxonomic confusion example
Taxonomic Confusion (63%)
Ground truth: Stylosanthes capitata [Crop] → Prediction: Stylosanthes humilis
Same genus within Fabaceae. The model recognizes the correct genus from the pod morphology but confuses species-level features (capitulum density, beak shape).
Unrelated prediction example
Unrelated Prediction (13%)
Ground truth: Gleditsia triacanthos [Crop] → Prediction: Black walnut
Cross-family error: Fabaceae vs. Juglandaceae. Both are deciduous North American trees, but they belong to different families and are morphologically distinct.
Granularity mismatch example
Granularity Mismatch (7%)
Ground truth: Swiss Warmblood [Livestock] → Prediction: Horse
The model correctly identifies the animal as a horse but fails to specify the breed, producing a species-level answer where a breed-level answer is required.
Other Errors (17%)
No Answer (9%) — the model fails to produce any species name, typically for obscure taxa absent from training data.
Parsing Error (8%) — the model's response is truncated or malformed, outputting non-species text instead of a valid name.

How AgriTaxon Works

1

Authority-grounded data collection. We query Wikidata for species that carry both an authoritative database identifier (FAO Ecocrop, FAO DAD-IS, or EPPO) and a Wikimedia Commons image, forming a traceable authority chain for every label.

2

Cross-domain coverage. The 48,950+ EPPO entries—mixing pests, weeds, pathogens, and host plants—are classified via LLM into pest and weed tracks. All images undergo resolution filtering (≥224px) and visual content validation.

3

Dual evaluation protocols. Multiple-choice uses semantically hard negatives (top-3 similar species by text embedding); open-ended requires free-form species name production. An LLM-as-a-Judge protocol handles alias matching (scientific names, common names, synonyms) with 98% expert agreement.

Potential Applications

AgriTaxon is designed to support a broad range of research directions across the multimodal AI and agricultural informatics communities.

Open-Ended Visual Recognition

A testbed for models that must produce free-form species names rather than selecting from a fixed label set.

Long-Tail Understanding

Popularity metadata enables controlled study of how accuracy degrades for rare, economically important organisms.

Retrieval-Augmented Generation

Authority-grounded labels (FAO, EPPO, Wikidata QIDs) provide natural retrieval anchors for augmenting LMMs.

Agentic Reasoning

Tool use (e.g., image cropping) significantly boosts accuracy, motivating system-level augmentation research.

Agricultural AI Deployment

Pest surveillance, quarantine enforcement, crop variety verification, and livestock breed identification.

Fine-Grained Classification

Semantically hard negatives and cross-domain coverage make a challenging FGVC benchmark.

Licensing & Access

AgriTaxon is publicly available and free for academic and research use.

The dataset is hosted on Hugging Face at Xin1818/AgriTaxon and can be downloaded freely without registration.

Getting Started

# Download the dataset from Hugging Face
pip install huggingface_hub
huggingface-cli download Xin1818/AgriTaxon --repo-type dataset --local-dir dataset
Loading HD from AgriTaxon HuggingFace …