AgriTaxon — Supplementary Material

Prompt templates, cost analysis, additional experiments, and error examples

← Back to project homepage
Contents
  1. Prompt Templates
  2. EPPO Entity Classification Prompt
  3. API Cost Breakdown
  4. Resolution Analysis
  5. Agentic Baseline: Tool & Prompt Details
  6. Error Type Examples

Prompt Templates

All prompts are designed for rigorous, reproducible evaluation. Each explicitly defines the task scope to prevent ambiguity. Raw files are available in the prompts/ directory.

System Prompt

System Prompt
You are an expert in agricultural species identification.

Track-Specific Scope Definitions

TrackScope Definition
CropAgricultural CROP (cultivated plant used for food, feed, fiber, oil, etc.)
LivestockLIVESTOCK (domesticated farm animal); label may be a species or a breed
PestAgricultural PEST (typically an arthropod such as insect/mite, or a nematode) harmful to crops
WeedWEED or INVASIVE PLANT (unwanted plant species in agricultural/managed environments)

Open-Ended Prompt Template

Open-Ended Prompt
You are doing agricultural species recognition (AgriTaxon).
Track type: {track}

Scope definition:
{scope}

Task: Identify the entity shown in the image within the scope above.

OUTPUT CONSTRAINT (EXACT-MATCH):
- You may think or write analysis before the final answer.
- Output the answer in the end of your response as:

detailed thinking and analysis ...
answer: <CANONICAL_NAME>

Multi-Choice Prompt Template

Multi-Choice Prompt
You are doing agricultural species recognition (AgriTaxon).
Track type: {track}

Scope definition:
{scope}

Task: Identify the entity shown in the image within the scope above.

Options:
A. {option_a}
B. {option_b}
C. {option_c}
D. {option_d}

OUTPUT CONSTRAINT:
- You may think or write analysis before the final answer.
- Output the answer in the end of your response as:

detailed thinking and analysis ...
answer: A/B/C/D

Design Rationale

Our prompt design ensures fair evaluation through three principles: (1) Scope-constrained canonical naming — each track defines the entity category and requires the canonical scientific or common name; (2) Structured output — the answer: format enables reliable extraction and exact-match scoring; (3) Thinking allowance — models are explicitly permitted to reason before answering, leveraging chain-of-thought capabilities.

EPPO Entity Classification Prompt

To classify EPPO database entities into pest vs. weed categories (Section 3.1 of the main paper), we use GPT-5 Mini with the following system prompt. For each entity, the Wikipedia content is provided in the user message, and the model outputs one of: pest, weed_invasive, pathogen, host_or_crop, or other.

Classification Prompt
You are an expert agricultural taxonomist. Your task is to classify organisms into one of the following categories based on their Wikipedia description.

## Classification Categories

### pest (Agricultural Pest)
An Agricultural Pest is an animal organism that causes direct damage to crops, forest trees, pasture, stored products, or other cultivated plants through feeding or other destructive behavior. Typical examples include insects, mites, nematodes, molluscs (such as snails and slugs), and in some cases vertebrates such as rodents or birds when they are recognized as agricultural pests.

### weed_invasive (Weed and Invasive Plant)
A Weed and Invasive Plant is a plant species that plays a harmful role in agricultural fields, pastures, orchards, forests, or natural ecosystems by growing where it is not desired and competing with crops or native vegetation.

### pathogen (Plant Pathogen)
A Plant Pathogen is a microorganism or microbe-like organism that primarily causes disease in plants, leading to symptoms such as leaf spots, blights, wilting, rots, galls, or other pathological changes.

### host_or_crop (Host or Crop Plant)
A Host or Crop Plant is a plant species that is primarily cultivated, managed, or studied as a useful plant in agriculture, forestry, horticulture, or landscaping.

### other (Other or Unclear)
Other or Unclear is a catch-all category for entities that cannot be reliably assigned to the above categories.

## Output Format
Respond with ONLY the category name (one of: pest, weed_invasive, pathogen, host_or_crop, other). Do not include any explanation.

API Cost Breakdown

Total benchmark cost across all 14 models: $1,554.70 USD (including $88.50 for LLM-as-a-Judge alias validation via GPT-5 Mini). Prices without official USD listing (GLM, Doubao) are converted from CNY at 1 USD = 6.91 CNY.

Model In ($/1M) Out ($/1M) MC Cost OE Cost Judge Total
Proprietary Models
gemini-3-pro-preview2.0012.00$224.0$400.1$4.8$629.0
gemini-3-flash-preview0.503.00$32.6$66.4$5.8$104.8
gpt-51.2510.00$72.1$119.8$5.8$197.7
gpt-5-mini0.252.00$20.5$20.9$6.7$48.1
claude-haiku-4-51.005.00$48.9$51.4$7.7$108.0
doubao-seed-2-0-pro0.462.31$22.1$30.3$4.6$57.0
doubao-seed-2-0-lite0.090.52$4.1$4.4$5.2$13.7
Open-Source Models
kimi-k2.50.603.00$44.2$64.4$5.9$114.6
glm-4.6v0.140.43$2.9$4.4$6.5$13.8
qwen3-vl-235b-a22b0.401.60$13.7$15.2$6.6$35.5
qwen3.5-397b-a17b0.603.60$16.6$17.2$6.6$40.4
qwen3-vl-30b-a3b0.200.80$7.7$8.6$7.3$23.6
glm-4.6v-flashx0.020.22$1.1$3.0$7.4$11.5
qwen3.5-35b-a3b0.252.00$45.1$104.3$7.6$157.0
Total Benchmark CostJudge: $88.5$1,554.7

Resolution Analysis

We examined whether image resolution correlates with recognition accuracy. Images in the 0.5–1 MP range have the lowest mean accuracy (3.24 models correct out of 14), while images above 4 MP achieve the highest (3.66 models correct). However, this effect is substantially weaker than the Wikipedia popularity effect (which shows a 2× gap), suggesting that resolution alone does not guarantee recognition success—the model must also have learned about the species during training.

Image resolution vs. recognition accuracy
Image resolution vs. recognition accuracy. Higher resolution shows modest improvement, but the effect is much weaker than species popularity.

Agentic Baseline: Tool & Prompt Details

Complete specification of the image cropping tool and prompts used in the agentic baseline. Raw files are available at prompts/agentic_*.

Tool Definition

image_crop Tool
image_crop: Crop a specific region of the image for closer inspection.

Use this tool when you need to examine fine details like:
- Leaf patterns, textures, or venation
- Insect body parts (antennae, legs, wings)
- Animal markings or features

Input: A bounding box with normalized coordinates (0-1):
  x1 (float): Left boundary   y1 (float): Top boundary
  x2 (float): Right boundary  y2 (float): Bottom boundary

Output: A cropped 448×448 view of the specified region.

Processing Algorithm

Crop Tool Processing Steps
Given bounding box [x1, y1, x2, y2] and image size (W, H):

1. Convert to pixels: px_i = x_i × W,  py_i = y_i × H
2. Compute center and size:
     cx = (px1 + px2) / 2,  cy = (py1 + py2) / 2
     s  = max(px2 - px1, py2 - py1)
3. Make square and expand: s' = s × 1.5
4. Enforce minimum size:   s' = max(s', 224)
5. Compute crop region centered at (cx, cy) with side s'
6. Shift into bounds if crop exceeds image boundaries
7. Crop and resize to 448 × 448 using LANCZOS

Agentic System Prompt

System Prompt (Agentic)
You are an expert in agricultural species identification.

You have access to a tool called `image_crop` that allows you to crop and zoom into specific regions of the image for closer inspection.

Guidelines:
1. First, examine the full image to get an overall understanding.
2. If you're uncertain or need to see fine details (leaf patterns, insect parts, animal markings), use the image_crop tool to zoom in.
3. You can use the tool up to 2 times per image.
4. After examining all necessary details, provide your final answer.

Remember: The crop tool returns a zoomed view of the specified region, allowing you to see details that might not be visible in the full image.

Agentic User Prompt

User Prompt (Agentic Multi-Choice)
You are doing agricultural species recognition (AgriTaxon).
Track type: {track}

Scope definition:
{scope}

Task: Identify the entity shown in the image within the scope above.

Options:
A. {option_a}    B. {option_b}
C. {option_c}    D. {option_d}

INSTRUCTIONS:
1. First analyze the full image. If you can confidently identify the species, provide your answer.
2. If you need to see more details (leaf texture, body parts, markings), use the image_crop tool to zoom into relevant regions.
3. After your analysis, output your final answer as:
   answer: A/B/C/D

Agent Loop Pseudocode

Algorithm: Agent-Based Evaluation with Crop Tool
Input:  Image I, question Q, options {A,B,C,D}, max_calls K=2
Output: Final answer a ∈ {A,B,C,D}

messages ← [SystemPrompt, (I, Q, options)]
tool_count ← 0

while tool_count < K do
    (response, tool_calls) ← LLM(messages, tools=[image_crop])
    if tool_calls = ∅ then
        return ExtractAnswer(response)
    end if
    for each call in tool_calls do
        bbox ← call.arguments
        I_crop ← image_crop(I, bbox)
        Append (response, call) to messages
        Append I_crop as tool result to messages
        tool_count ← tool_count + 1
    end for
end while

response ← LLM(messages, tools=∅)
return ExtractAnswer(response)

Error Type Examples

Representative examples for each of the five error types from the human-annotated 75 Acc-error cases (Gemini 3 Pro Preview, open-ended evaluation).

Taxonomic confusion example
Taxonomic Confusion (63%)
Ground truth: Stylosanthes capitata [Crop] → Prediction: Stylosanthes humilis
Same genus within Fabaceae. The model recognizes the correct genus from the pod morphology but confuses species-level features (capitulum density, beak shape).
Unrelated prediction example
Unrelated Prediction (13%)
Ground truth: Gleditsia triacanthos [Crop] → Prediction: Black walnut
Cross-family error: Fabaceae vs. Juglandaceae. Both are deciduous North American trees, but they belong to different families and are morphologically distinct.
No answer example
No Answer (9%)
Ground truth: Cydia toreuta [Pest] → Prediction: (none)
The model fails to produce any species name. Cydia toreuta is an obscure tortricid moth likely absent from the model's training data.
Parsing error example
Parsing Error (8%)
Ground truth: Licuala ramsayi [Crop] → Prediction: "answer"
The model's response is truncated or malformed, outputting the literal word "answer" instead of a species name.
Granularity mismatch example
Granularity Mismatch (7%)
Ground truth: Swiss Warmblood [Livestock] → Prediction: Horse
The model correctly identifies the animal as a horse but fails to specify the breed, producing a species-level answer where a breed-level answer is required.