Fine-grained knowledge acquisition benchmark

FIKA-Bench

From fine-grained recognition to evidence-grounded knowledge acquisition.

FIKA-Bench asks multimodal models and agents to identify unfamiliar fine-grained targets by searching, verifying, and using external evidence, rather than relying on memorized benchmark labels.

311 Samples
4 Domains
17 Subcategories
228 Fine-grained answers
100% Evidence coverage

Case Studies

Recognition v.s. knowledge acquisition.

Four real FIKA-Bench cases, one from each broad domain. The closed-set model is compressed to its visual-only failure, while the agent view highlights retrieved result thumbnails, fetched URLs, page evidence, and the verified answer.

Results

Fine-grained knowledge acquisition remains challenging.

Overall Model Accuracy

Strict accuracy (%), sorted from highest to lowest.

25.1
Kimi-K2.6Open
20.6
Gemini-3.1Closed
19.0
OpenClaw Qwen3.5Agent
18.3
Qwen3.5-397BOpen
17.4
GPT-5-miniClosed
15.8
GLM-5V-TurboClosed
14.8
Qwen3-VL-235BOpen
13.8
OpenClaw Qwen3-VLAgent
12.5
OpenClaw MiniMaxAgent
11.6
Fine-R1-7BFine
10.3
OpenCode MiniMaxAgent
9.3
Qwen3.5-9BOpen
9.0
Qwen3-VL-8BOpen
8.4
OpenCode Qwen3-VLAgent
6.8
VisualRFT-7BFine
Open-source Closed-source Agent Fine-grained
Rank System Type Public Real-Life Overall
Prod.Nat.Trans.Cult.Avg. Prod.Nat.Trans.Cult.Avg.
1Kimi-K2.6Open0.026.044.66.521.523.140.933.332.331.025.1
2Gemini-3.1-Flash-LiteClosed3.720.037.51.616.925.636.429.219.426.720.6
3OpenClaw + Qwen3.5-397B-A17BAgent7.416.025.03.213.323.140.933.322.628.419.0
4Qwen3.5-397B-A17BOpen11.112.028.63.213.820.531.833.322.625.918.3
5GPT-5-miniClosed3.712.041.10.015.415.445.525.06.520.717.4
6GLM-5V-TurboClosed3.716.023.24.812.810.340.933.39.720.715.8
7Qwen3-VL-235B-A22BOpen11.110.025.00.011.312.831.820.822.620.714.8
8OpenClaw + Qwen3-VL-8BAgent0.04.025.00.08.220.540.916.719.423.313.8
9OpenClaw + MiniMax-M2.7/Qwen3-VL-8BAgent0.06.021.40.07.717.927.325.016.120.712.5
10Fine-R1-7BFine3.714.021.43.211.37.718.216.79.712.111.6
11OpenCode + MiniMax-M2.7/Qwen3-VL-8BAgent0.012.016.10.07.77.727.320.89.714.710.3
12Qwen3.5-9BOpen0.02.017.90.05.67.745.512.56.515.59.3
13Qwen3-VL-8BOpen0.06.017.90.06.77.727.312.59.712.99.0
14OpenCode + Qwen3-VL-8BAgent3.74.019.61.67.75.122.716.70.09.58.4
15VisualRFT-7BFine3.78.017.90.07.75.14.54.26.55.26.8

What makes FIKA-Bench different?

It combines visual input, fine-grained labels, external tools, and a leakage-aware closed-book filter. This targets active knowledge acquisition rather than static fine-grained classification alone.

Read the paper

Dataset

Evidence-grounded samples across 4 fine-grained domains.

Source split

Public datasets
195 / 62.70%
Real-life collection
116 / 37.30%
Languages
English and Chinese

Leakage-aware curation

  1. Filter memorized cases with closed-book model checks.
  2. Audit reverse-image-search leakage for public-source images.
  3. Keep only samples with verified evidence for the final answer.

Evidence sources

Evidence URLs
319
Unique domains
120
Average URLs / sample
1.03
FIKA-Bench taxonomy distribution Interactive circular chart showing four broad domains and seventeen subcategories. 311 samples

Citation

@misc{li2026fikabenchfinegrainedrecognitionfinegrained,
  title={FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition},
  author={Geng Li and Yuxin Peng},
  year={2026},
  eprint={2605.13193},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2605.13193}
}