AI4S LAB : The World's First "One-Stop" Digital Intelligent Life Science Research Platform AI4S LAB deeply integrates computing power, data, models, and experiments. The platform achieves a closed-loop process: "theoretical prediction → experimental design → automated execution → data analysis".
The automation of scientific experimentation is hindered by the inability of LLMs to reliably handle accuracy-critical biological protocols. We introduce BioProBench (550k task instances) to expose the reasoning gap, and BioProAgent, a neuro-symbolic framework. By anchoring probabilistic planning in a deterministic Finite State Machine (FSM), our agent ensures hardware compliance and significantly outperforms GPT-4 baselines.
We present BioProBench, the first large-scale resource dedicated to procedural reasoning in biological experimental protocols, containing a BioProCorpus of nearly 27,000 protocols and over 550,000 structured instances, covering diverse subfields of biology.
Overview of BioProBench. (a) A foundational corpus of 27,000 professionally authored protocols; (b) A structured dataset of over 550,000 instances derived from this BioProCorpus, which is partitioned into a training set to facilitate model fine-tuning and a held-out test set; and (c) A rigorous benchmark with novel, domain-specific metrics to evaluate procedural understanding, such as keyword-based content metrics and embedding-based structural metrics, to accurately quantify procedural knowledge.
Our benchmark leaderboard provides a comprehensive evaluation of leading LLMs using novel, domain-specific metrics, enabling fine-grained analysis of procedural reasoning performance. It highlights systematic weaknesses in models’ ability to understand, reason about, and generate scientific protocols across four task categories: PQA ERR ORD GEN.
| Model | Type | PQA (Acc) | ERR (Acc) | ORD (τ) | GEN (BLEU) |
|---|---|---|---|---|---|
| Bioproagent | Our Method | 85.08 🥇 | 81.55 🥇 | 0.891 🥇 | 16.37 🥇 |
| Closed Source Models | |||||
| Gemini-2.5-Pro | Proprietary | 70.27 🥈 | 64.83 🥈 | 0.810 🥈 | 7.11 |
| Claude-3.7-Sonnet | Proprietary | 63.90 | 60.93 | 0.734 | 8.38 |
| GPT-4o | Proprietary | 63.50 | 62.67 | 0.627 | 8.92 |
| Gemini-2.0-Flash | Proprietary | 63.44 | 58.67 | 0.637 | 9.18 |
| GPT-4-Turbo | Proprietary | 57.92 | 56.17 | 0.528 | 9.26 |
| o3-mini | Proprietary | 65.67 | 62.33 | 0.733 | 8.69 |
| Open Source Models | |||||
| DeepSeek-R1 | Open Source | 67.83 🥉 | 62.92 | 0.745 🥉 | 8.62 |
| DeepSeek-V3 | Open Source | 66.58 | 58.58 | 0.640 | 9.37 🥉 |
| QwQ-32b | Open Source | 63.67 | 63.00 🥉 | 0.705 | 8.40 |
| Qwen-2.5-72b-instruct | Open Source | 65.30 | 59.17 | 0.657 | 10.27 🥈 |
* PQA: Procedural Question Answering, ERR: Error Correction, ORD: Step Ordering, GEN: Protocol Generation.
Data extracted from BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning.
Overview of BioProAgent. (a) Cognitive Memory utilizes Symbolic Grounding Φ to manage context efficiently; (b) Neural Planner π₀ is grounded in a Design-Verify-Rectify FSM Δ(σ); (c) Hierarchical Verification (Kₛ, Kₚ) acts as a safety interlock, enforcing physical compliance by deterministically triggering rectification.
@article{BioProAgent2026,
title={BioProAgent: Neuro-Symbolic Grounding for Constrained Scientific Planning},
author={Anonymous},
year={2026}
}
@article{liu2025bioprobench,
title={BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning},
author={Liu, Yuyang and Lv, Liuzhenghao and Zhang, Xiancheng and Yuan, Li and Tian, Yonghong},
journal={arXiv preprint arXiv:2505.07889},
year={2025}
}
}