AI4S LAB : The World's First "One-Stop" Digital Intelligent Life Science Research Platform AI4S LAB deeply integrates computing power, data, models, and experiments. The platform achieves a closed-loop process: "theoretical prediction → experimental design → automated execution → data analysis".
The automation of scientific experimentation is hindered by the inability of LLMs to reliably handle accuracy-critical biological protocols. We introduce BioProBench (550k task instances) to expose the reasoning gap, and BioProAgent, a neuro-symbolic framework. By anchoring probabilistic planning in a deterministic Finite State Machine (FSM), our agent ensures hardware compliance and significantly outperforms GPT-4 baselines.
We present BioProBench, the first large-scale resource dedicated to procedural reasoning in biological experimental protocols, containing a BioProCorpus of nearly 27,000 protocols and over 550,000 structured instances, covering diverse subfields of biology.
Overview of BioProBench. (a) A foundational corpus of 27,000 professionally authored protocols; (b) A structured dataset of over 550,000 instances derived from this BioProCorpus, which is partitioned into a training set to facilitate model fine-tuning and a held-out test set; and (c) A rigorous benchmark with novel, domain-specific metrics to evaluate procedural understanding, such as keyword-based content metrics and embedding-based structural metrics, to accurately quantify procedural knowledge.
Our benchmark leaderboard provides a comprehensive evaluation of leading LLMs using novel, domain-specific metrics, enabling fine-grained analysis of procedural reasoning performance. It highlights systematic weaknesses in models’ ability to understand, reason about, and generate scientific protocols across four task categories: PQA ERR ORD GEN.
| Model | Type | PQA (Acc) | ERR (Acc) | ORD (τ) | GEN (BLEU) |
|---|---|---|---|---|---|
| Bioproagent | Our Method | 85.08 🥇 | 81.55 🥇 | 0.891 🥇 | 16.37 🥇 |
| Closed Source Models | |||||
| gemini-3-flash-preview-nothinking | Proprietary | 73.33 | 65.08 | 0.8096 | 10.31 |
| claude-sonnet-4-5-20250929 | Proprietary | 68.02 | 63.17 | 0.7730 | 6.28 |
| gpt-5.4-2026-03-05 | Proprietary | 70.67 | 63.58 | 0.7270 | 9.20 |
| Gemini-2.5-Pro | Proprietary | 70.27 | 64.83 | 0.810 | 7.11 |
| Claude-3.7-Sonnet | Proprietary | 63.90 | 60.93 | 0.734 | 8.38 |
| GPT-4o | Proprietary | 63.50 | 62.67 | 0.627 | 8.92 |
| Gemini-2.0-Flash | Proprietary | 63.44 | 58.67 | 0.637 | 9.18 |
| GPT-4-Turbo | Proprietary | 57.92 | 56.17 | 0.528 | 9.26 |
| o3-mini | Proprietary | 65.67 | 62.33 | 0.733 | 8.69 |
| Open Source Models | |||||
| DeepSeek-R1 | Open Source | 67.83 🥉 | 62.92 | 0.745 🥉 | 8.62 |
| DeepSeek-V3 | Open Source | 66.58 | 58.58 | 0.640 | 9.37 🥉 |
| QwQ-32b | Open Source | 63.67 | 63.00 🥉 | 0.705 | 8.40 |
| Qwen-2.5-72b-instruct | Open Source | 65.30 | 59.17 | 0.657 | 10.27 🥈 |
* PQA: Procedural Question Answering, ERR: Error Correction, ORD: Step Ordering, GEN: Protocol Generation.
Data extracted from BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning.
We evaluate our framework on an extended BioProBench with four specialized subsets:
Subset A: Protocol Drafting
Subset B: Code Generation
Subset C: Long-Horizon
Subset D: Error Correction.
The benchmark includes a digitized hardware registry (Ω) for 22 core synthetic biology instruments and strict API-level constraints to bridge sim-to-real deployment.
| Method | Backbone | ROUGE-L ↑ | S_sem ↑ | C_s ↑ | Time (s) ↓ |
|---|---|---|---|---|---|
| Direct | GPT-4o | 0.107 | 0.202 | 0.189 | 13.8 |
| Direct | Gemini-3-Flash | 0.130 | 0.247 | 0.322 | 12.1 |
| Direct | DeepSeek-V3 | 0.123 | 0.260 | 0.285 | 52.1 |
| Biomni | (Specialized) | 0.081 | 0.252 | 0.342 | 87.1 |
| ReAct | Gemini-3-Flash | 0.116 | 0.268 | 0.455 | 44.5 |
| Reflexion | Gemini-3-Flash | 0.118 | 0.282 | 0.439 | 148.4 |
| AutoGPT | Gemini-3-Flash | 0.116 | 0.258 | 0.429 | 119.6 |
| BioProAgent | Gemini-3-Flash | 0.147 | 0.344 | 0.591 | 71.8 |
| Method | Backbone | S_code ↑ | C_p ↑ | Acc_param ↑ |
|---|---|---|---|---|
| Direct | GPT-4o | 0.590 | 0.995 | 0.295 |
| Direct | Gemini-3-Flash | 0.576 | 0.996 | 0.287 |
| Direct | DeepSeek-V3 | 0.495 | 0.995 | 0.205 |
| Biomni | (Specialized) | N/A | N/A | N/A |
| ReAct | Gemini-3-Flash | 0.038 | 0.210 | 0.103 |
| Reflexion | Gemini-3-Flash | 0.278 | 0.534 | 0.403 |
| AutoGPT | Gemini-3-Flash | 0.540 | 0.911 | 0.468 |
| BioProAgent | Gemini-3-Flash | 0.653 | 0.956 | 0.610 |
| Method | Backbone | Succ. ↑ | Acc_param ↑ | C_p ↑ |
|---|---|---|---|---|
| ReAct | Gemini-3-Flash | 88.9% | 0.114 | 0.217 |
| Reflexion | Gemini-3-Flash | 33.3% | 0.000 | 0.000 |
| AutoGPT | Gemini-3-Flash | 66.7% | 0.409 | 0.644 |
| BioProAgent | Gemini-3-Flash | 100.0% | 0.718 | 0.950 |
| Method | Backbone | ACC_seq ↑ | C_p ↑ | Loop Rate ↓ |
|---|---|---|---|---|
| ReAct | Gemini-3-Flash | 0.0% | 0.000 | 40.0% |
| Reflexion | Gemini-3-Flash | 0.0% | 0.000 | 0.0% |
| AutoGPT | Gemini-3-Flash | 0.0% | 0.000 | 0.0% |
| BioProAgent | Gemini-3-Flash | 0.464 | 0.887 | 0.0% |
Overview of BioProAgent. (a) Cognitive Memory utilizes Symbolic Grounding Φ to manage context efficiently; (b) Neural Planner π₀ is grounded in a Design-Verify-Rectify FSM Δ(σ); (c) Hierarchical Verification (Kₛ, Kₚ) acts as a safety interlock, enforcing physical compliance by deterministically triggering rectification.
BioProAgent eliminates the trade-off between scientific reasoning and physical safety. Compared to state-of-the-art baselines, it excels in hardware compliance, long-horizon stability, and cost-efficiency:
Figure: Scientific Reasoning vs. Automation Executability. Vanilla LLMs occupy the Theoretical Zone (high logical reasoning, weak executable code), while Neural Agents (e.g., ReAct) often generate better code but fail in scientific reasoning. BioProAgent achieves Trustworthy Autonomy with superior performance in both dimensions.
Standard LLM agents operate in an open-loop manner: generating a dangerous parameter (e.g., exceeding centrifuge limits) leads to immediate execution and physical failure. BioProAgent proactively intercepts these hallucinations.
Figure: In a physical violation case (a), the Symbolic Rule Engine intercepts an unsafe speed limit (25,000g), forcing a transition to the RECTIFY_CODE state and regenerating code within safe limits (15,000g) In a symbol grounding error (b), the system detects an undefined resource ID ("new_plate") and guides the agent to remap it to a validated slot ("plate_1").
@article{liu2026bioproagent,
title = {BioProAgent: Neuro-Symbolic Grounding for Constrained Scientific Planning},
author = {Liu, Yuyang and Wang, Jingya and Lv, Liuzhenghao and Tian, Yonghong},
journal = {arXiv preprint arXiv:2603.00876},
year = {2026}
}
@article{liu2025bioprobench,
title = {BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning},
author = {Liu, Yuyang and Lv, Liuzhenghao and Zhang, Xiancheng and Yuan, Li and Tian, Yonghong},
journal = {arXiv preprint arXiv:2505.07889},
year = {2025}
}