BioProSuite: Agent & Benchmark

AI4S LAB : The World's First "One-Stop" Digital Intelligent Life Science Research Platform AI4S LAB deeply integrates computing power, data, models, and experiments. The platform achieves a closed-loop process: "theoretical prediction → experimental design → automated execution → data analysis".

Visit Platform & Order Experiments

Abstract

The automation of scientific experimentation is hindered by the inability of LLMs to reliably handle accuracy-critical biological protocols. We introduce BioProBench (550k task instances) to expose the reasoning gap, and BioProAgent, a neuro-symbolic framework. By anchoring probabilistic planning in a deterministic Finite State Machine (FSM), our agent ensures hardware compliance and significantly outperforms GPT-4 baselines.

🧬 Dataset: BioProBench

We present BioProBench, the first large-scale resource dedicated to procedural reasoning in biological experimental protocols, containing a BioProCorpus of nearly 27,000 protocols and over 550,000 structured instances, covering diverse subfields of biology.

Overview of BioProBench. (a) A foundational corpus of 27,000 professionally authored protocols; (b) A structured dataset of over 550,000 instances derived from this BioProCorpus, which is partitioned into a training set to facilitate model fine-tuning and a held-out test set; and (c) A rigorous benchmark with novel, domain-specific metrics to evaluate procedural understanding, such as keyword-based content metrics and embedding-based structural metrics, to accurately quantify procedural knowledge.

🏆 BioProBench Leaderboard

Our benchmark leaderboard provides a comprehensive evaluation of leading LLMs using novel, domain-specific metrics, enabling fine-grained analysis of procedural reasoning performance. It highlights systematic weaknesses in models’ ability to understand, reason about, and generate scientific protocols across four task categories: PQA ERR ORD GEN.

Main Leaderboard

Model	Type	PQA (Acc)	ERR (Acc)	ORD (τ)	GEN (BLEU)
Bioproagent	Our Method	85.08 🥇	81.55 🥇	0.891 🥇	16.37 🥇
Closed Source Models
gemini-3-flash-preview-nothinking	Proprietary	73.33	65.08	0.8096	10.31
claude-sonnet-4-5-20250929	Proprietary	68.02	63.17	0.7730	6.28
gpt-5.4-2026-03-05	Proprietary	70.67	63.58	0.7270	9.20
Gemini-2.5-Pro	Proprietary	70.27	64.83	0.810	7.11
Claude-3.7-Sonnet	Proprietary	63.90	60.93	0.734	8.38
GPT-4o	Proprietary	63.50	62.67	0.627	8.92
Gemini-2.0-Flash	Proprietary	63.44	58.67	0.637	9.18
GPT-4-Turbo	Proprietary	57.92	56.17	0.528	9.26
o3-mini	Proprietary	65.67	62.33	0.733	8.69
Open Source Models
DeepSeek-R1	Open Source	67.83 🥉	62.92	0.745 🥉	8.62
DeepSeek-V3	Open Source	66.58	58.58	0.640	9.37 🥉
QwQ-32b	Open Source	63.67	63.00 🥉	0.705	8.40
Qwen-2.5-72b-instruct	Open Source	65.30	59.17	0.657	10.27 🥈

* PQA: Procedural Question Answering, ERR: Error Correction, ORD: Step Ordering, GEN: Protocol Generation.
Data extracted from BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning.

Want to see the solution? See how BioProAgent achieves 100% Success Rate on the Agent Execution tasks by solving the reasoning gaps shown above.

🧪 Extended BioProBench Leaderboard

We evaluate our framework on an extended BioProBench with four specialized subsets: Subset A: Protocol Drafting Subset B: Code Generation Subset C: Long-Horizon Subset D: Error Correction. The benchmark includes a digitized hardware registry (Ω) for 22 core synthetic biology instruments and strict API-level constraints to bridge sim-to-real deployment.

Subset A
Subset B
Subset C
Subset D

Method	Backbone	ROUGE-L ↑	S_sem ↑	C_s ↑	Time (s) ↓
Direct	GPT-4o	0.107	0.202	0.189	13.8
Direct	Gemini-3-Flash	0.130	0.247	0.322	12.1
Direct	DeepSeek-V3	0.123	0.260	0.285	52.1
Biomni	(Specialized)	0.081	0.252	0.342	87.1
ReAct	Gemini-3-Flash	0.116	0.268	0.455	44.5
Reflexion	Gemini-3-Flash	0.118	0.282	0.439	148.4
AutoGPT	Gemini-3-Flash	0.116	0.258	0.429	119.6
BioProAgent	Gemini-3-Flash	0.147	0.344	0.591	71.8

Method	Backbone	S_code ↑	C_p ↑	Acc_param ↑
Direct	GPT-4o	0.590	0.995	0.295
Direct	Gemini-3-Flash	0.576	0.996	0.287
Direct	DeepSeek-V3	0.495	0.995	0.205
Biomni	(Specialized)	N/A	N/A	N/A
ReAct	Gemini-3-Flash	0.038	0.210	0.103
Reflexion	Gemini-3-Flash	0.278	0.534	0.403
AutoGPT	Gemini-3-Flash	0.540	0.911	0.468
BioProAgent	Gemini-3-Flash	0.653	0.956	0.610

Method	Backbone	Succ. ↑	Acc_param ↑	C_p ↑
ReAct	Gemini-3-Flash	88.9%	0.114	0.217
Reflexion	Gemini-3-Flash	33.3%	0.000	0.000
AutoGPT	Gemini-3-Flash	66.7%	0.409	0.644
BioProAgent	Gemini-3-Flash	100.0%	0.718	0.950

Method	Backbone	ACC_seq ↑	C_p ↑	Loop Rate ↓
ReAct	Gemini-3-Flash	0.0%	0.000	40.0%
Reflexion	Gemini-3-Flash	0.0%	0.000	0.0%
AutoGPT	Gemini-3-Flash	0.0%	0.000	0.0%
BioProAgent	Gemini-3-Flash	0.464	0.887	0.0%

🤖 Method: BioProAgent

BioProAgent: Neuro-Symbolic Grounding for Constrained Scientific Planning

State-Augmented Adaptive Planning (FSM-Constrained Script-Free Planner): Abandons rigid linear workflows and adopts a neuro-symbolic framework that constrains probabilistic planning via a deterministic Finite State Machine (FSM). The Agent leverages State-Augmented Planning to flexibly select retrieval, draft generation, or code production based on current states, addressing LLMs' limitations in handling the rigorous constraints of physical actuation in wet-lab scenarios.
Scientific Review: Incorporates a strict scientific reflection mechanism (Validator) to automatically check for missing control groups, logical flaws, parameter rationality and machine code validity. This enforces a rigorous Draft-Verify-Rectify (DVR) workflow, ensuring the scientific rigor of experimental protocols.
Automation Hardware Alignment: Reads laboratory device and consumable inventories (CSV), mapping natural language steps to specific machine operations via Semantic Symbol Grounding, reducing token consumption by ~6×.
Hybrid Memory System:
- Short-Term Memory: Combines both episodic memory and working memory to maintain long-horizon protocol consistency.
- Long-Term Memory: Integrates Mem0 to recall past experimental experiences.
Human-in-the-Loop: Proactively requests user confirmation at critical decision points to ensure safety in high-risk wet-lab operations.

Overview of BioProAgent. (a) Cognitive Memory utilizes Symbolic Grounding Φ to manage context efficiently; (b) Neural Planner π₀ is grounded in a Design-Verify-Rectify FSM Δ(σ); (c) Hierarchical Verification (Kₛ, Kₚ) acts as a safety interlock, enforcing physical compliance by deterministically triggering rectification.

📈 BioProAgent Performance

BioProAgent eliminates the trade-off between scientific reasoning and physical safety. Compared to state-of-the-art baselines, it excels in hardware compliance, long-horizon stability, and cost-efficiency:

Unmatched Physical Compliance: Achieves a 95.6% physical compliance rate, acting as a crucial safety interlock against hallucinations that typically cause ReAct agents to fail catastrophically (21.0%).
Autonomous Self-Correction: While all standard baseline agents exhibit a 0% correction rate against injected errors, BioProAgent's FSM dynamically overwrites unsafe trajectories, restoring physical safety to 88.7%.
Cost Efficiency: By decoupling high-dimensional data payloads via Semantic Symbol Grounding, it reduces token consumption by ~82% compared to AutoGPT, while maintaining a 100% success rate in 60-step long-horizon workflows.

Figure: Scientific Reasoning vs. Automation Executability. Vanilla LLMs occupy the Theoretical Zone (high logical reasoning, weak executable code), while Neural Agents (e.g., ReAct) often generate better code but fail in scientific reasoning. BioProAgent achieves Trustworthy Autonomy with superior performance in both dimensions.

🔍 Case Study: FSM-Driven Self-Correction

Standard LLM agents operate in an open-loop manner: generating a dangerous parameter (e.g., exceeding centrifuge limits) leads to immediate execution and physical failure. BioProAgent proactively intercepts these hallucinations.

Figure: In a physical violation case (a), the Symbolic Rule Engine intercepts an unsafe speed limit (25,000g), forcing a transition to the RECTIFY_CODE state and regenerating code within safe limits (15,000g) In a symbol grounding error (b), the system detects an undefined resource ID ("new_plate") and guides the agent to remap it to a validated slot ("plate_1").

BibTeX

@article{liu2026bioproagent,
  title   = {BioProAgent: Neuro-Symbolic Grounding for Constrained Scientific Planning},
  author  = {Liu, Yuyang and Wang, Jingya and Lv, Liuzhenghao and Tian, Yonghong},
  journal = {arXiv preprint arXiv:2603.00876},
  year    = {2026}
}

@article{liu2025bioprobench,
  title   = {BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning},
  author  = {Liu, Yuyang and Lv, Liuzhenghao and Zhang, Xiancheng and Yuan, Li and Tian, Yonghong},
  journal = {arXiv preprint arXiv:2505.07889},
  year    = {2025}
}

BioProSuite

Towards Reliable Autonomous Wet-Lab Experimentation