BioProSuite

Towards Reliable Autonomous Wet-Lab Experimentation

Yuyang Liu1, Liuzhenghao Lv1,Xiancheng Zhang1, Jingya Wang1,2, Li Yuan1,2, Yonghong Tian1,2
1Peking University, 2School of AI4S
Latest News
🔬 [2026-04-01] ICML Rebuttal update: we added additional model evaluation results for gpt-5.4-2026-03-05, gemini-3-flash-preview-nothinking, and claude-sonnet-4-5-20250929 across PQA/ERR/ORD/GEN.
[2026-03-31] Data Split Update! We have officially released the Train/Test splits for each task (PQA, ORD, ERR, GEN, REA), making it easier for the community to train and evaluate models consistently.
🔥 [2026-03-18] Our BioProAgent is now live on AI4S LAB! Try it out and order wet-lab experiments here.
🎉 [2026-03-03] Our BioProAgent has been accepted by the ICLR 2026 LLA Workshop!.
📄 [2026-03-01] The preprint of our BioProAgent paper is available on arXiv.
📝 [2026-01-21] Our BioProBench paper has been updated.
🚀 [2025-12] Code and dataset (v1.0) are released on GitHub.

AI4S LAB : The World's First "One-Stop" Digital Intelligent Life Science Research Platform AI4S LAB deeply integrates computing power, data, models, and experiments. The platform achieves a closed-loop process: "theoretical prediction → experimental design → automated execution → data analysis".

Abstract

The automation of scientific experimentation is hindered by the inability of LLMs to reliably handle accuracy-critical biological protocols. We introduce BioProBench (550k task instances) to expose the reasoning gap, and BioProAgent, a neuro-symbolic framework. By anchoring probabilistic planning in a deterministic Finite State Machine (FSM), our agent ensures hardware compliance and significantly outperforms GPT-4 baselines.

🧬 Dataset: BioProBench

We present BioProBench, the first large-scale resource dedicated to procedural reasoning in biological experimental protocols, containing a BioProCorpus of nearly 27,000 protocols and over 550,000 structured instances, covering diverse subfields of biology.

BioProBench Statistics

Overview of BioProBench. (a) A foundational corpus of 27,000 professionally authored protocols; (b) A structured dataset of over 550,000 instances derived from this BioProCorpus, which is partitioned into a training set to facilitate model fine-tuning and a held-out test set; and (c) A rigorous benchmark with novel, domain-specific metrics to evaluate procedural understanding, such as keyword-based content metrics and embedding-based structural metrics, to accurately quantify procedural knowledge.

🏆 BioProBench Leaderboard

Our benchmark leaderboard provides a comprehensive evaluation of leading LLMs using novel, domain-specific metrics, enabling fine-grained analysis of procedural reasoning performance. It highlights systematic weaknesses in models’ ability to understand, reason about, and generate scientific protocols across four task categories: PQA ERR ORD GEN.

Model Type PQA (Acc) ERR (Acc) ORD (τ) GEN (BLEU)
Bioproagent Our Method 85.08 🥇 81.55 🥇 0.891 🥇 16.37 🥇
Closed Source Models
gemini-3-flash-preview-nothinking Proprietary 73.33 65.08 0.8096 10.31
claude-sonnet-4-5-20250929 Proprietary 68.02 63.17 0.7730 6.28
gpt-5.4-2026-03-05 Proprietary 70.67 63.58 0.7270 9.20
Gemini-2.5-Pro Proprietary 70.27 64.83 0.810 7.11
Claude-3.7-Sonnet Proprietary 63.90 60.93 0.734 8.38
GPT-4o Proprietary 63.50 62.67 0.627 8.92
Gemini-2.0-Flash Proprietary 63.44 58.67 0.637 9.18
GPT-4-Turbo Proprietary 57.92 56.17 0.528 9.26
o3-mini Proprietary 65.67 62.33 0.733 8.69
Open Source Models
DeepSeek-R1 Open Source 67.83 🥉 62.92 0.745 🥉 8.62
DeepSeek-V3 Open Source 66.58 58.58 0.640 9.37 🥉
QwQ-32b Open Source 63.67 63.00 🥉 0.705 8.40
Qwen-2.5-72b-instruct Open Source 65.30 59.17 0.657 10.27 🥈

* PQA: Procedural Question Answering, ERR: Error Correction, ORD: Step Ordering, GEN: Protocol Generation.
Data extracted from BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning.

Want to see the solution? See how BioProAgent achieves 100% Success Rate on the Agent Execution tasks by solving the reasoning gaps shown above.

🧪 Extended BioProBench Leaderboard

We evaluate our framework on an extended BioProBench with four specialized subsets: Subset A: Protocol Drafting Subset B: Code Generation Subset C: Long-Horizon Subset D: Error Correction. The benchmark includes a digitized hardware registry (Ω) for 22 core synthetic biology instruments and strict API-level constraints to bridge sim-to-real deployment.

MethodBackbone ROUGE-L ↑S_sem ↑C_s ↑Time (s) ↓
DirectGPT-4o0.1070.2020.18913.8
DirectGemini-3-Flash0.1300.2470.32212.1
DirectDeepSeek-V30.1230.2600.28552.1
Biomni(Specialized)0.0810.2520.34287.1
ReActGemini-3-Flash0.1160.2680.45544.5
ReflexionGemini-3-Flash0.1180.2820.439148.4
AutoGPTGemini-3-Flash0.1160.2580.429119.6
BioProAgentGemini-3-Flash0.1470.3440.59171.8
MethodBackbone S_code ↑C_p ↑Acc_param ↑
DirectGPT-4o0.5900.9950.295
DirectGemini-3-Flash0.5760.9960.287
DirectDeepSeek-V30.4950.9950.205
Biomni(Specialized)N/AN/AN/A
ReActGemini-3-Flash0.0380.2100.103
ReflexionGemini-3-Flash0.2780.5340.403
AutoGPTGemini-3-Flash0.5400.9110.468
BioProAgentGemini-3-Flash0.6530.9560.610
MethodBackbone Succ. ↑Acc_param ↑C_p ↑
ReActGemini-3-Flash88.9%0.1140.217
ReflexionGemini-3-Flash33.3%0.0000.000
AutoGPTGemini-3-Flash66.7%0.4090.644
BioProAgentGemini-3-Flash100.0%0.7180.950
MethodBackbone ACC_seq ↑C_p ↑Loop Rate ↓
ReActGemini-3-Flash0.0%0.00040.0%
ReflexionGemini-3-Flash0.0%0.0000.0%
AutoGPTGemini-3-Flash0.0%0.0000.0%
BioProAgentGemini-3-Flash0.4640.8870.0%

🤖 Method: BioProAgent

BioProAgent: Neuro-Symbolic Grounding for Constrained Scientific Planning

  • State-Augmented Adaptive Planning (FSM-Constrained Script-Free Planner): Abandons rigid linear workflows and adopts a neuro-symbolic framework that constrains probabilistic planning via a deterministic Finite State Machine (FSM). The Agent leverages State-Augmented Planning to flexibly select retrieval, draft generation, or code production based on current states, addressing LLMs' limitations in handling the rigorous constraints of physical actuation in wet-lab scenarios.
  • Scientific Review: Incorporates a strict scientific reflection mechanism (Validator) to automatically check for missing control groups, logical flaws, parameter rationality and machine code validity. This enforces a rigorous Draft-Verify-Rectify (DVR) workflow, ensuring the scientific rigor of experimental protocols.
  • Automation Hardware Alignment: Reads laboratory device and consumable inventories (CSV), mapping natural language steps to specific machine operations via Semantic Symbol Grounding, reducing token consumption by ~6×.
  • Hybrid Memory System:
    • Short-Term Memory: Combines both episodic memory and working memory to maintain long-horizon protocol consistency.
    • Long-Term Memory: Integrates Mem0 to recall past experimental experiences.
  • Human-in-the-Loop: Proactively requests user confirmation at critical decision points to ensure safety in high-risk wet-lab operations.
BioProAgent Architecture

Overview of BioProAgent. (a) Cognitive Memory utilizes Symbolic Grounding Φ to manage context efficiently; (b) Neural Planner π₀ is grounded in a Design-Verify-Rectify FSM Δ(σ); (c) Hierarchical Verification (Kₛ, Kₚ) acts as a safety interlock, enforcing physical compliance by deterministically triggering rectification.


📈 BioProAgent Performance

BioProAgent eliminates the trade-off between scientific reasoning and physical safety. Compared to state-of-the-art baselines, it excels in hardware compliance, long-horizon stability, and cost-efficiency:

  • Unmatched Physical Compliance: Achieves a 95.6% physical compliance rate, acting as a crucial safety interlock against hallucinations that typically cause ReAct agents to fail catastrophically (21.0%).
  • Autonomous Self-Correction: While all standard baseline agents exhibit a 0% correction rate against injected errors, BioProAgent's FSM dynamically overwrites unsafe trajectories, restoring physical safety to 88.7%.
  • Cost Efficiency: By decoupling high-dimensional data payloads via Semantic Symbol Grounding, it reduces token consumption by ~82% compared to AutoGPT, while maintaining a 100% success rate in 60-step long-horizon workflows.
Scientific Reasoning vs. Automation Executability

Figure: Scientific Reasoning vs. Automation Executability. Vanilla LLMs occupy the Theoretical Zone (high logical reasoning, weak executable code), while Neural Agents (e.g., ReAct) often generate better code but fail in scientific reasoning. BioProAgent achieves Trustworthy Autonomy with superior performance in both dimensions.

🔍 Case Study: FSM-Driven Self-Correction

Standard LLM agents operate in an open-loop manner: generating a dangerous parameter (e.g., exceeding centrifuge limits) leads to immediate execution and physical failure. BioProAgent proactively intercepts these hallucinations.

Self-Correction Trajectories

Figure: In a physical violation case (a), the Symbolic Rule Engine intercepts an unsafe speed limit (25,000g), forcing a transition to the RECTIFY_CODE state and regenerating code within safe limits (15,000g) In a symbol grounding error (b), the system detects an undefined resource ID ("new_plate") and guides the agent to remap it to a validated slot ("plate_1").

BibTeX

@article{liu2026bioproagent,
  title   = {BioProAgent: Neuro-Symbolic Grounding for Constrained Scientific Planning},
  author  = {Liu, Yuyang and Wang, Jingya and Lv, Liuzhenghao and Tian, Yonghong},
  journal = {arXiv preprint arXiv:2603.00876},
  year    = {2026}
}

@article{liu2025bioprobench,
  title   = {BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning},
  author  = {Liu, Yuyang and Lv, Liuzhenghao and Zhang, Xiancheng and Yuan, Li and Tian, Yonghong},
  journal = {arXiv preprint arXiv:2505.07889},
  year    = {2025}
}

Page template adapted from Nerfies and ChemCoTBench.

Visit Stats (Current Browser): Total 0 | Today (-) 0