BioPro Project

Towards Reliable Autonomous Wet-Lab Experimentation

Yuyang Liu1, Liuzhenghao Lv1,Xiancheng Zhang1, Jingya Wang1,2, Li Yuan1,2, Yonghong Tian1,2
1Peking University, 2School of AI4S
Latest News
🚀 [2025-12] Code and dataset (v1.0) are released on GitHub.

AI4S LAB : The World's First "One-Stop" Digital Intelligent Life Science Research Platform AI4S LAB deeply integrates computing power, data, models, and experiments. The platform achieves a closed-loop process: "theoretical prediction → experimental design → automated execution → data analysis".

Abstract

The automation of scientific experimentation is hindered by the inability of LLMs to reliably handle accuracy-critical biological protocols. We introduce BioProBench (550k task instances) to expose the reasoning gap, and BioProAgent, a neuro-symbolic framework. By anchoring probabilistic planning in a deterministic Finite State Machine (FSM), our agent ensures hardware compliance and significantly outperforms GPT-4 baselines.

🧬 Dataset: BioProBench

We present BioProBench, the first large-scale resource dedicated to procedural reasoning in biological experimental protocols, containing a BioProCorpus of nearly 27,000 protocols and over 550,000 structured instances, covering diverse subfields of biology.

BioProBench Statistics

Overview of BioProBench. (a) A foundational corpus of 27,000 professionally authored protocols; (b) A structured dataset of over 550,000 instances derived from this BioProCorpus, which is partitioned into a training set to facilitate model fine-tuning and a held-out test set; and (c) A rigorous benchmark with novel, domain-specific metrics to evaluate procedural understanding, such as keyword-based content metrics and embedding-based structural metrics, to accurately quantify procedural knowledge.

🏆 BioProBench Leaderboard

Our benchmark leaderboard provides a comprehensive evaluation of leading LLMs using novel, domain-specific metrics, enabling fine-grained analysis of procedural reasoning performance. It highlights systematic weaknesses in models’ ability to understand, reason about, and generate scientific protocols across four task categories: PQA ERR ORD GEN.

Model Type PQA (Acc) ERR (Acc) ORD (τ) GEN (BLEU)
Bioproagent Our Method 85.08 🥇 81.55 🥇 0.891 🥇 16.37 🥇
Closed Source Models
Gemini-2.5-Pro Proprietary 70.27 🥈 64.83 🥈 0.810 🥈 7.11
Claude-3.7-Sonnet Proprietary 63.90 60.93 0.734 8.38
GPT-4o Proprietary 63.50 62.67 0.627 8.92
Gemini-2.0-Flash Proprietary 63.44 58.67 0.637 9.18
GPT-4-Turbo Proprietary 57.92 56.17 0.528 9.26
o3-mini Proprietary 65.67 62.33 0.733 8.69
Open Source Models
DeepSeek-R1 Open Source 67.83 🥉 62.92 0.745 🥉 8.62
DeepSeek-V3 Open Source 66.58 58.58 0.640 9.37 🥉
QwQ-32b Open Source 63.67 63.00 🥉 0.705 8.40
Qwen-2.5-72b-instruct Open Source 65.30 59.17 0.657 10.27 🥈

* PQA: Procedural Question Answering, ERR: Error Correction, ORD: Step Ordering, GEN: Protocol Generation.
Data extracted from BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning.

Want to see the solution? See how BioProAgent achieves 100% Success Rate on the Agent Execution tasks by solving the reasoning gaps shown above.

🤖 Method: BioProAgent

BioProAgent: Neuro-Symbolic Grounding for Constrained Scientific Planning

  • State-Augmented Adaptive Planning (FSM-Constrained Script-Free Planner): Abandons rigid linear workflows and adopts a neuro-symbolic framework that constrains probabilistic planning via a deterministic Finite State Machine (FSM). The Agent leverages State-Augmented Planning to flexibly select retrieval, draft generation, or code production based on current states, addressing LLMs' limitations in handling the rigorous constraints of physical actuation in wet-lab scenarios.
  • Scientific Review: Incorporates a strict scientific reflection mechanism (Validator) to automatically check for missing control groups, logical flaws, parameter rationality and machine code validity. This enforces a rigorous Draft-Verify-Rectify (DVR) workflow, ensuring the scientific rigor of experimental protocols.
  • Automation Hardware Alignment: Reads laboratory device and consumable inventories (CSV), mapping natural language steps to specific machine operations via Semantic Symbol Grounding, reducing token consumption by ~6×.
  • Hybrid Memory System:
    • Short-Term Memory: Combines both episodic memory and working memory to maintain long-horizon protocol consistency.
    • Long-Term Memory: Integrates Mem0 to recall past experimental experiences.
  • Human-in-the-Loop: Proactively requests user confirmation at critical decision points to ensure safety in high-risk wet-lab operations.
BioProAgent Architecture

Overview of BioProAgent. (a) Cognitive Memory utilizes Symbolic Grounding Φ to manage context efficiently; (b) Neural Planner π₀ is grounded in a Design-Verify-Rectify FSM Δ(σ); (c) Hierarchical Verification (Kₛ, Kₚ) acts as a safety interlock, enforcing physical compliance by deterministically triggering rectification.


BibTeX

@article{BioProAgent2026,
  title={BioProAgent: Neuro-Symbolic Grounding for Constrained Scientific Planning},
  author={Anonymous},
  year={2026}
}

@article{liu2025bioprobench,
  title={BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning},
  author={Liu, Yuyang and Lv, Liuzhenghao and Zhang, Xiancheng and Yuan, Li and Tian, Yonghong},
  journal={arXiv preprint arXiv:2505.07889},
  year={2025}
}
}