BioPro Project: Agent & Benchmark

Abstract

The automation of scientific experimentation is hindered by the inability of LLMs to reliably handle accuracy-critical biological protocols. We introduce BioProBench (550k task instances) to expose the reasoning gap, and BioProAgent, a neuro-symbolic framework. By anchoring probabilistic planning in a deterministic Finite State Machine (FSM), our agent ensures hardware compliance and significantly outperforms GPT-4 baselines.

🧬 Dataset: BioProBench

We present BioProBench, the first large-scale resource dedicated to procedural reasoning in biological experimental protocols, containing a BioProCorpus of nearly 27,000 protocols and over 550,000 structured instances, covering diverse subfields of biology.

Overview of BioProBench. (a) A foundational corpus of 27,000 professionally authored protocols; (b) A structured dataset of over 550,000 instances derived from this BioProCorpus, which is partitioned into a training set to facilitate model fine-tuning and a held-out test set; and (c) A rigorous benchmark with novel, domain-specific metrics to evaluate procedural understanding, such as keyword-based content metrics and embedding-based structural metrics, to accurately quantify procedural knowledge.

🏆 BioProBench Leaderboard

Our benchmark leaderboard provides a comprehensive evaluation of leading LLMs using novel, domain-specific metrics, enabling fine-grained analysis of procedural reasoning performance. It highlights systematic weaknesses in models’ ability to understand, reason about, and generate scientific protocols across four task categories: PQA ERR ORD GEN.

Main Leaderboard

Model	Type	PQA (Acc)	ERR (Acc)	ORD (τ)	GEN (BLEU)
Bioproagent	Our Method	85.08 🥇	81.55 🥇	0.891 🥇	16.37 🥇
Closed Source Models
Gemini-2.5-Pro	Proprietary	70.27 🥈	64.83 🥈	0.810 🥈	7.11
Claude-3.7-Sonnet	Proprietary	63.90	60.93	0.734	8.38
GPT-4o	Proprietary	63.50	62.67	0.627	8.92
Gemini-2.0-Flash	Proprietary	63.44	58.67	0.637	9.18
GPT-4-Turbo	Proprietary	57.92	56.17	0.528	9.26
o3-mini	Proprietary	65.67	62.33	0.733	8.69
Open Source Models
DeepSeek-R1	Open Source	67.83 🥉	62.92	0.745 🥉	8.62
DeepSeek-V3	Open Source	66.58	58.58	0.640	9.37 🥉
QwQ-32b	Open Source	63.67	63.00 🥉	0.705	8.40
Qwen-2.5-72b-instruct	Open Source	65.30	59.17	0.657	10.27 🥈

* PQA: Procedural Question Answering, ERR: Error Correction, ORD: Step Ordering, GEN: Protocol Generation.
Data extracted from BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning.

Want to see the solution? See how BioProAgent achieves 100% Success Rate on the Agent Execution tasks by solving the reasoning gaps shown above.

🤖 Method: BioProAgent

BioProAgent: Neuro-Symbolic Grounding for Constrained Scientific Planning

State-Augmented Adaptive Planning (FSM-Constrained Script-Free Planner): Abandons rigid linear workflows and adopts a neuro-symbolic framework that constrains probabilistic planning via a deterministic Finite State Machine (FSM). The Agent leverages State-Augmented Planning to flexibly select retrieval, draft generation, or code production based on current states, addressing LLMs' limitations in handling the rigorous constraints of physical actuation in wet-lab scenarios.
Scientific Review: Incorporates a strict scientific reflection mechanism (Validator) to automatically check for missing control groups, logical flaws, parameter rationality and machine code validity. This enforces a rigorous Draft-Verify-Rectify (DVR) workflow, ensuring the scientific rigor of experimental protocols.
Automation Hardware Alignment: Reads laboratory device and consumable inventories (CSV), mapping natural language steps to specific machine operations via Semantic Symbol Grounding, reducing token consumption by ~6×.
Hybrid Memory System:
- Short-Term Memory: Combines both episodic memory and working memory to maintain long-horizon protocol consistency.
- Long-Term Memory: Integrates Mem0 to recall past experimental experiences.
Human-in-the-Loop: Proactively requests user confirmation at critical decision points to ensure safety in high-risk wet-lab operations.

Overview of BioProAgent. (a) Cognitive Memory utilizes Symbolic Grounding Φ to manage context efficiently; (b) Neural Planner π₀ is grounded in a Design-Verify-Rectify FSM Δ(σ); (c) Hierarchical Verification (Kₛ, Kₚ) acts as a safety interlock, enforcing physical compliance by deterministically triggering rectification.

BibTeX

@article{BioProAgent2026,
  title={BioProAgent: Neuro-Symbolic Grounding for Constrained Scientific Planning},
  author={Anonymous},
  year={2026}
}

@article{liu2025bioprobench,
  title={BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning},
  author={Liu, Yuyang and Lv, Liuzhenghao and Zhang, Xiancheng and Yuan, Li and Tian, Yonghong},
  journal={arXiv preprint arXiv:2505.07889},
  year={2025}
}
}

BioPro Project