UniBio Intelligence Wins Bronze Medal at NeurIPS 2025 CUREBench Competition

Published on December 6, 2025 • 10 min read

We are incredibly excited to share that UniBio Intelligence took home the bronze medal at the NeurIPS 2025 CUREBench competition! If you're not familiar, it was one of the toughest benchmarks for AI systems tackling medical reasoning. Seeing our name on that leaderboard was a huge validation for us - it demonstrates that our approach - wiring up multiple language models with a specialized vector database and deterministic tools - actually works in practice.

Competition Overview: CUREBench

The CUREBench competition, part of NeurIPS 2025 competitions track and hosted on Kaggle, evaluates AI systems on two distinct tracks:

Track 1 - CUREBench-Internal: Tests models' internal clinical knowledge and reasoning without external tools
Track 2 - CUREBench-Tools: Evaluates systems' ability to integrate external medical databases, clinical trial registries, and pharmacological resources

UniBio Intelligence team receiving bronze medal at NeurIPS 2025 CUREBench award ceremony

UniBio Intelligence team wins Bronze Medal at the NeurIPS 2025 CUREBench competition

The Journey to Bronze: From Public Leaderboard to Final Results

NeurIPS 2025 CUREBench competition was a marathon. It took us months of intense late-night coding, tweaking, and testing. In our previous detailed blog post, we laid out exactly how we tackled the public leaderboard phase. To recap, we hit those top rankings by relying on a few key ideas:

Dynamic weighted ensemble methods combining GPT-4.1, Gemini Pro, and Claude Sonnet 4.5
Strategic tool integration with curated MCP servers for medical databases
Data-driven tool selection based on usage pattern analysis
Retrieval-Augmented Generation (RAG) for consistent, grounded responses

Presenting Our Work at NeurIPS 2025

Being on the ground in San Diego for the NeurIPS 2025 conference was an unforgettable experience. We got to stand up in front of an amazing crowd of machine learning and medical AI researchers to share what we'd built. Here’s a quick rundown of what we talked about:

Ensemble Architecture Design

How we combined multiple frontier language models using dynamic weighted voting to achieve robust performance across diverse medical questions

Tool Integration Strategy

Our data-driven approach to selecting and integrating databases via Model Context Protocol (MCP) servers

Model Consistency Analysis

Insights from analyzing the 60-70% consistency rate we observed across model responses

Production Deployment Lessons

Practical challenges and solutions for deploying AI systems, from API rate limiting to safety filter configuration

Presentation slide detailing tool integration strategy

A Global Competition: Top Teams from Around the World

The CUREBench leaderboard was packed with brilliant teams from Big Tech (with ByteDance winning the Gold medal), universities, labs, and startups spanning multiple continents.

Award ceremony at NeurIPS 2025 CUREBench competition

Top Team Locations

UniBio (Bronze)

Competitor Teams

Conference Venue

Organizers

Learning from the Competition: Analyzing Top Team Approaches

Honestly, half the fun was seeing what everyone else came up with. Listening to the talks and walking around the poster sessions was very insightful.

Architectural Patterns

1. Structured Multi-Stage Workflows with Verification

Beijing Logic Intelligence Technology introduced a "Chain-of-Verification" workflow with mandatory verification checkpoints - represented by "lock icons" in their architecture - that prevented the model from proceeding without validating parameters or conducting comparative analysis. This safety-first design achieved significant gains:

• +9.01 points from tool retrieval vs. vanilla LLM
• +3.13 points from structured Extract-Think-Decide pipeline
• +4.01 points from mandatory safety constraints

Their hierarchical tool use strategy prioritized professional databases (ToolUniverse) over generic web search, ensuring data quality and clinical compliance.

2. Decoupled Executor-Analyst Frameworks

Team CureAgent presented a training-free architecture that separated "The Hand" (tool execution) from "The Brain" (clinical reasoning). This solved the "Context Utilization Failure" where models retrieve evidence but fail to ground diagnoses in that information:

• TxAgent (Llama-3.1-8B) handled syntactic tool precision and navigation of 200+ tools
• Gemini 2.5 focused purely on clinical reasoning, freed from tool burdens
• Stratified Ensemble topology preserved evidentiary diversity through late fusion

Key insight: Monolithic models struggle to balance syntactic precision for tool execution with semantic reasoning robustness. Decoupling enabled "hot-swapping" reasoning backbones without retraining.

3. Knowledge Graph-Augmented Tool Use

MedPathAgent (RMIT University) addressed a fundamental limitation: tool-augmented reasoning leverages structured knowledge but lacks entity relationships, while KG-augmented reasoning understands relationships but lacks accurate context. Their hybrid approach:

• Used SapBERT embeddings for entity linking to PrimeKG biomedical knowledge graph
• Extracted top-K reasoning paths combining node and edge relations
• Integrated KG paths with TxAgent tool retrieval for grounded decision-making

Critical finding: MedPathAgent remained stable across increasing reasoning complexity, while pure TxAgent showed sensitivity to longer reasoning paths - demonstrating the value of relational evidence for complex queries.

Model Selection and Ensemble Strategies

Multi-Model Diversity (RAISE-ODL)

Used exponential weighted voting across diverse models: gemini-2.5-pro, kimi-k2, qwen3-max, gpt-5. Formula: weights = score^α / Σ(score_i^α)

Key: Contextual shots via Q-Similarity from validation set, with 0-shot vs. 20-shot comparisons

Structured Question Decomposition (DMIS Lab)

GPT-5 pipeline: Break questions into sub-questions → ToolRAG search → Tool verification → Only verified results used for final answer

Averaged 2-6 sub-questions per query; 2-4 tools per final generation

Open-Source Focus (Constanze Care)

Systematically evaluated 15+ open-source models as alternatives to frontier models. Best results: Llama4 Maveric 17Bx128E (84.31%) and Kimi K2 (81.92%)

Innovative solver-judge architecture with prompt engineering for medical domain

Prompt Engineering Insights (VIM)

Discovered models frequently output "E" or "None" (invalid). Solution: Remove references, force A-D selection, add "Clinical Assistant" persona

Result: +29.04 points (34.04% → 63.40%) through constrained prompting alone

Tool Integration and Retrieval Innovations

Team MedAI: DailyMed Integration

Extended ToolUniverse with DailyMed as additional medical knowledge source. Compared retrieval strategies from BM25 to dense retrievers (Qwen2-1.5B). Found that TxAgent's finetuned retriever achieved best balance of precision and recall.

Consideration: DailyMed provides broader information that may require multiple ToolUniverse function calls, creating tradeoffs between precision and coverage.

CliniTHink (PreceptorAI): Multi-Agent Clinical Workflow

Inspired by physician workflows: Analysis Phase (identify required factoids) → Planning Phase (tool category selection) → Execution Phase (retrieval + context exploration) → Answering Phase

Used hierarchical agent coordination: GPT-5 for high-reasoning tasks, Opus 4 for planning, Sonnet 4.5 for tool selection, GPT-5-mini for retrieval. Modified ToolUniverse tools based on clinician feedback.

York University: Web Search Integration

Simple but effective: GPT-5 High Reasoning Mode + Web Search Agent accessing 40+ trusted medical domains. Achieved strong performance (69.38% internal, 74.43% agentic) with straightforward architecture.

Demonstrates that architectural simplicity combined with high-quality models can compete with more complex systems.

Key Patterns and Open Challenges

Convergent Solutions Across Top Teams

• Mandatory Verification Checkpoints: Multiple top teams independently implemented verification gates to prevent hallucinations and ensure clinical safety
• Question Decomposition: Breaking complex queries into sub-questions emerged as a dominant strategy for structured reasoning
• Ensemble Methods: All medal-winning teams used some form of multi-model or multi-agent architecture

Shared Open Challenges Identified

Context-Performance Paradox: Extending contexts beyond 12k tokens degraded accuracy from 94% → 87.9% due to noise accumulation
Curse of Dimensionality: Expanding tool library from 200+ → 600+ tools caused performance drop from 92.0% → 87.5% - highlighting need for hierarchical tool indexing
Model Consistency: Individual models showed 60-70% consistency on repeated queries, reinforcing need for ensemble validation
KG Path Coverage: 240/459 validation questions lacked extractable KG paths, suggesting some queries don't benefit from relational reasoning
Small Model Limitations: Models <70B parameters consistently lacked sufficient medical knowledge for reliable performance

Thank You to the CUREBench Organizers and Community

We can't wrap this up without a massive shoutout to the CUREBench organizers. Putting together a benchmark this rigorous (and keeping it running smoothly) is no small feat.

Resources: Poster and Presentation Materials

Competition Poster: Download PDF - Comprehensive overview of our methodology and results
Presentation Slides: Download PDF - Detailed technical presentation from the NeurIPS workshop
Detailed Technical Blog: Read our deep dive - Complete analysis of our approach during the competition

Join Us in Advancing AI for Drug Discovery

We didn't just build this system to win a medal. Everything we learned from the CUREBench trenches is getting baked directly into the AI platform we use for biologics discovery here at UniBio Intelligence. We're always looking to collaborate with researchers, biotech companies, and healthcare organizations.

Get in touch: Contact us at contact@unibiointelligence.com to discuss how our AI infrastructure can accelerate your drug discovery programs

Citing This Work

If you reference our CUREBench competition work or findings in your research, please cite:

@misc{ubi2025neuripsbronze,
  author = {UniBio Intelligence},
  title = {UniBio Intelligence Wins Bronze Medal at NeurIPS 2025 CUREBench Competition},
  year = {2025},
  url = {https://unibiointelligence.com/blog/neurips-2025-bronze-medal},
  note = {Accessed: 2026-05-12}
}

References

[1] CUREBench Competition (2025). "Clinical Understanding and Reasoning Evaluation Benchmark." NeurIPS 2025 Workshop.https://curebench.ai/
[2] CUREBench Competition - Kaggle. (2025). "CURE-Bench: Reasoning Models for Drug Decision-Making in Precision Therapeutics."https://www.kaggle.com/competitions/cure-bench
[3] Jiang, L. Y., et al. (2024). "Large Language Model Synergy for Ensemble Learning in Medical Question Answering."Journal of Medical Internet Research, 27, e70080.https://pmc.ncbi.nlm.nih.gov/articles/PMC12337233/
[4] Gu, Y., et al. (2023). "Large Language Models Vote: Prompting for Rare Disease Identification."arXiv preprint arXiv:2308.12890.https://arxiv.org/abs/2308.12890
[5] Kim, D., et al. (2024). "Retrieval-Augmented Generation for Generative Artificial Intelligence in Medicine."npj Health Systems.https://www.nature.com/articles/s44401-024-00004-1
[6] Jin, Q., et al. (2021). "PubMedQA: A Dataset for Biomedical Research Question Answering."Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Medical question answering benchmarks provide USMLE-style questions for evaluating models' medical reasoning abilities.
[7] Pal, A., et al. (2022). "MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering."Conference on Health, Inference, and Learning.https://paperswithcode.com/dataset/medmcqa
[8] Biomni Consortium. (2025). "Biomni: A General-Purpose Biomedical AI Agent."bioRxiv.https://www.biorxiv.org/content/10.1101/2025.05.30.656746v1
[9] TxAgent Team, Zitnik Lab. (2025). "TxAgent: An AI Agent for Therapeutic Reasoning Across a Universe of Tools."arXiv preprint arXiv:2503.10970.https://arxiv.org/abs/2503.10970
[10] Anthropic. (2024). "Introducing the Model Context Protocol."https://www.anthropic.com/news/model-context-protocol
[11] Model Context Protocol Documentation.https://modelcontextprotocol.io
[12] UniBio Intelligence. (2025). "Our Journey to the Top of CUREBench@NeurIPS 2025: A Deep Dive into Medical AI Benchmarking."https://unibiointelligence.com/blog/curebench-neurips-2025
[13] Beijing Logic Intelligence Technology. (2025). "Chain-of-Verification Workflow for Medical AI Reasoning." CUREBench@NeurIPS 2025 Competition Submission. Multi-stage verification system with safety checkpoints for medical AI.
[14] DMIS Lab, Korea University. (2025). "Structured Question Decomposition for Medical AI." CUREBench@NeurIPS 2025 Competition Submission. Systematic approach to breaking down complex medical queries into manageable sub-questions.
[15] CureAgent Team (Singapore/Cambridge). (2025). "Executor-Analyst Framework for Medical Reasoning." CUREBench@NeurIPS 2025 Competition Submission. Training-free dual-component architecture separating execution from analytical reasoning.
[16] RAISE-ODL, Shanghai AI Laboratory. (2025). "Multi-Model Exponential Weighted Voting for Medical AI." CUREBench@NeurIPS 2025 Competition Submission. Ensemble technique combining predictions from diverse frontier models.
[17] RMIT University MedPathAgent Team. (2025). "Knowledge Graph-Augmented Tool Use in Medical AI." CUREBench@NeurIPS 2025 Competition Submission. Integration of structured medical knowledge with LLM reasoning capabilities.
[18] Team MedAI, L3S Research Center (Hannover). (2025). "DailyMed Integration for Therapeutic Knowledge Retrieval." CUREBench@NeurIPS 2025 Competition Submission. Leveraging FDA drug information database for evidence-based medical reasoning.
[19] PreceptorAI (VISTEC/Mahidol University). (2025). "Clinician-Guided Multi-Agent Workflow for Medical AI." CUREBench@NeurIPS 2025 Competition Submission. Human-in-the-loop inspired architecture mimicking clinical decision-making processes.
[20] VIM Team (AIRI/Innopolis University). (2025). "Prompt Engineering and Constrained Choice Strategies for Medical Reasoning." CUREBench@NeurIPS 2025 Competition Submission. Optimization of model prompts and answer selection techniques.

← Back to Blog