AI drug discovery benchmarking: sample plates in automated laboratory storage unit

Without Better Benchmarks, AI Drug Discovery Risks Losing Its Way

Jun 4, 2026 | Health Tech

Image Source: Jesse Milns for Conscience

Written by: Contributor

On behalf of: Life Science Daily News

Artificial intelligence is rapidly reshaping drug discovery. Machine learning models are now used to predict protein structures, design molecules, prioritize targets, and optimize clinical candidates. Investment is surging, publications are multiplying, and claims of breakthrough performance are commonplace.

The stakes could hardly be higher. Bringing a new drug to market routinely takes more than a decade and billions of dollars, with most candidates failing along the way. Even modest improvements in how targets are selected or molecules are designed could translate into meaningful reductions in cost, time, and attrition, and ultimately, into patients receiving effective therapies sooner. That promise helps explain the urgency and enthusiasm surrounding AI in drug discovery.

Yet beneath this momentum lies a fundamental challenge: while benchmarking efforts do exist, the field lacks a comprehensive framework for benchmarking needed to distinguish real progress from hype along the entire drug development continuum. As AI-driven drug discovery grows more complex, benchmarking itself must evolve as a scientific practice with shared standards, coordinated frameworks, and long-term community stewardship.

In many areas of science and technology, benchmarks serve as the compass that keeps innovation on course. In AI-driven drug discovery, that compass exists in pockets, but not yet as a coherent system. Without objective, shared standards for evaluation and performance reporting, it remains difficult to know which methods genuinely work, which perform only under narrow conditions, and which may fail to translate to patient benefit. This is not a theoretical concern. In biomedical AI, reproducibility often lags behind other fields, with replication rates reported below 50 per cent. When results cannot be reliably reproduced, confidence erodes, and it becomes difficult to know how close to a breakthrough we really are.

Biology and drug development do not offer clean or static ground truth, making fair evaluation harder than in domains like image recognition or language modeling, but also more necessary. It means that benchmarking itself must be treated as a scientific endeavor that requires rigor, transparency, and continuous refinement. Poorly designed or weakly governed benchmarks can mislead just as easily as no benchmarks at all, allowing methodological weaknesses to persist and marketing narratives to carry more weight than evidence. In effect, as problems grow more complex, benchmarks themselves need benchmarks.

As benchmarking becomes a scientific discipline in its own right, the traditional mechanisms we rely on to assess scientific quality are struggling to keep pace. Peer review, while essential, was not designed for an era of fast-moving, highly complex AI systems trained on massive and often opaque datasets. Incremental publication alone cannot provide the comparative, system-level insight needed to evaluate competing approaches at scale.

Other communities have faced similar challenges and shown that there is a better way. Over the past two decades, community-driven “Critical Assessment” initiatives such as the Critical Assessment of protein Structure Prediction (CASP), the DREAM Challenges, and the Critical Assessment of Genome Interpretation (CAGI) have demonstrated how structured benchmarking can accelerate innovation while safeguarding rigor. These efforts provide shared datasets, clear evaluation criteria, and independent assessment of results. CASP, in particular, played a pivotal role in the breakthroughs that led to Nobel Prize-winning AlphaFold, by creating a trusted framework in which progress could be measured honestly and transparently.

Importantly, initiatives like CASP did not constrain innovation by dictating how problems should be solved. Instead, they created a neutral arena in which many different approaches could be tested against the same challenges, which created a trusted space for sustained community participation. The result was not convergence on a single method, but a clearer understanding of what worked, what did not, and why. That clarity enabled rapid iteration and collective learning, laying the groundwork for advances that might otherwise have remained isolated or unrecognized.

The broader AI ecosystem offers another instructive parallel. The rise of large language models has been fueled not only by larger models and much more data, but by a culture of benchmarking: standard datasets, public leaderboards, and transparent comparisons that make performance claims testable. While imperfect, these benchmarks create accountability and enable rapid iteration. They make it possible to see, at a glance, what has improved and what has not.

Drug discovery AI has yet to adopt this mindset at scale. Methods are proliferating faster than we can meaningfully evaluate them, often using proprietary datasets, bespoke metrics, or selective comparisons. The result is fragmentation rather than cumulative progress.

This is why a coordinated community-led approach to benchmarking AI in drug discovery is urgently needed. The Benchmarking, Evaluation, and Assessment Consortium (BEACON) has been launched to help fill this gap by bringing together industry-leading researchers, practitioners, and organizations across biomedical disciplines and sectors. The goal of the consortium is not to crown winners, but to support the community in establishing shared standards and metrics for biomedical AI; enabling open, reproducible evaluations of methods; and creating durable infrastructure so benchmarking efforts can evolve alongside the science itself. By aligning evaluation practices, such an alliance can help ensure that innovation is matched with accountability.

The timing matters. AI’s promise in drug discovery is too important and too consequential to leave unchecked. Without rigorous benchmarking, we risk wasting resources, overlooking effective approaches, and undermining trust — a fragile but essential ingredient in the responsible adoption of AI-enabled drug development. Benchmarking earns confidence by making both strengths and limitations visible, allowing us to identify what truly works, focus investment where it has the greatest impact, and shorten the path from algorithm to approved therapy.

Progress in drug discovery has always depended on more than powerful tools; it depends on the standards we set to judge them. As BEACON begins its work, the biomedical community, across academia, industry, funding, and policy, should treat benchmarking not as an afterthought, but as the foundation of responsible AI in drug discovery. Progress in AI-driven drug discovery will depend not only on better algorithms, but on the collective willingness to measure, compare, and learn together. In drug discovery, where the costs of error are measured in lost time, lost resources, and lost lives, that shared accountability is not optional.

Author Bios

Peng Fu

Peng Fu, Chief Executive Officer of Conscience

Peng Fu is the Chief Executive Officer of Conscience, a non-profit focused on enabling drug discovery and development in areas where open sharing and collaboration are key to advancement towards accessible treatments. Through his role at Conscience, he brings together a wide range of experience across biotech, finance, and law. He is the founder of Novatio Ventures, an investment and advisory firm focused on accelerating the commercialization of early-stage life science innovations and providing tailored financing solutions to growth-stage companies. He is also a board member and founding investor of 3io Therapeutics and Precision Proteomics, two Canadian companies developing technologies licensed from academic institutions.

Previously, Peng was Managing Director and Partner at CBC Group and LYFE Capital, two global healthcare private equity firms. He has also held senior roles at DRI Capital (now DRI Healthcare), Amgen Canada, and Teva Pharmaceuticals. Peng began his career as a life sciences attorney in private practice at Torys and Gilbert’s LLP in Toronto. He holds an MSc in biology from the University of Toronto and a JD from Queen’s University.

Estrid Jakobsen, PhD

Estrid Jakobsen, PhD, Conscience’s Communications Lead

Estrid Jakobsen is Conscience’s Communications Lead. She is passionate about science communication and helping researchers make their findings open and accessible to everyone.

Estrid has over 12 years of education and experience as a neuroscientist and holds a PhD from the Max Planck Institute for Human Cognitive and Brain Sciences in Leipzig, Germany. Her time as a researcher led her to realize that the aspects of science that she thrived at and enjoyed the most were translating knowledge for the wider community and thinking about science from a broader perspective. Prior to joining Conscience, Estrid spent two years as a postdoctoral fellow at McGill University, followed by five years as Communications and Student Engagement Manager at the Quebec Bio-imaging Network.

References: The unreasonable effectiveness of open science in AI: a replication study: https://dl.acm.org/doi/10.1609/aaai.v39i25.34818 The AI Imperative: Scaling High-Quality Peer Review in Machine Learning: https://arxiv.org/abs/2506.08134

The views expressed in this article are those of the author and do not represent the editorial position of Life Science Daily News. Contributors may have a commercial interest in the topics they write about. For more information see our Contributor Policy

Articles that may be of interest

Healthcare’s Claim-Denial Crisis Is an Administrative Epidemic

Healthcare has a second epidemic running alongside every clinical one, and it doesn't show up on a chart. It shows up in a fax queue. A claim gets denied, and somewhere in a hospital's business office, a trained professional begins a process that looks less like...

The Evidence Gap in Precision Prevention

Preventive medicine is entering a new era. Advances in genomics, biomarker testing, continuous monitoring technologies, imaging, and data analytics are enabling clinicians to identify disease risk with a level of precision that would have been unimaginable just a...

The Infrastructure Gap Behind the GLP-1 Boom

The rapid rise of GLP-1 medications has transformed healthcare delivery in ways few therapeutic categories have achieved in recent decades. Originally developed to help manage type 2 diabetes, GLP-1 receptor agonists such as semaglutide and tirzepatide have become...

AI Trust in Healthcare Starts with the People in the Room

Healthcare is on an accelerated path to digitisation, creating real opportunity to reduce administrative burden, improve risk detection, support prioritisation, and help people access the right care sooner. But without deliberate design, these same systems risk...

What My Stroke Recovery Taught Me About Mobility and Safer Care

After more than three decades in nursing, I have had the privilege of caring for patients during some of the most vulnerable moments of their lives. I have worked alongside extraordinary nurses, therapists, physicians, and support staff who dedicate themselves to...

IPA Highlights Expansion of Photodynamic Therapy in Türkiye

At the 8th Photodynamic Day of the International Photodynamic Association (IPA) held at Acıbadem University in Istanbul, speakers highlighted continued development of PDT in Türkiye and its growing clinical use internationally. Hosted by Professor Fabienne Dumoulin,...

Without Better Benchmarks, AI Drug Discovery Risks Losing Its Way

Articles that may be of interest

Healthcare’s Claim-Denial Crisis Is an Administrative Epidemic

The Evidence Gap in Precision Prevention

The Infrastructure Gap Behind the GLP-1 Boom

AI Trust in Healthcare Starts with the People in the Room

What My Stroke Recovery Taught Me About Mobility and Safer Care

IPA Highlights Expansion of Photodynamic Therapy in Türkiye

Articles that may be of interest

Healthcare’s Claim-Denial Crisis Is an Administrative Epidemic

The Evidence Gap in Precision Prevention

The Infrastructure Gap Behind the GLP-1 Boom

AI Trust in Healthcare Starts with the People in the Room

What My Stroke Recovery Taught Me About Mobility and Safer Care

IPA Highlights Expansion of Photodynamic Therapy in Türkiye

Useful Information

Support

Policies

Information

Support

Policies