Without Better Benchmarks, AI Drug Discovery Risks Losing Its Way

Jun 4, 2026 | Health Tech

Image Source: Jesse Milns for Conscience
Written by: Contributor
On behalf of: Life Science Daily News

Artificial intelligence is rapidly reshaping drug discovery. Machine learning models are now used to predict protein structures, design molecules, prioritize targets, and optimize clinical candidates. Investment is surging, publications are multiplying, and claims of breakthrough performance are commonplace.

The stakes could hardly be higher. Bringing a new drug to market routinely takes more than a decade and billions of dollars, with most candidates failing along the way. Even modest improvements in how targets are selected or molecules are designed could translate into meaningful reductions in cost, time, and attrition, and ultimately, into patients receiving effective therapies sooner. That promise helps explain the urgency and enthusiasm surrounding AI in drug discovery.

Yet beneath this momentum lies a fundamental challenge: while benchmarking efforts do exist, the field lacks a comprehensive framework for benchmarking needed to distinguish real progress from hype along the entire drug development continuum. As AI-driven drug discovery grows more complex, benchmarking itself must evolve as a scientific practice with shared standards, coordinated frameworks, and long-term community stewardship.

In many areas of science and technology, benchmarks serve as the compass that keeps innovation on course. In AI-driven drug discovery, that compass exists in pockets, but not yet as a coherent system. Without objective, shared standards for evaluation and performance reporting, it remains difficult to know which methods genuinely work, which perform only under narrow conditions, and which may fail to translate to patient benefit. This is not a theoretical concern. In biomedical AI, reproducibility often lags behind other fields, with replication rates reported below 50 per cent. When results cannot be reliably reproduced, confidence erodes, and it becomes difficult to know how close to a breakthrough we really are.

Biology and drug development do not offer clean or static ground truth, making fair evaluation harder than in domains like image recognition or language modeling, but also more necessary. It means that benchmarking itself must be treated as a scientific endeavor that requires rigor, transparency, and continuous refinement. Poorly designed or weakly governed benchmarks can mislead just as easily as no benchmarks at all, allowing methodological weaknesses to persist and marketing narratives to carry more weight than evidence. In effect, as problems grow more complex, benchmarks themselves need benchmarks.

As benchmarking becomes a scientific discipline in its own right, the traditional mechanisms we rely on to assess scientific quality are struggling to keep pace. Peer review, while essential, was not designed for an era of fast-moving, highly complex AI systems trained on massive and often opaque datasets. Incremental publication alone cannot provide the comparative, system-level insight needed to evaluate competing approaches at scale.

Other communities have faced similar challenges and shown that there is a better way. Over the past two decades, community-driven “Critical Assessment” initiatives such as the Critical Assessment of protein Structure Prediction (CASP), the DREAM Challenges, and the Critical Assessment of Genome Interpretation (CAGI) have demonstrated how structured benchmarking can accelerate innovation while safeguarding rigor. These efforts provide shared datasets, clear evaluation criteria, and independent assessment of results. CASP, in particular, played a pivotal role in the breakthroughs that led to Nobel Prize-winning AlphaFold, by creating a trusted framework in which progress could be measured honestly and transparently.

Importantly, initiatives like CASP did not constrain innovation by dictating how problems should be solved. Instead, they created a neutral arena in which many different approaches could be tested against the same challenges, which created a trusted space for sustained community participation. The result was not convergence on a single method, but a clearer understanding of what worked, what did not, and why. That clarity enabled rapid iteration and collective learning, laying the groundwork for advances that might otherwise have remained isolated or unrecognized.

The broader AI ecosystem offers another instructive parallel. The rise of large language models has been fueled not only by larger models and much more data, but by a culture of benchmarking: standard datasets, public leaderboards, and transparent comparisons that make performance claims testable. While imperfect, these benchmarks create accountability and enable rapid iteration. They make it possible to see, at a glance, what has improved and what has not.

Drug discovery AI has yet to adopt this mindset at scale. Methods are proliferating faster than we can meaningfully evaluate them, often using proprietary datasets, bespoke metrics, or selective comparisons. The result is fragmentation rather than cumulative progress.

This is why a coordinated community-led approach to benchmarking AI in drug discovery is urgently needed. The Benchmarking, Evaluation, and Assessment Consortium (BEACON) has been launched to help fill this gap by bringing together industry-leading researchers, practitioners, and organizations across biomedical disciplines and sectors. The goal of the consortium is not to crown winners, but to support the community in establishing shared standards and metrics for biomedical AI; enabling open, reproducible evaluations of methods; and creating durable infrastructure so benchmarking efforts can evolve alongside the science itself. By aligning evaluation practices, such an alliance can help ensure that innovation is matched with accountability.

The timing matters. AI’s promise in drug discovery is too important and too consequential to leave unchecked. Without rigorous benchmarking, we risk wasting resources, overlooking effective approaches, and undermining trust — a fragile but essential ingredient in the responsible adoption of AI-enabled drug development. Benchmarking earns confidence by making both strengths and limitations visible, allowing us to identify what truly works, focus investment where it has the greatest impact, and shorten the path from algorithm to approved therapy.

Progress in drug discovery has always depended on more than powerful tools; it depends on the standards we set to judge them. As BEACON begins its work, the biomedical community, across academia, industry, funding, and policy, should treat benchmarking not as an afterthought, but as the foundation of responsible AI in drug discovery. Progress in AI-driven drug discovery will depend not only on better algorithms, but on the collective willingness to measure, compare, and learn together. In drug discovery, where the costs of error are measured in lost time, lost resources, and lost lives, that shared accountability is not optional.

 

Author Bios

Peng Fu

Peng Fu, Chief Executive Officer of Conscience

Peng Fu is the Chief Executive Officer of Conscience, a non-profit focused on enabling drug discovery and development in areas where open sharing and collaboration are key to advancement towards accessible treatments. Through his role at Conscience, he brings together a wide range of experience across biotech, finance, and law. He is the founder of Novatio Ventures, an investment and advisory firm focused on accelerating the commercialization of early-stage life science innovations and providing tailored financing solutions to growth-stage companies. He is also a board member and founding investor of 3io Therapeutics and Precision Proteomics, two Canadian companies developing technologies licensed from academic institutions.

Previously, Peng was Managing Director and Partner at CBC Group and LYFE Capital, two global healthcare private equity firms. He has also held senior roles at DRI Capital (now DRI Healthcare), Amgen Canada, and Teva Pharmaceuticals. Peng began his career as a life sciences attorney in private practice at Torys and Gilbert’s LLP in Toronto. He holds an MSc in biology from the University of Toronto and a JD from Queen’s University.


Estrid Jakobsen, PhD

 

Estrid Jakobsen, PhD, Conscience’s Communications Lead

Estrid Jakobsen is Conscience’s Communications Lead. She is passionate about science communication and helping researchers make their findings open and accessible to everyone.

Estrid has over 12 years of education and experience as a neuroscientist and holds a PhD from the Max Planck Institute for Human Cognitive and Brain Sciences in Leipzig, Germany. Her time as a researcher led her to realize that the aspects of science that she thrived at and enjoyed the most were translating knowledge for the wider community and thinking about science from a broader perspective. Prior to joining Conscience, Estrid spent two years as a postdoctoral fellow at McGill University, followed by five years as Communications and Student Engagement Manager at the Quebec Bio-imaging Network.

    References: The unreasonable effectiveness of open science in AI: a replication study: https://dl.acm.org/doi/10.1609/aaai.v39i25.34818 The AI Imperative: Scaling High-Quality Peer Review in Machine Learning: https://arxiv.org/abs/2506.08134
    The views expressed in this article are those of the author and do not represent the editorial position of Life Science Daily News. Contributors may have a commercial interest in the topics they write about. For more information see our Contributor Policy

    Articles that may be of interest

    The Emerging Health Tech Driving Globalised Healthcare

    The Emerging Health Tech Driving Globalised Healthcare

    Historically, healthcare has been reactive, defined largely by where a patient lives and what is available within their local system. That model is beginning to break down as healthcare becomes increasingly global. Patients are no longer confined to a single system,...

    read more
    The Underestimated Occupational Impact of Pelvic Pain

    The Underestimated Occupational Impact of Pelvic Pain

    When people think about pelvic pain, they often think about symptoms. Cramping. Heavy bleeding. Fatigue. Bloating. Pain during intimacy. Bladder urgency. What is discussed far less is the occupational impact of pelvic pain: the way pain changes a person’s ability to...

    read more
    The Silent Clock in Your Arteries

    The Silent Clock in Your Arteries

    The Silent Clock in Your Arteries: Why Vascular Aging Is the Heart Health Crisis We're Missing We are living in an era of extraordinary innovation in heart health. Precision diagnostics. AI-guided therapies. Wearables that track everything from rhythm to recovery. And...

    read more

    Articles that may be of interest

    The Emerging Health Tech Driving Globalised Healthcare

    The Emerging Health Tech Driving Globalised Healthcare

    Historically, healthcare has been reactive, defined largely by where a patient lives and what is available within their local system. That model is beginning to break down as healthcare becomes increasingly global. Patients are no longer confined to a single system,...

    read more
    The Underestimated Occupational Impact of Pelvic Pain

    The Underestimated Occupational Impact of Pelvic Pain

    When people think about pelvic pain, they often think about symptoms. Cramping. Heavy bleeding. Fatigue. Bloating. Pain during intimacy. Bladder urgency. What is discussed far less is the occupational impact of pelvic pain: the way pain changes a person’s ability to...

    read more
    The Silent Clock in Your Arteries

    The Silent Clock in Your Arteries

    The Silent Clock in Your Arteries: Why Vascular Aging Is the Heart Health Crisis We're Missing We are living in an era of extraordinary innovation in heart health. Precision diagnostics. AI-guided therapies. Wearables that track everything from rhythm to recovery. And...

    read more