← back to all posts

Introducing BenchBB and the community paper of the Protein Design Competition

No items found.

Published on:

2025-04-24
TL;DR

We wrote a community paper about our Protein Design Competition, teaming up with your favourite protein designers from both rounds. 

We aimed to explore the competition data in as many ways as we could, take stock of the state of the field, what SotAs have been established across the rounds, and which metric is most predictive of binding and expression.

The one thing we kept meeting throughout: the lack of a standardized benchmark set for protein binder design, leading to difficult comparisons and a lack of consistent, high quality data. This is why we close the paper by creating BenchBB, the Bench-tested Binder Benchmark — a curated set of 7 protein targets designed to capture diverse binder design challenges by remaining accessible enough for wide scale lab validation. 

Want to read the community paper? Check it out on Biorxiv!

Want to test your protein-design model on BenchBB? Go to benchbb.com and get started!

The post-competition paper

As you might remember, we hosted the Protein Design Competition. Briefly, we called for protein designers to create novel binders for EGFR and received over 1857 total submissions, 600 which we then experimentally tested in our lab, validating a total of 60 novel binders! For details you should check our previous blog posts. However, in these posts we could barely scratch the surface of possible data exploration. There was a strong demand to do more analyses and collect all the learnings, including those that our participants had already been doing throughout the rounds.

The spirit of collaboration was strong throughout the competition so it did not seem fitting for us to act as the curators and gatekeepers for what would be included or said in such a writeup. Instead, we decided to form a consortium and launched an open call for collaboration. To our delight, several of the participants responded, and the excitement to contribute turned into a wide range of analyses, with people applying their expertise in different types of biomolecules (from antibodies to peptides), statistics, and general understanding of the current state of computational protein design, and vision for its future. The discussions in our Slack channel were some as delightful as they were interesting, and we finally distilled them into the preprint  Crowdsourced Protein Design: Lessons From The Adaptyv EGFR Binder CompetitionWe will now summarize some of its key takeaways, but we strongly encourage you to read the full paper (honest clickbait: table 4 might surprise you!)

Many more ways to look at the data

We took another look at the data we had already analyzed in our blog posts, to see whether we can identify patterns and trends on the combined dataset of both rounds.

While there was a sizeable increase in expression success and binder hit rate (see above), there was no statistically significant difference between the median $K_D$  across binders nor was any one method clearly “better” at designing high affinity binders (see below). This was somewhat surprising to us, but we ascribe it partially to the small sample size and partially to the overall focus in the development of new methods on finding more binders as opposed to finding the same number of better binders — the fancy term for this is satisficing — and it gels with the continued popularity of the “sample-and-filter” methods we talked about in our introductory blog post. Presumably this will come with some exceptions, e.g. commercial players offering binder optimization services.

Overview of the 60 binders obtained across both competition rounds and their binding affinities. 
(A) Distribution of the binding affinity (averaged across 3 replicates) for each binder, (B) Distribution of binding affinities for each competition round, (C) Distribution of binding affinities across design categories.
Overview of the 60 binders obtained across both competition rounds and their binding affinities. (A) Distribution of the binding affinity (averaged across 3 replicates) for each binder, (B) Distribution of binding affinities for each competition round, (C) Distribution of binding affinities across design categories.


In order to make it easier for others, we checked the correlation of various surrogates (including iPTM, ipAE, and ESM 2/3/C) with binding strength and found that:

  1. at least on our dataset, ipAE,iPTM and ESM2 PLL (normalized or not) only correlate weakly with $K_D$, despite some of them being part of the competition target metric. This might be due the satisficing trend mentioned above or due to participants diversifying their submissions once they clear a threshold to hedge their bets
  2. the good news, as Nikhil Haas (BioLM) had already noted, ESM3 and ESMC, when length normalized do correlate with $K_D$, at least on our dataset.

We don’t recommend you to rush towards blindly maximising this metric though, because as we show later in the paper, specific antibody domains might require different metrics to inform binder design — huge shoutout to Nikhil and the team at https://biolm.ai/ for donating both the data and their valuable time for making this analysis possible. Beyond this, there’s a lot more in the paper and the supplementary, e.g. a detailed study of the specific EGFR domains and their role in the successful binders, highlights of methods used throughout the competition (including Cradles winning entry, which they have been further evaluating and explaining in more detail in their own series of posts ) and more context on the competition and the community response. We thank all the authors for the time and resources they to get the paper done this way.

However, the one thing that kept being said as we did all of these analyses and was confirmed with every insignificant p-value our analyses yielded is: we need more data.

Thus, in the discussion we tried to acknowledge the great advances that became obvious as we looked back on the data, but also stress the limitations and challenges the field still faces:

  1. Computational metrics are getting better, but are not yet plug-and-play reliable, largely caused by the extremely non-standardized datasets they are derived from, often using different assays
  2. This lack of standardization extends as far as the definition of a binding hit, making it very difficult to compare results from one report to another
  3. Even if the assays and hit definitions were stable, everyone makes up their slightly tweaked sets of targets right now , with only a few if any targets being shared across studies.

Of course, it is too easy to only complain about things, so we also suggest a first step in fixing this situation. 

Introducing BenchBB: the Bench-tested Binder Benchmark

BenchBB is a set of 7 protein targets you can use to test your in silico binder design method against, to enable rigorous, consistent, and practical evaluation of computational binder design methodsA few commonly used targets have started to emerge (we can point you to the RFdiffusion AlphaProteo, and BindCraft papers), but we are still lacking a pre-defined minimal set to test your method against. BenchBB addresses this!

We tried balancing between having novel interfaces — not seen in the training set of commonly used ML models — and including challenging targets (large, conformational changes), with  therapeutic relevance, benchmarks in prior studies and still being accessible to any lab (e.g., easily expressible in E. coli), while offering diverse binding modalities. 

Note that consistent targets are addresses only a third of the issues we just highlighted. For our benchmark, we also have to define what a “hit” means and keep the assay stable.

For now, we propose to simply report the assay performed and the affinity of each replicate, cleanly tagging variations if ANY parameters change between assays. Ideally the assay should be made open source, but for proprietary assays, a convention of tagging the data with reporterName.AssayName.VariationID  and changing the last element whenever something proprietary changes, and then making as much information as possible public with an “Assay card”  is probably enough.

If you want to report “hit counts”, we suggest the threshold of a clear signal of K_D\leq 10 μM introduced with Bindcraft, but this is one of the things we expect to require a community consensus to evolve. As long as the raw data is available, assay variation is clearly tagged and the targets are the same, we expect this to be a manageable source of noise. 
If you want to report “hit counts”, we suggest the threshold of a clear signal of $K_D\leq 10 μM$ introcued with Bindcraft, but this is one of the things we expect to require a community consensus to evolve. As long as the raw data is available, assay variation is clearly tagged and the targets are the same, we expect this to be a manageable source of noise. 

Like many benchmarks pushing Machine Learning forward as they saturate, we hope that his set of targets will quickly become obsolete and improved upon, but for now we are excited to start this journey with BenchBB as the first step. So let’s meet the 7 targets!

Epidermal growth factor receptor EGFR

  • This one needs no introduction - it was the catalyst that launched our entire competition. But, briefly, GFR’s extracellular domain (~620 AA) binds EGF and TGF-α; it is frequently overexpressed or mutated in several cancers; and several therapeutic antibodies (e.g. Cetuximab) target it. PDB ID: 8HGO.
  • We have accumulated a solid data set thanks to the participants in both competition rounds. Designers can further expand it or compare their tools or results to the data we have released - this is one of the main reasons to include EGFR. 
  • Cao et al. 2022 designed 50–65 aa miniproteins that bound EGFR’s Domain I and Domain III, successfully blocking EGF-induced signaling. They reported, however, an 0.01% hit-rate. We saw in the community paper how this was significantly improved upon: almost 3% in Round 1, then 13% in Round 2. And let’s not forget about the 8.2x binding affinity improvement over Cetuximab that Cradle achieved.

Synthetic β-barrel BBF-14

  • It is a de novo designed 112-residue β-barrel protein (13.8 kDa) with an internal hydrophobic pore. PDB ID: 9HAG.
  • Serves as a stress-test for binder design on a novel, non-natural target. With BBF-14, there are no evolved binders or known epitopes – designers must rely solely on the computed structure. Thus, it “can assess generalization beyond natural interfaces”.
  • It was previously used as a target in the BindCraft paper, where one design (“binder4”) bound BBF-14 with $K_D$ ≈ 20.9 nM (SPR). BindCraft achieved a 55% hit-rate (6/11) on BBF-14.

EBV Bcl-2 homolog BHRF1

  • It is a viral anti-apoptotic protein from Epstein–Barr virus (EBV) that mimics Bcl-2, allowing infected cells to evade apoptosis. It is associated with EBV-linked cancers. PDB ID: 2WH6.
  • Our main reasons for choosing were that it is “easily expressed in E. coli and commercially
    available with many antibody controls”. Additionally, it has a known hydrophobic hotspot - the BH3-binding cleft, which restricts the search space. 
  • Targeted initially by Procko et al. 2014 - their de novo 86 AA minibinder (“BINDI”) could bind BHRF1 with 220 pM affinity (PDB ID: 4OYD). More recently, AlphaProteo reported an 88% experimental hit-rate, far above prior methods, yielding multiple nanomolar binders without any optimization. 


CRISPR-associated nuclease Cas9

  • This RNA-guided DNA endonuclease needs to further introduction - it the key in CRISPR gene editing, widely used in gene editing and any genomic biotechnology application. PDB ID: 4OO8.
  • We chose it for its “easy structural characterization via cryoEM; stable, easily expressed in E.
    coli, and with multiple known binding sites and conformations”. Cas9’s size and moving parts make binder design difficult, but a successful binder can act as an “off-switch” for genome editing (de novo binders regulating the enzyme function). The BindCraft authors note including “multi-domain nucleases, such as CRISPR-Cas9” as challenging targets.
  • Used as a target for BindCraft - where a small designed protein binder (~100 AA) bound Cas9 and inhibited its genome editing activity . Surprisingly, it also yielded a 100% hit-rate, with the best binder measuring 267 $K_D$ (SPR). 

Interleukin-7 receptor α Il7Ra

  • The alpha subunit of the IL-7 receptor (CD127), a 219 AA cytokine receptor critical for T-cell development. PDB ID: 3DI3.
  • Other than its therapeutic relevance (blocking IL-7/IL-7R interaction could modulate immune responses), we chose it because its “ectodomain is easily produced in human cells and has been benchmarked in multiple prior studies”.
  • RFdiffusion yielded multiple IL-7Rα binders where earlier Rosetta designs yielded almost none (original ~2.2% with AlphaFold selection to a reported ~34% for RFdiffusion). One designed binder showed nanomolar binding and inhibited IL-7 signaling in vitroCao et al. 2022 reported an 0.05% pre-AlphaFold hit-rate for de novo binders. AlphaProteo also generated strong IL-7Rα binders in one round. All these were de novo mini-proteins (~50–60 AA) that expressed well in E. coli and bound IL-7Rα with high affinity (comparable to or better than a natural IL-7:IL-7R interaction). They report a 24.5% success rate, greater than the remeasured RFdiffusion one (16.8% versus the original 34% published). 

Maltose-binding protein MBP

  • It is a 42-kDa periplasmic binding protein in E. coli that binds maltose/maltodextrins as part of a sugar transport system. It is very stable and well-expressed; commonly used as an N-terminal fusion solubility tag to aid recombinant protein expression. PDB ID: 1PEB.
  • MBP’s abundance and stability make it easy to produce and assay, thus it can be tested in any lab. Another reasons for choosing it is that it features “a well-characterized active site allowing straightforward binder screening via elution from amylose resin”.
  • Zhou et al. 2025 employed de novo design and computational screening to create MBP binders: “6 candidate binders targeting MBP” were identified without any directed evolution. These hits were small folded proteins (≈80–100 aa) that bound MBP with low micromolar to nanomolar affinity. 

Programmed death-ligand 1 PD-L1

  • An immune checkpoint ligand (~290 AA) expressed on cancer cells and APCs. PD-L1 binds PD-1 on T cells, suppressing immune responses. PDB ID: 4Z18.
  • Gainza et al. 2023 noted PD-L1’s surface “displays a flat interface considered to be ‘hard to drug’ by small molecules”, thus making it ideal for testing advanced design methods. We consider it a “de facto binder design benchmark target”
  • RFdiffusion reported a 12.6% hit-rate. It was additionally benchmarked by AlphaProteoMaSIFBindCraft, and Yang et al., 2025.


Acknowledgements

Thanks to all the consortium authors for joining us throughout this journey, their contribution to the paper and in-depth discussions over the benchmark targets!

Stay tuned for more updates on BenchBB - lots of benchmarked models are coming your way!! 

Related posts

View all posts →