Protein Optimization 102: Lessons from the protein design competition

October 18, 2024insight

/Article

/TL;DR

-We analyzed the submissions for the first EGFR competition (results here) for you!
-Most people explored RFdiffusion for backbone generation, ProteinMPNN for obtaining sequences, with some other interesting approaches we’ve highlighted.

In our last blog post, we looked at the different ways you can optimize a protein for a given task using machine learning models.

We made the distinction between fixed and sequential model optimization and further split the latter into model-based sampling and greedy/heuristic methods. Model-based methods use a learned…well…model to explicitly sample novel candidates. This model tries to balance between exploitation and exploration using the knowledge that it incorporates from observations via training. Greedy/heuristic methods often use the fitness model to rank randomly proposed candidates, but generally don’t adapt to the observations as much as model-based methods.

In the first half of this post, we will use this taxonomy to take a look at the methods employed by designers in the first round of our EGFR competition. Different approaches come with different theoretical trade-offs, and we will see whether we can see those in the submissions’ practical performance.

In the second half, we will give you some actionable advice on how to choose an approach for the second round, summarized with two handy decision charts.

Overview of Adaptyv’s EGFR binder design competition

In our competition, creative protein engineers were tasked with designing a new binder to the extracellular domain of EGFR, a cancer-associated drug target.

We selected 200(+1) of the most promising sequences for screening. The top 100 were based on the AlphaFold2 interface pAE as a proxy for binding (Bennett et al., 2023), whereas the rest were selected across a wide range of design techniques. AlphaFold2 iPAE as a binding metric has its problems, which we will briefly discuss, but we chose it because it has been exhaustively used in the communities, including luminaries such as the Baker lab (Bennett et al., 2023; Watson et al., 2023; Zhang et al., 2024). Most successful designs (in terms of leaderboard placement) aimed to directly optimize this objective. For a brief overview of protein design competitions, check out this Nature article we have been featured in!

Design and model categories

We had requested competitors to include a brief description of their submissions’ strategy. Some wrote very brief stubs (which we still thank them for!), and some wrote detailed descriptions and blog posts (you guys are awesome!).

The figure below shows the diversity of techniques designers used, categorized into certain themes by us.

You can browse through all the categorized (tested) designs here.

Number of designs submitted and selected in the first round of the EGFR competition by design and model category.

The most popular options included:

sampling from a generative model like RFdiffusion (Watson et al., 2023), sometimes conditioned on EGFR’s structure, followed by inverse-folding with ProteinMPNN (Dauparas et al., 2022), then using AlphaFold’s predicted metrics or others for filtering (de novo + filter). We called this playing the “de novo slot machine” in our previous post. In a typical experimental setting, these approaches can be prone to relatively low hit rates, although recent models have made great leaps to address this (e.g., Google DeepMind’s AlphaProteo, Zambaldi et al., 2024, which has a 24.5% hit rate for interleukin-7 receptor-𝛼 binders).
diversifying known binders from the literature with ProteinMPNN (Dauparas et al., 2022) and subsequent selection after re-folding with AlphaFold and computing selection metrics (diversified binder + filter). This category includes partial diffusion of a known binder with RFdiffusion. In our last post’s taxonomy, this would be a fixed model optimization or local search strategy. Other designers chose a rational approach (Korendovych, 2018) for diversifying: starting from known sequences (such as EGF), they mutated regions (often not in the binding site!) using methods such as BLOSUM62 substitutions (Eddy, 2004). Diversified binder + filter was the most common design strategy.
gradient-based input space optimization via AlphaFold2 (aka hallucination) (Anishchenko et al., 2021; Goverde et al., 2023; Frank et al., 2023). For this, a random sequence is fed through AlphaFold2, then a custom loss is computed (often including terms like pLDDT, pAE, and other constraints) and its gradient with regards to the input is taken by backpropagating through the folding network. Several iterations of input optimization can achieve a structurally stable design. Feeding a random sequence and its desired binding target through AlphaFold-Multimer enables the design of binders, with losses that can take into account the interaction pLDDT and pAE, the number of interface residues, etc.
directly optimizing a starting sequence using unsupervised protein languge model directed evolution (fixed model optimization), and building custom active learning loops (sequential model optimization with greedy/heuristic sampling). We included both in the optimized binder category. This is very similar to input space optimization which hallucination is also a part of, yet we decided it should have its separate category to highlight its “resurgence” in protein design, as we will see from our winners.
strategies that mainly employed molecular dynamics, docking, or Rosetta design protocols were labelled as physics-based. Most of these still filtered the final candidates using AlphaFold2’s iPAE to ensure they made it on the leaderboard.

Submissions with missing or unclear descriptions are labelled as “not mentioned” and excluded from this analysis (398 out of 726 total submissions, with 65 of these being selected for validation).

Looking more in-depth at the models represented, we found that RFdiffusion for backbone design followed by ProteinMPNN diversification and filtering is by far the most common strategy (176 total submissions).

Here RFdiffusion is used either for target-conditioned generation or partial diffusion/scaffolding of known binder motifs. This is a completely valid choice given the iPAE objective and lack of fully characterized EGFR binder datasets: people opted to sample a large number of potential designs and simply select the top candidates. But, as Brian Naughton reports, it is often hard to get both RFdiffusion and ProteinMPNN to reliably sample high-confidence designs. For validation, this would entail screening massive designed libraries and it might not be the best choice if you’re working with a limited experimental budget.

Expression rates

Not all sequences expressed, leaving some people surprised and some disappointed. This is the reality of translating in silico designs into experimentally-valid candidates. A 73% (146/201) expression rate is about on par with Wicky and colleagues, who achieved a 74% (71/96) expression rate for their AlphaFold2 hallucinated + ProteinMPNN-designed symmetric assemblies. Goverde and colleagues compared the designs obtained from the standard ProteinMPNN for an AlphaFold2-hallucinated backbone (with the AF2Seq pipeline) with biasing ProteinMPNN’s sampling towards hydrophilic amino acids, and with their fine-tuned version on soluble proteins (SolubleMPNN): for a single redesigned protein, normal ProteinMPNN has a 0% expression rate (0/12), the biased version 75% (6/8), and SolubleMPNN 93.1% (27/29). Recently, a paper from the Baker lab (Glögl et al., 2024) achieved a 98% (94/96) expression rate for TNFR1 binders.

With this context, 73% could be considered on the lower side compared to the Baker lab. However, only half of the participants we could verify via their socials had protein design experience (defined as PhD/postdoc/professor or working in a therapeutics biotech). This combined with the wide range of methods employed and the fact that not many people optimized for expression, only binding/iPAE, makes us at Adaptyv relatively happy with the outcome.

For the second round, we are including an expression proxy in the metrics. We also recommend using SolubleMPNN as a final check before submitting your sequences!

Hit rates

Out of these, 5 were considered strong binders, with a KD between 3e-8 M to 2.3e-5 M, with 2 labelled as weak binders (KD above 1e-5 M). This yields a 2.5% total hit rate (5/201), considerably higher than past EGFR-targeting design campaigns (0.01% previously reported using a Rosetta protocol, Cao et al., 2022). We should emphasize that hit rates are highly dependent on the design target, with highly accessible, hydrophobic (Cao et al., 2022; Pacesa et al., 2024), and reduced flexibility epitope regions (Kim, Choi & Kim, 2021) often yielding better results. To ensure you get a lot of binders, we recommend you target hydrophobic surface regions or the EGFR epitope region we provided, especially if you want to succeed in the EGF neutralization assay we are doing!

iPAE as a proxy for KD

Seven binders (3 true binders, with 2 disqualified due to similarity to other therapeutics, and 2 weak) are too few data points to test the quality of iPAE as a competition proxy. Brian Naughton recently looked at 55 entries from PDBBind in his blog post, showing a Pearson’s correlation coefficient of -0.25 (p-value << 0.05). iPAE might not be an ideal score for binding affinity.

We recently asked for input from the protein design community regarding which metric should be used as a binding affinity proxy. This sparked quite an interesting conversation:

Nikhil Haas conducted an extensive analysis of the competition sequences and other data ordered from Adaptyv, looking into expression and binding correlations. He showed that ESM2 log-likelihoods correlate well with expression. You can watch his video explanation here.
Martin Pacesa says ipTM could be used for binary binding predictions.
Sergey Ovchinnikov recommends using no filters for the ranking process or, at best, looking at iPAE from AlphaFold2’s predictions with initial guess and also selecting designs that did not pass the filters.

We hope to see more filtering and optimization proxies suggested and experimented with in the second round!

The champion binders

First place 🥇- Bindcraft: Martin Pacesa and Lennart Nickel

Our undisputed winners are Martin Pacesa and Lennart Nickel, with a 4.91e-7 M KD. This was achieved with their custom AlphaFold2 hallucination pipeline called BindCraft (Pacesa et al., 2024). As opposed to others, it hallucinates the binding interface instead of the binder’s structure only. It uses four stages of optimization: first gradient-based optimization on the continuous sequence logits, then on the softmax matrix, followed by one-hot encoding without and with randomly sampled mutations. It is also highly customizable via the combined loss function: some terms aim to optimize the binding interface’s prediction confidence, the number of residues, as well as a radius of gyration for the binder (to prevent “spaghetti”-like long binding stretches) and a “helicity” loss (as AlphaFold2 hallucination is biased towards helices, this promotes more non-helical designs).

You can try it out here using Google Colab. As the authors pointed out, it is super straightforward to adapt it to your own design campaign and implement additional loss terms (see here!). For the second round, give BindCraft a try!

Second place 🥈 - Khondamir Rustamov

The second spot was taken by a binder designed using the AF2Seq protocol for backbone hallucination and SolubleMPNN for inverse-folding, as initially explored by Goverde and colleagues (see the section on expression rates). It had a KD of 4.77e-6 M and ranked 54th for iPAE. We can now see a common theme: hallucinated backbones with AlphaFold2, followed by SolubleMPNN inverse-folding!

The methods we have seens so far indirectly optimized multiple objectives and their designs are likely on the Pareto-optimal front between experimental binding, iPAE, and expression.

Third place 🥉- Adrian Tripp and Sigrid Kaltenbrunner

This binder had a KD of 2.29e-5 M, ranked 89th in the iPAE leaderboard, and had a high expression rate. Sigrid and Adrian used the ProtFlow pipeline (not to be confused with Adaptyv Bio’s ProteinFlow - which you should check out for processing protein structures!) to orchestrate their complex workflow. First, RFdiffusion tackled the binder backbone generation, targeting EGFR’s residues 18, 39, 41, 108 and 131, followed by initial filtering, then LigandMPNN inverse-folding on Rosetta-relaxed structures, and sequential folding and filtering with ESMFold and ColabFold. The ESMFold step accounted only for the binder structures and selected for pLDDT+TMscore, whereas the ColabFold step predicted the entire complex and filtered based on pLDDT, TMscore, iPAE, ipTM, and the number of hotspot contacts. The resulting designs were once again fed through the entire pipeline, for 3 cycles in total. We can see how they took the de novo design + filter approach to its limit!

Special mentions

Some approaches were not as successful, but were extensively documented and so deserve being highlighted as well.

In his blog post, Brian Naughton tried out a couple of methods, including ESM2 directed evolution with an iPAE oracle, RFdiffusion, ProteinMPNN, and Bayesian optimization. He provides the Modal commands for several of these tools here and you should definitely check this out. It is a great way to get started in protein design. And you can make use of your $30 free Modal credits as well!!

Anthony Gitter found that a de novo, language-instructed binder from ProTrek (Su et al., 2024) performed quite well, despite not ranking in the top 100. He tested the “non-biological/domain unadapted” language model Llama 3.1’s ability to design proteins, which still suggested antibody-like EGFR inhibitors. We are pretty hopeful for what language-instructed, chat-based protein design might look like in the future!

Alex Naka’s strategy was the only one that fits the definition of sequentially-optimized, model-based sampling we established in the last blog post. For his in silico oracle, he first opted for a simple AlphaFold2 implementation using Modal with PepMLM (Chen et al., 2023) as his EGFR-conditioned sequence generator. He trained a surrogate (an ensemble of 1D CNNs on one-hot encodings) on a starting dataset mined from this in-silico oracle, then continued training it during optimization. He used EvoProtGrad (Emami et al., 2023) as his surrogate-conditioned generator, iterating between scoring new candidates and then retraining. A complete active learning loop! Now we know why all his designs were at the top of the leaderboard (the ”custom active learning” category is entirely comprised of these). This is an interesting strategy you could use for the second competition round, but make sure you also co-optimize for expressibility!

The top 5 submitted designs ranked by AlphaFold2’s iPAE in the first design competition round.

We thank Brian Naughton, Anthony Gitter, and Alex Naka for their contributions, both to the competition, and especially for these highly detailed writeups. Make sure you read their posts!

We thank all other design competition participants as well - it was a great experience seeing so many creative protein engineers’ solutions and, even beyond round two, we plan to organize more competitions in the future!

Recommendations for ML based protein optimization

Main considerations for a protein engineering campaign. For simplicity, we grouped the starting data, lab budget, and risk-reward trade-off as business goals - willingness to spend on your campaign’s lab validation. ML power includes computational resources and ML experience - available expertise and resources to use ML models.

A) Questions to ask yourself before an optimization campaign

When selecting a model for your protein optimization project, ask yourself the following questions:

Starting data: How much labeled data do you have available to start with (low-N vs. large-N regime)? Do you plan to obtain more data given your model’s suggestions?
Lab budget: What is your total budget for screening? How many variants do you think you can reliably test out (consider replicates as well!)? Does your lab offer the necessary screening platforms and machines?
Risk-reward trade-off: Are you trying to be more conservative and incrementally improve an existing lead? Or are you trying to discover the next moonshot therapeutic?
Computational resources: How many computational resources (GPUs, available time to implement or train) do you have?
ML experience: What is your ML experience? Are you able to implement novel architectures, simply fine-tune a model, or are you familiar with tools like ColabFold, 310.ai, tamarind.bio?

B) The protein engineer’s personality test

The answers to these questions determine in which of a couple of profiles a protein engineer might fall into, and which ML models likely to achieve their aims during a binder optimization campaign.

Decision flow chart for your next optimization campaign.

If you only have a single known binder or no starting data, you might be forced to de novo design using ProTrek or RFdiffusion, or mine some data from the literature and maybe replace some residues in the binding site using a BLOSUM62 scheme. If you have a small starting set, you might try low-N fine-tuning techniques and carefully plan to test what your model will suggest. If you’re lucky or proficient enough in the lab, you could have a large dataset, mapping both single point and combinatorial mutations to their binding affinities. In this case, custom architectures and from-scratch training become feasible for you.

Next, when it comes to the lab resources for validation, most at-home protein engineers have a small to non-existent screening budget: they would validate at most their final best binder. This was the case for our EGFR competition, where completely in silico oracles are necessary. Alex Naka really leaned into this constraint, building a small dataset of low iPAE binders, training CNN ensembles for prediction, and finally creating a completely in silico active learning loop.

If you can afford to validate about 200 designs, we recommend sequential optimization using either explicit model-based sampling or greedy/heuristics. We prefer Bayesian optimization/Active learning with simple, uncertainty-aware surrogates (Gaussian Processes or ensembles), as argued for in our first blog post! The ALDE (Yang et al., 2024) or EVOLVEpro (Jiang et al., 2024) frameworks are great starting points. If you want to do some input optimization via AlphaFold2 backpropagation, take a look at BindCraft (Pacesa et al., 2024) - use the default settings or even implement your own design campaign-tailored loss function!

If you are fortunate enough, your validation budget could be large to limitless: your best bet would be to take the largest protein foundation model you can find and sequentially fine-tuning it on new batches of data - you can retrain any fixed model. Be as greedy as you want: generate combinatorial libraries (Wu et al., 2019) and select the top-performing variants at each step!

Another axis of consideration is the risk-reward trade-off you are willing to make. Running more iterations of active learning, exploring more of the sequence space, or validating a larger batch at each step means higher investment, but also a higher chance of discovering moonshot binders, if you think those should exist. The alternative is to simply take an existing binder and do some local search that just gets you away from patent protections while improving enough to be worth it. These areas of consideration (starting sequences, lab resources, risk-reward trade-off) ultimately depend on your business goals.

Compute is rarely, if ever the bottleneck in protein design. Most protein engineers can make do with a single GPU on their local machine. Some use Modal or AWS, with a low resource consumption. With very limited resources it’s even possible to set up an active learning loop (as seen here) and only run it for a couple hours or days. We recommend using simple surrogates (1D CNN ensemble or Gaussian Processes, specifically). Simple one-hot encodings can also suffice, In fact, they perform about as well as embeddings from a protein language model like ESM2 for fitness prediction and optimization (Shanehsazzadeh et al., 2020; Greenman, Amini & Yang, 2023; Yang et al., 2024).

←Back to Blog