Oct 1, 2025

Designer Spotlight: Navigating the multi-property maze for therapeutic peptide design - Tong Chen

TL;DR


Aim: 2k word count

Brief introduction

Peptides as a therapeutic modality + generative models on the sequence space (discrete diffusion/flow matching).

Peptides have long been recognized as a versatile and effective therapeutic modality. Their ability to specifically bind to targets, including proteins, antibodies, and receptors, makes them powerful tools for treating a range of diseases, particularly those involving the immune system and cancer. Unlike traditional small molecules or biologics, peptides offer a unique combination of specificity, flexibility, and ease of design. This has led to an increase in their use in therapeutic contexts, such as cancer immunotherapy, enzyme inhibitors, and antimicrobial peptides. Our lab’s recent advances in generative and predictive modeling have further expanded what’s possible: moPPIt for motif-specific targeting, fusion-breakpoint targeting via SOAPIA and FusOn-pLM, post-translational modification (PTM) effect prediction with PTM-Mamba, and the incorporation of non-canonical amino acids (ncAAs) and cyclizations using PepTune.


You’ve probably heard about GLP-1 peptides: they’re the technology behind drugs like Ozempic and Wegovy that have transformed diabetes and obesity treatment. But what you might not know is that it took decades of work to make these peptides be clinically viable, meaning that they not only bind their targets effectively, but also survive in the body, avoid toxicity, dissolve well, and actually work as safe medicines. Most peptide candidates fail somewhere along this pipeline, not because they can’t bind, but because they don’t have a combination of good therapeutic properties. Now imagine if we could design GLP-1-like peptides for any disease and have them be therapeutically viable from the very beginning. We could eliminate endless cycles of trial and error, saving significant time and resources.

Recent advances in machine learning are making this vision possible. Instead of tweaking one molecule at a time, researchers now use generative models, basically algorithms that can explore vast sequence spaces and imagine entirely new peptide candidates. Our lab has pushed the boundaries with models for motif-specific targeting (moPPIt), fusion-breakpoint detection (SOAPIA, FusOn-pLM), post-translational modification prediction (PTM-Mamba), the design of non-canonical or cyclized peptides (PepTune), and target-binding peptides for rare-disease treatment (our recent Gumbel-Softmax flow matching framework).

However, the development of therapeutic peptides involves more than just designing a peptide that binds effectively to its target. Researchers must consider a variety of physical and biological properties, such as solubility, stability, affinity, hemolysis, and non-fouling behavior. The challenge lies in balancing these conflicting properties, as optimizing one may negatively affect another. Our PepTune framework, recently published at ICML, has demonstrated strong capabilities for exploring large sequence spaces and discovering chemically modified peptides that meet multiple criteria at once. Yet, having a model that can start with wild-type sequences and refine them into therapeutically promising candidates is also crucial for practical drug development.

In this post, we’ll break down our newest model, Multi-Objective-Guided Discrete Flow Matching (MOG-DFM), which addresses these challenges by leveraging a novel multi-objective optimization algorithm for discrete flow matching to guide the generation of peptide sequences that optimize multiple properties. This framework represents a significant step forward in the design of therapeutic peptides, providing a more efficient and robust pathway for navigating the complex landscape of peptide optimization.


Problem statement

Designing peptides that satisfy multiple, often conflicting, functional and biophysical criteria is no simple task. The difficulty arises on two intertwined fronts. First, unlike single-objective problems where there is a clear optimum, multi-objective optimization (MOO) yields a Pareto front of trade-off solutions where improving one property typically degrades another.

The second difficulty is biological: the properties themselves are not independent and the mapping from sequence to phenotype is highly entangled and nonlinear. A single amino acid change can simultaneously affect affinity, solubility, and toxicity in unpredictable ways due to epistatic interactions, conformational shifts, and context-dependent effects.

This creates a sparse and rugged feasible region where sequences satisfying multiple high-level criteria exist but are hard to find, and predictive models for these properties add further uncertainty, especially when trained on limited data.


Solution: MOG-DFM

MOG-DFM (Multi-Objective-Guided Discrete Flow Matching) offers a novel solution to this problem by combining generative models with multi-objective optimization techniques. It is a framework that steers pre-trained discrete flow matching models toward Pareto-efficient trade-offs across multiple objectives. MOG-DFM can generate peptide sequences that optimize multiple conflicting properties simultaneously.

MOG-DFM builds directly on discrete flow matching, which defines a generative process over sequences as a continuous-time Markov chain (CTMC) in the discrete space. The base discrete flow matching model learns a set of transition rates for each position in a candidate sequence that describe how the sequence can evolve step by step from random sequences toward realistic target sequences.


MOG-DFM starts with a randomly initialized peptide. A trade-off direction is also sampled from a Das–Dennis simplex lattice to represent a particular Pareto trade-off, so different runs explore different regions of the front. The algorithm then undergoes multiple sampling iterations to refine the initialized peptide. At each iteration, the algorithm picks a position in the current sequence and considers all possible candidate tokens at that position. Each candidate transition is scored by measuring its impact on the objectives. Two complementary scores are computed for each candidate. First, a rank score captures how much a candidate improves each objective relative to others. Second, a directional score measures how well the overall multi-objective improvement aligns with the trade-off direction. Both components are normalized and combined into a single scalar guidance score, which is then used to reweight the original discrete flow matching transition rates. In practice, the base velocity for a candidate transition is multiplied by an exponential of its guidance score, boosting transitions that both improve objectives and move in the chosen trade-off direction. This produces a guided velocity field that still satisfies the CTMC validity constraints but prefers multi-objective-improving moves.


MOG-DFM (Multi-Objective-Guided Discrete Flow Matching) offers a novel solution to this problem by integrating generative sequence modeling with multi-objective optimization. At is core, MOG-DFM leverages discrete flow matching (DFM), a generative modeling approach tailored for biological sequence data like peptides. DFM learns how a sequence should evolve, step by step, from a random initialization toward a realistic, functional target. At the heart of this framework is the concept of a “velocity field”: for each position in the sequence, the model predicts the probability (or “velocity”) of switching the current token (such as an amino acid) to any other possible choice. Unlike methods that operate in continuous space, DFM works natively with discrete symbols, making it especially suitable for applications in biology.

To design a new peptide, MOG-DFM begins with a random sequence and sets a specific “trade-off direction” that represents the desired balance among properties (such as affinity vs. solubility) from a Das–Dennis simplex lattice . This direction is chosen so that different runs of the algorithm can explore different parts of the possible trade-off landscape—what’s known in optimization as the Pareto front.

After initialization, MOG-DFM will perform multiple sampling steps to gradually evolve the starting sequences to ones with desired properties. At each step, MOG-DFM selects one random position in the sequence to update. For that position, the algorithm evaluates all possible candidate tokens by calculating two scores: a rank score, reflecting how much each option improves the desired properties compared to the alternatives, and a directional score, which measures how well the change moves the sequence toward the chosen trade-off direction. These are combined into a guidance score, which adjusts the underlying DFM “velocity” for each possible transition. In practice, transitions with higher guidance scores are exponentially favored, actively steering sequence evolution toward peptides that optimally balance all objectives.


To ensure each candidate token transition drives the sequence towards the chosen trade-off direction, MOG-DFM applies adaptive hypercone filtering. Each candidate’s multi-objective improvement vector forms an angle with the trade-off direction, and only those within an angular cone are considered “feasible”. If too many candidates are being excluded, the cone widens to admit more directions; if too few are excluded, it narrows to increase selectivity. From the feasible set, the candidate with the highest combined guidance score is chosen, and the CTMC advances via an Euler-style sampling rule, meaning the position switches to the selected token with a probability determined by the guided outgoing rates, and otherwise remains unchanged, preserving stochasticity in the evolution.

To keep the generative process focused and efficient, MOG-DFM uses a technique called adaptive hypercone filtering. Imagine each possible sequence change as an arrow pointing in a direction: some arrows point toward the desired trade-off (the optimal balance of properties), while others don’t. Hypercone filtering works by only allowing changes whose arrows fall within a certain angle (“cone”) of the target direction. If the algorithm finds itself with too few options, the cone automatically widens, encouraging more exploration! If it’s admitting too many, the cone narrows to maintain focus. This dynamic adjustment helps the algorithm avoid both getting stuck and wandering aimlessly. From the set of allowed changes, MOG-DFM selects the candidate with the highest combined guidance score. The sequence is then updated using the Euler method: the chosen position is switched to the selected token with a probability determined by its transition velocity to the selected token, otherwise the original token is retained. This approach maintains stochasticity in the sequence evolution, while ensuring that updates remain consistent with the desired trade-off direction.

In sum, discrete flow matching supplies the underlying generative backbone, while MOG-DFM injects multi-objective guidance through the rank-directional scores and maintains exploitation and exploration balance via the adaptive hypercone filtering mechanism.


In silico and experimental results

MOG-DFM was benchmarked on a peptide binder design task guided simultaneously by five therapeutic properties:

  • Hemolysis: A measure of toxicity, specifically the ability of a peptide to damage red blood cells. Lower values indicate safer, less toxic peptides.

  • Non-fouling: Reflects the peptide’s tendency to avoid sticking to unintended surfaces, reducing unwanted interactions and side effects.

  • Solubility: Determines how readily the peptide dissolves in biological fluids, a key factor for delivery and bioavailability.

  • Half-life: Indicates the stability of the peptide in the body. Longer half-life means the peptide persists longer, allowing for lower dosing and improved efficacy.

  • Binding affinity: Measures how tightly the peptide binds to its intended target, such as a disease-related protein or receptor.

We performed benchmarking using different type of target proteins, including structured proteins with pre-existing binders, structured proteins without known binders, and intrinsically disordered proteins. Significantly, MOG-DFM-designed peptides consistently achieve low hemolysis (0.06–0.09), high non-fouling (>0.78) and solubility (>0.74), extended half-life (28–47 h), and good affinity scores (6.4–7.6), demonstrating balanced multi-objective optimization and robustness to target proteins.

At each sampling iteration, we recorded the mean and standard deviation of the five property scores to evaluate the effectiveness of the guided generation strategy. All five properties exhibited an improving trend over iterations, with the average score of the solubility and non-fouling properties showing a significant increase from score around 0.3 to 0.8. The improvements of hemolysis, non-fouling, and solubility gradually converge, demonstrating MOG-DFM's efficiency in steering the generation process to the Pareto Front within only 100 iterations.

To illustrate the shift in the generated distribution, we compared property score distributions of peptides of fixed length sampled unconditionally from the base model (PepDFM) versus those steered by MOG-DFM. MOG-DFM both concentrates and translates the distribution, yielding designs with uniformly improved profiles across all five objectives. This demonstrates the framework’s capacity to simultaneously optimize multiple properties rather than improving them in isolation.

MOG-DFM was also compared to four classical multi-objective optimization baselines: NSGA-III, SMS-EMOA, SPEA2, and MOPSO. Although MOG-DFM incurs longer runtimes, it consistently yields superior trade-offs: it reduces predicted hemolysis by over 10%, increases non-fouling and solubility by roughly 30–50%, and extends half-life by a factor of three to four relative to the next-best competitor, while maintaining comparable affinity. These results highlight MOG-DFM’s ability to navigate high-dimensional, conflicting property landscapes and produce peptide binders with well-balanced profiles that would be difficult to obtain via traditional optimizers.


Limitations and what’s next

While MOG-DFM demonstrates strong performance on therapeutic peptide design, we still see it facing several limitations that point the way to future improvements. First, the framework can become computationally intensive as sequence length or output dimensionality grows. Extending to longer proteins or other high-dimensional biological sequences will increase the number of candidate transitions per step and and the number of sampling iterations needed. Second, while MOG-DFM steers generation toward Pareto-efficient regions, it does not come with theoretical guarantees of Pareto optimality or coverage. The adaptive guidance and hypercone filtering induce positive expected improvement in the desired directions, but there is no formal assurance that the sampled set will fully represent or converge to the true Pareto front.

Looking forward, these limitations motivate two main directions for us to explore: 1. Scale MOG-DFM to longer sequences, including those with non-canonical amino acids, 2. Strengthen Pareto convergence guarantees and better characterizing coverage, potentially via uncertainty-aware or feedback-driven extensions to the guidance mechanism, is a priority to make MOG-DFM more principled and reliable.


Resources and links

  • Try out MOG-DFM here and read the preprint here.

  • Check out what the Programmable Biology Group is working on!

  • Have some novel proteins you want to test in the lab? Come talk to us — we’d like to run many more of those protein designer spotlights, so if you have a cool new hypothesis or model to test we’d love to hear from you!