The landscape of protein engineering has been transformed by the advent of machine learning tools such as AlphaFold (structure prediction), ProteinMPNN (sequence generation) or RFDiffusion (structure generation). These powerful tools all rely on structural data from the Protein Data Bank (PDB), but working with this data presents its own set of unique challenges.
Here, we present and open-source ProteinFlow, a powerful solution that aims to simplify protein data processing by providing users with a customizable pipeline to generate high-quality, unbiased datasets.
Challenges of Working with Structural Protein Data
Protein structural data in the PDB is characterized by its heterogeneity in both method and experimental conditions, often resulting in issues such as low accuracy structures, missing residues, or redundancies. This complicates the task of leveraging this data as "ground truth" in machine learning. Furthermore, these datasets are heavily biased towards a few over-represented families of proteins, which can influence downstream training tasks if the data is not weighted correctly.
The lack of a consensus in the research field for selecting filtering criteria exacerbates these issues. With a plethora of experimental parameters across the PDB, there is no standardized list of methods used across training datasets, leading to discrepancies in data and, consequently, a lack of reliable insight into the properties of machine learning models. Furthermore, benchmarking datasets used to test machine learning models are often static and updated infrequently, preventing new models and modelling modalities from being efficiently tested.
To overcome these challenges, it is crucial to standardize datasets that accommodate a variety of training tasks, with sufficient data for training purposes and safeguards against data leakage between train and test data.p
ProteinFlow is designed to simplify and streamline the process of analyzing 3D protein structures for deep learning applications and ensures the reliable generation of datasets. It offers a fully customizable end-to-end bioinformatic pipeline to extract, filter, annotate, and cluster data from the PDB, allowing users to design and generate a dataset according to their specific modelling task. The pipeline provides ready-to-use output compatible with machine learning frameworks such as PyTorch (Support for other frameworks is on development) and extracts all levels of the protein organization, including primary sequence, secondary structure, structural single chain information, and protein complex information.
- Easy Installation: ProteinFlow can be easily installed via Pypi. A Docker image is also available with all the library dependencies.
- Ready-to-Use Clustered Dataset for Deep Learning Applications: A pre-computed dataset is available with commonly used parameters and filters across the research community.
- Accessing All levels of Protein Structure: ProteinFlow can process all hierarchical levels of protein representation from primary to quaternary structure, enabling users to produce datasets for a variety of training tasks such us protein docking or protein-multimer prediction.
- Running Your Own Pipeline: ProteinFlow offers the option to run the pipeline with your own parameters, along with a range of data filtering options.
- Preventing Data Leakage and Ensuring Reliability: Pre-processed data is clustered and split into training, validation, and test sets, ensuring the absence of data leakage.
- Feature Extraction and Sampling: The library allows for feature extraction, filtering based on chain types, and randomized sampling from sequence identity clusters.
- Accessing and Processing Data: ProteinFlow's output data class contain various types of features, such as backbone and sidechain atom coordinates, sequence information, dihedral angles and missing residues, which can be directly accessed using the pickle module or with provided classes like
ProteinLoaderfor direct integration with machine learning frameworks.
ProteinFlow PPI dataset 2023
We also present a curated dataset of protein structures for Protein-Protein Interaction (PPI) prediction, along with the methods used to cluster and split the data. Our dataset contains over 280,000 biounits (a biounit is defined as the smallest biological assembly in the PDB file that still contains all chains of the biological entity), making it one of the largest publicly available datasets for PPI prediction.
Clustering the dataset was done based on sequence identity, using the MMseqs2 suite, and at a 30% sequence identity cutoff. This approach yielded results equivalent to using structural clusters from the CATH database, but without losing any entries due to missing classifications. A PPI graph was then constructed, where nodes represent sequence clusters and edges between nodes connect the protein-protein interactions occurring between sequences of different clusters as they appear on the biounits. Each connected component of this graph was defined as a biounit cluster since no biounit is shared between two of these connected components.
Splitting the dataset was done by randomly partitioning the clusters while taking into account the constraint that the distributions of single chains, homomers, and heteromers are similar between the train, validation, and test sets. If after fifty trials the partition still did not meet the criteria, we added and removed clusters one by one until we reached the criteria. The resulting dataset was split into train, validation, and test sets, with a total of 90%, 5%, and 5% of the data, respectively.
ProteinFlow is an open-source project by Adaptyv Bio, and we welcome contributions from the research community. By simplifying the process of working with structural protein data, ProteinFlow empowers researchers to harness the full potential of machine learning for protein design projects. With its customizable pipeline, reliable dataset generation, and a variety of processing options, we built ProteinFlow to be a versatile tool for researchers at the forefront of protein engineering and computational biology.
We are planning future releases to support other typical data workflows like clustering the dataset using protein structural similarity and supporting new entities and features such as additional processing options for small-ligands and other biopolymer entities.
If you have any questions, suggestions, or would like to contribute to the project, please feel free to reach out to us at email@example.com or check out the Github repo & documentation.