Designing a fixed backbone sequence is the "last mile" of protein engineering. ProteinMPNN is currently the most commonly used method in industry, but it has a hidden drawback: insufficient sequence diversity. High diversity means a larger functional search space and also a higher probability of success in wet experiments.
GPD (Graphormer-based Protein Design) is based on the Graphormer graph neural network architecture, representing the three-dimensional protein structure as a graph and using Transformer to perform attention calculations on node features. Gaussian noise and random masks are introduced during training to enhance sequence recovery ability and diversity. By inputting the target backbone, it outputs highly active and diverse amino acid sequences.

In terms of data, GPD achieved a sequence reproducibility rate of 27.9% and maintained 28% diversity on a test set of 103 single-stranded proteins. Designing 10,000 261-residue sequences, GPD required only 0.97 hours, compared to 3.11 hours for ProteinMPNN and 55 hours for ESM-IF1. This represents a 2.2-fold increase in diversity and a 1.6-fold increase in speed compared to ProteinMPNN.
Wet experiments are the final judge. The research group used GPD to design protein drugs, and the success rate of in vitro validation exceeded 50%. In the modification of Antarctic yeast lipase (CalB), the resulting variant showed 1.7 times higher catalytic activity than the wild type and exhibited strong selectivity for substrates with different carbon chains from C2 to C16.
For industrial enzyme modification, antibody humanization sequence optimization, and synthetic biology enzyme element development—in any scenario that requires "matching sequences to the backbone," GPD offers a wider design space and faster iteration speed than existing state-of-the-art solutions.
This work was published in Briefings in Bioinformatics (2024), and the code has been open-sourced.
WeChat Customer Service