💡 Notes
-
This is a list accompanying our preprint: https://www.biorxiv.org/content/10.1101/2022.08.31.505981v1 . We focus on deep learning methods for protein design released after 2018 (and mostly 2019). This table complements Table 1 in our manuscript.
-
We curated this list manually and as such it might be incomplete. Please drop us an email or open an issue if you find we didn't describe your method correctly or it's missing.
-
We order the methods by release date (preprint when available) and categorize them in four classes (for more details on these categories see our preprint, Figure 1 and text):
- 1️⃣: 'Fixed-backbone' protein design; p(sequence|structure)
- 2️⃣: Structure generation; p(structure)
- 3️⃣: Sequence generation; p(sequence) or p(sequence|sequence*)
- 4️⃣: Concomitant protein and sequence design. p(sequence and structure) (which can be constrained).
-
Others before us have also done a fantastic work assembling deep learning methods for other protein-related problems, sometimes overlapping with this list. We link these lists here:
-
- Sean Peldom Zhang's super comprehensive list on protein design methods using DL
- Kevin Yang's list on ML methods for protein research
- Christian Dallago & Sergey Ovchinnikov's lists on structure prediction methods and protein language models.
- Simon Dürr and Gina El Nesr's list on inverse folding
-
💥 This work was recently highlighted in Nature
Contributors
Methods in this class attempt to solve the classical protein design problem: Find an optimal sequence that adopts a pre-determined 3D structure.
Name | Architecture | Number of Parameters | User Input | Output | Training Dataset | Paper | Code | Release Month/Year |
---|---|---|---|---|---|---|---|---|
SPIN2 | FNN | ~105k | 3D structure | sequence | 1,532 X-ray structures | Paper | Code used to be here - no longer available | 2018/02 |
SPROF | CNN-LSTM | - | 3D structure | sequence | 1,532 X-ray structures | Paper | Code Web Server | 2019/08 |
Ingraham et al. | modified Transformer | >3k | sequence | CATH 4.2 40% sequences/structures | Paper | Code | 2019/12 | |
ProDCoNN | CNN | >28k | 3D structure | sequence | Two datasets: ID90TR: 17,044; ID30TR: 9,135 sequences/PDB pairs | Paper | Reimplementation | 2019/12 |
Anand et al. | CNN | - | 3D structure | Amino acid and side chain conformation | 53,414 CATH domain structures | Paper | Code | 2020/01 |
DenseCPD | CNN | 3M | 3D structure | sequence | 11,227 X-ray structures | Paper | Web server Reimplementation | 2020/01 |
ProteinSolver | GNN | - | 3D structure | sequence | 72,464,122 sequences/adjacency matrices pairs | Paper | Code | 2020/03 |
Norn et al. | CNN | N/A | distances, angles, and dihedrals for every pair of residues (trRosetta) | sequence | N/A | Paper | Code | 2020/07 |
GVP-GNN | GVP | - | 3D structure | sequence | CATH 4.2 40% sequences/structures | Paper | Code | 2020/09 |
Fold2Seq | modified Transformer | - | 3D structure | sequence | 45,995 3D structures from CATH 4.2 filtered @ 100% | Paper | Code | 2021/06 |
CNN_protein_landscape | CNN | >10M | 3D structure | sequence | 16,569 PDB chains | Paper | Code | 2021/08 |
Orellana et al. | GCN | - | 3D structures | sequence | CATH 4.2 40% sequences/structures | Paper | - | 2021/11 |
ABACUS-R | Transformer | 152M | 3D structures | sequence | CATH 4.2 | Paper | Code | 2022/02 |
ESM-IF1 | GVP-Transformer | 142M | 3D structure | sequence | 16k X-ray structures + 1.2M AF2 predictions | Paper | Code | 2022/04 |
TERMinator | GNN | - | 3D structures | sequences | CATH 4.2 40% sequences/structures | Paper | - | 2022/04 |
McPartlon et al. | modified Transformer | - | 3D structures | sequences | 37k X-ray structures from BC40 | Paper | - | 2022/04 |
MIF | Structured GNN | 6.8M | 3D structure | sequence | Paper | Code | 2022/05 | |
ProteinMPNN | MPNN | 1.8M | 3D structure | sequence | CATH 4.2 40% sequences/structures | Paper | Code Web Interface | 2022/06 |
ProDESIGN-LE | Transformer + FNN | - | 3D structure | sequence | 5,867,488 residues from PDB40 | Paper | - | 2022/07 |
TIMED | CNN | 3M | 3D structure | sequence | 32k structures from the PISCES server | Paper | Code | 2022/08 |
PiFold | GNN | - | 3D structure | sequence | - | Paper | 2022/09 |
Methods in this class generate structures unconditionally or from a set of secondary structural conditions.
Name | Architecture | Number of Parameters | User Input | Output | Training Dataset | Paper | Code | Release Month/Year |
---|---|---|---|---|---|---|---|---|
64GAN | GAN | - | - | contact map (3D structure via ADMM) | 427,659 contact maps | Paper | - | 2018/12 |
Anand et al. | GAN | - | - | distance map (3D structure via CNN) | 800,000 distance maps | Paper | 2019/03 | |
RamaNet | LSTM | >2k | - | A sequence of φ and ψ angles | 607 helical structures | Paper | Code | 19/06 |
DECO-VAE | VAE | - | Structures represented as graphs | contact graph (translatable to contact map) | >650,000 contact graphs | Paper | Upon request | 2020/04 |
SCUBA | NC-NN | ~20k | secondary structure motif | backbone | 12,465 structures | Paper | Code | 2022/02 |
Ig-VAE | VAE | - | - | protein backbone coordinates | 10,768 individual immunoglobulin domains | Paper | Code | 2022/02 |
GENESIS | VAE | - | secondary structure motif | contact map | 40,726 backbones with remodeled loops | Paper | - | 2022/03 |
ProtDiff & SMCDiff | EGNN | - | Optional: structural motif | coordinates | 4,269 PDB structures | Paper | - | 2022/06 |
Lai et al. | VAE | - | topology | protein backbone coordinates | CATH 4.2 40% sequences/structures | Paper | - | 2022/07 |
ProteinSGM | SDE + RefineNet | - | optional: masked matrices | matrices describing distance and torsional angles | 10,361 CATH 4.3 95% structures | Paper | - | 2022/07 |
FoldingDiff | Transformer | - | - | internal angles | CATH 4.2 40% structures | Paper | Code | 2022/09 |
Methods in this class generate sequences usually from autoregressive language models, and can sometimes be conditioned.
Name | Architecture | Number of Parameters | User Input | Output | Training Dataset | Paper | Code | Release Month/Year |
---|---|---|---|---|---|---|---|---|
ProteinGAN | GAN | 60M | sequence | 16,706 MDH sequences | Paper | Code | 2019/10 | |
ProGen | Transformer | 1.2B | Optional: sequence or function | sequence | 280M sequences | Paper | 2020/03 | |
ProtXLnet | Transformer | 409M | Optional: sequence | sequence | UniRef100 | Paper | Code | 2020/07 |
ProtXL | Transformer | 562M | Optional: sequence | sequence | BFD100 | Paper | 2020/07 | |
ProtElectra-Generator | Transformer | 420M | Optional: sequence | sequence | Uniref100 | Paper | Code | 2020/07 |
ProtT5 | Transformer | 11B | Optional: sequence | sequence | BFD100 | Paper | Code | 2020/07 |
EVE | VAE | MSA | Sequence | 3,219 MSAs extracted from UniRef100 | Paper | Code | 2020/12 | |
DARK3 | Transformer | 110M | Optional: sequence | sequence | 615,000 synthetic sequences | Paper | - | 2022/01 |
ReLSO | Modified transformer | 110M | sequence | sequence and predicted value for label | directed evolution datasets | Paper | Code | 2022/02 |
ProtGPT2 | Transformer | 739M | Optional: sequence | sequence | UniRef50 | Paper | Code | 2022/03 |
RITA | Transformer | 1.2B | Optional: sequence | sequence | UniRef100 | Paper | Code | 2022/05 |
Tranception | Transformer | 700M | Optional: sequence | sequence | UniRef100 | Paper | Code | 2022/05 |
ProGEN2 | Transformer | 6.4B | Optional: sequence | sequence | Uniref90+BF30 | Paper | Code | 2022/06 |
Methods in this class generate sequences and structures concomitantly, and include hallucination methods and constrained generation (inpainting)
Name | Architecture | Number of Parameters | User Input | Output | Training Dataset | Paper | Code | Release Month/Year |
---|---|---|---|---|---|---|---|---|
Hallucination | CNN (trRosetta) | N/A | random sequence | sequence/structure | N/A | Paper | Code | 2020/07 |
Constrained hallucination | CNN (trRosetta) | N/A | sequence/structure | sequence/structure | N/A | Paper | Code | 2020/11 |
Constrained hallucination2 | CNN (RoseTTAFold) | N/A | sequence/structure | sequence/structure | N/A | Paper | Code | 2021/11 |
RFjoint | CNN (RoseTTAFold, finetuned) | N/A | sequence/structure | sequence/structure | Finetuned with 25% PDB version 02/2020 + 75 % AF2 structures | Paper | Code | 2021/11 |
Protein Diffusion | Diffussion model | - | Secondary structure motif sketches | sequence/structure | 53,414 3D structures (95% CATH 4.2 S95) | Paper | Code | 2022/05 |
Roney | AlphaFold2 | N/A | random sequence | sequence/structure | N/A | Paper | Code | 2022/06 |