Name		Name	Last commit message	Last commit date
parent directory ..
pdb_files		pdb_files
README.md		README.md
analysis.ipynb		analysis.ipynb
config.py		config.py
data.zip		data.zip
run_ad.py		run_ad.py
run_ga.py		run_ga.py
run_sd.py		run_sd.py

README.md

This folder contains the code to reproduce and analyze the RfaH multistate design benchmark results, as well as the source data to reproduce all RfaH analysis figures in the paper.

Simulation code

config.py: protein system and simulation setups required for all RfaH design simulations. The user needs to update file/folder path strings in this file according to the local setups.
run_ga.py: performs multistate RfaH sequence design with NSGA-II. The script is setup to sweep through a set of mutation rates, mutation operator setups, and objective function setups with the batch_settings_dict variable and the batch_ind command line argument.
run_ad.py: performs multistate RfaH sequence design in ProteinMPNN and computes additional metric/objective functions for the redesigned sequences; an option is provided to score the WT sequence only.
run_sd.py: performs single-state RfaH sequence design in ProteinMPNN and computes additional metric/objective functions for the redesigned sequences. The user needs to specify which RfaH state to redesign as a command line argument, although the metric/objective functions will be calculated for both states.
pdb_files/: Rosetta relaxed PDB files.

The design simulation results will be outputted as pickle files in the output/ folder. run_ga.py is setup for parallelization over an SGE job scheduler, and run_ad.py and run_sd.py are setup to be submitted as SGE array jobs (pass the array job ID to batch_ind as a command line argument).

Analysis code

analysis.ipynb: jupyter notebook used to generate all RfaH-related figures in the main text and supplementary material, except for the structure visualizations.
data/: the RfaH benchmark data; required to run the analysis notebook.
- benchmark_collated.gz.parquet: a pandas DataFrame containing all single-state and multistate simulation results. The pyarrow package is required for pandas to parse the parquet file format.
- 41467_2022_31532_MOESM3_ESM.csv: A NusG-like sequence database containing computational foldswitching predictions; retrieved from the supplementary materials of the paper Many dissimilar NusG protein domains switch between α-helix and β-sheet folds.
- idmapping_active_true_2023_12_17.fasta and porter_full_seqs_clustal_omega.fa: intermediate analysis sequence files; see the jupyter notebook for more information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RfaH_benchmark

RfaH_benchmark

README.md

Simulation code

Analysis code

Files

RfaH_benchmark

Directory actions

More options

Directory actions

More options

Latest commit

History

RfaH_benchmark

Folders and files

parent directory

README.md

Simulation code

Analysis code