This repository contains the data and code necessary to reproduce the results from the article Prediction of protein subplastid localization and origin with PlastoGram. Sidorczuk K., Gagat P., Kała J., Nielsen H., Pietluch F., Mackiewicz P., Burdukiewicz M. Sci Rep 13, 8365 (2023). https://doi.org/10.1038/s41598-023-35296-0
PlastoGram is available as:
This repository uses renv and targets packages to control the workflow and assure the reproducibility.
Some of the data files are too large to store them on GitHub but they can be downloaded using the links below:
-
All_sequences.fasta - All downloaded sequences of proteins localizing to different plastid compartments. See the supplementary materials of the paper for exact queries used to obtain them.
-
Dataset_annotations_references.xlsx - File with UniProt annotations of downloaded proteins, along with our curated localization and references. See the Methods section of the article for detailed description of manual curation procedure.
Part of the analysis requires HMMER software. Please make sure that you have installed HMMER before reproducing the pipeline. Please see the HMMER documentation for installation guidelines.
To reproduce the results clone the repo, set your path to the directories with data files and:
renv::restore()
targets::tar_make()
_targets.R - reproducible pipeline for generation of all data sets and results processing,
data - data files used during the study, e.g. for creation of the positive dataset,
drafts - draft codes used for initial exploratory analyses,
functions - all functions used for running the pipeline and obtaining results,
renv - renv package files,
third-party - third-party executables used in the pipeline.
If you are interested in seeing some intermediate results but do not want to run the whole pipeline, we provide links to the most important directories and files with results obtained during the study.
-
Replication 1, Replication 2, Replication 3, Replication 4, Replication 5 - prediction results of all lower-level models in 10-fold CV repeated 5 times.
-
Model_architectures_envelope - directory with files describing each of the considered ensembles.
-
Model_architectures_envelope_results - directory containing lower-level models prediction results filtered according to each ensemble.
-
Architectures_envelope_performance.csv - file with performance measures of all ensembles calculated for each replication and fold separately.
-
Architectures_envelope_mean_performance.csv - file with averaged performance measures of all ensembles.