Skip to content
Rayan Chikhi edited this page Mar 5, 2021 · 5 revisions

Accessing Assembly Data

s3://lovelywater/     # A Read-Only Archive of Serratus Data Releases
├── assembly/         # Viral assembly and annotation data
│   └─── cov/         # .fasta  : Assembled/filtered coronaviruses
│   └─── contigs/     # CoronaSPAdes output, contigs, graphs, stats...
│   └─── annotation/  # CoV annotation and taxonomic assignments

assembly/cov:

These are the 11,120 coronavirus assemblies made with coronaSPAdes, where contigs have been filtered either using CheckV or using coronaSPAdes' bgc-statistics. See Serratus' manuscript for more details.

assembly/contigs:

SRRXXXXXX.[assembler].assembly_graph_with_scaffolds.gfa.gz
SRRXXXXXX.[assembler].bgc_statistics.txt
SRRXXXXXX.[assembler].contigs.fa.mfc
SRRXXXXXX.[assembler].domain_graph.dot
SRRXXXXXX.[assembler].gene_clusters.fa
SRRXXXXXX.[assembler].scaffolds.fasta.gz
SRRXXXXXX.[assembler].scaffolds.paths
SRRXXXXXX.[assembler].log
SRRXXXXXX.[assembler].txt

All of these are [assembler] outputs, where [assembler] is either coronaSPAdes or rnaviralSPAdes. Depending on the assembler, a subset of these files will be present for each accession. Beware: contigs.fa.mfc actually contains the content of coronaSPAdes' scaffolds.fasta compressed with MFCompress.

assembly/annotation:

This folder contains the annotation results of several programs applied to different inputs.

CheckV applied to the scaffolds.fasta and/or gene_clusters.fasta:

SRRXXXXXX.[assembler].checkv.completeness.tsv.gz
SRRXXXXXX.[assembler].checkv.contamination.tsv.gz
SRRXXXXXX.[assembler].checkv.quality_summary.tsv.gz
SRRXXXXXX.[assembler].gene_clusters.checkv.completeness.tsv.gz
SRRXXXXXX.[assembler].gene_clusters.checkv.contamination.tsv.gz
SRRXXXXXX.[assembler].gene_clusters.checkv.quality_summary.tsv.gz

serraplace (phylo placement) output of CheckV-filtered gene clusters:

SRRXXXXXX.[assembler].gene_clusters.checkv_filtered.fa.serraplace.tar.gz

serratax (taxonomic identification) output of CheckV-filtered gene clusters:

SRRXXXXXX.[assembler].gene_clusters.checkv_filtered.fa.serratax.final
SRRXXXXXX.[assembler].gene_clusters.checkv_filtered.fa.serratax.tar.gz

Then, the following are annotations of the assemblies in cov/. They include the outputs of Darth, a pipeline created within Serratus for annotation of coronavirus assemblies.

SRRXXXXXX.fa.darth.alignments.fasta
SRRXXXXXX.fa.darth.alignments.sto
SRRXXXXXX.fa.darth.input_md5
SRRXXXXXX.fa.darth.stripped.tar.gz
SRRXXXXXX.fa.darth.tar.gz
SRRXXXXXX.fa.darth.transeq.alignments.fasta
SRRXXXXXX.fa.serraplace.tar.gz
SRRXXXXXX.fa.serratax.final
SRRXXXXXX.fa.serratax.tar.gz

See also: Accessing Serratus Data

Clone this wiki locally