-
Notifications
You must be signed in to change notification settings - Fork 22
Pipeline outputs
If you ran the pipelines without a database, then the output files will be written to the output directory specified in the command that was run.
If you used a database, then the script clockwork find_data
can be run to
make a tab-delimited file of sample information and directories on disk.
The general usage, which will output sample identifiers and directories/files
on disk is:
clockwork find_data db.ini Pipeline_root/ > out.tsv
where db.ini
and Pipeline_root/
are as described in the
Pipelines page.
All files (reads, any pipeline outputs) for one isolate are stored in a
directory, specified by the isolate_directory
column.
The original and decontaminated reads are stored in the Reads/
directory.
A Reads/
directory for a sample that has had the remove contamination
pipeline run contains these files:
-
reads.original.1.1.fq.gz
,reads.original.1.2.fq.gz
- the original reads files -
reads.contam.1.1.fq.gz
,reads.contam.1.2.fq.gz
- the reads classified as contamination -
reads.remove_contam.1.1.fq.gz
,reads.remove_contam.1.2.fq.gz
- decontaminated reads -
reads.remove_contam.1.counts.tsv
- a tab-delimited file containing counts of how the reads were classified.
Example made-up counts TSV file:
Name Is_contam Reads
Human 1 5
TB 0 100000
Virus 1 10
Unmapped 0 100
Reads_kept_after_remove_contam 0 100100
Depending on the remove contamination reference used, the contents may look different.
The reference sequences are grouped - the Name
column is the name
of the group. Is_contam
value 0
means reads will not be removed,
and a value of 1
means that they are.
In this example, there were 5 human reads and 10 virus reads removed.
There were 100000 TB and 100 unmapped reads, all of which were retained.
The total kept and removed are stored in the database. This query will give you the numbers:
select site_id,subject_id,sample_id_from_lab,isolate_number_from_lab,sequence_replicate_number,dataset_name,original_total,contamination,not_contamination,unmapped,total_after_remove_contam from Seqrep join Isolate on Seqrep.isolate_id = Isolate.isolate_id join Sample on Isolate.sample_id = Sample.sample_id join Read_counts on Seqrep.seqrep_id = Read_counts.seqrep_id
Running clockwork find_data
with the option --pipeline qc
will output a tab-delimited
file with the QC output directory for each pair of fastq files. It is in the
column pipeline_directory
, and will look like
Pipeline_root/00/00/12/34/1234/Pipelines/1/qc/0.1.7/
.
Inside that directory are the results of running fastqc and samtools stats.
Key results from those two tools are also stored in the database. You can extract them using this mysql query:
select * from QC join Seqrep on QC.seqrep_id = Seqrep.seqrep_id join Isolate on Isolate.isolate_id = Seqrep.isolate_id join Sample on Isolate.sample_id = Sample.sample_id;
Note: if you want mean depth, it can be calculated using the samtools_bases_mapped_cigar
column from
the output of that query (divide it by the genome size).
Running clockwork find_data
with the option --pipeline variant_call
will output a tab-delimited
file with the variant_call
output directory for each isolate. It is in the
column pipeline_directory
, and will look like
Pipeline_root/00/00/12/34/1234/Pipelines/1/variant_call/0.8.3.ref.3
.
The final VCF file is called minos/final.vcf
. The pipeline first
calls variants using samtools and cortex (the output from those
tools are in directories with the tool names). The unfiltered calls from
samtools and cortex are input to
minos to make the final
call set.
Note: variants are not stored in the database - you need to use the VCF file for each isolate.