Skip to content

Pipeline outputs

Martin Hunt edited this page Jun 15, 2022 · 4 revisions

Getting output files

If you ran the pipelines without a database, then the output files will be written to the output directory specified in the command that was run.

If you used a database, then the script clockwork find_data can be run to make a tab-delimited file of sample information and directories on disk. The general usage, which will output sample identifiers and directories/files on disk is:

clockwork find_data db.ini Pipeline_root/ > out.tsv

where db.ini and Pipeline_root/ are as described in the Pipelines page.

All files (reads, any pipeline outputs) for one isolate are stored in a directory, specified by the isolate_directory column. The original and decontaminated reads are stored in the Reads/ directory.

Remove contamination

A Reads/ directory for a sample that has had the remove contamination pipeline run contains these files:

  • reads.original.1.1.fq.gz, reads.original.1.2.fq.gz - the original reads files
  • reads.contam.1.1.fq.gz, reads.contam.1.2.fq.gz - the reads classified as contamination
  • reads.remove_contam.1.1.fq.gz, reads.remove_contam.1.2.fq.gz - decontaminated reads
  • reads.remove_contam.1.counts.tsv - a tab-delimited file containing counts of how the reads were classified.

Example made-up counts TSV file:

Name                            Is_contam  Reads
Human                           1          5
TB                              0          100000
Virus                           1          10
Unmapped                        0          100
Reads_kept_after_remove_contam  0          100100

Depending on the remove contamination reference used, the contents may look different. The reference sequences are grouped - the Name column is the name of the group. Is_contam value 0 means reads will not be removed, and a value of 1 means that they are. In this example, there were 5 human reads and 10 virus reads removed. There were 100000 TB and 100 unmapped reads, all of which were retained.

The total kept and removed are stored in the database. This query will give you the numbers:

select site_id,subject_id,sample_id_from_lab,isolate_number_from_lab,sequence_replicate_number,dataset_name,original_total,contamination,not_contamination,unmapped,total_after_remove_contam from Seqrep join Isolate on Seqrep.isolate_id = Isolate.isolate_id join Sample on Isolate.sample_id = Sample.sample_id join Read_counts on Seqrep.seqrep_id = Read_counts.seqrep_id

QC

Running clockwork find_data with the option --pipeline qc will output a tab-delimited file with the QC output directory for each pair of fastq files. It is in the column pipeline_directory, and will look like Pipeline_root/00/00/12/34/1234/Pipelines/1/qc/0.1.7/.

Inside that directory are the results of running fastqc and samtools stats.

Key results from those two tools are also stored in the database. You can extract them using this mysql query:

select * from QC join Seqrep on QC.seqrep_id = Seqrep.seqrep_id join Isolate on Isolate.isolate_id = Seqrep.isolate_id join Sample on Isolate.sample_id = Sample.sample_id;

Note: if you want mean depth, it can be calculated using the samtools_bases_mapped_cigar column from the output of that query (divide it by the genome size).

Variant call

Running clockwork find_data with the option --pipeline variant_call will output a tab-delimited file with the variant_call output directory for each isolate. It is in the column pipeline_directory, and will look like Pipeline_root/00/00/12/34/1234/Pipelines/1/variant_call/0.8.3.ref.3.

The final VCF file is called minos/final.vcf. The pipeline first calls variants using samtools and cortex (the output from those tools are in directories with the tool names). The unfiltered calls from samtools and cortex are input to minos to make the final call set.

Note: variants are not stored in the database - you need to use the VCF file for each isolate.