Name		Name	Last commit message	Last commit date
parent directory ..
expected-results		expected-results
input-A		input-A
input-B		input-B
Dockerfile		Dockerfile
README.md		README.md
stepA-calculate-frequency.nf		stepA-calculate-frequency.nf
stepB-integrate-results.nf		stepB-integrate-results.nf

README.md

Joint cohort genotyping demonstrator pipeline

This simple demonstrator pipeline follows the basic principles of the common federation approach adopted by CINECA WP4. The goal is to demonstrate how a simple metric (in this case, allele frequency) can be computed in a federated manner, without requiring ever collecting the raw (individual level) data in a central location.

Step A (private step) reduces the individual level genotypes to dataset-specific allele numbers and counts, which are then exported and collected in a central location. Step B (meta-analysis step) then computes the final allele frequencies based on the results collected from step A.

The instructions below demonstrate how the pipeline can be run on two separate datasets with two different reference genomes using three different execution environments. The example input files are provided. The test region used for the demonstrator is the ACE gene with the coordinates chr17:61554422-61575741 in GRCh37 and chr17:63477061-63498373 in GRCh38.

Dependency installation

The pipeline dependencies are contained in a Dockerfile, available as a tskir/cineca-wp4-genotyping image.

For simplicity, the commands below always display Nextflow being invoked simply as nextflow; however, the syntax slightly varies between environments. Please see the separate documentation for each environment on how to run it.

Step A1, individual level data processing: GIAB

Property	Value
Dataset	GIAB
Access protocol	FTP
Number of samples	7 (4 used)
Data format	BAM
Reference genome	GRCh38
Processing environment	TESK @ CSC Rahti cloud

This example, input-A1-giab.tsv, uses FTP links to GRCh38 alignments of HG001...HG007 obtained from https://github.com/genome-in-a-bottle/giab_data_indexes.

See also the general instructions for setting up and using the TESK environment.

nextflow run -with-docker tskir/cineca-wp4-genotyping:v0.5.0 \
  stepA-calculate-frequency.nf \
  --inputData input-A/input-A1-giab.tsv \
  --referenceGenomeLink 'http://hgdownload.cse.ucsc.edu/goldenpath/hg38/chromosomes/chr17.fa.gz' \
  --debugDir debug_giab \
  --outputVcf result-A1-giab.vcf.gz

Step A2, individual level data processing: EGA

Property	Value
Dataset	EGA synthetic dataset
Access protocol	htsget via EGA download client
Number of samples	6 (4 used)
Data format	BAM
Reference genome	GRCh37
Processing environment	LSF @ EMBL-EBI cluster

The example, input-A2-ega.tsv, was constructed using 6 BAM files from the EGA test dataset EGAD00001003338. Details: https://github.com/EGA-archive/ega-download-client.

See also the general instructions for setting up and using the LSF environment.

nextflow run -with-singularity tskir/cineca-wp4-genotyping:v0.5.0 \
  stepA-calculate-frequency.nf \
  --inputData input-A/input-A2-ega.tsv \
  --referenceGenomeLink 'ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.chromosome.17.fa.gz' \
  --debugDir debug_ega \
  --outputVcf result-A2-ega.vcf.gz

Step B, result integration

Procesing environment: local (Linux machine).

After steps A1 and A2 have been run, collect the result files (result-A1-giab.vcf.gz and result-A2-ega.vcf.gz) into the same location (input-B directory in this example). The input file, input-B.tsv, contains the two output files from the previous steps, as well as a definition of transformations (chromosome renaming and liftover) applicable to each file.

nextflow run -with-docker tskir/cineca-wp4-genotyping:v0.5.0 \
  stepB-integrate-results.nf \
  --inputData input-B/input-B.tsv \
  --inputDir `realpath input-B` \
  --targetReferenceGenomeLink 'http://hgdownload.cse.ucsc.edu/goldenpath/hg38/chromosomes/chr17.fa.gz' \
  --debugDir debug_integrate \
  --outputVcf result-B.vcf.gz

The resulting file, result-B.vcf.gz, contains the joint AN and AC counts from two datasets.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4.3.1-genotyping

4.3.1-genotyping

README.md

Joint cohort genotyping demonstrator pipeline

Dependency installation

Step A1, individual level data processing: GIAB

Step A2, individual level data processing: EGA

Step B, result integration

Files

4.3.1-genotyping

Directory actions

More options

Directory actions

More options

Latest commit

History

4.3.1-genotyping

Folders and files

parent directory

README.md

Joint cohort genotyping demonstrator pipeline

Dependency installation

Step A1, individual level data processing: GIAB

Step A2, individual level data processing: EGA

Step B, result integration