Add test profile and data (#33)

Add small test profile and data, hg38 asset files and code formatting.
genomic-medicine-sweden · Mar 22, 2024 · 06e7868 · 06e7868
1 parent 1bb1e69
commit 06e7868
Show file tree

Hide file tree

Showing 29 changed files with 300 additions and 142 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,7 +1,6 @@
 .gitignore
 .nextflow*
 work/
-data/
 results/
 .DS_Store
 testing/

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,16 +1,13 @@
 # fellen31/skierfe: Changelog
 
-The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
-and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+All notable changes to this project will be documented in this file.
 
-## v1.0dev - [date]
+The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
+and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).
 
-Initial release of fellen31/skierfe, created with the [nf-core](https://nf-co.re/) template.
+<!-- insertion marker -->
+<!-- ## [0.1.0](https://github.com/fellen31/skierfe/releases/tag/0.1.0) - 2024-03-21 -->
 
-### `Added`
+### Added
 
-### `Fixed`
-
-### `Dependencies`
-
-### `Deprecated`
+- Added test data and test profile [#33](https://github.com/genomic-medicine-sweden/skierfe/pull/33)
diff --git a/README.md b/README.md
@@ -19,26 +19,31 @@ The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool
 ## Pipeline summary
 
 ##### QC
+
 - FastQC ([`FastQC`](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
 - Aligned read QC ([`cramino`](https://github.com/wdecoster/cramino))
 - Depth information ([`mosdepth`](https://github.com/brentp/mosdepth))
 
 ##### Alignment & assembly
+
 - Align reads to reference ([`minimap2`](https://github.com/lh3/minimap2))
 - Assemble (trio-binned) haploid genomes (HiFi only) ([`hifiasm`](https://github.com/chhylp123/hifiasm))
 
 ##### Variant calling
+
 - Short variant calling & joint genotyping of SNVs ([`deepvariant`](https://github.com/google/deepvariant) + [`GLNexus`](https://github.com/dnanexus-rnd/GLnexus))
 - SV calling and joint genotyping ([`sniffles2`](https://github.com/fritzsedlazeck/Sniffles))
 - Tandem repeats ([`TRGT`](https://github.com/PacificBiosciences/trgt/tree/main))
 - Assembly based variant calls (HiFi only) ([`dipcall`](https://github.com/lh3/dipcall))
 - CNV-calling (HiFi only) ([`HiFiCNV`](https://github.com/PacificBiosciences/HiFiCNV))
 
 ##### Phasing and methylation
+
 - Phase and haplotag reads ([`whatshap`](https://github.com/whatshap/whatshap) + [`hiphase`](https://github.com/PacificBiosciences/HiPhase))
 - Methylation pileups (Revio/ONT) ([`modkit`](https://github.com/nanoporetech/modkit))
 
 ##### Annotation - SNV
+
 1. Annotate variants with database(s) of choice, i.e. [gnomAD](https://gnomad.broadinstitute.org), [CADD](https://cadd.gs.washington.edu) etc. ([`echtvar`](https://github.com/brentp/echtvar))
 2. Annotate variants ([`VEP`](https://github.com/Ensembl/ensembl-vep))
 
@@ -56,25 +61,29 @@ The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool
 1. Prepare a samplesheet with input data (gzipped fastq-files):
 
 `samplesheet.csv`
+
 ```
 sample,file,family_id,paternal_id,maternal_id,sex,phenotype
 HG002,/path/to/HG002.fastq.gz,FAM1,HG003,HG004,1,1
 HG005,/path/to/HG005.fastq.gz,FAM1,HG003,HG004,2,1
 ```
 
 2. Optional inputs:
+
 - Limit SNV calling to regions in BED file (`--bed`)
 - If running dipcall, download a BED file with PAR regions ([hg38](https://raw.githubusercontent.com/lh3/dipcall/master/data/hs38.PAR.bed))
 - If running TRGT, download a BED file with tandem repeats ([TRGT](https://github.com/PacificBiosciences/trgt/tree/main/repeats)) matching your reference genome.
 - If running SNV annotation, download [VEP cache](https://ftp.ensembl.org/pub/release-110/variation/vep/homo_sapiens_vep_110_GRCh38.tar.gz) and prepare a samplesheet with annotation databases ([`echtvar encode`](https://github.com/brentp/echtvar)):
 - If running CNV-calling, expected CN regions for your reference genome can be downloaded from [HiFiCNV GitHub](https://github.com/PacificBiosciences/HiFiCNV/tree/main/data/excluded_regions)
 
 `snp_dbs.csv`
+
 ```
 sample,file
 gnomad,/path/to/gnomad.v3.1.2.echtvar.popmax.v2.zip
 cadd,/path/to/cadd.v1.6.hg38.zip
 ```
+
 <!---
 
 - If you want to give more samples to filter variants against, for SVs - prepare a samplesheet with .snf files from Sniffles2:
@@ -115,13 +124,13 @@ HG01125,/path/to/HG01125.g.vcf.gz
 
 To run in an offline environment, download the pipeline using [`nf-core download`](https://nf-co.re/tools/#downloading-pipelines-for-offline-use):
 
-   ```
-   nf-core download fellen31/skierfe -r dev
-   ```
+```
+nf-core download fellen31/skierfe -r dev
+```
 
-   > - The pipeline comes with config profiles called `docker`, `singularity`, `podman`, `shifter` and `charliecloud` and which instruct the pipeline to use the named tool for software management. For example, `-profile test,docker`.
-   > - Please check [nf-core/configs](https://github.com/nf-core/configs#documentation) to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use `-profile <institute>` in your command. This will enable either `docker` or `singularity` and set the appropriate execution settings for your local compute environment.
-   > - If you are using `singularity`, please use the [`nf-core download`](https://nf-co.re/tools/#downloading-pipelines-for-offline-use) command to download images first, before running the pipeline. Setting the [`NXF_SINGULARITY_CACHEDIR` or `singularity.cacheDir`](https://www.nextflow.io/docs/latest/singularity.html?#singularity-docker-hub) Nextflow options enables you to store and re-use the images from a central location for future pipeline runs.
+> - The pipeline comes with config profiles called `docker`, `singularity`, `podman`, `shifter` and `charliecloud` and which instruct the pipeline to use the named tool for software management. For example, `-profile test,docker`.
+> - Please check [nf-core/configs](https://github.com/nf-core/configs#documentation) to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use `-profile <institute>` in your command. This will enable either `docker` or `singularity` and set the appropriate execution settings for your local compute environment.
+> - If you are using `singularity`, please use the [`nf-core download`](https://nf-co.re/tools/#downloading-pipelines-for-offline-use) command to download images first, before running the pipeline. Setting the [`NXF_SINGULARITY_CACHEDIR` or `singularity.cacheDir`](https://www.nextflow.io/docs/latest/singularity.html?#singularity-docker-hub) Nextflow options enables you to store and re-use the images from a central location for future pipeline runs.
 
 > **Warning:**
 > Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those

diff --git a/assets/expected_cn.hg38.XX.bed b/assets/expected_cn.hg38.XX.bed
@@ -0,0 +1,6 @@
+chrX	0	2781479	chrX_PAR_1	2
+chrX	2781479	155701382	chrX_uniq_1	2
+chrX	155701382	156040895	chrX_PAR_2	2
+chrY	0	2781479	chrY_PAR_1	0
+chrY	2781479	56887902	chrY_uniq_1	0
+chrY	56887902	57227415	chrY_PAR_2	0
diff --git a/assets/expected_cn.hg38.XY.bed b/assets/expected_cn.hg38.XY.bed
@@ -0,0 +1,6 @@
+chrX	0	2781479	chrX_PAR_1	2
+chrX	2781479	155701382	chrX_uniq_1	1
+chrX	155701382	156040895	chrX_PAR_2	2
+chrY	0	2781479	chrY_PAR_1	0
+chrY	2781479	56887902	chrY_uniq_1	1
+chrY	56887902	57227415	chrY_PAR_2	0
diff --git a/assets/external/cnv.excluded_regions.hg38.bed.gz b/assets/external/cnv.excluded_regions.hg38.bed.gz
diff --git a/assets/external/expected_cn.hg38.XX.bed b/assets/external/expected_cn.hg38.XX.bed
@@ -0,0 +1,6 @@
+chrX	0	2781479	chrX_PAR_1	2
+chrX	2781479	155701382	chrX_uniq_1	2
+chrX	155701382	156040895	chrX_PAR_2	2
+chrY	0	2781479	chrY_PAR_1	0
+chrY	2781479	56887902	chrY_uniq_1	0
+chrY	56887902	57227415	chrY_PAR_2	0
diff --git a/assets/external/expected_cn.hg38.XY.bed b/assets/external/expected_cn.hg38.XY.bed
@@ -0,0 +1,6 @@
+chrX	0	2781479	chrX_PAR_1	2
+chrX	2781479	155701382	chrX_uniq_1	1
+chrX	155701382	156040895	chrX_PAR_2	2
+chrY	0	2781479	chrY_PAR_1	0
+chrY	2781479	56887902	chrY_uniq_1	1
+chrY	56887902	57227415	chrY_PAR_2	0
diff --git a/assets/external/hs38.PAR.bed b/assets/external/hs38.PAR.bed
@@ -0,0 +1,2 @@
+chrX	0	2781479
+chrX	155701383	156030895
diff --git a/assets/external/pathogenic_repeats.hg38.bed b/assets/external/pathogenic_repeats.hg38.bed
@@ -0,0 +1,56 @@
+chr1	57367043	57367119	ID=DAB1;MOTIFS=AAAAT,GAAAT;STRUC=(AAAAT)n(GAAAT)n(AAAAT)n
+chr1	146228800	146228821	ID=NOTCH2NLA;MOTIFS=GCC;STRUC=(GCC)n
+chr1	149390802	149390841	ID=NOTCH2NLC;MOTIFS=GGC;STRUC=(GGC)n
+chr10	79826383	79826404	ID=NUTM2B-AS1;MOTIFS=CGG;STRUC=(CGG)n
+chr10	93702522	93702547	ID=FRA10AC1;MOTIFS=CCG;STRUC=(CCG)n
+chr11	66744821	66744850	ID=C11ORF80;MOTIFS=GCG;STRUC=(GCG)n
+chr11	119206289	119206322	ID=CBL;MOTIFS=CGG;STRUC=(CGG)n
+chr12	6936716	6936773	ID=ATN1;MOTIFS=CAG;STRUC=(CAG)n
+chr12	50505001	50505022	ID=DIP2B;MOTIFS=GGC;STRUC=(GGC)n
+chr12	111598949	111599018	ID=ATXN2;MOTIFS=GCT;STRUC=(GCT)n
+chr13	70139353	70139428	ID=ATXN8;MOTIFS=CTA,CTG;STRUC=(CTA)n(CTG)n
+chr13	99985448	99985494	ID=ZIC2;MOTIFS=GCN;STRUC=(GCN)n
+chr14	23321472	23321490	ID=PABPN1;MOTIFS=GCG;STRUC=(GCG)n
+chr14	92071009	92071042	ID=ATXN3;MOTIFS=GCT;STRUC=(GCT)n
+chr15	22786677	22786701	ID=NIPA1;MOTIFS=GCG;STRUC=(GCG)n
+chr16	17470907	17470923	ID=XYLT1;MOTIFS=GCC;STRUC=(GCC)n
+chr16	24613439	24613530	ID=TNRC6A;MOTIFS=TTTTA,TTTCA;STRUC=(TTTTA)n(TTTCA)n(TTTTA)n
+chr16	66490398	66490467	ID=BEAN1;MOTIFS=TGGAA,TAAAA;STRUC=(TGGAA)n(TAAAA)n
+chr16	87604287	87604329	ID=JPH3;MOTIFS=CTG;STRUC=(CTG)n
+chr18	55586155	55586227	ID=TCF4;MOTIFS=CAG;STRUC=(CAG)n
+chr19	13207858	13207897	ID=CACNA1A;MOTIFS=CTG;STRUC=(CTG)n
+chr19	14496041	14496074	ID=GIPC1;MOTIFS=CCG;STRUC=(CCG)n
+chr19	18786034	18786050	ID=COMP;MOTIFS=GTC;STRUC=(GTC)n
+chr19	45770204	45770264	ID=DMPK;MOTIFS=CAG;STRUC=(CAG)n
+chr2	96197066	96197122	ID=STARD7;MOTIFS=TGAAA,TAAAA;STRUC=(TGAAA)n(TAAAA)n
+chr2	100104799	100104824	ID=AFF3;MOTIFS=GCC;STRUC=(GCC)n
+chr2	176093058	176093104	ID=HOXD13;MOTIFS=GCN;STRUC=(GCN)n
+chr2	190880872	190880920	ID=GLS;MOTIFS=GCA;STRUC=(GCA)n
+chr20	2652733	2652775	ID=NOP56;MOTIFS=GGCCTG,CGCCTG;STRUC=(GGCCTG)n(CGCCTG)n
+chr21	43776443	43776479	ID=CSTB;MOTIFS=CGCGGGGCGGGG;STRUC=(CGCGGGGCGGGG)n
+chr22	45795354	45795424	ID=ATXN10;MOTIFS=ATTCT;STRUC=(ATTCT)n
+chr3	63912684	63912726	ID=ATXN7;MOTIFS=GCA,GCC;STRUC=(GCA)n(GCC)n
+chr3	129172576	129172732	ID=CNBP;MOTIFS=CAGG,CAGA,CA;STRUC=(CAGG)n(CAGA)n(CA)n
+chr3	138946020	138946063	ID=FOXL2;MOTIFS=NGC;STRUC=(NGC)n
+chr3	183712187	183712223	ID=YEATS2;MOTIFS=TTTTA,TTTCA;STRUC=(TTTTA)n(TTTCA)n
+chr4	3074876	3074966	ID=HTT;MOTIFS=CAG,CCG;STRUC=(CAG)nCAACAG(CCG)n
+chr4	39348424	39348479	ID=RFC1;MOTIFS=AAAAG,AAAGG,AAGGG,AAGAG,AGAGG,AACGG,GGGAC,AAAGGG;STRUC=<RFC1>
+chr4	41745972	41746032	ID=PHOX2B;MOTIFS=GCN;STRUC=(GCN)n
+chr4	159342526	159342617	ID=RAPGEF2;MOTIFS=TTTTA,TTTCA;STRUC=(TTTTA)n(TTTCA)n(TTTTA)n
+chr5	10356346	10356412	ID=MARCHF6;MOTIFS=TTTTA,TTTCA;STRUC=(TTTTA)n(TTTCA)n
+chr5	146878727	146878757	ID=PPP2R2B;MOTIFS=GCT;STRUC=(GCT)n
+chr6	16327633	16327723	ID=ATXN1;MOTIFS=TGC;STRUC=(TGC)n
+chr6	45422750	45422802	ID=RUNX2;MOTIFS=GCN;STRUC=(GCN)n
+chr6	170561906	170562017	ID=TBP;MOTIFS=GCA;STRUC=(GCA)n
+chr7	27199825	27199862	ID=HOXA13;MOTIFS=NGC;STRUC=(NGC)n
+chr7	55887601	55887640	ID=ZNF713;MOTIFS=CGG;STRUC=(CGG)n
+chr8	104588972	104589000	ID=LRP12;MOTIFS=CCG;STRUC=(CCG)n
+chr8	118366815	118366919	ID=SAMD12;MOTIFS=TGAAA,TAAAA;STRUC=(TGAAA)n(TAAAA)n
+chr9	27573528	27573546	ID=C9ORF72;MOTIFS=GGCCCC;STRUC=(GGCCCC)n
+chr9	69037261	69037304	ID=FXN;MOTIFS=A,GAA;STRUC=(A)n(GAA)n
+chrX	25013649	25013698	ID=ARX;MOTIFS=GCG;STRUC=(GCG)n
+chrX	67545316	67545385	ID=AR;MOTIFS=GCA;STRUC=(GCA)n
+chrX	140504316	140504362	ID=SOX3;MOTIFS=NGC;STRUC=(NGC)n
+chrX	147912050	147912110	ID=FMR1;MOTIFS=CGG;STRUC=(CGG)n
+chrX	148500631	148500691	ID=AFF2;MOTIFS=GCC;STRUC=(GCC)n
+chrX	149631763	149631782	ID=TMEM185A;MOTIFS=GCC;STRUC=(GCC)n
diff --git a/assets/samplesheet.csv b/assets/samplesheet.csv
@@ -1,3 +1,2 @@
-sample,fastq_1,fastq_2
-SAMPLE_PAIRED_END,/path/to/fastq/files/AEG588A1_S1_L002_R1_001.fastq.gz,/path/to/fastq/files/AEG588A1_S1_L002_R2_001.fastq.gz
-SAMPLE_SINGLE_END,/path/to/fastq/files/AEG588A4_S4_L003_R1_001.fastq.gz,
+sample,file,family_id,paternal_id,maternal_id,sex,phenotype
+HG002_Revio,https://raw.githubusercontent.com/genomic-medicine-sweden/skierfe/dev/assets/test_data/HG002_PacBio_Revio.fastq.gz,FAM,XXX,YYY,1,1
diff --git a/assets/schema_snfs.json b/assets/schema_snfs.json
@@ -19,7 +19,6 @@
                 "pattern": "^\\S+\\.snf$",
                 "errorMessage": "SNF file must be provided, cannot contain spaces and must have extension '.snf"
             }
-
         },
         "required": ["sample", "file"]
     }

diff --git a/assets/test_data.bed b/assets/test_data.bed
@@ -0,0 +1,3 @@
+chr16	172876	173710
+chr16	176680	177522
+chrX	140502985	140505069
diff --git a/assets/test_data/HG002_ONT_UL_dorado0.4.0_sup4.1.0_5mCG_5hmCG.fastq.gz b/assets/test_data/HG002_ONT_UL_dorado0.4.0_sup4.1.0_5mCG_5hmCG.fastq.gz
diff --git a/assets/test_data/HG002_PacBio_Revio.fastq.gz b/assets/test_data/HG002_PacBio_Revio.fastq.gz
diff --git a/assets/test_data/empty.bed b/assets/test_data/empty.bed
@@ -0,0 +1 @@
+
diff --git a/assets/test_data/hg38.test.fa.gz b/assets/test_data/hg38.test.fa.gz
diff --git a/bin/split_bed_chunks.py b/bin/split_bed_chunks.py
@@ -2,58 +2,64 @@
 
 # Released under the MIT license.
 
-# Split regions in BED into n files with approximately equal region sizes. 
-# A region is never split. 13 is a good number. 
+# Split regions in BED into n files with approximately equal region sizes.
+# A region is never split. 13 is a good number.
 
 import sys
 import pandas as pd
 import string
 
+
 def contains_whitespace_other_than_tab(filepath):
-    with open(filepath, 'r') as file:
+    with open(filepath, "r") as file:
         for line_number, line in enumerate(file, start=1):
             for char_number, char in enumerate(line, start=1):
-                if char.isspace() and char != '\t' and char != '\n':
-                    print(f"Error: File contains whitespace characters other than tab at line {line_number}, position {char_number}.")
+                if char.isspace() and char != "\t" and char != "\n":
+                    print(
+                        f"Error: File contains whitespace characters other than tab at line {line_number}, position {char_number}."
+                    )
                     sys.exit(1)
 
+
 file_path = sys.argv[1]  # Replace with the path to your file
 
 contains_whitespace_other_than_tab(file_path)
 print("File does not contain whitespace characters other than tab and newline.")
 
-chromosome_data = pd.read_csv(sys.argv[1], names = ['chr', 'start', 'stop'], usecols=range(3), sep = '\t')
+chromosome_data = pd.read_csv(sys.argv[1], names=["chr", "start", "stop"], usecols=range(3), sep="\t")
 
-chromosome_data['size'] = chromosome_data['stop'] - chromosome_data['start']
+chromosome_data["size"] = chromosome_data["stop"] - chromosome_data["start"]
 
 # Number of bins
 n = int(sys.argv[2])
 
 # Sort chromosome data by size in descending order
-sorted_data = chromosome_data.sort_values(by='size', ascending=False)
+sorted_data = chromosome_data.sort_values(by="size", ascending=False)
 
 # Initialize empty bins as lists
 bins = [[] for _ in range(n)]
 
 # Allocate chromosomes to bins
 for index, row in sorted_data.iterrows():
     # Find the bin with the fewest chromosomes
-    min_bin = min(range(n), key=lambda i: sum(chrom['size'] for chrom in bins[i]))
+    min_bin = min(range(n), key=lambda i: sum(chrom["size"] for chrom in bins[i]))
 
     # Place the chromosome data in the selected bin
     bins[min_bin].append(row.to_dict())
 
 # Create a DataFrame to store the results
-result_df = pd.DataFrame({
-    'bin': [i + 1 for i in range(n) for _ in bins[i]],
-    'chr': [chromosome['chr'] for bin_chromosomes in bins for chromosome in bin_chromosomes],
-    'start': [int(chromosome['start']) for bin_chromosomes in bins for chromosome in bin_chromosomes],
-    'stop': [int(chromosome['stop']) for bin_chromosomes in bins for chromosome in bin_chromosomes],
-    'size': [int(chromosome['size']) for bin_chromosomes in bins for chromosome in bin_chromosomes]
-})
+result_df = pd.DataFrame(
+    {
+        "bin": [i + 1 for i in range(n) for _ in bins[i]],
+        "chr": [chromosome["chr"] for bin_chromosomes in bins for chromosome in bin_chromosomes],
+        "start": [int(chromosome["start"]) for bin_chromosomes in bins for chromosome in bin_chromosomes],
+        "stop": [int(chromosome["stop"]) for bin_chromosomes in bins for chromosome in bin_chromosomes],
+        "size": [int(chromosome["size"]) for bin_chromosomes in bins for chromosome in bin_chromosomes],
+    }
+)
 
 # Print the result DataFrame, ordered by size within each bin
-result_df = result_df.sort_values(by=['bin', 'size'], ascending=[True, False])
+result_df = result_df.sort_values(by=["bin", "size"], ascending=[True, False])
 
-for id, group in result_df.groupby(['bin']):
-    group[['chr', 'start', 'stop']].to_csv(f'{id}.bed', index=False, header=False, sep = '\t')
+for id, group in result_df.groupby(["bin"]):
+    group[["chr", "start", "stop"]].to_csv(f"{id}.bed", index=False, header=False, sep="\t")
diff --git a/conf/modules/align_reads.config b/conf/modules/align_reads.config
@@ -18,7 +18,7 @@ process {
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     */
 
-     withName: FASTP {
+    withName: FASTP {
         ext.args = "--disable_adapter_trimming --disable_quality_filtering --split_by_lines ${params.split_fastq * 4}"
 
         // Not part of preprocess workflow now but makes sense to store it there

diff --git a/conf/modules/qc.config b/conf/modules/qc.config
@@ -46,7 +46,7 @@ process {
 
     withName: CRAMINO_PHASED {
         ext.args = '--karyotype --phased'
-         publishDir = [
+        publishDir = [
             path: { "${params.outdir}/qc/cramino/phased/${meta.id}" },
             mode: params.publish_dir_mode,
             saveAs: { filename -> filename.equals('versions.yml') ? null : filename }