Skip to content

Commit

Permalink
Add test profile and data (#33)
Browse files Browse the repository at this point in the history
Add small test profile and data, hg38 asset files and code formatting.
  • Loading branch information
fellen31 authored Mar 22, 2024
1 parent 1bb1e69 commit 06e7868
Show file tree
Hide file tree
Showing 29 changed files with 300 additions and 142 deletions.
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
.gitignore
.nextflow*
work/
data/
results/
.DS_Store
testing/
Expand Down
17 changes: 7 additions & 10 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,13 @@
# fellen31/skierfe: Changelog

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
All notable changes to this project will be documented in this file.

## v1.0dev - [date]
The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).

Initial release of fellen31/skierfe, created with the [nf-core](https://nf-co.re/) template.
<!-- insertion marker -->
<!-- ## [0.1.0](https://github.com/fellen31/skierfe/releases/tag/0.1.0) - 2024-03-21 -->

### `Added`
### Added

### `Fixed`

### `Dependencies`

### `Deprecated`
- Added test data and test profile [#33](https://github.com/genomic-medicine-sweden/skierfe/pull/33)
21 changes: 15 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,26 +19,31 @@ The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool
## Pipeline summary

##### QC

- FastQC ([`FastQC`](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
- Aligned read QC ([`cramino`](https://github.com/wdecoster/cramino))
- Depth information ([`mosdepth`](https://github.com/brentp/mosdepth))

##### Alignment & assembly

- Align reads to reference ([`minimap2`](https://github.com/lh3/minimap2))
- Assemble (trio-binned) haploid genomes (HiFi only) ([`hifiasm`](https://github.com/chhylp123/hifiasm))

##### Variant calling

- Short variant calling & joint genotyping of SNVs ([`deepvariant`](https://github.com/google/deepvariant) + [`GLNexus`](https://github.com/dnanexus-rnd/GLnexus))
- SV calling and joint genotyping ([`sniffles2`](https://github.com/fritzsedlazeck/Sniffles))
- Tandem repeats ([`TRGT`](https://github.com/PacificBiosciences/trgt/tree/main))
- Assembly based variant calls (HiFi only) ([`dipcall`](https://github.com/lh3/dipcall))
- CNV-calling (HiFi only) ([`HiFiCNV`](https://github.com/PacificBiosciences/HiFiCNV))

##### Phasing and methylation

- Phase and haplotag reads ([`whatshap`](https://github.com/whatshap/whatshap) + [`hiphase`](https://github.com/PacificBiosciences/HiPhase))
- Methylation pileups (Revio/ONT) ([`modkit`](https://github.com/nanoporetech/modkit))

##### Annotation - SNV

1. Annotate variants with database(s) of choice, i.e. [gnomAD](https://gnomad.broadinstitute.org), [CADD](https://cadd.gs.washington.edu) etc. ([`echtvar`](https://github.com/brentp/echtvar))
2. Annotate variants ([`VEP`](https://github.com/Ensembl/ensembl-vep))

Expand All @@ -56,25 +61,29 @@ The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool
1. Prepare a samplesheet with input data (gzipped fastq-files):

`samplesheet.csv`

```
sample,file,family_id,paternal_id,maternal_id,sex,phenotype
HG002,/path/to/HG002.fastq.gz,FAM1,HG003,HG004,1,1
HG005,/path/to/HG005.fastq.gz,FAM1,HG003,HG004,2,1
```

2. Optional inputs:

- Limit SNV calling to regions in BED file (`--bed`)
- If running dipcall, download a BED file with PAR regions ([hg38](https://raw.githubusercontent.com/lh3/dipcall/master/data/hs38.PAR.bed))
- If running TRGT, download a BED file with tandem repeats ([TRGT](https://github.com/PacificBiosciences/trgt/tree/main/repeats)) matching your reference genome.
- If running SNV annotation, download [VEP cache](https://ftp.ensembl.org/pub/release-110/variation/vep/homo_sapiens_vep_110_GRCh38.tar.gz) and prepare a samplesheet with annotation databases ([`echtvar encode`](https://github.com/brentp/echtvar)):
- If running CNV-calling, expected CN regions for your reference genome can be downloaded from [HiFiCNV GitHub](https://github.com/PacificBiosciences/HiFiCNV/tree/main/data/excluded_regions)

`snp_dbs.csv`

```
sample,file
gnomad,/path/to/gnomad.v3.1.2.echtvar.popmax.v2.zip
cadd,/path/to/cadd.v1.6.hg38.zip
```

<!---
- If you want to give more samples to filter variants against, for SVs - prepare a samplesheet with .snf files from Sniffles2:
Expand Down Expand Up @@ -115,13 +124,13 @@ HG01125,/path/to/HG01125.g.vcf.gz

To run in an offline environment, download the pipeline using [`nf-core download`](https://nf-co.re/tools/#downloading-pipelines-for-offline-use):

```
nf-core download fellen31/skierfe -r dev
```
```
nf-core download fellen31/skierfe -r dev
```

> - The pipeline comes with config profiles called `docker`, `singularity`, `podman`, `shifter` and `charliecloud` and which instruct the pipeline to use the named tool for software management. For example, `-profile test,docker`.
> - Please check [nf-core/configs](https://github.com/nf-core/configs#documentation) to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use `-profile <institute>` in your command. This will enable either `docker` or `singularity` and set the appropriate execution settings for your local compute environment.
> - If you are using `singularity`, please use the [`nf-core download`](https://nf-co.re/tools/#downloading-pipelines-for-offline-use) command to download images first, before running the pipeline. Setting the [`NXF_SINGULARITY_CACHEDIR` or `singularity.cacheDir`](https://www.nextflow.io/docs/latest/singularity.html?#singularity-docker-hub) Nextflow options enables you to store and re-use the images from a central location for future pipeline runs.
> - The pipeline comes with config profiles called `docker`, `singularity`, `podman`, `shifter` and `charliecloud` and which instruct the pipeline to use the named tool for software management. For example, `-profile test,docker`.
> - Please check [nf-core/configs](https://github.com/nf-core/configs#documentation) to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use `-profile <institute>` in your command. This will enable either `docker` or `singularity` and set the appropriate execution settings for your local compute environment.
> - If you are using `singularity`, please use the [`nf-core download`](https://nf-co.re/tools/#downloading-pipelines-for-offline-use) command to download images first, before running the pipeline. Setting the [`NXF_SINGULARITY_CACHEDIR` or `singularity.cacheDir`](https://www.nextflow.io/docs/latest/singularity.html?#singularity-docker-hub) Nextflow options enables you to store and re-use the images from a central location for future pipeline runs.

> **Warning:**
> Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those
Expand Down
6 changes: 6 additions & 0 deletions assets/expected_cn.hg38.XX.bed
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
chrX 0 2781479 chrX_PAR_1 2
chrX 2781479 155701382 chrX_uniq_1 2
chrX 155701382 156040895 chrX_PAR_2 2
chrY 0 2781479 chrY_PAR_1 0
chrY 2781479 56887902 chrY_uniq_1 0
chrY 56887902 57227415 chrY_PAR_2 0
6 changes: 6 additions & 0 deletions assets/expected_cn.hg38.XY.bed
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
chrX 0 2781479 chrX_PAR_1 2
chrX 2781479 155701382 chrX_uniq_1 1
chrX 155701382 156040895 chrX_PAR_2 2
chrY 0 2781479 chrY_PAR_1 0
chrY 2781479 56887902 chrY_uniq_1 1
chrY 56887902 57227415 chrY_PAR_2 0
Binary file added assets/external/cnv.excluded_regions.hg38.bed.gz
Binary file not shown.
6 changes: 6 additions & 0 deletions assets/external/expected_cn.hg38.XX.bed
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
chrX 0 2781479 chrX_PAR_1 2
chrX 2781479 155701382 chrX_uniq_1 2
chrX 155701382 156040895 chrX_PAR_2 2
chrY 0 2781479 chrY_PAR_1 0
chrY 2781479 56887902 chrY_uniq_1 0
chrY 56887902 57227415 chrY_PAR_2 0
6 changes: 6 additions & 0 deletions assets/external/expected_cn.hg38.XY.bed
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
chrX 0 2781479 chrX_PAR_1 2
chrX 2781479 155701382 chrX_uniq_1 1
chrX 155701382 156040895 chrX_PAR_2 2
chrY 0 2781479 chrY_PAR_1 0
chrY 2781479 56887902 chrY_uniq_1 1
chrY 56887902 57227415 chrY_PAR_2 0
2 changes: 2 additions & 0 deletions assets/external/hs38.PAR.bed
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
chrX 0 2781479
chrX 155701383 156030895
56 changes: 56 additions & 0 deletions assets/external/pathogenic_repeats.hg38.bed
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
chr1 57367043 57367119 ID=DAB1;MOTIFS=AAAAT,GAAAT;STRUC=(AAAAT)n(GAAAT)n(AAAAT)n
chr1 146228800 146228821 ID=NOTCH2NLA;MOTIFS=GCC;STRUC=(GCC)n
chr1 149390802 149390841 ID=NOTCH2NLC;MOTIFS=GGC;STRUC=(GGC)n
chr10 79826383 79826404 ID=NUTM2B-AS1;MOTIFS=CGG;STRUC=(CGG)n
chr10 93702522 93702547 ID=FRA10AC1;MOTIFS=CCG;STRUC=(CCG)n
chr11 66744821 66744850 ID=C11ORF80;MOTIFS=GCG;STRUC=(GCG)n
chr11 119206289 119206322 ID=CBL;MOTIFS=CGG;STRUC=(CGG)n
chr12 6936716 6936773 ID=ATN1;MOTIFS=CAG;STRUC=(CAG)n
chr12 50505001 50505022 ID=DIP2B;MOTIFS=GGC;STRUC=(GGC)n
chr12 111598949 111599018 ID=ATXN2;MOTIFS=GCT;STRUC=(GCT)n
chr13 70139353 70139428 ID=ATXN8;MOTIFS=CTA,CTG;STRUC=(CTA)n(CTG)n
chr13 99985448 99985494 ID=ZIC2;MOTIFS=GCN;STRUC=(GCN)n
chr14 23321472 23321490 ID=PABPN1;MOTIFS=GCG;STRUC=(GCG)n
chr14 92071009 92071042 ID=ATXN3;MOTIFS=GCT;STRUC=(GCT)n
chr15 22786677 22786701 ID=NIPA1;MOTIFS=GCG;STRUC=(GCG)n
chr16 17470907 17470923 ID=XYLT1;MOTIFS=GCC;STRUC=(GCC)n
chr16 24613439 24613530 ID=TNRC6A;MOTIFS=TTTTA,TTTCA;STRUC=(TTTTA)n(TTTCA)n(TTTTA)n
chr16 66490398 66490467 ID=BEAN1;MOTIFS=TGGAA,TAAAA;STRUC=(TGGAA)n(TAAAA)n
chr16 87604287 87604329 ID=JPH3;MOTIFS=CTG;STRUC=(CTG)n
chr18 55586155 55586227 ID=TCF4;MOTIFS=CAG;STRUC=(CAG)n
chr19 13207858 13207897 ID=CACNA1A;MOTIFS=CTG;STRUC=(CTG)n
chr19 14496041 14496074 ID=GIPC1;MOTIFS=CCG;STRUC=(CCG)n
chr19 18786034 18786050 ID=COMP;MOTIFS=GTC;STRUC=(GTC)n
chr19 45770204 45770264 ID=DMPK;MOTIFS=CAG;STRUC=(CAG)n
chr2 96197066 96197122 ID=STARD7;MOTIFS=TGAAA,TAAAA;STRUC=(TGAAA)n(TAAAA)n
chr2 100104799 100104824 ID=AFF3;MOTIFS=GCC;STRUC=(GCC)n
chr2 176093058 176093104 ID=HOXD13;MOTIFS=GCN;STRUC=(GCN)n
chr2 190880872 190880920 ID=GLS;MOTIFS=GCA;STRUC=(GCA)n
chr20 2652733 2652775 ID=NOP56;MOTIFS=GGCCTG,CGCCTG;STRUC=(GGCCTG)n(CGCCTG)n
chr21 43776443 43776479 ID=CSTB;MOTIFS=CGCGGGGCGGGG;STRUC=(CGCGGGGCGGGG)n
chr22 45795354 45795424 ID=ATXN10;MOTIFS=ATTCT;STRUC=(ATTCT)n
chr3 63912684 63912726 ID=ATXN7;MOTIFS=GCA,GCC;STRUC=(GCA)n(GCC)n
chr3 129172576 129172732 ID=CNBP;MOTIFS=CAGG,CAGA,CA;STRUC=(CAGG)n(CAGA)n(CA)n
chr3 138946020 138946063 ID=FOXL2;MOTIFS=NGC;STRUC=(NGC)n
chr3 183712187 183712223 ID=YEATS2;MOTIFS=TTTTA,TTTCA;STRUC=(TTTTA)n(TTTCA)n
chr4 3074876 3074966 ID=HTT;MOTIFS=CAG,CCG;STRUC=(CAG)nCAACAG(CCG)n
chr4 39348424 39348479 ID=RFC1;MOTIFS=AAAAG,AAAGG,AAGGG,AAGAG,AGAGG,AACGG,GGGAC,AAAGGG;STRUC=<RFC1>
chr4 41745972 41746032 ID=PHOX2B;MOTIFS=GCN;STRUC=(GCN)n
chr4 159342526 159342617 ID=RAPGEF2;MOTIFS=TTTTA,TTTCA;STRUC=(TTTTA)n(TTTCA)n(TTTTA)n
chr5 10356346 10356412 ID=MARCHF6;MOTIFS=TTTTA,TTTCA;STRUC=(TTTTA)n(TTTCA)n
chr5 146878727 146878757 ID=PPP2R2B;MOTIFS=GCT;STRUC=(GCT)n
chr6 16327633 16327723 ID=ATXN1;MOTIFS=TGC;STRUC=(TGC)n
chr6 45422750 45422802 ID=RUNX2;MOTIFS=GCN;STRUC=(GCN)n
chr6 170561906 170562017 ID=TBP;MOTIFS=GCA;STRUC=(GCA)n
chr7 27199825 27199862 ID=HOXA13;MOTIFS=NGC;STRUC=(NGC)n
chr7 55887601 55887640 ID=ZNF713;MOTIFS=CGG;STRUC=(CGG)n
chr8 104588972 104589000 ID=LRP12;MOTIFS=CCG;STRUC=(CCG)n
chr8 118366815 118366919 ID=SAMD12;MOTIFS=TGAAA,TAAAA;STRUC=(TGAAA)n(TAAAA)n
chr9 27573528 27573546 ID=C9ORF72;MOTIFS=GGCCCC;STRUC=(GGCCCC)n
chr9 69037261 69037304 ID=FXN;MOTIFS=A,GAA;STRUC=(A)n(GAA)n
chrX 25013649 25013698 ID=ARX;MOTIFS=GCG;STRUC=(GCG)n
chrX 67545316 67545385 ID=AR;MOTIFS=GCA;STRUC=(GCA)n
chrX 140504316 140504362 ID=SOX3;MOTIFS=NGC;STRUC=(NGC)n
chrX 147912050 147912110 ID=FMR1;MOTIFS=CGG;STRUC=(CGG)n
chrX 148500631 148500691 ID=AFF2;MOTIFS=GCC;STRUC=(GCC)n
chrX 149631763 149631782 ID=TMEM185A;MOTIFS=GCC;STRUC=(GCC)n
5 changes: 2 additions & 3 deletions assets/samplesheet.csv
Original file line number Diff line number Diff line change
@@ -1,3 +1,2 @@
sample,fastq_1,fastq_2
SAMPLE_PAIRED_END,/path/to/fastq/files/AEG588A1_S1_L002_R1_001.fastq.gz,/path/to/fastq/files/AEG588A1_S1_L002_R2_001.fastq.gz
SAMPLE_SINGLE_END,/path/to/fastq/files/AEG588A4_S4_L003_R1_001.fastq.gz,
sample,file,family_id,paternal_id,maternal_id,sex,phenotype
HG002_Revio,https://raw.githubusercontent.com/genomic-medicine-sweden/skierfe/dev/assets/test_data/HG002_PacBio_Revio.fastq.gz,FAM,XXX,YYY,1,1
1 change: 0 additions & 1 deletion assets/schema_snfs.json
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,6 @@
"pattern": "^\\S+\\.snf$",
"errorMessage": "SNF file must be provided, cannot contain spaces and must have extension '.snf"
}

},
"required": ["sample", "file"]
}
Expand Down
3 changes: 3 additions & 0 deletions assets/test_data.bed
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
chr16 172876 173710
chr16 176680 177522
chrX 140502985 140505069
Binary file not shown.
Binary file added assets/test_data/HG002_PacBio_Revio.fastq.gz
Binary file not shown.
1 change: 1 addition & 0 deletions assets/test_data/empty.bed
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

Binary file added assets/test_data/hg38.test.fa.gz
Binary file not shown.
44 changes: 25 additions & 19 deletions bin/split_bed_chunks.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,58 +2,64 @@

# Released under the MIT license.

# Split regions in BED into n files with approximately equal region sizes.
# A region is never split. 13 is a good number.
# Split regions in BED into n files with approximately equal region sizes.
# A region is never split. 13 is a good number.

import sys
import pandas as pd
import string


def contains_whitespace_other_than_tab(filepath):
with open(filepath, 'r') as file:
with open(filepath, "r") as file:
for line_number, line in enumerate(file, start=1):
for char_number, char in enumerate(line, start=1):
if char.isspace() and char != '\t' and char != '\n':
print(f"Error: File contains whitespace characters other than tab at line {line_number}, position {char_number}.")
if char.isspace() and char != "\t" and char != "\n":
print(
f"Error: File contains whitespace characters other than tab at line {line_number}, position {char_number}."
)
sys.exit(1)


file_path = sys.argv[1] # Replace with the path to your file

contains_whitespace_other_than_tab(file_path)
print("File does not contain whitespace characters other than tab and newline.")

chromosome_data = pd.read_csv(sys.argv[1], names = ['chr', 'start', 'stop'], usecols=range(3), sep = '\t')
chromosome_data = pd.read_csv(sys.argv[1], names=["chr", "start", "stop"], usecols=range(3), sep="\t")

chromosome_data['size'] = chromosome_data['stop'] - chromosome_data['start']
chromosome_data["size"] = chromosome_data["stop"] - chromosome_data["start"]

# Number of bins
n = int(sys.argv[2])

# Sort chromosome data by size in descending order
sorted_data = chromosome_data.sort_values(by='size', ascending=False)
sorted_data = chromosome_data.sort_values(by="size", ascending=False)

# Initialize empty bins as lists
bins = [[] for _ in range(n)]

# Allocate chromosomes to bins
for index, row in sorted_data.iterrows():
# Find the bin with the fewest chromosomes
min_bin = min(range(n), key=lambda i: sum(chrom['size'] for chrom in bins[i]))
min_bin = min(range(n), key=lambda i: sum(chrom["size"] for chrom in bins[i]))

# Place the chromosome data in the selected bin
bins[min_bin].append(row.to_dict())

# Create a DataFrame to store the results
result_df = pd.DataFrame({
'bin': [i + 1 for i in range(n) for _ in bins[i]],
'chr': [chromosome['chr'] for bin_chromosomes in bins for chromosome in bin_chromosomes],
'start': [int(chromosome['start']) for bin_chromosomes in bins for chromosome in bin_chromosomes],
'stop': [int(chromosome['stop']) for bin_chromosomes in bins for chromosome in bin_chromosomes],
'size': [int(chromosome['size']) for bin_chromosomes in bins for chromosome in bin_chromosomes]
})
result_df = pd.DataFrame(
{
"bin": [i + 1 for i in range(n) for _ in bins[i]],
"chr": [chromosome["chr"] for bin_chromosomes in bins for chromosome in bin_chromosomes],
"start": [int(chromosome["start"]) for bin_chromosomes in bins for chromosome in bin_chromosomes],
"stop": [int(chromosome["stop"]) for bin_chromosomes in bins for chromosome in bin_chromosomes],
"size": [int(chromosome["size"]) for bin_chromosomes in bins for chromosome in bin_chromosomes],
}
)

# Print the result DataFrame, ordered by size within each bin
result_df = result_df.sort_values(by=['bin', 'size'], ascending=[True, False])
result_df = result_df.sort_values(by=["bin", "size"], ascending=[True, False])

for id, group in result_df.groupby(['bin']):
group[['chr', 'start', 'stop']].to_csv(f'{id}.bed', index=False, header=False, sep = '\t')
for id, group in result_df.groupby(["bin"]):
group[["chr", "start", "stop"]].to_csv(f"{id}.bed", index=False, header=False, sep="\t")
2 changes: 1 addition & 1 deletion conf/modules/align_reads.config
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ process {
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*/

withName: FASTP {
withName: FASTP {
ext.args = "--disable_adapter_trimming --disable_quality_filtering --split_by_lines ${params.split_fastq * 4}"

// Not part of preprocess workflow now but makes sense to store it there
Expand Down
2 changes: 1 addition & 1 deletion conf/modules/qc.config
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ process {

withName: CRAMINO_PHASED {
ext.args = '--karyotype --phased'
publishDir = [
publishDir = [
path: { "${params.outdir}/qc/cramino/phased/${meta.id}" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
Expand Down
Loading

0 comments on commit 06e7868

Please sign in to comment.