Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert to platform-agnostic pipeline #99

Closed
kelly-sovacool opened this issue Feb 6, 2024 · 15 comments · Fixed by #102
Closed

convert to platform-agnostic pipeline #99

kelly-sovacool opened this issue Feb 6, 2024 · 15 comments · Fixed by #102
Assignees
Milestone

Comments

@kelly-sovacool
Copy link
Member

kelly-sovacool commented Feb 6, 2024

  • switch modules to containers (Containerize major rules #42)
  • remove any biowulf-specific code from driver script
  • remove hard-coded biowulf paths from config, or document them well
  • test on FRCE to verify that it works off biowulf

development in progress here: /data/CCBR_Pipeliner/Pipelines/CHARLIE/charlie-dev-sovacool

@kelly-sovacool kelly-sovacool self-assigned this Feb 6, 2024
@kelly-sovacool kelly-sovacool added this to the 2024-02 milestone Feb 6, 2024
@kelly-sovacool kelly-sovacool modified the milestones: 2024-02, 2024-03 Feb 28, 2024
@kelly-sovacool
Copy link
Member Author

kelly-sovacool commented Mar 28, 2024

test run command to modify

/data/Ziegelbauer_lab/Pipelines/circRNA/v0.10.1/charlie \
  -w=/data/Ziegelbauer_lab/circRNADetection/circRNA_daq_v0.10.x/samples_15 \
  -m=init \
  -g=hg38 \
  -v=NC_009333.1,KT899744.1,NC_006273.2 \
  -s /data/Ziegelbauer_lab/circRNADetection/circRNA_daq_v0.10.x/samples_15.tsv

@kelly-sovacool
Copy link
Member Author

kelly-sovacool commented Mar 28, 2024

Created a new samples.tsv file with just 4 samples from Vishal's samples_15.tsv.

/data/Ziegelbauer_lab/Pipelines/circRNA/v0.10.1/charlie \
    -w=/data/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_v0.10.1 \
    -m=init -g=hg38 -v=NC_009333.1,KT899744.1,NC_006273.2 \
    -s=/data/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/samples.tsv

Currently running on biowulf with latest release so we can compare outputs to the containerized version.

/data/Ziegelbauer_lab/Pipelines/circRNA/v0.10.1/charlie \
    -w=/data/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_v0.10.1 \
    -m=run

@kelly-sovacool kelly-sovacool modified the milestones: 2024-03, 2023-04, 2024-04 Apr 9, 2024
@kelly-sovacool
Copy link
Member Author

kelly-sovacool commented Apr 10, 2024

Testing containerized version:

/data/Ziegelbauer_lab/Pipelines/circRNA/charlie-dev-sovacool/charlie \
    -w=/data/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev \
    -m=init -g=hg38 -v=NC_009333.1,KT899744.1,NC_006273.2 \
    -s=/data/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/samples.tsv
/data/Ziegelbauer_lab/Pipelines/circRNA/charlie-dev-sovacool/charlie \
    -w=/data/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev \
    -m=run -g=hg38 -v=NC_009333.1,KT899744.1,NC_006273.2 \
    -s=/data/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/samples.tsv

@kelly-sovacool
Copy link
Member Author

kelly-sovacool commented Apr 16, 2024

/usr/bin/bash: line 32: fastq-filter: command not found

need to add to cutadapt docker

Edit: fixed and renamed the container charlie_cutadapt_fqfilter

@kelly-sovacool
Copy link
Member Author

create_index failed due to missing output files

MissingOutputException in rule create_index in file /vf/users/Ziegelbauer_lab/Pipelines/circRNA/charlie-dev-sovacool/workflow/rules/create_index.smk, line 4:
Job 0 completed successfully, but some output files are missing. Missing files after 120 seconds. This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait:
/vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/ref/NCLscan_index/AllRef.ndx
Removing output files of failed job create_index since they might be corrupted:
/vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/ref/ref.genes.genepred_w_geneid, /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/ref/STAR_no_GTF/SA, /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/ref/ref.fixed.gtf, /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/ref/ref.transcripts.fa, /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/ref/ref.dummy.fa, /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/ref/separate_fastas/separate_fastas.lst
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

@kelly-sovacool
Copy link
Member Author

kelly-sovacool commented Apr 29, 2024

test on FRCE

/home/sovacoolkl/CHARLIE/charlie \
    -w=/scratch/cluster_scratch/sovacoolkl/charlie_dev_test/charlie_iss-99 \
    -m=init -g=hg38 -v=NC_009333.1,KT899744.1,NC_006273.2 \
    -s=/scratch/cluster_scratch/sovacoolkl/charlie_dev_test/samples.tsv
/home/sovacoolkl/CHARLIE/charlie \
    -w=/scratch/cluster_scratch/sovacoolkl/charlie_dev_test/charlie_iss-99 \
    -m=run -g=hg38 -v=NC_009333.1,KT899744.1,NC_006273.2 \
    -s=/scratch/cluster_scratch/sovacoolkl/charlie_dev_test/samples.tsv

@kelly-sovacool
Copy link
Member Author

kelly-sovacool commented Apr 30, 2024

error in rule DCC

Activating singularity image /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/.snakemake/singularity/b688737477c8cf86b329e4227da72916.simg
+ '[' -d /lscratch/25273199 ']'
+ TMPDIR=/lscratch/25273199/09975c64-8e35-4c64-bd19-c0afbf581a78
+ '[' '!' -d /lscratch/25273199/09975c64-8e35-4c64-bd19-c0afbf581a78 ']'
+ mkdir -p /lscratch/25273199/09975c64-8e35-4c64-bd19-c0afbf581a78
++ dirname /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Normal/DCC/CircRNACount
+ cd /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Normal/DCC
+ '[' PE == PE ']'
+ DCC @/vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Normal/DCC/samplesheet.txt \
    --temp /lscratch/25273199/09975c64-8e35-4c64-bd19-c0afbf581a78/DCC --threads 4 --detect --gene \
    --bam /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Normal/STAR2p/G1_Normal_p2.bam \
    -ss \
    --annotation /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/ref/ref.fixed.gtf \
    --chrM -G --rep_file /data/CCBR_Pipeliner/db/PipeDB/charlie/fastas_gtfs/hg38.repeats.gtf \
    --refseq /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/ref/ref.fa \
    --PE-independent \
    -mt1 @/vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Normal/DCC/mate1.txt \
    -mt2 @/vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Normal/DCC/mate2.txt
[W::hts_idx_load3] The index file is older than the data file: /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Normal/STAR2p/G1_Normal_p2.bam.csi
Traceback (most recent call last):
  File "/usr/local/bin/DCC", line 11, in <module>
    load_entry_point('DCC==0.5.0', 'console_scripts', 'DCC')()
  File "/usr/local/lib/python3.8/dist-packages/DCC-0.5.0-py3.8.egg/DCC/main.py", line 490, in main
  File "/usr/local/lib/python3.8/dist-packages/DCC-0.5.0-py3.8.egg/DCC/main.py", line 679, in findCircSkipJunction
  File "/usr/local/lib/python3.8/dist-packages/DCC-0.5.0-py3.8.egg/DCC/Circ_nonCirc_Exon_Match.py", line 281, in findcircAdjacent
  File "/usr/local/lib/python3.8/dist-packages/DCC-0.5.0-py3.8.egg/DCC/Circ_nonCirc_Exon_Match.py", line 222, in getAdjacent
ValueError: invalid literal for int() with base 10: '3"'
[Tue Apr 30 00:44:26 2024]
Error in rule dcc:
    jobid: 0
    input: /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Normal/DCC/samplesheet.txt, /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Normal/DCC/mate1.txt, /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Normal/DCC/mate2.txt, /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Normal/STAR2p/G1_Normal_p2.bam, /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/ref/ref.fixed.gtf
    output: /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Normal/DCC/CircRNACount, /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Normal/DCC/CircCoordinates, /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Normal/DCC/LinearCount, /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Normal/DCC/G1_Normal.dcc.counts_table.tsv, /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Normal/DCC/G1_Normal.dcc.counts_table.tsv.filtered
    shell:

This worked with the previous charlie version. (/data/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_v0.10.1)

Checking for differences in input files for this rule between the two runs:

The bam files

samtools stat summaries are identical

samtools stat charlie_v0.10.1/results/G1_Tumor/STAR2p/G1_Tumor_p2.bam > G1_Tumor_p2.bam.stat.old
samtools stat charlie_dev/results/G1_Tumor/STAR2p/G1_Tumor_p2.bam > G1_Tumor_p2.bam.stat.new
diff G1_Tumor_p2.bam.stat.*
3c3
< # The command line was:  stat charlie_dev/results/G1_Tumor/STAR2p/G1_Tumor_p2.bam
---
> # The command line was:  stat charlie_v0.10.1/results/G1_Tumor/STAR2p/G1_Tumor_p2.bam

The gtf files are identical

md5sum charlie_dev/ref/ref.fixed.gtf charlie_v0.10.1/ref/ref.fixed.gtf
54dcc6005272fcda13e6c46c76ec9b3d  charlie_dev/ref/ref.fixed.gtf
54dcc6005272fcda13e6c46c76ec9b3d  charlie_v0.10.1/ref/ref.fixed.gtf

The chimera files are all equal

library(tidyverse)

files <- tibble(dev = c('charlie_dev/results/G1_Tumor/STAR1p/G1_Tumor_p1.Chimeric.out.junction',
                        'charlie_dev/results/G1_Tumor/STAR1p/mate1/G1_Tumor_mate1.Chimeric.out.junction',
                        'charlie_dev/results/G1_Tumor/STAR1p/mate2/G1_Tumor_mate2.Chimeric.out.junction'),
                rel = c('charlie_v0.10.1/results/G1_Tumor/STAR1p/G1_Tumor_p1.Chimeric.out.junction',
                        'charlie_v0.10.1/results/G1_Tumor/STAR1p/mate1/G1_Tumor_mate1.Chimeric.out.junction',
                        'charlie_v0.10.1/results/G1_Tumor/STAR1p/mate2/G1_Tumor_mate2.Chimeric.out.junction'),)
files %>% pmap(\(dev, rel) all_equal(read_tsv(dev), read_tsv(rel)))
[[1]]
[1] TRUE

[[2]]
[1] TRUE

[[3]]
[1] TRUE

checking DCC & python version in conda env vs Docker

release version used conda env:

. "/data/CCBR_Pipeliner/db/PipeDB/Conda/etc/profile.d/conda.sh"
conda activate DCC

now using docker:

ENV DCC_VERSION=0.5.0
RUN wget https://github.com/dieterich-lab/DCC/archive/refs/tags/v${DCC_VERSION}.tar.gz -O dcc.tar.gz && \
tar -xzvf dcc.tar.gz && \
cd DCC-${DCC_VERSION} && \
python setup.py install

Both use v0.5.0. According to the release notes, DCC 0.5.0 requires python 3.5 and no longer supports python 2.7.

I tried having the docker container install DCC via conda, but the rule still failed with the same error.

still failing...

After rebuilding the docker to install DCC 0.5.0 from conda, it still fails with the same error as before:

Activating singularity image /data/CCBR_Pipeliner/SIFS/charlie_dcc_v0.1.0.sif
+ '[' -d /lscratch/25536525 ']'
+ TMPDIR=/lscratch/25536525/8e9ea0a8-9ea7-406e-ab74-605db2e6e40d
+ '[' '!' -d /lscratch/25536525/8e9ea0a8-9ea7-406e-ab74-605db2e6e40d ']'
+ mkdir -p /lscratch/25536525/8e9ea0a8-9ea7-406e-ab74-605db2e6e40d
++ dirname /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Tumor/DCC/CircRNACount
+ cd /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Tumor/DCC
+ '[' PE == PE ']'
+ DCC @/vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Tumor/DCC/samplesheet.txt --temp /lscratch/25536525/8e9ea0a8-9ea7-406e-ab74-605db2e6e40d/DCC --threads 4 --detect --gene --bam /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Tumor/STAR2p/G1_Tumor_p2.bam -ss --annotation /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/ref/ref.fixed.gtf --chrM -G --rep_file /data/CCBR_Pipeliner/db/PipeDB/charlie/fastas_gtfs/hg38.repeats.gtf --refseq /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/ref/ref.fa --PE-independent -mt1 @/vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Tumor/DCC/mate1.txt -mt2 @/vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Tumor/DCC/mate2.txt
[W::hts_idx_load3] The index file is older than the data file: /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Tumor/STAR2p/G1_Tumor_p2.bam.csi
Traceback (most recent call last):
  File "/opt2/conda/envs/dcc/bin/DCC", line 10, in <module>
    sys.exit(main())
  File "/opt2/conda/envs/dcc/lib/python3.10/site-packages/DCC/main.py", line 490, in main
    CircSkipfiles = findCircSkipJunction(output_coordinates, options.tmp_dir,
  File "/opt2/conda/envs/dcc/lib/python3.10/site-packages/DCC/main.py", line 679, in findCircSkipJunction
    circStartAdjacentExons, circStartAdjacentExonsIv = CCEM.findcircAdjacent(circStartExons, Custom_exon_id2Iv,
  File "/opt2/conda/envs/dcc/lib/python3.10/site-packages/DCC/Circ_nonCirc_Exon_Match.py", line 281, in findcircAdjacent
    interval = Custom_exon_id2Iv[self.getAdjacent(ids, start=start)]
  File "/opt2/conda/envs/dcc/lib/python3.10/site-packages/DCC/Circ_nonCirc_Exon_Match.py", line 222, in getAdjacent
    exon_number = int(custom_exon_id.split(':')[1]) - 1
ValueError: invalid literal for int() with base 10: '1"'

On further inspection, it looks like the DCC conda env on biowulf was built with python 2.7:
/data/CCBR_Pipeliner/db/PipeDB/Conda/envs/DCC/lib/python2.7/site-packages

@kelly-sovacool
Copy link
Member Author

kelly-sovacool commented Apr 30, 2024

errors on FRCE:

sbatch: error: invalid partition specified: ccr
sbatch: error: Batch job submission failed: Invalid partition name specified
sbatch: error: Invalid generic resource (gres) specification
Error submitting jobscript (exit code 1):

Will need to edit cluster.json and submit_script.sbatch accordingly

@kelly-sovacool
Copy link
Member Author

kelly-sovacool commented May 3, 2024

Looks like the DCC devs are aware of the issue and fixed it in the master branch -- https://www.github.com/dieterich-lab/DCC/issues/103

Edited the docker container to use the dev version. It worked!

@kelly-sovacool
Copy link
Member Author

kelly-sovacool commented May 14, 2024

First run-through on biowulf completed successfully after several bug fixes. Re-run from start to finish completed successfully on biowulf. Test in progress on frce.

@kelly-sovacool
Copy link
Member Author

kelly-sovacool commented May 14, 2024

more problems on FRCE:

sbatch: error: CPU count per node can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available

need to reduce threads for FRCE, but I can't find how many are available per node on the norm partition
https://ncifrederick.cancer.gov/staff/frce/documentation/slurm-partitions-features

just switched jobs that requested 56 threads to 32 for FRCE and jobs are running now

edit: found the FRCE hardware config here: https://ncifrederick.cancer.gov/staff/frce/documentation/frce-hardware-capabilities

@kelly-sovacool
Copy link
Member Author

Currently running on FRCE with improved handling of config & cluster templates

@kelly-sovacool
Copy link
Member Author

kelly-sovacool commented May 16, 2024

error on FRCE:

SystemExit in file /home/sovacoolkl/CHARLIE/workflow/rules/init.smk, line 20:
File: /mnt/projects/CCBR-Pipelines/db/charlie/fastas_gtfs/hg38.fa does not exists!
  File "/home/sovacoolkl/CHARLIE/workflow/Snakefile", line 19, in <module>
  File "/home/sovacoolkl/CHARLIE/workflow/rules/init.smk", line 190, in <module>
  File "/home/sovacoolkl/CHARLIE/workflow/rules/init.smk", line 29, in check_readaccess
  File "/home/sovacoolkl/CHARLIE/workflow/rules/init.smk", line 20, in check_existence
SystemExit in file /home/sovacoolkl/CHARLIE/workflow/rules/init.smk, line 20:
File: /mnt/projects/CCBR-Pipelines/db/charlie/fastas_gtfs/hg38.fa does not exists!
  File "/home/sovacoolkl/CHARLIE/workflow/Snakefile", line 19, in <module>
  File "/home/sovacoolkl/CHARLIE/workflow/rules/init.smk", line 190, in <module>
  File "/home/sovacoolkl/CHARLIE/workflow/rules/init.smk", line 29, in check_readaccess
  File "/home/sovacoolkl/CHARLIE/workflow/rules/init.smk", line 20, in check_existence

even though the file does exist 🤔

file /mnt/projects/CCBR-Pipelines/db/charlie/fastas_gtfs/hg38.fa

/mnt/projects/CCBR-Pipelines/db/charlie/fastas_gtfs/hg38.fa: ASCII text, with very long lines

is /mnt not available in compute nodes on FRCE??

Edit: this seems to be a FRCE regression -- tried to submit a RENEE job and that failed for the same reason

/var/spool/slurmd/job37856165/slurm_script: line 4: /mnt/projects/CCBR-Pipelines/pipelines/RENEE/renee-dev-sovacool/bin/renee: No such file or directory

Submitted a help ticket

@kelly-sovacool kelly-sovacool modified the milestones: 2024-04, 2024-05 May 17, 2024
@kelly-sovacool
Copy link
Member Author

upgraded snakemake in the shared conda env on FRCE to v7

conda activate /mnt/projects/CCBR-Pipelines/conda/envs/snakemake
mamba install -c bioconda snakemake=7.32.4

@kelly-sovacool
Copy link
Member Author

on FRCE, star_circrnafinder hangs indefinitely and gets cancelled by slurm, but actually completes successfully in < 3 hours when run interactively.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant