Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tune Juicer for Cheaha #2

Open
jprorama opened this issue Sep 23, 2023 · 4 comments
Open

Tune Juicer for Cheaha #2

jprorama opened this issue Sep 23, 2023 · 4 comments

Comments

@jprorama
Copy link

Is your feature request related to a problem? Please describe.
Juicer can't use the SLURM scheduler on Cheaha

Describe the solution you'd like
Run juicer.sh in a screen/tmux/byobu session on the login node and have all work submitted as jobs to the cluster.

Describe alternatives you've considered
Running on a single node but takes to long.

Additional context
We need to be able to demonstrate successful operation of juicer.sh on cheaha. This requires customizing the juicer environment to use the partitions and modules available on cheaha, as described here:

https://github.com/aidenlab/juicer/wiki/Running-Juicer-on-a-cluster

This demonstration needs to include an example data set that can be run quickly but accurately reflects a full-scale run.

The sample data listed at the above wiki docs link is no longer available.

@jprorama
Copy link
Author

Proposed changes are available in pull request #1

@jprorama
Copy link
Author

jprorama commented Sep 23, 2023

The juicer forum is a potential resource for customizing the slurm support.

https://groups.google.com/g/3d-genomics/search?q=slurm

@jprorama
Copy link
Author

jprorama commented Sep 23, 2023

I've opened this issue requesting correction or clarification on running juicer with test data.

aidenlab#331

@jprorama
Copy link
Author

Just for guidance, the slurm version of de-dup chimera reads awk script, splits the sam data at every 1million reads at a known non-duplicate boundary. It checks to see if any of last 6 fields of the "cb" record are different from the prior record. If they are, they will not be duplicates. It places all those reads in a file and submits a job to process those records. It continues this step until all reads are submitted for de-duping, so the max time for dedup will be the time it takes to process 1million records.

The code should work fine but we will need to improved the following line in that script. It has a hard coded email address and host name. This should be driven by parameters.

printf("#!/bin/bash -l\n#SBATCH -o %s/dup-mail.out\n#SBATCH -e %s/dup-mail.err\n#SBATCH -p %s\n#SBATCH -J %s_msplit0\n#SBATCH -d singleton\n#SBATCH -t 1440\n#SBATCH -c 1\n#SBATCH --ntasks=1\ndate;\necho %s %s %s %s | mail -r [email protected] -s \"Juicer pipeline finished successfully @ Voltron\" -t %[email protected];\ndate\n", debugdir, debugdir, queue, groupname, topDir, site, genomeID, genomePath, user) > sscriptname;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant