Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering & Annotation workflow #36

Open
cjfields opened this issue Sep 2, 2021 · 15 comments
Open

Filtering & Annotation workflow #36

cjfields opened this issue Sep 2, 2021 · 15 comments
Assignees

Comments

@cjfields
Copy link
Contributor

cjfields commented Sep 2, 2021

The first step in the workflow (assembly) is performed per sample and is in assembly.nf. @NeginValizadegan will work on the annotation steps for each sample assembly, with the basic steps:

  1. Filter reads to minimum given length (default = 500bp, same as used for HUPAN per Jess Bourne).
  2. BLAST runs against references, using (1) GRCh38 + alts + decoys (same as the alignment) (2) Sherman-Salzberg data, (3) others (CHM13?)
  3. RepeatMasker (see Kim's notes in the repo)
  4. QUAST
  5. Contamination detection. We have used Kraken2 for this (code is on Gloria's branch), but we may want to check what HUPAN is doing here, which I believe is BLASTN

Any others?

@cjfields
Copy link
Contributor Author

cjfields commented Sep 2, 2021

@NeginValizadegan maybe start with simple bash scripts first for testing the steps, then add to a nextflow script.

@NeginValizadegan
Copy link
Collaborator

Linking commits 7374f49 and 696cbcf here.

@NeginValizadegan
Copy link
Collaborator

There are some memory-related issue with blastn step. Job was killed at 10GB memory and bus error at 40, 100, and even 150 GB. Still troubleshooting.

@cjfields
Copy link
Contributor Author

cjfields commented Sep 6, 2021

@NeginValizadegan re: the BLASTN work (and the annotation steps in general), I'm guessing you are trying to run all the annotation steps in one bash script? I'd recommend keeping it simple and testing out each step in an independent bash script; these can be independently moved into nextflow process blocks when they are working.

So for example you have the seqkit step in the annotation.sh bash script. You can try running BLASTN in a separate blastn.sh bash script, RepeatMasker in rm.sh, etc. The inputs (FASTA files) will largely be the same for all of these.

@NeginValizadegan
Copy link
Collaborator

NeginValizadegan commented Sep 9, 2021

@chrisfields Yes, but I set it up so that I can deactivate specific steps so not running it all at once but putting it all in one script. At the end of the script, I have the main section which allows me to comment out the steps I don't want to run easily.

@NeginValizadegan
Copy link
Collaborator

Linking commit 627872f here. Sorry forgot to add #36

@NeginValizadegan
Copy link
Collaborator

Linking a3476ca here

@NeginValizadegan
Copy link
Collaborator

Linking 321f764

NeginValizadegan pushed a commit that referenced this issue Oct 21, 2021
NeginValizadegan pushed a commit that referenced this issue Nov 4, 2021
1. The order of processes has changed to this:
        0. create blast databases
        1. filter below a read length
        2. kraken
        3. blastn nt
        4. run blast GRCh38, GRch38.p0, CHM13
        5. repeatmasker
        6. quast

2. The read length filtered fasta file will be now filtered based on a list of read ids from kraken that are not any of the following:
        - Homo sapiens
        - Eukaryota
        - cellular organisms
        - unclassified
        - root

3. The file from previous step will be used as input to blastn NT for further contamination detection.

LIST OF TOOLS USED IN THIS PIPLINE:
  1. blastn        -->  BLAST+/2.10.1
  2. seqkit        -->  seqkit/0.12.1
  2. kraken2       -->  Kraken2/2.0.8
  4. repeatmasker  -->  RepeatMasker/4.1.2
  5. quast         -->  quast/5.0.0
NeginValizadegan pushed a commit that referenced this issue Nov 6, 2021
…amination. The filtered file will be input to blast against human reference genome
@NeginValizadegan
Copy link
Collaborator

Linking 1983b41 here

@NeginValizadegan NeginValizadegan changed the title Annotation workflow Filtering & Annotation workflow Dec 3, 2021
NeginValizadegan pushed a commit that referenced this issue Dec 3, 2021
…nge the parameter inside the config file from false to true to skip cd-hit
@cjfields
Copy link
Contributor Author

cjfields commented Dec 3, 2021

For example, you can do this to see the last commit: fc187b9

@NeginValizadegan
Copy link
Collaborator

Linking 0f29654 here.

@NeginValizadegan
Copy link
Collaborator

Linking 3c7310a here.

@NeginValizadegan
Copy link
Collaborator

Linking cc16dc6 here.

@NeginValizadegan
Copy link
Collaborator

Linking db039c2 and 4eecbe5 here.

@NeginValizadegan
Copy link
Collaborator

Linking 4a9b3f2 here (pipeline testing).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants