Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index FilteredNT on Pegasus for BLAST #18

Open
HadleyKing opened this issue May 13, 2024 · 24 comments
Open

Index FilteredNT on Pegasus for BLAST #18

HadleyKing opened this issue May 13, 2024 · 24 comments
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@HadleyKing
Copy link
Member

Apr 19, 2024, 5:09 PM

We would like to try to do an experiment on Pegasis with NCBI's NT db, which is about 1.2 TB last time I checked. We have a filtered version of that which is about 1.1 TB that I would need to transfer from our HIVE resources to Pegasis. I would like to use NCBI BLAST to create an index for the NT db.
We think this process may take a week or so. I wanted to let someone know before I started it so that the job will not get killed.

Once we have the index we will move it out of Pegasus and delete the file. We need the indexed file to run some blasts and compare to will a smaller version of NT (80GB) that we have prepared. This is for a paper we are working on.

@HadleyKing HadleyKing self-assigned this May 13, 2024
@HadleyKing
Copy link
Member Author

On Mon, 22 Apr 2024 10:43:44 -0400, [email protected] wrote:

Forwarding Raja's email to our RT ticket system for a record and the
future
follow up.


Adam K. L. Wong, PhD.
High Performance Computing Specialist for Genomics
Research Technology Services, GWIT
The George Washington University
Email: [email protected]

---------- Forwarded message ---------
From: Raja Mazumder <[email protected]>
Date: Fri, Apr 19, 2024 at 5:09 PM
Subject: Re: NCBI NT
To: Charles Hadley King <[email protected]>
Cc: Kai Leung Wong <[email protected]>, Jonathon Keeney
<[email protected]>

Hi Adam,
Hope you are doing well. Once we have the index we will move it out of
Pegasus and delete the file. We need the indexed file to run some
blasts
and compare to will a smaller version of NT (80GB) that we have
prepared.
This is for a paper we are working on.
Many thanks,
Raja

On Fri, Apr 19, 2024, 5:03 PM Charles Hadley King
<[email protected]>
wrote:

Adam,
I would like to try to do an experiment on Pegasis with NCBI's NT db,
which is about 1.2 TB last time I checked. We have a filtered version
of
that which is about 1.1 TB that I would need to transfer from our
HIVE
resources to Pegasis. I would like to use NCBI BLAST to create an
index for
the NT db.
We think this process may take a week or so. I wanted to let someone
know
before I started it so that the job will not get killed.

Is there anything else I should do to protect this job once it is
started?

Thank you,

Charles Hadley King, M.S.
Research Scientist, HIVE Lab
BioCompute Technical Lead
The George Washington University
Ross Hall
2300 Eye Street N.W.
Washington, DC 20037
Mobile: 610-613-3063
[email protected]
https://orcid.org/0000-0003-1409-4549

Hi Raja and Charlies,

Thank you for contacting us for your request of using Pegasus to run your NCBI blast computation!
I have created this ticket for a record and the future follow up.

@CHARLIES, if your are going to work with huge data, please make sure to select the "defq" partition which allows for up to 14 days runtime to run your jobs.

Best,
Adam

@HadleyKing HadleyKing added enhancement New feature or request help wanted Extra attention is needed labels May 13, 2024
@penningtonea penningtonea pinned this issue May 14, 2024
@penningtonea
Copy link
Collaborator

penningtonea commented May 31, 2024

New task for @penningtonea

  • create DB using BLAST on pegasus
  • Move 6 resulting files to location Tigran can access

@penningtonea
Copy link
Collaborator

From Joe:

Will use slimNT in place of filtered-NT for the time being for comparison of FASTA to BLAST. slimNT is smaller but is already on GW HPC.

@penningtonea
Copy link
Collaborator

penningtonea commented May 31, 2024

Next step for NT project:

Index NT with blast, find documentation online for blast indexing and submit as slurm job.

  1. Figure out if blast is on Pegasus.

  2. Find documentation for slurm jobs is on GW website. Review Github issue and email for all information for the job.

  3. Build test DB of 10 or so fasta files and move it to Pegasus. Run a test index with 10 fasta DB and see how it long it takes to run and fully index.

  4. Scale up to 10, 100, 1000 and see how long it takes. Note the size of the DB. (e.g. Filtered NT is 1TB list of fasta files).

  5. Move NT over to HIVE, perform several comparisons and present the results to Merck

Blockers:

  • Access to pegasus
  • Access to SMHS_BIOC
  • Access to hivelab

@rajamazumder
Copy link

BLAST command to make database: makeblastdb –in mydb.fsa –dbtype nucl –parse_seqids

@rajamazumder
Copy link

BT006946.1 Homo sapiens cytochrome c, somatic mRNA, complete cds
ATGGGTGATGTTGAGAAAGGCAAGAAGATTTTTATTATGAAGTGTTCCCAGTGCCACACCGTTGAAAAGG
GAGGCAAGCACAAGACTGGGCCAAATCTCCATGGTCTCTTTGGGCGGAAGACAGGTCAGGCCCCTGGATA
CTCTTACACAGCCGCCAATAAGAACAAAGGCATCATCTGGGGAGAGGATACACTGATGGAGTATTTGGAG
AATCCCAAGAAGTACATCCCTGGAACAAAAATGATCTTTGTCGGCATTAAGAAGAAGGAAGAAAGGGCAG
ACTTAATAGCTTATCTCAAAAAAGCTACTAATGAGTAG
M20622.1 Rat somatic cytochrome c mRNA
ATGGGTGATGTTGAAAAAGGCAAGAAGATTTTTGTTCAAAAGTGTGCCCAGTGCCACACTGTGGAAAAAG
GAGGCAAGCATAAGACTGGACCAAACCTCCATGGTCTGTTTGGGCGGAAGACAGGCCAGGCTGCTGGATT
CTCTTACACAGATGCCAACAAGAACAAAGGTATCACCTGGGGAGAGGATACCCTGATGGAGTATTTGGAA
AATCCCAAAAAGTACATCCCTGGAACAAAAATGATCTTCGCTGGAATTAAGAAGAAGGGAGAAAGGGCAG
ACCTAATAGCTTATCTTAAAAAGGCTACTAATGAATAA
AK088098.1 Mus musculus 2 days neonate thymus thymic cells cDNA, RIKEN full-length enriched library, clone:E430004C08 product:cytochrome c, somatic, full insert sequence
GAGAGCGCGGGACGTCTGTCTTCGAGTCCGAACGTTCGTGGTGTTGACCAGCCCGGAACGAATTAAAAAT
GGGTGATGTTGAAAAAGGCAAGAAGATTTTTGTTCAGAAGTGTGCCCAGTGCCACACTGTGGAAAAGGGA
GGCAAGCATAAGACTGGACCAAATCTCCACGGTCTGTTCGGGCGGAAGACAGGCCAGGCTGCTGGATTCT
CTTACACAGATGCCAACAAGAACAAAGGCATCACCTGGGGAGAGGATACCCTGATGGAGTATTTGGAGAA
TCCCAAAAAGTACATCCCTGGAACAAAAATGATCTTCGCTGGAATTAAGAAGGAGGGAGAAAGGGCAGAC
CTAATAGCTTATCTTAAAAAGGCTACTAATGAGTAATTCCACTGCCTTATTTATTACAAAACAAATGTCT
CATGGCTTTTAATGTACACCATAATTTAATTCACACACCAAATTCAGATCATGAATGGCTAGCAATGTTT
TTGTTGGACAGTCCTGATTTAAGTAAAACTGACTTGTCATAAAGTGGGTACGGTCTTTATTAAAGCAACA
GTTCCAGTTGTATACATGCTACCACGGCTCTCCCTTTCTCAAGATAAGATTGGACTTAATTAGCAATGTT
TTACTTTCCATAAATAGGGGCATGTCACCTCAAACCTACTAAATGGTTTTATACTTAGATTTATATAACT
GGGCATATGAATATGCTTAAACACTGGGAAAATTCTATCACTGTCTCAGAAACAAGAAGACTCAAATGTG
TTTCAGTTGTGTTCACTGGCCTCTTTCAGGTCATGGCTAACCACCAGGAGGCAACTGTCTATTCTTGACA
GTGCATTTTTAATTAGAATGTCTACATCAAGGATGTTGCCTTTACTATTGAAAGGCATTTACTTTTTTTT
TTGTATGATATCAAATAAAGAGTATTTAACACTTTTT

@penningtonea
Copy link
Collaborator

@penningtonea
Copy link
Collaborator

penningtonea commented Jun 5, 2024

Next task:

  • run blast test using 3 seqdb and 1 test file for blast+ VM on personal laptop (EDIT: not necessary now but probably good practice for later)
  • review lit for slurm jobs and try to execute there

@rajamazumder
Copy link

rajamazumder commented Jun 5, 2024

It looks like the command makeblastdb works on Keeney's computer. So @penningtonea please run the command via slurm. Ask Hadley or Keeney for help if you are stuck. For blast search use blastn -db databaseName -query queryFileName. Also check if there is -output you can use

Edit: spelling

@kee007ney
Copy link
Collaborator

Note for SCP transferring:
You have to pull from hive-login, you cannot push from Pegasus. Make sure your public key is added in the correct place and pull the file, e.g.:
scp pegasus.arc.gwu.edu:/SMHS/home/keeneyjg/sample.txt .

@penningtonea
Copy link
Collaborator

penningtonea commented Jun 5, 2024

Progress update:

  1. Connected to pegasus.
  2. Wrote a bash script to make the database and run the blast search as written in above comments.
  3. Executed slurm job ID 3121152.

The script did not work. Received the same error message from the call earlier -
"Error: Too many positional arguments (1), the offending value: –in
Error: (CArgException::eSynopsis) Too many positional arguments (1), the offending value: –in".
See screenshots below for additional information.

scontrol show job output:
Image

output file with error message:

Image

QUESTION: How can I change my account from Account=watkinslab and my GroupID from MG-watkinslab(1111) to our group? I no longer need access to the watkinslab account.

@penningtonea
Copy link
Collaborator

penningtonea commented Jun 6, 2024

Downloading preformatted blast will negate the need to index NT on pegasus. Log in to HIVE API and execute the following commands received from NLM to download preformatted NT:

  1. install standalone blast+
  2. use update_blastdb.pl command to download and extract all files needed for nt:

perl update_blastdb.pl --decompress nt
Use perl update_blastdb.pl --help to see available switches.

  1. run BLAST with 1 query sequence to test

@rajamazumder
Copy link

@penningtonea you still should go ahead and index filtered-NT on pegasus in parallel. It will be good training for you and also I am not 100% sure the formatted-NT will work.

@penningtonea
Copy link
Collaborator

Instructions to download indexed NT from NCBI:

Hi,

Thanks for writing to us.

Make sure you have enough disk space. After install standalone blast+, you can use the update_blastdb.pl to download and extract all files needed for nt:

perl update_blastdb.pl --decompress nt

Do

perl update_blastdb.pl --help to see available switches

Regards,

@penningtonea
Copy link
Collaborator

Documenting email response:
"Do not touch those files through further manual manipulation - they are ready to use as they are. Those files are tied together by an alias file (nt.nal) and you simply call the database with -db nt if you have BLASTDB variable set to point to the directory containing all those files.

If you do not, you will need to prefix the nt with direct path."

@penningtonea
Copy link
Collaborator

penningtonea commented Jun 12, 2024

Was able to execute makeblastdb on pegasus. Output:
image

tested 1 sequence with manually inserted deletions in 3 sequence db. Output:
image

@penningtonea
Copy link
Collaborator

@penningtonea you still should go ahead and index filtered-NT on pegasus in parallel. It will be good training for you and also I am not 100% sure the formatted-NT will work.

Working on downloading and indexing filtered-NT. Afterwards will scp the db to HIVE3.

@penningtonea
Copy link
Collaborator

@penningtonea you still should go ahead and index filtered-NT on pegasus in parallel. It will be good training for you and also I am not 100% sure the formatted-NT will work.

Working on downloading and indexing filtered-NT. Afterwards will scp the db to HIVE3.

Mix-up with task. Will download NT on Pegasus today. Email sent to Adam Wong inquiring the best way to go about indexing NT via slurm scheduler due to the size of the job.

@penningtonea
Copy link
Collaborator

NT was downloaded and indexed on Pegasus. Directory: SMHS/home/epennington/lustre/groups/hivelab/emily/NT

Now downloading and indexing filtered-NT v7.0 on Pegasus.

@HadleyKing
Copy link
Member Author

HadleyKing commented Jun 24, 2024

@penningtonea
This is what I see. I am on the login node.

[email protected]:/SMHS/groups/hivelab/filteredNT_v7.0/filteredNT.fasta

Screenshot 2024-06-24 at 10 28 25 AM

@penningtonea
Copy link
Collaborator

GW HPC Pegasus is not working. I am able to log in but unable to submit or allocate jobs. Indexing filtered_nt will resume when Pegasus is back up and running as usual.

@penningtonea
Copy link
Collaborator

I am moving this closed ticket to July task list to compare to similar ticket

@rajamazumder
Copy link

@HadleyKing and @penningtonea I was trying to follow this ticket. Can you tell me which ticket this was moved to?

@rajamazumder rajamazumder reopened this Aug 30, 2024
@penningtonea
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants