Computational Genomics Final Project - Shreyas Aiyar, Debanik Purkayastha, Ishita Tripathi
.
├── api
├── matching
├── index_assisted_dp.py # Index Assisted DP algorithm implementation
├── neighbourhood.py # Neighbourhood search algorithm implementation
├── pigeonhole_principle_approximate_match.py # L-mer filtration algorithm implementation
├── six_frame_translation.py # Six frame translation implementation
├── blast.py # Trial implementation of blast.py
├── api.py
├── client # Web app files
├── tests
├── data_generation.py # Script to generate test reads
├── kmer_dict_tests.py # Test script to compare kmer dictionary sizes and creation runtime for different k's and number of proteins.
├── test_suite.py # Test script to run and benchmark algorithms
├── scripts
├── kmers.py # Script to generate L-mer dictionary file
├── proteins.py # Script to generate protein id dictionary file
├── data # Stores supporting dictionary files in .pickle format
├── contributions.txt
├── recording # Screen recording of the web app
└── README.md
To run the web app specifically, make sure your machine has Node.js installed. After installing Node.js, make sure that yarn is accessible as a package manager.
Node.js - https://nodejs.org/en/download/
yarn - https://classic.yarnpkg.com/en/docs/install/#mac-stable
Also, to generate some supporting files, you would require the Uniprot FASTA file which can be downloaded from:
Uniprot FASTA File - ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
Place the Uniprot File in /data directory once downloaded
From root directory
cd api && pip install -r requirements.txt && flask run
From root directory
cd client && yarn install && yarn start
cd tests
python3 data_generation.py --infile ../data/uniprot_sprot.fasta --seed 1 --num_errors 4 --num_reads 5 --read_length 60 --no-gaps
cd ..
Note: Generate synthetic reads prior to this
Smaller values of L would take a while to run
cd tests
python3 test_suite.py --infile test_reads.txt --mm 4 --l 4
cd ..
cd tests
python3 kmer_dict_tests.py
cd ..
cd scripts
python3 kmers.py --infile ../data/protein_dict_num_prots_100.pickle -k 5 -n 100
cd ..
cd scripts
python3 proteins.py --infile ../data/uniprot_sprot.fasta --num_proteins 100
cd ..
cd api/matching
python3 neighborhood.py -i [PROT_SEQ] -m 1 -t 17
cd ../../
Note: The query files have sample usage given as comments at the top of the file. They can be run independently as well.