Skip to content

Improving data quality of taxonomic assignments in large-scale public databases

Notifications You must be signed in to change notification settings

boalang/quality

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Quality of NR database

Improving data quality of taxonomic assignments in large-scale public databases

Method

Implementation

To run data quality analysis at large-scale, we used BoaG for the computationally expensive part. Then, we used current library in Python and Jupyter notebook for the postprocessing.

Boag is a domain-specific language and infrastructure on top of Hadoop for genomics data. Website: https://boalang.github.io/bio/

BoaG compiler is written in Java and the source code is available here

  • This is a video on step by step instructions to set up programming environment on Eclipse for Boa compiler. link

Step1: Script and analysis on BoaG infrastructure

Step2: Postprocessing in Python and Jupyter Notebooks

Dataset

Clustering information

Evaluation

Simulated dataset

Manual Analysis

Literature dataset

Following works on detecting and correcting misclassifications in rRNA sequences.

UniProt --UniRef90 (clusters at 90% sequence similarity)

Following are examples of misclassifications in the 90% clusters

root conflict in  UniRef90_I3TC36
cellular organisms conflict UniRef90_I3TC36
superkingdom conflict UniRef90_I3TC36
phylum conflict UniRef90_I3TC36

Run time

take 1M sample and check for common1 and common2 python ~/Documents/MyGithub/docs/nr_functions/seq_clstr_conflict.py /Users/hbagheri/Downloads/nr_protein_functions/95-part-r-00000clustr-seq ../boag-job82-output.txt_converted nr_single_taxa_converted_1M > log_conf_1M

About

Improving data quality of taxonomic assignments in large-scale public databases

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published