Data Quality of NR database

Improving data quality of taxonomic assignments in large-scale public databases

Method

Implementation

To run data quality analysis at large-scale, we used BoaG for the computationally expensive part. Then, we used current library in Python and Jupyter notebook for the postprocessing.

Boag is a domain-specific language and infrastructure on top of Hadoop for genomics data. Website: https://boalang.github.io/bio/

BoaG compiler is written in Java and the source code is available here

This is a video on step by step instructions to set up programming environment on Eclipse for Boa compiler. link

Step1: Script and analysis on BoaG infrastructure

Step2: Postprocessing in Python and Jupyter Notebooks

Lineage
Provenance
Construct Tree with ETE3 library
Identifying conflicts
- Output: List of misclassified sequences
  - This file shows list of conflicts. Sequence ID, Cluster DI, Sequence assignment, top3 assignments of the clusters are shown along with confidence score for the proposed assignment in the next line. See example: 1A0Q 55656088 [('10090', 1)] [('562', 24), ('168807', 2), ('405955', 1)] CS= 0.8888888888888888

Dataset

Clustering information

Evaluation

Simulated dataset

Link

Manual Analysis

Link

Literature dataset

Following works on detecting and correcting misclassifications in rRNA sequences.

UniProt --UniRef90 (clusters at 90% sequence similarity)

The entire dataset 119 million sequences: https://www.uniprot.org/uniref/?query=&fil=identity:0.9

Following are examples of misclassifications in the 90% clusters

root conflict in  UniRef90_I3TC36
cellular organisms conflict UniRef90_I3TC36
superkingdom conflict UniRef90_I3TC36
phylum conflict UniRef90_I3TC36

Run time

take 1M sample and check for common1 and common2 python ~/Documents/MyGithub/docs/nr_functions/seq_clstr_conflict.py /Users/hbagheri/Downloads/nr_protein_functions/95-part-r-00000clustr-seq ../boag-job82-output.txt_converted nr_single_taxa_converted_1M > log_conf_1M

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Python		Python
evaluation		evaluation
notebooks		notebooks
supplementals		supplementals
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Quality of NR database

Method

Implementation

Step1: Script and analysis on BoaG infrastructure

Step2: Postprocessing in Python and Jupyter Notebooks

Dataset

Clustering information

Evaluation

Simulated dataset

Manual Analysis

Literature dataset

UniProt --UniRef90 (clusters at 90% sequence similarity)

Run time

About

Releases

Packages

Languages

boalang/quality

Folders and files

Latest commit

History

Repository files navigation

Data Quality of NR database

Method

Implementation

Step1: Script and analysis on BoaG infrastructure

Step2: Postprocessing in Python and Jupyter Notebooks

Dataset

Clustering information

Evaluation

Simulated dataset

Manual Analysis

Literature dataset

UniProt --UniRef90 (clusters at 90% sequence similarity)

Run time

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages