Protein Research

Summer Research at the National Institute of Standards and Technology (NIST), Summer 2016

Background

Glycosylation is the process in which a carbohydrate attaches to a protein or lipid, and it is essential to the folding and function of proteins. Therefore, defects in glycosylation can cause severe diseases and even cancers. Proteins consist of long strings of amino acids, and our research focused mainly on N-linked glycosylation sites, which have an N-X-S/T/C motif (where X is not Proline). For N-linked glycosylation, not all potential glycosylation sites become glycosylated, so this program was designed to predict which glycosylation sites are actually important based on how well the N-X-S/T/C motif is conserved across several species.

Project Details

This program data-mines the UniProt website for sequences that have experimentally proven n-linked glycosylation sites. It then expands the search and aligns all sequences of a given protein using the Clustal Omega alignment software. From there, it performs statistical analyses on the alignments to predict significant potential glycosylation sites. The data can also be used to generate other statistics, like amino acid (X) frequency, alignment of flanking regions, and mutation rates (i.e. from N to D).

The program performs several optimizations in order to quickly compile and sort large quantities of data, such as specifying which glycosylation sites already have experimentally proven data, filtering out duplicate organisms to find the "most representative" sequence, and distinguishing between N-X-S, N-X-T, and N-X-C sites. It also stores all the information retreived in an easily retrievable format to allow for quick and easy analysis.

1). info_tester.py queries the UniProt database to data-mine and parse all the sequencing data. It also generates gene tables and fasta files for the sequence information.

2). alignment_tester.py runs the Clustal Omega alignment softare, and produces a clean, "pretty printed" alignment for all the sequences of the proteins.

3). analysis_tester.py performs the statistical analyses on the sequencing data. While it is not necessarily an executable file, it has separate function calls from analysis.py (where the statistical functions are actually written) and a rough pipeline structure.

Data Sources

Protein sequencing data was taken from UniProt.

Technology

This program was written using Python 3. It uses the BioPython, tqdm, and Bioservices packages, and the Clustal Omega alignment software. In order to generate colored html files, the ansifilter program is run from the command line in order to generate the visual html alignment files.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
examples		examples
README.md		README.md
UniprotUtils.py		UniprotUtils.py
alignment_tester.py		alignment_tester.py
analysis.py		analysis.py
analysis_tester.py		analysis_tester.py
get_info.py		get_info.py
info_tester.py		info_tester.py
myUniprotIO.py		myUniprotIO.py
seq_utils.py		seq_utils.py
sequence_alignment.py		sequence_alignment.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Protein Research

Background

Project Details

Data Sources

Technology

About

Releases

Packages

Languages

ejin700/protein-research

Folders and files

Latest commit

History

Repository files navigation

Protein Research

Background

Project Details

Data Sources

Technology

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages