Document Detail


A Novel algorithm for identifying low-complexity regions in a protein sequence.
MedLine Citation:
PMID:  17018537     Owner:  NLM     Status:  MEDLINE    
Abstract/OtherAbstract:
MOTIVATION: We consider the problem of identifying low-complexity regions (LCRs) in a protein sequence. LCRs are regions of biased composition, normally consisting of different kinds of repeats. RESULTS: We define new complexity measures to compute the complexity of a sequence based on a given scoring matrix, such as BLOSUM 62. Our complexity measures also consider the order of amino acids in the sequence and the sequence length. We develop a novel graph-based algorithm called GBA to identify LCRs in a protein sequence. In the graph constructed for the sequence, each vertex corresponds to a pair of similar amino acids. Each edge connects two pairs of amino acids that can be grouped together to form a longer repeat. GBA finds short subsequences as LCR candidates by traversing this graph. It then extends them to find longer subsequences that may contain full repeats with low complexities. Extended subsequences are then post-processed to refine repeats to LCRs. Our experiments on real data show that GBA has significantly higher recall compared to existing algorithms, including 0j.py, CARD, and SEG. AVAILABILITY: The program is available on request.
Authors:
Xuehui Li; Tamer Kahveci
Related Documents :
9282477 - Mse behavior of biomedical event-related filters.
18302777 - Assessment of composite motif discovery methods.
20047957 - Tfold: efficient in silico prediction of non-coding rna secondary structures.
17018537 - A novel algorithm for identifying low-complexity regions in a protein sequence.
22431647 - Multigene families in trypanosoma cruzi and their role in infectivity.
9680967 - In vivo evidence for 5'-->3' exoribonuclease degradation of an unstable chloroplast mrna.
Publication Detail:
Type:  Journal Article     Date:  2006-10-02
Journal Detail:
Title:  Bioinformatics (Oxford, England)     Volume:  22     ISSN:  1367-4811     ISO Abbreviation:  Bioinformatics     Publication Date:  2006 Dec 
Date Detail:
Created Date:  2006-12-04     Completed Date:  2006-12-28     Revised Date:  2009-11-04    
Medline Journal Info:
Nlm Unique ID:  9808944     Medline TA:  Bioinformatics     Country:  England    
Other Details:
Languages:  eng     Pagination:  2980-7     Citation Subset:  IM    
Affiliation:
Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL 32611, USA. xli@cise.ufl.edu
Export Citation:
APA/MLA Format     Download EndNote     Download BibTex
MeSH Terms
Descriptor/Qualifier:
Algorithms*
Amino Acid Sequence
Molecular Sequence Data
Proteins / chemistry*
Sequence Alignment / methods*
Sequence Analysis, Protein / methods*
Sequence Homology, Amino Acid
Chemical
Reg. No./Substance:
0/Proteins

From MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine


Previous Document:  Sungear: interactive visualization and functional analysis of genomic datasets.
Next Document:  The 'distal-dorsal difference': a thermographic parameter by which to differentiate between primary ...