Document Detail


Computational inference of grammars for larger-than-gene structures from annotated gene sequences.
MedLine Citation:
PMID:  21258064     Owner:  NLM     Status:  MEDLINE    
Abstract/OtherAbstract:
MOTIVATION: Larger than gene structures (LGS) are DNA segments that include at least one gene and often other segments such as inverted repeats and gene promoters. Mobile genetic elements (MGE) such as integrons are LGS that play an important role in horizontal gene transfer, primarily in Gram-negative organisms. Known LGS have a profound effect on organism virulence, antibiotic resistance and other properties of the organism due to the number of genes involved. Expert-compiled grammars have been shown to be an effective computational representation of LGS, well suited to automating annotation, and supporting de novo gene discovery. However, development of LGS grammars by experts is labour intensive and restricted to known LGS.
OBJECTIVES: This study uses computational grammar inference methods to automate LGS discovery. We compare the ability of six algorithms to infer LGS grammars from DNA sequences annotated with genes and other short sequences. We compared the predictive power of learned grammars against an expert-developed grammar for gene cassette arrays found in Class 1, 2 and 3 integrons, which are modular LGS containing up to 9 of about 240 cassette types.
RESULTS: Using a Bayesian generalization algorithm our inferred grammar was able to predict > 95% of MGE structures in a corpus of 1760 sequences obtained from Genbank (F-score 75%). Even with 100% noise added to the training and test sets, we obtained an F-score of 68%, indicating that the method is robust and has the potential to predict de novo LGS structures when the underlying gene features are known.
AVAILABILITY: http://www2.chi.unsw.edu.au/attacca.
Authors:
Guy Tsafnat; Jaron Schaeffer; Andrew Clayphan; Jon R Iredell; Sally R Partridge; Enrico Coiera
Related Documents :
17213324 - Ancestral genome sizes specify the minimum rate of lateral gene transfer during prokary...
18415094 - Functional diversification of the toll-like receptor gene family.
21483804 - Widespread hypomethylation occurs early and synergizes with gene amplification during e...
12566394 - Genomic sequence and transcriptional profile of the boundary between pericentromeric sa...
23796434 - Conserved micrornas mir-8-5p and mir-2a-3p modulate chitin biosynthesis in response to ...
7784084 - Identification of p53 target genes through immune selection of genomic dna: the cyclin ...
Publication Detail:
Type:  Comparative Study; Journal Article; Research Support, Non-U.S. Gov't     Date:  2011-01-22
Journal Detail:
Title:  Bioinformatics (Oxford, England)     Volume:  27     ISSN:  1367-4811     ISO Abbreviation:  Bioinformatics     Publication Date:  2011 Mar 
Date Detail:
Created Date:  2011-03-10     Completed Date:  2011-05-31     Revised Date:  2014-07-30    
Medline Journal Info:
Nlm Unique ID:  9808944     Medline TA:  Bioinformatics     Country:  England    
Other Details:
Languages:  eng     Pagination:  791-6     Citation Subset:  IM    
Export Citation:
APA/MLA Format     Download EndNote     Download BibTex
MeSH Terms
Descriptor/Qualifier:
Algorithms*
Automatic Data Processing / methods*
Bayes Theorem
DNA / genetics
Databases, Genetic
Markov Chains
Molecular Sequence Annotation
Sequence Analysis, DNA / methods*
Chemical
Reg. No./Substance:
9007-49-2/DNA

From MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine


Previous Document:  Automated validation of genetic variants from large databases: ensuring that variant references refe...
Next Document:  Computational refinement of post-translational modifications predicted from tandem mass spectrometry...