Document Detail


Classifier assessment and feature selection for recognizing short coding sequences of human genes.
MedLine Citation:
PMID:  22401589     Owner:  NLM     Status:  MEDLINE    
Abstract/OtherAbstract:
With the ever-increasing pace of genome sequencing, there is a great need for fast and accurate computational tools to automatically identify genes in these genomes. Although great progress has been made in the development of gene-finding algorithms during the past decades, there is still room for further improvement. In particular, the issue of recognizing short exons in eukaryotes is still not solved satisfactorily. This article is devoted to assessing various linear and kernel-based classification algorithms and selecting the best combination of Z-curve features for further improvement of the issue. Eight state-of-the-art linear and kernel-based supervised pattern recognition techniques were used to identify the short (21-192 bp) coding sequences of human genes. By measuring the prediction accuracy, the tradeoff between sensitivity and specificity and the time consumption, partial least squares (PLS) and kernel partial least squares (KPLS) algorithms were verified to be the most optimal linear and kernel-based classifiers, respectively. A surprising result was that, by making good use of the interpretability of the PLS and the Z-curve methods, 93 Z-curve features were proved to be the best selective combination. Using them, the average recognition accuracy was improved as high as 7.7% by means of KPLS when compared with what was obtained by the Fisher discriminant analysis using 189 Z-curve variables (Gao and Zhang, 2004 ). The used codes are freely available from the following approaches (implemented in MATLAB and supported on Linux and MS Windows): (1) SVM: http://www.support-vector-machines.org/SVM_soft.html. (2) GP: http://www.gaussianprocess.org. (3) KPLS and KFDA: Taylor, J.S., and Cristianini, N. 2004. Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge, UK. (4) PLS: Wise, B.M., and Gallagher, N.B. 2011. PLS-Toolbox for use with MATLAB: ver 1.5.2. Eigenvector Technologies, Manson, WA. Supplementary Material for this article is available at www.liebertonline.com/cmb.
Authors:
Kai Song; Ze Zhang; Tuo-Peng Tong; Fang Wu
Related Documents :
21839789 - Large scale parallel pyrosequencing technology: prrsv strain vr-2332 nsp2 deletion muta...
21764749 - Precise manipulation of chromosomes in vivo enables genome-wide codon replacement.
22887659 - Revised genome sequence of burkholderia thailandensis msmb43 with improved annotation.
22160129 - Genetic diversity in the 3'-terminal region of papaya ringspot virus (prsv-w) isolates ...
21814289 - Evolution of symbiotic organs and endosymbionts in lygaeid stinkbugs.
21717329 - Identification of the genes involved in 1-deoxynojirimycin synthesis in bacillus subtil...
6278499 - Plasmid gene organization: naphthalene/salicylate oxidation.
24173929 - Molecular phylogeny of the pooideae (poaceae) based on nuclear rdna (its) sequences.
18370279 - Insertion of introns: a strategy to facilitate assembly of infectious full length clones.
Publication Detail:
Type:  Journal Article; Research Support, Non-U.S. Gov't    
Journal Detail:
Title:  Journal of computational biology : a journal of computational molecular cell biology     Volume:  19     ISSN:  1557-8666     ISO Abbreviation:  J. Comput. Biol.     Publication Date:  2012 Mar 
Date Detail:
Created Date:  2012-03-09     Completed Date:  2012-06-25     Revised Date:  2013-06-26    
Medline Journal Info:
Nlm Unique ID:  9433358     Medline TA:  J Comput Biol     Country:  United States    
Other Details:
Languages:  eng     Pagination:  251-60     Citation Subset:  IM    
Affiliation:
School of Chemical Engineering and Technology, Tianjin University, Tianjin, China. ksong@tju.edu.cn
Export Citation:
APA/MLA Format     Download EndNote     Download BibTex
MeSH Terms
Descriptor/Qualifier:
Algorithms
Computer Simulation
Databases, Genetic
Humans
Least-Squares Analysis
Models, Genetic
Open Reading Frames*
Sequence Analysis, DNA / methods
Support Vector Machines*
Comments/Corrections

From MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine


Previous Document:  A Study on the Learning Curve of the Robotic Virtual Reality Simulator.
Next Document:  Algorithms for the design of maximum hydropathic complementarity molecules.