Document Detail

Classifier assessment and feature selection for recognizing short coding sequences of human genes.
MedLine Citation:
PMID:  22401589     Owner:  NLM     Status:  MEDLINE    
With the ever-increasing pace of genome sequencing, there is a great need for fast and accurate computational tools to automatically identify genes in these genomes. Although great progress has been made in the development of gene-finding algorithms during the past decades, there is still room for further improvement. In particular, the issue of recognizing short exons in eukaryotes is still not solved satisfactorily. This article is devoted to assessing various linear and kernel-based classification algorithms and selecting the best combination of Z-curve features for further improvement of the issue. Eight state-of-the-art linear and kernel-based supervised pattern recognition techniques were used to identify the short (21-192 bp) coding sequences of human genes. By measuring the prediction accuracy, the tradeoff between sensitivity and specificity and the time consumption, partial least squares (PLS) and kernel partial least squares (KPLS) algorithms were verified to be the most optimal linear and kernel-based classifiers, respectively. A surprising result was that, by making good use of the interpretability of the PLS and the Z-curve methods, 93 Z-curve features were proved to be the best selective combination. Using them, the average recognition accuracy was improved as high as 7.7% by means of KPLS when compared with what was obtained by the Fisher discriminant analysis using 189 Z-curve variables (Gao and Zhang, 2004 ). The used codes are freely available from the following approaches (implemented in MATLAB and supported on Linux and MS Windows): (1) SVM: (2) GP: (3) KPLS and KFDA: Taylor, J.S., and Cristianini, N. 2004. Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge, UK. (4) PLS: Wise, B.M., and Gallagher, N.B. 2011. PLS-Toolbox for use with MATLAB: ver 1.5.2. Eigenvector Technologies, Manson, WA. Supplementary Material for this article is available at
Kai Song; Ze Zhang; Tuo-Peng Tong; Fang Wu
Related Documents :
10376009 - The genome of rickettsia prowazekii and some thoughts on the origin of mitochondria and...
9032479 - Structural analysis of the oligosaccharide-alditols released by reductive beta-eliminat...
22185719 - Prokaryotic diversity in aran-bidgol salt lake, the largest hypersaline playa in iran.
15218219 - Spotted fever group rickettsiae from ticks captured in sudan.
1671779 - The msp1 beta multigene family of anaplasma marginale: nucleotide sequence analysis of ...
19378409 - Microbial communities and interactions in the lone star tick, amblyomma americanum.
17027029 - Thermus thermophilus bacteriophage phiys40 genome and proteomic characterization of vir...
22851649 - Clonal interference in the evolution of influenza.
14573489 - An ac-like transposable element family with transcriptionally active y-linked copies in...
Publication Detail:
Type:  Journal Article; Research Support, Non-U.S. Gov't    
Journal Detail:
Title:  Journal of computational biology : a journal of computational molecular cell biology     Volume:  19     ISSN:  1557-8666     ISO Abbreviation:  J. Comput. Biol.     Publication Date:  2012 Mar 
Date Detail:
Created Date:  2012-03-09     Completed Date:  2012-06-25     Revised Date:  2013-06-26    
Medline Journal Info:
Nlm Unique ID:  9433358     Medline TA:  J Comput Biol     Country:  United States    
Other Details:
Languages:  eng     Pagination:  251-60     Citation Subset:  IM    
School of Chemical Engineering and Technology, Tianjin University, Tianjin, China.
Export Citation:
APA/MLA Format     Download EndNote     Download BibTex
MeSH Terms
Computer Simulation
Databases, Genetic
Least-Squares Analysis
Models, Genetic
Open Reading Frames*
Sequence Analysis, DNA / methods
Support Vector Machines*

From MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine

Previous Document:  A Study on the Learning Curve of the Robotic Virtual Reality Simulator.
Next Document:  Algorithms for the design of maximum hydropathic complementarity molecules.