Document Detail

Does Rational Selection of Training and Test Sets Improve the Outcome of QSAR Modeling?
MedLine Citation:
PMID:  23030316     Owner:  NLM     Status:  Publisher    
Prior to using a quantitative structure activity relationship (QSAR) model for external predictions, its predictive power should be established and validated. In the absence of a true external data set, the best way to validate the predictive ability of a model is to perform its statistical external validation. In statistical external validation, the overall data set is divided into training and test sets. Commonly, this splitting is performed using random division. Rational splitting methods can divide data sets into training and test sets in an intelligent fashion. The purpose of this study was to determine whether rational division methods lead to more predictive models compared to random division. A special data splitting procedure was used to facilitate the comparison between random and rational division methods. For each toxicity end point, the overall data set was divided into a modeling set (80% of the overall set) and an external evaluation set (20% of the overall set) using random division. The modeling set was then subdivided into a training set (80% of the modeling set) and a test set (20% of the modeling set) using rational division methods and by using random division. The Kennard-Stone, minimal test set dissimilarity, and sphere exclusion algorithms were used as the rational division methods. The hierarchical clustering, random forest, and k-nearest neighbor (kNN) methods were used to develop QSAR models based on the training sets. For kNN QSAR, multiple training and test sets were generated, and multiple QSAR models were built. The results of this study indicate that models based on rational division methods generate better statistical results for the test sets than models based on random division, but the predictive power of both types of models are comparable.
Todd M Martin; Paul Harten; Douglas M Young; Eugene N Muratov; Alexander Golbraikh; Hao Zhu; Alexander Tropsha
Related Documents :
23961916 - How to deal with low-resolution target structures: using sar, ensemble docking, hydropa...
22118036 - Geographical mapping and bayesian spatial modeling of malaria incidence in sistan and b...
25009056 - Using phylogenetic information and chemical properties to predict species tolerances to...
24629066 - Potential causes and consequences of behavioural resilience and resistance in malaria v...
24506976 - Effect of occupational mobility and health status on life satisfaction of chinese resid...
22864976 - Mathematical models of the generation of radiation-induced dna double-strand breaks.
Publication Detail:
Type:  JOURNAL ARTICLE     Date:  2012-10-3
Journal Detail:
Title:  Journal of chemical information and modeling     Volume:  -     ISSN:  1549-960X     ISO Abbreviation:  J Chem Inf Model     Publication Date:  2012 Oct 
Date Detail:
Created Date:  2012-10-3     Completed Date:  -     Revised Date:  -    
Medline Journal Info:
Nlm Unique ID:  101230060     Medline TA:  J Chem Inf Model     Country:  -    
Other Details:
Languages:  ENG     Pagination:  -     Citation Subset:  -    
Sustainable Technology Division, National Risk Management Research Laboratory, Office of Research and Development, United States Environmental Protection Agency , 26 West Martin Luther King Drive, Cincinnati, Ohio 45268, United States.
Export Citation:
APA/MLA Format     Download EndNote     Download BibTex
MeSH Terms

From MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine

Previous Document:  Angiotensin-(1-7) Treatment Ameliorates Angiotensin II-Induced HUVEC Apoptosis.
Next Document:  A survey of the awareness, use and attitudes of women towards Down syndrome screening.