Document Detail


What you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models.
MedLine Citation:
PMID:  15184705     Owner:  NLM     Status:  MEDLINE    
Abstract/OtherAbstract:
Statistical models, such as linear or logistic regression or survival analysis, are frequently used as a means to answer scientific questions in psychosomatic research. Many who use these techniques, however, apparently fail to appreciate fully the problem of overfitting, ie, capitalizing on the idiosyncrasies of the sample at hand. Overfitted models will fail to replicate in future samples, thus creating considerable uncertainty about the scientific merit of the finding. The present article is a nontechnical discussion of the concept of overfitting and is intended to be accessible to readers with varying levels of statistical expertise. The notion of overfitting is presented in terms of asking too much from the available data. Given a certain number of observations in a data set, there is an upper limit to the complexity of the model that can be derived with any acceptable degree of uncertainty. Complexity arises as a function of the number of degrees of freedom expended (the number of predictors including complex terms such as interactions and nonlinear terms) against the same data set during any stage of the data analysis. Theoretical and empirical evidence--with a special focus on the results of computer simulation studies--is presented to demonstrate the practical consequences of overfitting with respect to scientific inference. Three common practices--automated variable selection, pretesting of candidate predictors, and dichotomization of continuous variables--are shown to pose a considerable risk for spurious findings in models. The dilemma between overfitting and exploring candidate confounders is also discussed. Alternative means of guarding against overfitting are discussed, including variable aggregation and the fixing of coefficients a priori. Techniques that account and correct for complexity, including shrinkage and penalization, also are introduced.
Authors:
Michael A Babyak
Related Documents :
22894455 - Effectiveness of external respiratory surrogates for in vivo liver motion estimation.
17120655 - Assessment of cso loads--based on uvivis-spectroscopy by means of different regression ...
14981675 - Electronic monitoring device event modelling on an individual-subject basis using adapt...
8054505 - Biological viewpoints of neoplastic regression.
23565335 - Models in animal collective decision-making: information uncertainty and conflicting pr...
22648285 - A quantitative diagnostic method for oral mucous precancerosis by rose bengal fluoresce...
Publication Detail:
Type:  Journal Article    
Journal Detail:
Title:  Psychosomatic medicine     Volume:  66     ISSN:  1534-7796     ISO Abbreviation:  Psychosom Med     Publication Date:    2004 May-Jun
Date Detail:
Created Date:  2004-06-08     Completed Date:  2004-07-19     Revised Date:  2013-05-20    
Medline Journal Info:
Nlm Unique ID:  0376505     Medline TA:  Psychosom Med     Country:  United States    
Other Details:
Languages:  eng     Pagination:  411-21     Citation Subset:  IM    
Affiliation:
Duke University Medical Center, Durham, NC 27710, USA. michael.babyak@duke.edu
Export Citation:
APA/MLA Format     Download EndNote     Download BibTex
MeSH Terms
Descriptor/Qualifier:
Computer Simulation
Data Interpretation, Statistical
Humans
Models, Statistical
Psychophysiology / statistics & numerical data
Psychosomatic Medicine / statistics & numerical data
Regression Analysis
Research Design
Statistics as Topic / education,  methods*

From MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine


Previous Document:  Historical sexual abuse and current thyroid axis profiles in women with premenstrual dysphoric disor...
Next Document:  Patients with carcinoid syndrome exhibit symptoms of aggressive impulse dysregulation.