Document Detail

EM-random forest and new measures of variable importance for multi-locus quantitative trait linkage analysis.
Jump to Full Text
MedLine Citation:
PMID:  18499695     Owner:  NLM     Status:  MEDLINE    
MOTIVATION: We developed an EM-random forest (EMRF) for Haseman-Elston quantitative trait linkage analysis that accounts for marker ambiguity and weighs each sib-pair according to the posterior identical by descent (IBD) distribution. The usual random forest (RF) variable importance (VI) index used to rank markers for variable selection is not optimal when applied to linkage data because of correlation between markers. We define new VI indices that borrow information from linked markers using the correlation structure inherent in IBD linkage data.
RESULTS: Using simulations, we find that the new VI indices in EMRF performed better than the original RF VI index and performed similarly or better than EM-Haseman-Elston regression LOD score for various genetic models. Moreover, tree size and markers subset size evaluated at each node are important considerations in RFs.
AVAILABILITY: The source code for EMRF written in C is available at
Sophia S F Lee; Lei Sun; Rafal Kustra; Shelley B Bull
Related Documents :
18173855 - Maximum likelihood estimates of two-locus recombination fractions under some natural in...
16118185 - Disentangling linkage disequilibrium and linkage from dense single-nucleotide polymorph...
22077695 - Accuracy and precision of the pediatric evaluation of disability inventory computer-ada...
12813725 - Analysis of multilocus models of association.
22880745 - Aspects of the biology of the pygmy ribbontail catshark eridacnis radcliffei (proscylli...
18394225 - Can the shared circuits model (scm) explain joint attention or perception of discrete e...
Publication Detail:
Type:  Journal Article; Research Support, N.I.H., Extramural; Research Support, Non-U.S. Gov't     Date:  2008-05-21
Journal Detail:
Title:  Bioinformatics (Oxford, England)     Volume:  24     ISSN:  1367-4811     ISO Abbreviation:  Bioinformatics     Publication Date:  2008 Jul 
Date Detail:
Created Date:  2008-07-08     Completed Date:  2008-09-09     Revised Date:  2013-06-05    
Medline Journal Info:
Nlm Unique ID:  9808944     Medline TA:  Bioinformatics     Country:  England    
Other Details:
Languages:  eng     Pagination:  1603-10     Citation Subset:  IM    
Department of Public Health Sciences, University of Toronto, Toronto M5T3M7, Canada.
Export Citation:
APA/MLA Format     Download EndNote     Download BibTex
MeSH Terms
Chromosome Mapping
Computational Biology / methods*
Data Interpretation, Statistical
Genetic Linkage*
Lod Score
Models, Genetic*
Models, Statistical
Programming Languages
Quantitative Trait, Heritable
Random Allocation
Regression Analysis
Grant Support

From MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine

Full Text
Journal Information
Journal ID (nlm-ta): Bioinformatics
Journal ID (publisher-id): bioinformatics
Journal ID (hwp): bioinfo
ISSN: 1367-4803
ISSN: 1460-2059
Publisher: Oxford University Press
Article Information
Download PDF
© 2008 The Author(s)
creative-commons: This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Received Day: 15 Month: 11 Year: 2007
Revision Received Day: 16 Month: 5 Year: 2008
Accepted Day: 17 Month: 5 Year: 2008
Print publication date: Day: 15 Month: 7 Year: 2008
Electronic publication date: Day: 21 Month: 5 Year: 2008
pmc-release publication date: Day: 21 Month: 5 Year: 2008
Volume: 24 Issue: 14
First Page: 1603 Last Page: 1610
ID: 2638262
DOI: 10.1093/bioinformatics/btn239
Publisher Id: btn239
PubMed Id: 18499695

EM-random forest and new measures of variable importance for multi-locus quantitative trait linkage analysis
Sophia S. F. Lee12
Lei Sun13
Rafal Kustra1
Shelley B. Bull12*
1Department of Public Health Sciences, University of Toronto, Toronto M5T 3M7, 2Samuel Lunenfeld Research Institute, Mount Sinai Hospital, Toronto M5G 1X5 and 3Genetics and Genomic Biology, The Hospital for Sick Children Research Institute, Toronto M5G 1L7, Canada
Associate Editor: Martin Bishop
Correspondence: *To whom correspondence should be addressed.

Breiman L. Heuristics of instability and stabilization in model selectionAnn. Stat 1996a;24:2350–2383.
Breiman L. Bagging predictorsMach. Learn 1996b;24:123–140.
Breiman L. Random forestsMach. Learn 2001;45:5–32.
Briollais L,et al. Multilevel modeling for the analysis of longitudinal blood pressure data in the Framingham heart study pedigreesBMC Genet 2003;4:S19. [pmid: 14975087]
Bureau A,et al. Mapping complex traits using random forestsBMC Genet 2003;4:S64. [pmid: 14975132]
Bureau A,et al. Identifying SNPs predictive of phenotype using random forestsGenet. Epidemiol 2005;28:171–182. [pmid: 15593090]
Chen WM,et al. Quantitative trait linkage analysis by generalized estimating equations: unification of variance components and Haseman-Elston regressionGenet. Epid 2004;26:265–272.
Churchill GA,Doerge RW. Empirical threshold values for quantitative trait mappingGenetics 1994;138:963–971. [pmid: 7851788]
Dawber TR,et al. Epidemiological approaches to heart disease: the Framingham studyAm. J. Public Health 1951;41:279.
Dempster AP,et al. Maximum likelihood from incomplete data via the EM algorithmJ. R. Stat. Soc 1977;39:1–38.
Dolan CV,et al. A simulation study of the effects of assignment of prior identity-by-descent probabilities to unselected sib pairs, in covariance-structure modeling of a quantitative-trait locusAm. J. Hum. Genet 1999a;64:268–280. [pmid: 9915966]
Dolan CV,et al. A note on the power provided by sibships of sizes 2, 3, and 4 in genetic covariance modeling of a codominant QTLBehav. Genet 1999b;29:163–170. [pmid: 10547922]
Elston RC,Stewart J. A general model for the genetic analysis of pedigree dataHum. Hered 1971;21:523–542. [pmid: 5149961]
Falconer DS. Introduction to Quantitative Genetics (3rd edn). 19893rd edn. Harlow, Essex, UK/New York: Longmans Green/John Wiley & Sons;
Friedman JH. Greedy function approximation: a gradient boosting machineAnn. Stat 2001;29:1189–1232.
Gibson G. Epistasis and pleiotropy as natural properties of transcriptional regulationTheor. Popul. Biol 1996;49:58–89. [pmid: 8813014]
Haseman JK,Elston RC. The investigation of linkage between a quantitative trait and a marker locusBehav. Genet 1972;2:3–19. [pmid: 4157472]
Izmirlian G. Application of the random forest classification algorithm to a SELDI-TOF proteomics study in the setting of a cancer prevention trialAnn. N. Y. Acad. Sci 2004;1020:154–174. [pmid: 15208191]
Kruglyak L,Lander ES. Complete multipoint sib-pair analysis of qualitative and quantitative traitsAm. J. Hum. Genet 1995;57:439–454. [pmid: 7668271]
Kruglyak L,et al. Parametric and nonparametric linkage analysis: a unified multipoint approachAm. J. Hum. Genet 1996;58:1347–1363. [pmid: 8651312]
Lander ES,Green P. Construction of multilocus genetic linkage maps in humansProc. Natl Acad. Sci. USA 1987;84:2363–2367. [pmid: 3470801]
Levy D,et al. Evidence for a gene influencing blood pressure on chromosome 17. Genome scan linkage results for longitudinal blood pressure phenotypes in subjects from the Framingham Heart StudyHypertension 2000;36:477–483. [pmid: 11040222]
Liaw A,Wiener M. Classification and regression by randomForestR News 2002;2:18–22.
Lunetta KL,et al. Screening large-scale association study data: exploiting interactions using random forestsBMC Genet 2004;10:32. [pmid: 15588316]
Moore JH. The ubiquitous nature of epistasis in determining susceptibility to common human diseasesHum. Hered 2003;56:73–82. [pmid: 14614241]
Ott J. Analysis of Human Genetic Linkage (3rd edn). 19993rd edn. Baltimore, MD: Johns Hopkins University Press;
R Development Core Team. R: A language and environment for statistical computing2008Vienna, Austria: R Foundation for Statistical Computing; Available at
Schork NJ. Extended multipoint identity-by-descent analysis of human quantitative traits: efficiency, power, and modeling considerationsAm. J. Hum. Genet 1993;53:1306–1319. [pmid: 8250047]
Segal MR,et al. Relating HIV-1 sequence variation to replication capacity via trees and forestsStat. Appl. Genet. Mol. Biol 2004;3:2.
Shi T,et al. Tumor classification by tissue microarray profiling: random forest clustering applied to renal cell carcinomaMod. Pathol 2005;18:547–557. [pmid: 15529185]
Sing T,et al. ROCR: visualizing classifier performance in RBioinformatics 2005;21:3940–3941. [pmid: 16096348]
Wang T,Elston RC. Two-level Haseman-Elston regression for general pedigree data analysisGenet. Epidemiol 2005;29:12–22. [pmid: 15838848]
Williams JT,Blangero J. Power of variance component linkage analysis to detect quantitative trait lociAnn. Hum. Genet 1999;63:545–563. [pmid: 11246457]
Williams JT,et al. Statistical properties of a variance components method for quantitative trait linkage analysis in nuclear families and extended pedigreesGenet. Epidemiol 1997;14:1065–1070. [pmid: 9433625]
Wu LY,et al. Locus-specific heritability estimation via the bootstrap in linkage scans for quantitative trait lociHum. Hered 2006;62:84–96. [pmid: 17047338]

Article Categories:
  • Original Papers
    • Genetics and Population Analysis

Previous Document:  Clinical implications of idiopathic multicentric castleman disease among Japanese: a report of 28 ca...
Next Document:  The Sleipnir library for computational functional genomics.