Document Detail


Correction for population stratification in random forest analysis.
MedLine Citation:
PMID:  23148107     Owner:  NLM     Status:  MEDLINE    
Abstract/OtherAbstract:
BACKGROUND: Population structure (PS), including population stratification and admixture, is a significant confounder in genome-wide association studies (GWAS), as it may produce spurious associations. Random forest (RF) has been increasingly applied in GWAS data analysis because of its advantage in analysing high dimensional genetic data. RF creates importance measures for single nucleotide polymorphisms (SNPs), which are helpful for feature selections. However, if PS is not appropriately corrected, RF tends to give high importance to disease-unrelated SNPs with different frequencies of allele or genotype among subpopulations, leading to inaccurate results.
METHODS: In this study, the authors propose to correct for the confounding effect of PS by including the information of PS in RF analysis. The correction procedure starts by extracting the information of PS using EIGENSTRAT or multi-dimensional scaling clustering procedure from a large number of structure inference SNPs. Phenotype and genotypes adjusted by the information of PS are then used as the outcome and predictors in RF analysis.
RESULTS: Extensive simulations indicate that the importance measure of the causal SNP is increased following the PS correction. By analysing a real dataset, the proposed correction removes the spurious association between the lactase gene and height.
CONCLUSION: The authors propose a simple method to correct for PS in RF analysis on GWAS data. Further studies in real GWAS datasets are required to validate the robustness of the proposed approach.
Authors:
Yang Zhao; Feng Chen; Rihong Zhai; Xihong Lin; Zhaoxi Wang; Li Su; David C Christiani
Related Documents :
23089827 - Ma-snp -- a new genotype calling method for oligonucleotide snp arrays modeling the bat...
23060897 - Statistical properties of multivariate distance matrix regression for high-dimensional ...
23861737 - A unified framework for association analysis with multiple related phenotypes.
24432257 - The use of computational models to predict response to hiv therapy for clinical cases i...
2721167 - Stochastic petri net modeling of wave sequences in cardiac arrhythmias.
22350687 - Modeling the impact of changing patient transportation systems on peri-operative proces...
Publication Detail:
Type:  Journal Article; Research Support, N.I.H., Extramural; Research Support, Non-U.S. Gov't     Date:  2012-11-12
Journal Detail:
Title:  International journal of epidemiology     Volume:  41     ISSN:  1464-3685     ISO Abbreviation:  Int J Epidemiol     Publication Date:  2012 Dec 
Date Detail:
Created Date:  2013-01-03     Completed Date:  2013-06-10     Revised Date:  2013-12-04    
Medline Journal Info:
Nlm Unique ID:  7802871     Medline TA:  Int J Epidemiol     Country:  England    
Other Details:
Languages:  eng     Pagination:  1798-806     Citation Subset:  IM    
Export Citation:
APA/MLA Format     Download EndNote     Download BibTex
MeSH Terms
Descriptor/Qualifier:
Confounding Factors (Epidemiology)
Data Interpretation, Statistical*
Genetics, Population / methods*
Genome-Wide Association Study / methods*
Humans
Models, Statistical
Polymorphism, Single Nucleotide*
Grant Support
ID/Acronym/Agency:
CA092824/CA/NCI NIH HHS; ES00002/ES/NIEHS NIH HHS
Comments/Corrections

From MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine


Previous Document:  Body mass index in relation to oesophageal and oesophagogastric junction adenocarcinomas: a pooled a...
Next Document:  Demographic and health surveys: a profile.