Document Detail

Correction for population stratification in random forest analysis.
MedLine Citation:
PMID:  23148107     Owner:  NLM     Status:  MEDLINE    
BACKGROUND: Population structure (PS), including population stratification and admixture, is a significant confounder in genome-wide association studies (GWAS), as it may produce spurious associations. Random forest (RF) has been increasingly applied in GWAS data analysis because of its advantage in analysing high dimensional genetic data. RF creates importance measures for single nucleotide polymorphisms (SNPs), which are helpful for feature selections. However, if PS is not appropriately corrected, RF tends to give high importance to disease-unrelated SNPs with different frequencies of allele or genotype among subpopulations, leading to inaccurate results.
METHODS: In this study, the authors propose to correct for the confounding effect of PS by including the information of PS in RF analysis. The correction procedure starts by extracting the information of PS using EIGENSTRAT or multi-dimensional scaling clustering procedure from a large number of structure inference SNPs. Phenotype and genotypes adjusted by the information of PS are then used as the outcome and predictors in RF analysis.
RESULTS: Extensive simulations indicate that the importance measure of the causal SNP is increased following the PS correction. By analysing a real dataset, the proposed correction removes the spurious association between the lactase gene and height.
CONCLUSION: The authors propose a simple method to correct for PS in RF analysis on GWAS data. Further studies in real GWAS datasets are required to validate the robustness of the proposed approach.
Yang Zhao; Feng Chen; Rihong Zhai; Xihong Lin; Zhaoxi Wang; Li Su; David C Christiani
Related Documents :
24965247 - 13. the application of reproduction number concepts to tuberculosis vynnycky e, fine pe...
19889797 - Arh: predicting splice variants from genome-wide data with modified entropy.
23351567 - Relationship between area-level socioeconomic characteristics and outdoor no2 concentra...
23123557 - Indirect reciprocity with trinary reputations.
11375727 - Data preprocessing and partial least squares regression analysis for reagentless determ...
16661927 - A model for predicting ionic equilibrium concentrations in cell walls.
Publication Detail:
Type:  Journal Article; Research Support, N.I.H., Extramural; Research Support, Non-U.S. Gov't     Date:  2012-11-12
Journal Detail:
Title:  International journal of epidemiology     Volume:  41     ISSN:  1464-3685     ISO Abbreviation:  Int J Epidemiol     Publication Date:  2012 Dec 
Date Detail:
Created Date:  2013-01-03     Completed Date:  2013-06-10     Revised Date:  2013-12-04    
Medline Journal Info:
Nlm Unique ID:  7802871     Medline TA:  Int J Epidemiol     Country:  England    
Other Details:
Languages:  eng     Pagination:  1798-806     Citation Subset:  IM    
Export Citation:
APA/MLA Format     Download EndNote     Download BibTex
MeSH Terms
Confounding Factors (Epidemiology)
Data Interpretation, Statistical*
Genetics, Population / methods*
Genome-Wide Association Study / methods*
Models, Statistical
Polymorphism, Single Nucleotide*
Grant Support

From MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine

Previous Document:  Body mass index in relation to oesophageal and oesophagogastric junction adenocarcinomas: a pooled a...
Next Document:  Demographic and health surveys: a profile.