Stochastic model search with binary outcomes for genomewide association studies.  
Jump to Full Text  
MedLine Citation:

PMID: 22534080 Owner: NLM Status: Publisher 
Abstract/OtherAbstract:

ObjectiveThe spread of casecontrol genomewide association studies (GWASs) has stimulated the development of new variable selection methods and predictive models. We introduce a novel Bayesian model search algorithm, Binary Outcome Stochastic Search (BOSS), which addresses the model selection problem when the number of predictors far exceeds the number of binary responses.Materials and methodsOur method is based on a latent variable model that links the observed outcomes to the underlying genetic variables. A Markov Chain Monte Carlo approach is used for model search and to evaluate the posterior probability of each predictor.ResultsBOSS is compared with three established methods (stepwise regression, logistic lasso, and elastic net) in a simulated benchmark. Two real case studies are also investigated: a GWAS on the genetic bases of longevity, and the type 2 diabetes study from the Wellcome Trust Case Control Consortium. Simulations show that BOSS achieves higher precisions than the reference methods while preserving good recall rates. In both experimental studies, BOSS successfully detects genetic polymorphisms previously reported to be associated with the analyzed phenotypes.DiscussionBOSS outperforms the other methods in terms of Fmeasure on simulated data. In the two real studies, BOSS successfully detects biologically relevant features, some of which are missed by univariate analysis and the three reference techniques.ConclusionThe proposed algorithm is an advance in the methodology for model selection with a large number of features. Our simulated and experimental results showed that BOSS proves effective in detecting relevant markers while providing a parsimonious model. 
Authors:

Alberto Russu; Alberto Malovini; Annibale A Puca; Riccardo Bellazzi 
Publication Detail:

Type: JOURNAL ARTICLE Date: 2012425 
Journal Detail:

Title: Journal of the American Medical Informatics Association : JAMIA Volume:  ISSN: 1527974X ISO Abbreviation:  Publication Date: 2012 Apr 
Date Detail:

Created Date: 2012426 Completed Date:  Revised Date:  
Medline Journal Info:

Nlm Unique ID: 9430800 Medline TA: J Am Med Inform Assoc Country:  
Other Details:

Languages: ENG Pagination:  Citation Subset:  
Affiliation:

Department of Industrial and Information Engineering, University of Pavia, Pavia, Italy. 
Export Citation:

APA/MLA Format Download EndNote Download BibTex 
MeSH Terms  
Descriptor/Qualifier:

Full Text  
Journal Information Journal ID (nlmta): J Am Med Inform Assoc Journal ID (isoabbrev): J Am Med Inform Assoc Journal ID (publisherid): jamia Journal ID (hwp): amiajnl ISSN: 10675027 ISSN: 1527974X Publisher: BMJ Group, BMA House, Tavistock Square, London, WC1H 9JR 
Article Information Download PDF © 2012, Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rightslicensing/permissions. openaccess: Received Day: 1 Month: 12 Year: 2011 Accepted Day: 29 Month: 3 Year: 2012 Print publication date: Month: 6 Year: 2012 pmcrelease publication date: Month: 6 Year: 2012 Volume: 19 Issue: e1 First Page: e13 Last Page: e20 ID: 3392850 PubMed Id: 22534080 Publisher Id: amiajnl2011000741 DOI: 10.1136/amiajnl2011000741 
Stochastic model search with binary outcomes for genomewide association studies  
Alberto Russu1  
Alberto Malovini12  
Annibale A Puca34  
Riccardo Bellazzi1  
1Department of Industrial and Information Engineering, University of Pavia, Pavia, Italy 

2IRCCS Salvatore Maugeri, Pavia, Italy 

3IRCCS Multimedica, Milan, Italy 

4Istituto di Tecnologie Biomediche, Consiglio Nazionale delle Ricerche, Segrate, Milan, Italy 

Correspondence: Correspondence to Alberto Russu, Laboratory for Biomedical Informatics ‘Mario Stefanelli’, Department of Industrial and Information Engineering, University of Pavia, Via A. Ferrata, 1, 27100 Pavia, Italy; alberto.russu@yahoo.it 
Many of the advances in genomewide association studies (GWASs) are based on the analysis of case–control data. The goal of such studies is usually to correlate a phenotype, often coded as a binary variable, with genetic markers such as single nucleotide polymorphisms (SNPs). The outcomes of GWASs are increasingly integrated with nextgeneration sequencing results for SNP validation and disease marker extraction. Therefore, GWASs are powerful tools for the identification of interesting regions at the genomewide level for further investigation by highdefinition procedures.^{1}
Regrettably, achieving the promising goals mentioned above is hindered by the technical difficulties accompanying GWASs. Recent genotyping technologies allow the extraction of millions of genetic measurements, but the number of subjects is smaller by several orders of magnitude. In the literature, this is known as a ‘p>n’ problem. The key task is to reduce the dimensionality of the problem by identifying the optimal subset of predictors that are informative with respect to the outcome variable. Widely used techniques such as univariate or stepwise regression, are recognized as insufficient to fully address the above issues. Advanced multivariate methods are therefore required in order to determine the biological truth.^{2–4}
Recently, a number of model search algorithms have been proposed, with particular emphasis on sparse linear models based on regularization methods^{5}^{, 6} and naiveBayes techniques.^{7} Furthermore, Bottolo and Richardson designed a new method for Bayesian model selection for linear regression with continuous outcomes.^{8} Unlike regularization analysis, this method performs a model search by sampling the predictors on the basis of a general and efficient Markov Chain Monte Carlo (MCMC) technique that exploits the conjugacy structure of data and parameters.
In the case of binary outcomes, however, easy analytical tractability is only partially possible, mainly because of the nonlinear link between the observed variables and the underlying predictors. Although a detailed Bayesian analysis for binary and categorical responses is available,^{9} model search techniques for binary data have only been explored in a limited number of cases.^{10–12}
In the present paper, we propose an innovative stochastic model search technique for binary outcomes. The relationship between the observed responses and the available predictors is described by a latent variable model with a probit link,^{9} rather than resorting to Gaussian approximations.^{12} Prior distributions are assigned both to the regression coefficients and the model size, therefore allowing the user to specify a prior belief on the model complexity. A computationally efficient MetropolisHastings sampling algorithm^{8} is adopted. Our method extensively explores the model space to identify the important predictors.
Denote with Y the vector of the observed individual binary responses Y_{i}, where i=1…n and n is the number of subjects. The Y_{i} are assumed to be the results of independent Bernoulli trials with success probability π_{i}. Let x_{i}=(x_{i}_{1}, …, x_{ip})^{T} be the covariate vector associated with response Y_{i}, where p is the total number of predictors. Additionally, let β=(β_{1}, …, β_{p})^{T} be the vector of regression coefficients for the p predictors.
The relationship between observed binary responses and covariates is usually modeled with a link function g, which expresses the dependency of the success probabilities π_{i} on the linear regressor x_{i}^{T}β. Taking g as the standard normal distribution Φ yields the probit model:
(1)
πi=Φ(xiTβ)=∫−∞xiTβN(0,1)du 
In order to circumvent the above issue and obtain the posterior of β, one can resort to a data augmentation approach as described by Albert and Chib.^{9} The idea is to introduce n independent latent variables Z=(Z_{1}, …, Z_{n})^{T} such that each Z_{i} ∼ N(x_{i}^{T}β, 1), letting Y_{i}=0 if Z_{i}≤0 and Y_{i}=1 if Z_{i}>0. Equation 2 summarizes the latent variable model:
(2)
Yi={0Zi≤01Zi>0Z=Xβ+ɛ 
Of course, if the Z_{i} were known, and if β were assigned a multivariate normal prior, the posterior of β would be obtained through usual Bayesian linear regression results. Although the Z_{i} are actually not known, their distribution conditional on the data Y_{i} is a truncated normal with mean x_{i}^{T}β and unit variance, the side of truncation depending on the value of Y_{i}. One could therefore resort to Gibbs sampling to iteratively sample Z and β, thus obtaining the required posteriors.
The goal of variable selection is to model the dependency of Y (or equivalently Z) on a subset of the predictors x_{1}, …, x_{p}. However, there is uncertainty about which subset to use. In this work, we follow the approach of Bottolo and Richardson^{8} and introduce a latent binary vector γ=(γ_{1}, …, γ_{p})^{T} such that γ_{j}=0 if β_{j}=0 and γ_{j}=1 if β_{j}≠0. The model space is therefore given by the 2^{p} possible combinations of the indicator variables γ_{j}. Given γ, the Gaussian linear model in equation 2 can therefore be modified as:
(3)
Z=α1+Xγβγ+ɛ 
The intercept α is assigned a flat prior, f(α)∝1, whereas the prior distribution for β_{γ} is taken as a multivariate normal:
(4)
f(βγγ,σ2)=N(0,σ2Σγ) 
Moreover, the error variance σ^{2} is equal to 1 by definition.^{9} The matrix Σ_{γ} can be expressed as τI, where τ controls the degree of coefficient shrinkage and I is the identity matrix. Other specifications of the prior covariance matrix in equation 4 are covered by Bottolo and Richardson.^{8} The prior for the indicator vector γ is defined as in Kohn et al^{13}:
(5)
f(γ)=B(pγ+a,p−pγ+b)B(a,b) 
In order to obtain γ, observe that, if Z were known, one could write its associated marginal likelihood by integrating out α and β_{γ}:
(6)
f(Zγ)=∫f(Zγ,α,βγ)f(βγγ)f(α)dβγdα 
The model specification is completed by defining the full conditional for Z. As stated in the previous paragraph, conditional on Y_{i} and β_{γ}, the densities of Z_{i} are independent truncated normals:
(7)
f(ZiYi,βγ)={TN−(xi,γTβγ,1)Yi=0TN+(xi,γTβγ,1)Yi=1 
Equations 4–7 allow devising a sampling scheme to obtain the posterior density of the latent binary vector γ, which is the main goal of the model search procedure. Our algorithm is based on an efficient Fast Scan MetropolisHastings (FSMH) sampler introduced by Bottolo and Richardson^{8} and generalized here to account for binary responses.
The key idea of our algorithm is that, if Z is assumed known, the variable selection problem for binary outcomes reduces to one for continuous outcomes. In such a case, efficient algorithms are available to sample from the conditional density of γ.^{8}^{, 12} As shown by Bottolo and Richardson,^{8} sampling γ can be achieved through the FSMH sampler using equations 5 and 6. Once a sample of γ is available, a new value of Z is then sampled from its full conditional distribution and the process is iterated.
The sampling scheme, depicted in figure 1, is implemented as follows:
 Choose a random initialization for Z (positive or negative according to its corresponding Y_{i}, see equation 7);
 Sample γ through the FSMH sampler, using the prior model size in equation 5 and the marginal likelihood of Z in equation 6;
 Given Z and γ, sample β_{γ} through the standard Bayesian linear regression. Note that, given Z, such density does not depend on Y;
 Given β_{γ} and Y, sample Z from equation 7;
 Restart from step 2 until a predetermined number of iterations is reached.
The proposed approach has been applied to a first GWAS on the genetic bases of longevity,^{14} and a second GWAS aimed at the genetic dissection of the type 2 diabetes (T2D) trait.^{15} The latter dataset has been provided by the Wellcome Trust Case Control Consortium (WTCCC) after data access approval.
Individual and markerlevel data quality control, inbred patients, and genetic outliers removal have been described by Malovini et al^{14} and the WTCCC.^{15} A preliminary feature selection has been performed based on the results from univariate Pearson's χ^{2} tests with 2 degrees of freedom comparing genotype distributions between cases and controls. Only SNPs characterized by complete genotype data, passing the significance threshold (p<0.40 for the longevity dataset and p<0.01 for the T2D dataset) and featuring at least five counts per cell have been retained. Missing values in the T2D dataset were imputed to the modal value for each SNP. The above statistical analyses have been performed with PLINK.^{16} BOSS was implemented in MATLAB 7.5.0 R2007b^{17} together with the RANDRAW routine^{18} to sample from the truncated normal density.
A simulation study was performed to assess the performances of BOSS, stepwise regression (stepwisefit function in MATLAB^{17}), and two stateoftheart methods, logistic lasso and elastic net^{11} (package glmnet in R V.2.13.1^{19}). Our benchmark featured simulated scenarios of varying complexity, according to the following characteristics of the design matrix X:
 Uncorrelated or correlated columns (pairwise correlation of 0.5);
 Percentage of explained variance (ie, proportion of variance of Z explained by the predictors).
Therefore, we simulated 2×3=6 scenarios, labeled as follows:
 A_{90}: uncorrelated X, 90% explained variance;
 A_{70}: uncorrelated X, 70% explained variance;
 A_{50}: uncorrelated X, 50% explained variance;
 B_{90}: correlated X, 90% explained variance;
 B_{70}: correlated X, 70% explained variance;
 B_{50}: correlated X, 50% explained variance.
We set n=200 and p=300, so that p>n. Synthetic datasets were generated by simulating a design matrix X according to the characteristics described above, and a vector of independent normally distributed errors ɛ with zero mean and unit variance. Only the first 10 parameters of the vector β were set to a nonzero value, so as to yield 10 true predictors. The module of such parameters was set so as to obtain the given percentage of explained variance. The 10 true parameters were assigned alternating signs. The values of Z_{i} so generated were then binarized according to their sign to obtain Y_{i} (equation 2). Each dataset featured an approximately equal number of cases and controls. For each scenario, 50 simulated datasets were generated.
BOSS was run for 6000 iterations (first 1000 of burnin). The prior model size p_{γ} was specified by E[p_{γ}]=20 and Var[p_{γ}]=100, which elicited the values a=4.52 and b=63.26. Parameter τ was fixed to 1, as done by Bottolo and Richardson^{8} and Hans et al,^{12} except in scenarios A_{50} and B_{50} where a value of 0.1 proved more suitable. In principle, one could additionally sample τ to actually optimize the algorithm performances. The prior mean and variance were chosen so as to cover a reasonable range of plausible model sizes. The resulting prior density for p_{γ} is depicted in figure 2.
Logistic lasso and elastic net were run with the ‘onestandarderror rule’ to choose the penalty parameter.^{11} The elastic net method was executed with equal lasso and ridgeregression penalties. Stepwise regression was performed using the default settings of the stepwisefit function.
The goal of the simulated benchmark was to assess the capabilities of BOSS, logistic lasso, elastic net, and stepwise regression to capture the true predictors used to generate the data, while discarding useless predictors. For this purpose, we evaluated precision, recall, and Fmeasure relative to the number of predictors correctly/incorrectly identified as true/false. In the case of BOSS, a predictor was deemed ‘true’ if its marginal posterior probability of inclusion^{8}^{, 12} was at least 50%.
Results of the comparison are reported in figure 3 for uncorrelated predictors (A_{90}, A_{70}, and A_{50}) and in figure 4 for correlated predictors (B_{90}, B_{70}, and B_{50}). The two figures show the distribution of Fmeasure values obtained with BOSS, logistic lasso, elastic net, and stepwise regression in each scenario. Average numerical values are also reported in table 1 for precision and recall.
In most scenarios considered here, BOSS outperformed the three reference methods in terms of Fmeasure. Although BOSS achieved generally lower recall rates, it attained higher values of precision and a better overall tradeoff. Larger priors on the model size and a larger number of iterations did not substantially affect the results obtained with BOSS.
Interestingly, the performances of the four methods did not degrade if predictors were correlated. Overall, logistic lasso proved superior to the traditional stepwise regression and comparable to the elastic net.
Furthermore, we assessed the ability of the four methods to correctly predict newly observed binary responses that were not used for parameter estimation. For this purpose, 50 additional datasetsperscenario were simulated. The new outcomes were then predicted using the features selected by each method. Prediction performances were evaluated in terms of correctly/incorrectly predicted positive/negative responses. A new outcome was predicted as positive (y=1) if the corresponding probability, reconstructed with the selected features, was greater than a given threshold. We then evaluated specificity, sensitivity, and prediction accuracy for thresholds from 0.05 to 0.95, with steps of 0.05. In order to obtain summary statistics, we computed the area under the receiver operating characteristic (ROC AUC) using mean specificity and sensitivity for a given threshold. Additionally, we calculated the average prediction accuracy across all thresholds.
Table 2 shows the results of our predictive analysis. In most cases, the AUC obtained with BOSS is only slightly greater than with the other three methods. In terms of accuracy, however, BOSS achieves a substantial improvement.
This section reports the results obtained by BOSS and the three reference methods on the two experimental GWASs.
A total of 410 long aged individuals (cases), 553 average living controls and 290 364 autosomal SNPs passing data quality control underwent a feature selection preprocessing phase. A subset of 5707 SNPs, selected according to the inclusion criteria reported in the Materials and methods section, were then considered for subsequent analyses. BOSS was run for 20 000 iterations (first 5000 of burnin; running time of about 4.5 days). The prior model size p_{γ} was specified by E[p_{γ}]=20 and Var[p_{γ}]=100, so as to encompass a range of plausible model sizes without being overly restrictive.
Our analysis allowed the identification of several relevant markers. In particular, the intergenic SNP rs2147556 achieved a 100% probability of inclusion, while rs10491334, mapping to CAMK4 and previously identified as strongly associated,^{14} achieved a probability of about 76%. Moreover, BOSS was able to identify rs522796 (KL gene) as mildly associated with the outcome (probability of 14%). This SNP was reported to be related to nondiabetic endstage renal disease^{20} and preterm birth.^{21} SNP rs800292, mapping to CFH and known to be associated with agerelated macular degeneration,^{22} was also identified, although with a relatively low probability of inclusion (10%). Posterior probabilities of SNPs obtained with BOSS are shown in figure 5. We limited the plot to the top 10 SNPs in order to summarize the most relevant findings only.
In order to further evaluate the results obtained by the proposed model search technique, we additionally fitted a multivariate logistic regression model on the top 10 predictors of figure 5. Results are summarized in table 3.
It is interesting to observe that the minor allele of rs800292 has a predisposing additive impact on longevity (OR=1.53): this is in accordance with the results obtained by Yang et al, who showed a significant association of rs800292 with a reduced risk for exudative agerelated macular degeneration.^{22}
As a further step, we ran logistic lasso, elastic net, and stepwise regression on the longevity data to assess possible differences with the results obtained by BOSS. The three algorithms were run with the same settings described in the Simulated benchmark section of the Results section. In our analysis, the logistic lasso included most of the features reported in figure 5, but failed to detect rs800292. The elastic net yielded a more parsimonious model, which included four SNPs (rs2147556, rs10491334, rs621011, and rs522796). Stepwise regression selected a large number of features (891): all SNPs in figure 5 were included except rs522796.
A total of 1924 T2D patients, 1458 UK Blood Service (UKBS) controls and 456 856 autosomal SNPs (mapping to chromosomes 1–22) passing data quality control^{15} underwent a features selection phase (see Materials and methods section). A total of 4102 markers were then considered for subsequent analyses. We ran BOSS for 700 iterations (first 200 of burnin; running time of about 1 day). Increasing the number of iterations did not yield appreciably different results. The prior model size p_{γ} was specified by E[p_{γ}]=4 and Var[p_{γ}]=4, in order to allow BOSS identify only the most relevant genetic features.
Our analysis detected several SNPs mapping to gene regions which have been previously associated with metabolic syndromes or other disease classes. In particular, the top ranked SNP rs7193144 is an intronic marker mapping to FTO, a gene that has been already associated with T2D and obesity conditions in several European cohorts.^{15}^{, 23–25} Moreover, rs4132670 and rs10885409 map to TCF7L2, previously reported to be associated with T2D.^{15}^{, 26}^{, 27} Last, rs11693602 is localized within the RBMS1 gene, which has been associated with T2D by Qi et al.^{28} Posterior probabilities of SNPs obtained with BOSS are shown in figure 6. We limited the plot to the top 20 SNPs in order to summarize the most relevant findings only.
Additionally, we performed further model evaluation by fitting a multivariate logistic regression on the genetic markers reported in figure 6. Results are reported in table 4.
We also performed logistic lasso, elastic net, and stepwise regression on the T2D data. All three algorithms identified more than 600 features each. However, stepwise regression and logistic lasso detected none of the four SNPs introduced above. The elastic net missed rs10885409, but captured rs7193144, rs11693602, and rs4132670.
The results obtained offer an interesting insight into our new model search technique and its performances relative to three established methods for feature selection.
In the simulated benchmark, stepwise regression captured many features, including important ones, although at the cost of reduced precision. Logistic lasso performed better than stepwise regression and comparably to the elastic net, although both techniques attained relatively low precisions (0.26–0.59; see table 1). By contrast, as far as this simulation study is concerned, BOSS was able to discover true signals even in difficult scenarios with as low as 50% of explained variance. Note that, even if the prior model size allowed for possibly large models (up to about 60 predictors, see figure 2), BOSS was able to identify a small subset of key features that correlated well with the data, while discarding unimportant predictors. Interestingly, BOSS compared favorably with the reference methods in terms of prediction performances. Our analysis suggests that the greater ability of BOSS to detect the true predictors entails increased prediction accuracy on newly observed data.
It is worth remarking that, in principle, BOSS is also capable of handling interactions between features. Although this was not extensively explored, a preliminary simulated analysis (similar to the A_{90} scenario) showed that BOSS successfully captured both the main and interaction effects between pairs of features.
In the longevity study, BOSS effectively accomplished variable selection with many possible predictors (about 5700 SNPs) and a smaller number of observations (n=963). BOSS successfully identified relevant genetic markers and yielded a parsimonious model, which can be used for predictive purposes. In particular, rs10491334 has been previously associated with longevity in the same population,^{14} while rs800292 has been previously associated with agerelated macular degeneration in a cohort of Han Chinese individuals.^{22} Moreover, rs522796 has been reported both in relation to nondiabetic endstage renal disease^{20} and preterm birth.^{21}
The results from logistic lasso, elastic net, and stepwise regression obtained on the longevity data were comparable, although neither the lasso nor the elastic net were able to detect rs800292. Stepwise regression yielded an overly parametrized model, with almost as many features (891) as subjects (963), but missed the association of rs522796.
In the T2D study, BOSS identified SNPs mapping to genes known to be associated with diabetes or other metabolic syndromes.^{15}^{, 23–28} Further, BOSS identified functionally relevant variants that would have been discarded by standard univariate association tests: for example, the top SNP detected by BOSS, rs7193144, achieved a univariate p value of 2.61×10^{−5} (χ^{2} distribution with 2 degrees of freedom), which is greater than usual significance thresholds, for example, 1×10^{−5} or less. A direct comparison of our findings with those reported by the WTCCC^{15} is only partially feasible, since our research group has no access to the WTCCC 1958 British Birth Cohort, which was included in the original analyses as the control group. Locusbased analyses will allow identification of structural correlations and/or potential interactions between the selected SNPs and other functionally relevant unobserved variants.
Interestingly, stepwise regression and logistic lasso were able to detect a large set of features, but failed to identify the four SNPs reported in the Results section. This may be explained by convergence to a local optimum for the stepwise regression, and excessive coefficient shrinkage for the lasso. The elastic net did not detect rs10885409 but found the other three markers described above.
The computational efficiency of our approach deserves one last remark. It is a known in the literature that advanced samplingbased techniques for model selection may be computationally demanding (see, for example, O'Hara and Sillanpää^{29}), especially if compared to lassobased methods.^{11} In the two experimental studies, we preliminarily filtered the genetic features to approximately the same number (about 4000 and 6000), so as to demonstrate the effectiveness of our approach while keeping the computational complexity reasonable. Nonetheless, our results showed that the detection of relevant makers was not hindered. Of note, variational Bayes approaches (see, for example, Girolami et al^{30}) could be adapted for model selection to further speed up the model search.
The algorithm presented in this paper represents a step forward in the statistical methodology for variable selection. Binary outcomes have traditionally received less attention than continuous outcomes, mainly because of their lower analytical tractability. Moreover, the ‘p>n’ problem requires advanced techniques that allow discovery of important features without yielding overparametrized models. We tackled such challenges using a Bayesian paradigm and developed a general MCMC framework for variable selection in the presence of binary responses. The exact formulation of the likelihood paves the way to generalizations such as multinomial responses that cannot be handled simply by Gaussian approximation. The results obtained from our simulated benchmark and the two experimental case studies suggest that BOSS is an effective and reliable tool for model selection when the number of features is very large.
Notes
Contributors: AR and RB conceived the main idea of the paper. AR developed the algorithm and the software implementation. AM performed the statistical analyses. All authors contributed to writing and reviewing the manuscript.
Funding:This research was supported by the European Union's Seventh Framework Programme (FP7/20072013) for the Innovative Medicine Initiative under grant agreement no. IMI/115006 (the SUMMIT consortium). This study makes use of data generated by the Wellcome Trust Case Control Consortium. A full list of the investigators who contributed to the generation of the data is available from www.wtccc.org.uk. Funding for the project was provided by the Wellcome Trust under award076113.
Competing interests: None.
Provenance and peer review: Not commissioned; externally peer reviewed.
We gratefully acknowledge Professor Antonietta Mira for insightful discussions. We also thank the anonymous reviewers and the Associate Editor for their suggestions, which helped improve the clarity and accuracy of the manuscript.
References
1.  Davey JW,Hohenlohe PA,Etter PD,et al. Genomewide genetic marker discovery and genotyping using nextgeneration sequencing. Nat Rev GenetYear: 2011;12:499–51021681211 
2.  George EI,McCulloch RE. Variable selection via Gibbs sampling. J Am Stat AssocYear: 1993;88:881–9 
3.  Fridley B. Bayesian variable and model selection methods for genetic association studies. Genet EpidemiolYear: 2009;33:27–3718618760 
4.  Alekseyenko AV,Lytkin NI,Ai J,et al. Causal graphbased analysis of genomewide association data in rheumatoid arthritis. Biol DirectYear: 2011;6:2521592391 
5.  Agakov FV,McKeigue P,Krohn J,et al. Sparse instrumental variables (SPIV) for genomewide studies. 24th Annual Conference on Neural Information Processing Systems (NIPS). Vancouver, Canada, December Year: 2010 
6.  Li J,Das K,Fu G,et al. The Bayesian lasso for genomewide association studies. BioinformaticsYear: 2011;27:516–2321156729 
7.  Wei W,Visweswaran S,Cooper GF. The application of naive Bayes model averaging to predict Alzheimer's disease from genomewide data. J Am Med Inform AssocYear: 2011;18:370–521672907 
8.  Bottolo L,Richardson S. Evolutionary stochastic search for Bayesian model exploration. Bayesian AnalYear: 2010;5:583–618 
9.  Albert JH,Chib S. Bayesian analysis of binary and polychotomous response data. J Am Stat AssocYear: 1993;88:669–79 
10.  Wu TT,Chen YF,Hastie T,et al. Genomewide association analysis by lasso penalized logistic regression. BioinformaticsYear: 2009;25:714–2119176549 
11.  Friedman J,Hastie T,Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat SoftwYear: 2010;33:1–2220808728 
12.  Hans C,Dobra A,West M. Shotgun stochastic search for “large p” regression. J Am Stat AssocYear: 2007;102:507–17 
13.  Kohn R,Smith M,Chan D. Nonparametric regression using linear combinations of basis functions. Stat ComputYear: 2001;11:313–22 
14.  Malovini A,Illario M,Iaccarino G,et al. Association study on longliving individuals from southern Italy identifies rs10491334 in the CAMKIV gene that regulates survival proteins. Rejuvenation ResYear: 2011;14:283–9121612516 
15.  Wellcome Trust Case Control ConsortiumGenomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls. NatureYear: 2007;447:661–7817554300 
16.  Purcell S,Neale B,ToddBrown K,et al. PLINK: a tool set for wholegenome association and populationbased linkage analyses. Am J Hum GenetYear: 2007;81:559–7517701901 
17.  MATLAB Statistics ToolboxTM User's Guide. The MathWorks, Year: 2011 
18.  BarGuy A. Randraw. http://www.mathworks.com/matlabcentral/fileexchange/7309 (accessed 7 Jul 2011). 
19.  R Development Core TeamR: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, Year: 2011 ISBN 3900051070http://www.Rproject.org/ 
20.  Bostrom MA,Hicks PJ,Lu L,et al. Association of polymorphisms in the klotho gene with severity of nondiabetic ESRD in African Americans. Nephrol Dial TransplantYear: 2010;25:3348–5520466664 
21.  Velez DR,Fortunato SJ,Thorsen P,et al. Preterm birth in Caucasians is associated with coagulation and inflammation pathway gene variants. PLoS OneYear: 2008;3:e328318818748 
22.  Yang X,Hu J,Zhang J,et al. Polymorphisms in CFH, HTRA1 and CX3CR1 confer risk to exudative agerelated macular degeneration in Han Chinese. Br J OphthalmolYear: 2010;94:1211–1420538655 
23.  Scott LJ,Mohlke KL,Bonnycastle LL,et al. A genomewide association study of type 2 diabetes in Finns detects multiple susceptibility variants. ScienceYear: 2007;316:1341–517463248 
24.  Frayling TM,Timpson NJ,Weedon MN,et al. A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. ScienceYear: 2007;316:889–9417434869 
25.  Dina C,Meyre D,Gallina S,et al. Variation in FTO contributes to childhood obesity and severe adult obesity. Nat GenetYear: 2007;39:724–617496892 
26.  Florez JC,Jablonski KA,Bayley N,et al. TCF7L2 polymorphisms and progression to diabetes in the Diabetes Prevention Program. N Engl J MedYear: 2006;355:241–5016855264 
27.  Saxena R,Gianniny L,Burtt NP,et al. Common single nucleotide polymorphisms in TCF7L2 are reproducibly associated with type 2 diabetes and reduce the insulin response to glucose in nondiabetic individuals. DiabetesYear: 2006;55:2890–517003358 
28.  Qi L,Cornelis MC,Kraft P,et al. Genetic variants at 2q24 are associated with susceptibility to type 2 diabetes. Hum Mol GenetYear: 2010;19:2706–1520418489 
29.  O'Hara RB,Sillanpää MJ. A review of Bayesian variable selection methods: what, how and which. Bayesian AnalYear: 2009;4:85–118 
30.  Girolami M,Rogers S. Variational Bayesian multinomial probit regression with Gaussian Process priors. Neural ComputYear: 2006;18:1790–817 
Figures
Tables
Comparison of precision, recall, and Fmeasure obtained with stepwise regression (SW), elastic net (EN), logistic lasso (LL) and BOSS
Scenario  Average precision  Average recall  Average Fmeasure  
SW  EN  LL  BOSS  SW  EN  LL  BOSS  SW  EN  LL  BOSS  
A_{90}  0.29  0.26  0.33  0.93  1  0.99  0.99  1  0.44  0.40  0.49  0.96 
A_{70}  0.28  0.35  0.45  0.78  0.93  0.89  0.89  0.80  0.42  0.48  0.57  0.75 
A_{50}  0.23  0.40  0.52  0.68  0.73  0.54  0.50  0.46  0.34  0.44  0.46  0.53 
B_{90}  0.35  0.28  0.39  0.95  1  1  1  0.98  0.52  0.43  0.56  0.96 
B_{70}  0.31  0.35  0.44  0.78  0.82  0.95  0.93  0.77  0.45  0.50  0.59  0.76 
B_{50}  0.26  0.48  0.59  0.83  0.63  0.68  0.61  0.47  0.36  0.49  0.52  0.58 
Each value is the average across the 50 datasets simulated in each scenario. In terms of Fmeasure, BOSS performed better than the three reference methods (p<0.01, onesided paired t test).
Comparison of prediction performances on new data using the selected features
Scenario  ROC AUC  Average accuracy  
SW  EN  LL  BOSS  SW  EN  LL  BOSS  
A_{90}  0.86  0.92  0.93  0.96  0.60  0.66  0.71  0.85 
A_{70}  0.78  0.82*  0.82*  0.82  0.59  0.60  0.61  0.70 
A_{50}  0.64  0.65  0.66  0.67  0.54  0.53  0.53  0.57 
B_{90}  0.87  0.92  0.93  0.94  0.60  0.68  0.71  0.80 
B_{70}  0.70  0.77*  0.78*  0.77  0.54  0.57  0.58  0.64 
B_{50}  0.64  0.68*  0.67*  0.68  0.54  0.54  0.54  0.56 
Except where indicated (*), BOSS improved on the three reference methods (p<0.05, onesided paired t test).
EN, elastic net; LL, logistic lasso; ROC AUC, area under the receiver operating characteristic; SW, stepwise regression.
Summary of results from logistic regression on the top 10 predictors identified by BOSS in the longevity dataset
SNP  Chr  Position (bp)  Gene  OR (95% CI)  p Value 
rs2147556  13  72010472  0.58 (0.46 to 0.72)  1.8×10^{−6}  
rs10491334  5  110800303  CAMK4  0.58 (0.45 to 0.76)  5.3×10^{−5} 
rs621011  11  78294837  1.52 (1.24 to 1.87)  7.4×10^{−5}  
rs1570383  6  126241555  0.61 (0.47 to 0.78)  1.1×10^{−4}  
rs1158408  4  67799134  1.55 (1.21 to 1.99)  5.0×10^{−4}  
rs9613094  22  24788388  0.58 (0.44 to 0.75)  6.0×10^{−5}  
rs522796  13  32528055  KL  1.36 (1.12 to 1.67)  2.5×10^{−3} 
rs4869566  5  38165184  0.67 (0.54 to 0.84)  4.5×10^{−4}  
rs800292  1  194908856  CFH  1.53 (1.19 to 1.97)  9.7×10^{−4} 
rs7567687  2  129750796  0.69 (0.56 to 0.85)  4.1×10^{−4} 
Chr, chromosome; Position (bp), physical position on the chromosome expressed in terms of base pairs; SNP, single nucleotide polymorphism.
Summary of results from logistic regression on the top 20 predictors identified by BOSS in the T2D dataset
SNP  Chr  Position (bp)  Gene  OR (95% CI)  p Value 
rs7193144  16  53810686  FTO  1.25 (1.13 to 1.39)  2.4×10^{−5} 
rs7115136  11  98935203  0.79 (0.71 to 0.89)  4.1×10^{−5}  
rs540532  2  30954621  1.38 (1.20 to 1.59)  8.3×10^{−6}  
rs1426409  4  37173944  0.78 (0.69 to 0.88)  3.6×10^{−5}  
rs2582753  2  191216109  1.34 (1.18 to 1.52)  4.7×10^{−6}  
rs9874888  3  60016317  0.76 (0.67 to 0.85)  5.3×10^{−6}  
rs11693602  2  161224658  RBMS1  0.74 (0.65 to 0.84)  4.0×10^{−6} 
rs4788837  17  72466219  0.78 (0.70 to 0.88)  2.2×10^{−5}  
rs228787  17  42099488  1.36 (1.18 to 1.57)  1.7×10^{−5}  
rs11178602  12  71491505  1.18 (1.04 to 1.34)  1.1×10^{−2}  
rs1981496  14  106645846  0.71 (0.60 to 0.84)  1.1×10^{−4}  
rs13244625  7  2206690  1.30 (1.13 to 1.49)  1.5×10^{−4}  
rs1208611  10  38160499  1.56 (1.29 to 1.89)  6.5×10^{−6}  
rs4132670  10  114767771  TCF7L2  1.36 (1.16 to 1.59)  2.0×10^{−4} 
rs1708558  6  67049406  0.75 (0.65 to 0.87)  1.6×10^{−4}  
rs10885409  10  114808072  TCF7L2  0.90 (0.77 to 1.04)  1.6×10^{−1} 
rs4760786  12  71446789  0.86 (0.75 to 0.98)  2.5×10^{−2}  
rs2154490  21  30915962  1.26 (1.12 to 1.43)  1.7×10^{−4}  
rs11688935  2  189175905  1.26 (1.13 to 1.41)  4.4×10^{−5}  
rs2239930  17  10558733  1.26 (1.12 to 1.43)  2.0×10^{−4} 
Chr, chromosome; Position (bp), physical position on the chromosome expressed in terms of base pairs; SNP, single nucleotide polymorphism.
Article Categories:
Keywords: Feature selection, stochastic model search, binary outcomes, GWAS, pharmacokinetics, pharmacodynamics, population models, bayesian modelling, genetics, data mining, integration of clinical and molecular data. 
Previous Document: Clustering of maternal/fetal clinical conditions and outcomes and placental lesions.
Next Document: Evaluating alert fatigue over time to EHRbased clinical trial alerts: findings from a randomized co...