|iBAG: integrative Bayesian analysis of high-dimensional multiplatform genomics data.|
|Jump to Full Text|
|PMID: 23142963 Owner: NLM Status: MEDLINE|
|MOTIVATION: Analyzing data from multi-platform genomics experiments combined with patients' clinical outcomes helps us understand the complex biological processes that characterize a disease, as well as how these processes relate to the development of the disease. Current data integration approaches are limited in that they do not consider the fundamental biological relationships that exist among the data obtained from different platforms. Statistical Model: We propose an integrative Bayesian analysis of genomics data (iBAG) framework for identifying important genes/biomarkers that are associated with clinical outcome. This framework uses hierarchical modeling to combine the data obtained from multiple platforms into one model.
RESULTS: We assess the performance of our methods using several synthetic and real examples. Simulations show our integrative methods to have higher power to detect disease-related genes than non-integrative methods. Using the Cancer Genome Atlas glioblastoma dataset, we apply the iBAG model to integrate gene expression and methylation data to study their associations with patient survival. Our proposed method discovers multiple methylation-regulated genes that are related to patient survival, most of which have important biological functions in other diseases but have not been previously studied in glioblastoma.
SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
|Wenting Wang; Veerabhadran Baladandayuthapani; Jeffrey S Morris; Bradley M Broom; Ganiraju Manyam; Kim-Anh Do|
Related Documents :
|23312993 - Contribution of domestic production records, interbull estimated breeding values, and s...
3758043 - On a model of human bioenergetics. ii. maximal power and endurance.
22847713 - Three approaches to investigating functional compromise to the default mode network aft...
|Type: Journal Article; Research Support, N.I.H., Extramural; Research Support, U.S. Gov't, Non-P.H.S. Date: 2012-11-09|
|Title: Bioinformatics (Oxford, England) Volume: 29 ISSN: 1367-4811 ISO Abbreviation: Bioinformatics Publication Date: 2013 Jan|
|Created Date: 2013-01-17 Completed Date: 2013-08-05 Revised Date: 2013-09-24|
Medline Journal Info:
|Nlm Unique ID: 9808944 Medline TA: Bioinformatics Country: England|
|Languages: eng Pagination: 149-59 Citation Subset: IM|
|Department of Biostatistics, The University of Texas, MD Anderson Cancer Center, Houston, TX 77030, USA.|
|APA/MLA Format Download EndNote Download BibTex|
Brain Neoplasms / genetics*, metabolism, mortality*
Gene Expression Profiling
Genomics / methods
Glioblastoma / genetics*, metabolism, mortality*
|P30 CA016672/CA/NCI NIH HHS; P30 CA016672/CA/NCI NIH HHS; P50 CA116199/CA/NCI NIH HHS; P50 CA127001 03/CA/NCI NIH HHS; P50 CA14038802/CA/NCI NIH HHS; R01 CA107304/CA/NCI NIH HHS; R01 CA160736/CA/NCI NIH HHS; R01 CA160736/CA/NCI NIH HHS|
Journal ID (nlm-ta): Bioinformatics
Journal ID (iso-abbrev): Bioinformatics
Journal ID (publisher-id): bioinformatics
Journal ID (hwp): bioinfo
Publisher: Oxford University Press
© The Author 2012. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: firstname.lastname@example.org
Received Day: 25 Month: 5 Year: 2012
Revision Received Day: 11 Month: 10 Year: 2012
Accepted Day: 31 Month: 10 Year: 2012
Print publication date: Day: 15 Month: 1 Year: 2013
Electronic publication date: Day: 9 Month: 11 Year: 2012
pmc-release publication date: Day: 9 Month: 11 Year: 2012
Volume: 29 Issue: 2
First Page: 149 Last Page: 159
PubMed Id: 23142963
Publisher Id: bts655
|iBAG: integrative Bayesian analysis of high-dimensional multiplatform genomics data|
|Jeffrey S. Morris1|
|Bradley M. Broom2|
1Department of Biostatistics and 2Department of
Bioinformatics and Computational Biology, The University of Texas, MD Anderson Cancer
Center, Houston, TX 77030, USA
|Correspondence: *To whom correspondence should be addressed.
Associate Editor: Martin Bishop
The overarching goal of cancer genomics is to customize patient care decisions according to diverse genetic and epigenetic alterations for a tumor (Chin et al., 2011; Vogelstein and Kinzler, 1993; Weir et al., 2004). Early cancer genomics studies focused on only a single type of alteration at a time to assess these changes, e.g. high-resolution copy number profiling led to the discovery of novel oncogenes in ovarian cancer (Nanjundan et al., 2007), melanoma (Scott et al., 2009) and lung carcinoma (Bass et al., 2009). Some of these findings have already been translated into personalized cancer treatment, such as imatinib for KIT-mutated gastrointestinal stromal tumors (Handolias et al., 2010) and trastuzumab for HER2-positive breast tumors (Pegram and Slamon, 2000).
As technologies to perform comprehensive profiling of the cancer genome have progressed, different technology platforms, from basic capillary electrophoresis sequencing to advanced forms of microarrays, have been brought together on the same patient set. For example, the Cancer Genome Atlas (TCGA) is a worldwide research program that currently encompasses comprehensive genomic datasets for >20 types of cancer (http://cancergemone.nih.gov; Hudson et al., 2010). The work of TCGA is motivating approaches for integrating data outputs from different types of technology platforms to identify important biomarkers related to cancer development and progression. The key hypothesis behind these approaches is that cancer consists of hundreds of distinct molecular changes, from multiple types of genetic and epigenetic alterations to the interactions among them. Each type of alteration provides a different and complementary view of the whole genome. Hence, integrating multiple aspects of the genome and the underlying biological processes to identify novel targets is essential and has the potential to improve the clinical management of cancer.
The concept of integration is very broad (see review by Hamid et al., 2009). Such integration studies can be divided into three general groups according to the primary focus of the study (Daemen et al., 2009). The focus of the first group, called sequential integration studies, is the sequential analysis of heterogeneous data from different platforms for the purpose of understanding the biological evolution of disease as opposed to predicting clinical outcome (Fridlyand et al., 2006; Qin, 2008; Tomioka et al., 2008). In this group, data obtained on one type of platform are analyzed along with the clinical outcome data, and then a second data platform is subsequently used to clarify or confirm the results obtained from the first platform. For example, Qin (2008) showed that microRNA expression can be used to sort tumors from normal tissues, regardless of tumor type. The study then analyzed the relationship between the candidate target genes for the cancer-related microRNAs and mRNA expression and disease status.
The focus of the second group of integration studies, which we call biological integration studies, is the analysis of biological pathways and regulatory mechanisms among data obtained from different platforms, such as the relationship between gene expression and protein abundances, or the relationship between gene expression and copy number changes in patient tumor samples (Karpenko and Dai, 2010; van Wieringen et al., 2012; Waters et al., 2006). The challenge for this group of studies is that the biological annotation databases used for mapping different datasets are inconsistent. An R-package to match array comparative genomic hybridization (CGH) and gene expression microarray features for integrative analysis purposes was provided by van Wieringen et al. (2012).
The focus of the third group of integration studies, which we term model-based integration studies, is the analysis of data obtained from multiple platforms that are combined into one statistical model to identify clinically relevant genes and/or to predict clinical outcome. Instead of merging datasets or analyzing them sequentially, the data from different platforms are treated equally, and the most relevant features are selected from all available platforms (Daemen et al., 2009; Lanckriet et al., 2004). For example, Daemen et al. (2009) proposed a kernel-based approach to integrate data from multiple platforms for the classification of discrete clinical outcomes. They showed that the area under curve (AUC) based on integrated data used for predictions was significantly improved compared with the AUC based on data from a single platform. However, these studies treated each platform independently and ignored the underlying biological mechanisms among different platforms. Witten and Tibshirani (2009) developed a supervised canonical correlation model to find significant axes of correlations between multiple multivariate datasets at a global (chromosomal) level. They integrated copy number and gene expression data and identified linear combinations (canonical variables) that are related to a clinical outcome. However, they also did not take the biological mechanisms (directionality) into account, as we detail later in the text.
Our proposed method takes a different approach in modeling biological relationships among molecular features measured by different platforms, by focusing on relationships at a ‘gene-centric’ level. We first study the underlying biological mechanisms, relating the data across the different platforms. Then using this information, we partition gene expression into different (independent) units and use this to identify genes relevant to clinical outcome as modulated by these different platforms. We hypothesize (and show) that, compared with non-integrative methods, our proposed method can detect clinically relevant gene expression changes with greater power and a lower false discovery rate (FDR), in addition to obtaining results that are more biologically interpretable.
Molecular biology has shown that features identified on different platforms influence clinical outcome at different levels. For example, in TCGA studies, copy number, methylation, mutation status, mRNA expression, microRNA expression and the expression of proteins in specific signaling pathways have been measured on the same set of samples. The fundamental biological relationships among the products of these different platforms and their associations with clinical outcome are shown in Figure 1. Generally speaking, molecular features measured at the transcript level (e.g. mRNA expression) affect clinical outcome more directly than molecular features measured at the DNA/epigenetics level (e.g. copy number, methylation and mutation status). Molecular features measured at the DNA level affect clinical outcome by influencing mRNA expression (Fabiani et al., 2010; Glinsky, 2006; de Tayrac et al., 2009). Similarly, microRNAs, post-transcriptional regulators that bind to complementary sequences on target mRNAs, influence mRNA through translational repression or target degradation, which then affects clinical outcome (Tseng et al., 2011).
Conducting the proposed integrative analysis is a challenge because of the complicated biological relationships and the different intrinsic structures of various platforms. For example, molecular features measured at the DNA level regulate the mRNA expressions of the corresponding genes or nearby genes (Peng et al., 2010). In contrast, microRNAs can regulate the mRNA expression of any gene, regardless of its locus, and each microRNA molecule has multiple target genes. Another challenge underlying this analysis is the large scale of the different types of gene alterations in contrast to the limited number of patient samples for such a study. Hence, an easily implemented and efficient variable selection method is needed for such an integration analysis.
We have developed the integrative Bayesian analysis of genomics data (iBAG) model to address these challenges. The main advantages of our proposed model can be summarized as follows. The iBAG model (i) uses a hierarchical approach to model the fundamental biological relationships underlying molecular features obtained by different platforms; (ii) accounts for both the influences of different platforms and the biological relationships among the platforms in one unified model to predict patients’ clinical outcomes; (iii) can conduct high-dimensional variable selection, which adapts to analyzing hundreds of distinct molecular entity effects jointly in one model; (iv) uses a Bayesian framework, which allows the model enough flexibility to estimate the different intrinsic structures of biological relationships for different high-throughput platforms; and (v) is computationally efficient and feasible owing to its closed forms of full conditional posterior distributions for posterior sampling.
The rest of this article is organized as follows. In Section 2, we describe the iBAG model construction along with prior formulations for continuous, discrete and survival clinical outcomes. In Section 3, we introduce an innovative approach for conducting high-dimensional variable selection using Bayesian FDRs. In Section 4, we illustrate the performance of the iBAG model and use simulations to compare its performance with those of alternative approaches. In Section 5, we apply the iBAG model to integrate gene expression and methylation data for TCGA’s glioblastoma study, and evaluate the associations between those data and patients’ survival times. Finally, we provide a summary and discussion in Section 6. The technical details and additional simulation results are contained in the Supplementary Material (Section S1).
For ease of interpretation and exposition, we illustrate our methodology using two platforms at a time—DNA methylation and gene expression data. Integration across more than two platforms can be done in an analogous manner, as discussed in Section 6.
Suppose the total number of patients is N. For the nth patient, our observed datum consists of—(i) Yn, the clinical outcome of interest [e.g. survival time, tumor(sub)type], (ii) , the measures of methylation levels for J probes/sites on the whole genome, (iii) , the measures of gene expression level for K genes, and (iv) , the values of L clinical (non-genomic) factors (e.g. tumor stage, age and other demographic variables). Hence, all of the observed datasets in our study can be denoted (in matrix notation) as .
We propose the following two-component hierarchical construction for our iBAG model: a mechanistic model to infer direct effects of methylation on gene expression, and a clinical model that uses this information to predict a clinical outcome. The first component of our model assesses the underlying biological relationship between methylation and gene expression. The expression level of a gene is affected primarily by the methylation sites in the promoter region and is usually lower when its promoter is highly methylated. However, methylation is only one of the many potential factors contributing to a change in gene expression level (as shown in Fig. 1). The mechanistic model regresses the measure of gene expression for the kth gene ( ) on the methylation measures obtained within the promoter of the kth gene. To match the methylation sites to a given gene, we use the annotation files for the platforms and use those methylation sites that are encompassed within the promoter region of a given gene—thus potentially allowing multiple methylation sites to map to a particular gene. The second component of our model assesses when the expression of a particular gene affects the clinical outcome, whether this effect is modulated through methylation and/or through some other mechanisms that are independent of methylation (e.g. microRNA, copy number effects).
The parameters of the mechanistic and clinical models have the following interpretations:
- , where denotes the part of the expression changes of the kth gene expression feature that is modulated through methylation (M).
- , where is an N × 1 vector that denotes the part of the expression changes of the kth gene expression feature that is modulated by mechanisms other than methylation (e.g. microRNA, copy number effects). We assume that follows a multivariate normal distribution with mean 0 and covariance matrix for .
- , where ωjk is the ‘gene-methylation’ effect that estimates the (conditional) effect of the jth methylation feature on the kth feature identified from the gene expression data.
- , where denotes the effect of the lth clinical factor on clinical outcome Y.
- , where estimates the effect of on Y, which can be interpreted as the effect of gene expression modulated by methylation for the kth feature identified from the gene expression data. We denote this partial effect of gene expression on clinical outcome as a type M effect.
- , where measures the effect of on Y, which can be interpreted as the effect of gene expression modulated by other sources for the kth feature identified from the gene expression data. We denote this partial gene expression effect on clinical outcome as a type effect.
- is the error term that accounts for variation not explained by the observed genomic and clinical factors and which is assumed to follow a normal distribution with a common standard deviation σ.
In essence, our mechanistic model divides the gene expression levels into two components—one modulated by methylation and the other independent of methylation —and uses both of these components (jointly) in the clinical model for the prediction of relevant outcomes. Figure 2 further exemplifies the architecture of the iBAG model for integrating data from two platforms. The formal directed acyclic graphical representation is given in Supplementary Fig. S1.
There are various univariate and multivariate approaches for fitting the iBAG model, as specified above, which require some variable selection and/or sparsity to regularize the ill-posed high-dimensional problem—as both the number of genes (K) and methylation features (J) are potentially of very high dimension (on the level of thousands) as compared with the sample size (N = hundreds), and most of the genes are expected to have very weak effects on clinical outcome. We use a Bayesian penalized regression approach that not only jointly models the mechanistic and clinical components in Equation (1) but also provides a natural approach for imposing sparsity and performing variable selection via hierarchical priors.
We denote our full parameter set as and specify our prior construction for each of these parameters. To model the main constructs of interest, , we use the Bayesian formulation of the lasso (Tibshirani, 1996), which serves a dual purpose. First, similar to the lasso regression with L1 penalty, it achieves sparsity (variable selection) via non-linear shrinkage of small/weak effects toward zero. This approach has proven to be useful in identifying genomic features with large effects on clinical outcomes in various genomic studies (Hoggart et al., 2008; Li et al., 2011). Second, and more importantly, as the complete conditionals are available in closed forms, the Bayesian formation of the lasso substantially aids our Bayesian computations for large genomic datasets such as those considered here. Specifically, we can write the double exponential (lasso) prior distribution as a scale mixture of a normal distribution with an exponential mixing density (Park and Casella, 2008), which allows us to use Gibbs sampling to draw the samples from the posterior distribution.
Thus our (conditional) Bayesian lasso prior for the type M effects ( ) can be written as, and σ is the standard deviation for the random error term . Similarly, we define a Bayesian lasso prior for , conditioned on the (different) shrinkage parameter and the (same) standard deviation σ.
For , which models the gene-methylation effects in the mechanistic model, we adopt the following strategy. When the number of features matching a given gene (promoter) is lower than the sample size (N) (e.g. methylation features), we assume that ωjk follows a normal distribution if is within the kth gene promoter and ωjk = 0, otherwise. In cases where the number of probes/features exceeds N for a particular gene (e.g. microRNA features), we allow for a Bayesian lasso prior, as described earlier in the text, to achieve regularization.
For , which models the effects of clinical factors, we simply assume that the prior of each is a multivariate normal distribution with mean 0 and large variance (e.g. on the order of 106 for a variable with standard deviation = 1). For the error variance (σ2 and ), we assume an improper prior . For other hyper parameters in the hierarchical model, we assume a gamma prior on and , with mean parameter αM, and scale parameter ξM, , respectively. In our applications, we assume the values of αM, , ξM and are all equal to 1.
Our complete iBAG model can be expressed hierarchically asdenotes the K dimensional multivariate normal distribution with mean u and covariance matrix Σ.
To conduct estimation and subsequent inference, we follow a fully Bayesian analysis of the iBAG model specified above using Markov chain Monte Carlo (MCMC) approaches (Casella and George, 1992). Specifically, we iteratively draw posterior samples from the full conditional distributions of the parameter sets, as specified below.
As all the above full conditional likelihoods are available in closed form, an efficient Gibbs sampler can be used to update our posterior distributions by drawing samples sequentially from full conditional distributions for each parameter set. See details in Supplementary Material (Section S1).
The construction of the iBAG model can be easily extended to model discrete and censored outcomes using latent variable formulations (Albert and Chib, 1993). Specifically, when Y is a binary variable taking values of 0 or 1 [e.g. tumor-(sub)type], we use a probit latent-variable formulation that preserves all the conjugate constructions. We let Z be the (unobserved) latent variable that relates to Y as follows,
Then conditionally (on Z) our iBAG model for discrete responses is the same as that shown in Equation (1), with Y in the clinical model replaced by Z and parameter representations and corresponding interpretations remaining exactly the same as those for continuous outcomes.
If the clinical outcome of interest is patient survival time (with censoring), we use the accelerated failure time model with a data augmentation approach (Tanner and Wong, 1987) to impute the censored values for this study. We let denote the survival time and denote the censoring status. Still, we let denote an unobserved latent variable. Given the latent variable Z for right-censored responses, the expression of the iBAG model is similar to Equation (1), changing the response variable Y to Z.
The relationship between clinical outcome (tn,δn) and the latent variable Zn can be expressed as
The full conditionals and the MCMC sampling schemes for discrete and survival responses are provided in the Supplementary Material (Section S2).
Our posterior sampling schemes for the iBAG model result in MCMC samples for all model parameters and, of specific interest to our study, the effects of gene expression levels on clinical outcomes modulated by and independent of methylation . One key issue is to summarize this information in the MCMC samples to conduct gene selection. Typical inferential approaches, such as selection based on posterior quantiles or the maximum a posteriori (MAP) via MCMC samples, suffer from two drawbacks. First, the Bayesian lasso has excellent shrinkage properties but does not conduct natural model/variable selection, as it does not set the effects exactly equal to 0 (owing to an absolutely continuous prior). Second, such inferential methods do not allow for the natural incorporation of FDR controls that are commonly used in high-dimensional settings (Benjamini and Hochberg, 1995; Storey and Tibshirani, 2003).
We propose an alternative approach to obtaining posterior probabilities to evaluate the significance of gene expression effects that facilitates efficient FDR-based inferences. Let denote the S MCMC posterior samples for the model parameters. When the clinical outcome Y is continuous, for each MCMC sample, we compute the (conditional) MAP estimate of , conditional on all other model parameters that can be obtained by minimizing the following objective/loss function:is the l-norm, and Y is the observed continuous outcome. When the clinical outcome is discrete or censored, the MAP estimate of can be obtained similarly by replacing Y in Equation (2) with the MCMC samples for the unobserved latent variable Z. Equation (2) is similar to the penalized objective function in the frequentist lasso (Tibshirani, 1996) with two different shrinkage parameters (λM, )—however, with the key difference that it conditions on all the other model parameters, thus accounting for uncertainty. There are several algorithms available for computing the MAP estimate. We use the computationally efficient least angle regression selection algorithm (Efron et al., 2004) to compute the estimates. We denote the resulting (conditional) estimates as and for the kth gene feature. Finally, we estimate the posterior probability of significance ( ) by computing the (empirical) frequencies of the non-zero elements in the MAP estimates for each gene k as is the indicator function.
Note that, in this framework, and can be interpreted as estimates of the ‘local’ FDR or Bayesian q-values (Newton et al., 2004; Storey and Tibshirani, 2003). Thus, given a desired global FDR α, we can determine a threshold ϕα to use in flagging the set of genes as significant genes associated with the clinical outcome. The significant threshold ϕα can be determined according to the method proposed by Morris et al. (2008). Let be the combined vector of posterior probabilities for and . We then sort pi in descending order to obtain p(i). Then , where . Using this cutoff, the expected proportion of genes found to be significant that are in fact false positive genes is α, in other words, we can control the average Bayesian FDR to be α.
In this section, we examine the operating characteristics of the proposed iBAG model through synthetic numerical examples. We use two versions of the iBAG model: an iBAGunified model as specified in Section 2 and an iBAG2-stage model in which the mechanistic and clinical models in Equation (1) are fit sequentially. Specifically, the mechanistic model involves fitting K linear regressions (for each gene separately). Subsequently, both the fitted values and the residuals from the mechanistic model are used as predictors in the clinical model. We use a similar lasso framework to estimate the clinical model and select genes related to clinical outcome. In addition, we compare the performance with those of two other models—a non-integrative (non-INT) model and a single gene (SG) model. In the non-INT model, we ignore the information provided by methylation and fit only the clinical model with gene expression features (G) as multivariate explanatory variables. In the SG model, we perform a multivariate linear regression for each gene separately with all of the genomic features available (including both mRNA and methylation levels) for the gene, to conduct selection based on individual P-values.
We simulate datasets that reflect the application dataset (analyzed in Section 5) as closely as possible. We fix the total number of patients at N = 200 and vary the total number of genes (K = 400, 600, 800, 1000). We assume that J = 200 out of K genes have had methylation levels measured (the proportion in the application dataset). Given the triplet, (N, J, K), we first generate methylation data, mnj, independently from Uniform (0,1), corresponding to the beta-values of DNA methylation used in the TCGA glioblastoma study (described in Section 5). Next, we simulate the gene expression values from a mixture of two normal distributions, based on the corresponding methylation measures, i.e. (regulated by methylation) and (not regulated by methylation). In the application dataset, ∼80% of the correlations between DNA methylations and the corresponding gene expression levels range from –0.4 to –0.8. To induce explicit dependence between methylation and gene expression, we assume and vary the values (= 0.31, 0.44, 0.73) that respectively correspond to gene expression-methylation correlations . Finally, we use model (1) to generate the clinical outcomes Y by setting—(i) for and , and for all other ks; (ii) for and , and for all other ks and (iii) . In essence, we have three groups of genes: group 1 consists of genes with only a nonzero type M effect, which is the gene expression effect modulated only by methylation (genes 1–20); group 2 consists of genes with only a non-zero type effect, which is the gene expression effect independent of methylation, but modulated by other mechanisms (gene J + 1 to gene J + 20); and group 3 consists of genes with both nonzero type M and type effects, i.e. the gene expression effects modulated (partially) by both methylation and other mechanisms (gene J − 21 to gene J). In total, we investigate 12 different data combinations based on variations of (K,ρ), and we generate 10 datasets for each combination.
For the non-INT and iBAG2-stage models, we use regular lasso regression and obtain receiver operating characteristic (ROC) curves by varying the shrinkage parameter. For the SG model, we vary the cutoff of the P-value for significance to obtain the ROC curves. For the iBAGunified model, we obtain the ROC curve by varying the significance threshold for the Bayesian posterior probabilities of the gene expression effects. We fit all four models, iBAGunified, iBAG2-stage, SG and non-INT, to all the simulated datasets and obtain ROC curves to identify the true effects of gene expression for the three groups of genes. For each model, we plot the means of the ROC curves based on the 10 simulated datasets for each (K,ρ) combination. For example, in Figure 3, we plot the ROC curves for identifying the true effects of gene expressions for the three groups of genes when K = 1000 and ρ = −0.6, which most closely mimics the real dataset analyzed in Section 5. Supplementary Table S1 shows the rank of performance for the four models in identifying the three groups of genes based on the areas under the ROC curves (AUC) values. Based on the AUC values, we can conclude that the iBAGunified model outperforms the non-INT, SG and iBAG2-stage models in identifying all three groups of genes. Although the iBAG2-stage model performs slightly worse than the non-INT model in identifying genes with only type effects and genes with both type M and type effects, it has a clear advantage in identifying genes with only type M effects. The SG model performs better than the non-INT model in identifying genes with only type M effects, but it has lower AUCs in identifying the other two groups of genes. The performances of the four models in the other 11 scenarios are similar (see Supplementary Figs S2.1–S2.3 for the detection of genes in groups 1–3).
Based on the results from all 12 scenarios, the performance of the four models can be summarized as follows: (i) Our proposed iBAGunified model consistently performs the best of all three models for discovering all three groups of genes; (ii) The iBAG2-stage model performs better than the non-INT model in discovering the genes in group 1, those with effects of gene expression modulated only by methylation; (iii) In discovering the genes in groups 2 and 3, the iBAG2-stage model performs as well as the non-INT model; and (iv) Compared with the non-INT model, the SG model performs better in identifying genes in group 1, but worse in identifying genes in the other two groups.
Glioblastoma multiforme (GBM) is the most common and most aggressive type of malignant primary brain tumor in humans. The TCGA GBM dataset includes tumor samples from >500 patients with GBM, along with DNA copy number, mutation, methylation and gene expression information. Analyzing different platforms individually illuminates some of the pathobiologic features and molecular biomarkers in GBM. For example, Verhaak et al. (2010) proposed using gene expression data to develop clinically relevant molecular sub-classifications of GBM, and Noushmehr et al. (2010) used methylation levels to identify a subset of GBM tumors that harbor characteristic promoter DNA methylation alterations, referred to as the glioma CpG island methylator phenotype.
Here, we focus on integrating gene expression, methylation data and patients’ clinical features from the GBM study. The data can be downloaded directly from TCGA’s website (http://tcga-data.nci.nih.gov/tcga/tcgaHome2.jsp). The clinical outcome of interest is the overall survival time. The gene expression profile is obtained using Affymetrix Human Genome U133A Array. Level 2 data were downloaded from the TCGA website as of June 2011, and the data were normalized globally using BrainArray Custom Chip Definition Files (CDF) and the Robust Multichip Average (RMA) normalization method. Unsupervised hierarchical clustering (Pearson correlation and Ward linkage) and principal components analysis were used to search for batch effects, but no significant batch effects were observed. The DNA methylation information is obtained using the Illumina Human methylation 27 BeadChip. We directly downloaded the level 3 data from the TCGA website; there are no significant batch effects (http://bioinformatics.mdanderson.org/tcgabatcheffects/). For DNA methylation data, we use the beta value, which is a number between zero and one that measures the percentage of methylation. For the subsequent analyses, we briefly outline the data pre-processing steps here for the gene expression and methylation data. Complete details can be found in the Supplementary Material (Section S4.1). First, we filter out the under-expressed genes and the methylation features for which the beta values do not vary by patient. After this step, 7785 genes and 6890 methylation features remain. Second, we annotate the 6890 methylation features to the 7785 genes according to their positions on the chromosomes. Third, we choose the top genes based on univariate filtering, adjusting for patient age. Finally, 1000 genes (348 of them with methylation information available) remain for our analysis on 201 patients. Our goal is not only to understand methylation-based regulation of genes but also to use this information to find significant genes associated with survival times.
We randomly split the total data from 201 patients into a training dataset (data from 134 patients) and a test dataset (data from 67 patients). For the training dataset, we fit the following three models for the selected genes: (i) the non-INT model, with only gene expression information as explanatory variables, (ii) the additive (ADD) model, with both gene expression and methylation information as explanatory variables and assuming their effects on patients’ survival times are additive, (iii) the iBAGunified model for censored outcomes, which integrates both gene expression and methylation information. We include patient age as a clinical covariate for both the iBAG and non-INT models. For a fair comparison, we use a Bayesian approach to obtain estimations for all three models using double-exponential (lasso) priors. We construct the priors for the iBAGunified model as stated in Section 2.3. The priors for the non-INT model are the same as those for the iBAGunified model, except for setting to be 0. The priors for the ADD model are the same as those in the non-INT model, except for the priors of the methylation effects, which are set in a manner similarly to that of the priors for gene expression effects in the non-INT model. To check the convergence of the iBAGunified model, we run two MCMC chains with different starting values. As seen in the trace plots and the plots based on Gelman and Rubin’s convergence diagnostic statistics, for the important parameters in the iBAGunified model (Supplementary Fig. S5), the results show that the iBAGunified model converges after ∼2000 iterations.
To compare the performance of the three models, we obtain the predicted values for the test dataset using the mean estimations of the parameters from the posterior samples for all three models. We use the concordance index (C-index) to evaluate the prediction performance for the different models. The C-index can be expressed as , where for and = 0 otherwise, is the estimated survival time for patient i, and Φ is the set that consists of all pairs of i, j such that survival time ti > tj. This measure has been shown to be effective in comparing prediction performances among different models for right-censored data (Bonato et al., 2011; Harrel et al., 2001; van Wieringen et al., 2009). We calculated C-indexes for the fitted values in the training dataset and the predicted values in the test dataset for all three models. The results are summarized in Table 1. The C-indexes for the iBAGunified model are the highest for both the training (0.80) and test datasets (0.76). The C-indexes for the non-INT model are the lowest for both training (0.70) and test datasets (0.73). Although the improvements by integrating the methylation data are limited (all 95% CIs of the C-index overlapped for the three models), the iBAGunified model has the best performance in both model fitting and model prediction.
For the two models performing relatively better in prediction (iBAGunified model and ADD model), we use Gibbs sampling to obtain posterior samples for the parameters and apply the method described in Section 3 to obtain the posterior probabilities for the different types of gene expression effects. The Bayesian posterior probabilities obtained by the iBAG model for the type M and type effects are summarized in Figure 4, panels A and B, respectively. The Bayesian posterior probabilities of the methylation effects and gene expression effects by the ADD model are summarized in Figure 4, panels C and D, respectively.
Applying the iBAGunified model to the GBM training dataset, we identify 136 genes (of the total 348 genes with methylation information) as significantly modulated by at least one methylation feature. These genes are listed in Supplementary Table S3. Of 136 genes, 102 genes (76%) have a negative estimation for methylation effects. This result reflects the biologic action of methylation, which usually represses gene expression. At FDR = 0.2 (corresponding posterior probability cutoff = 0.517), we obtain 22 genes with non-zero type M effects (effects modulated by methylation) on patient survival using the iBAGunified model. These genes are listed in the top box of Table 2. We use an asterisk to show the genes significantly modulated by methylation, as identified by the iBAGunified model (within the list provided in Supplementary Table S1). We use a boldface font to show the genes positively associated with patient survival (higher expression of the gene indicates longer survival time), and a regular font to show the genes negatively associated with patient survival (higher expression of the gene indicate shorter survival time). In addition, we identify 107 genes with nonzero type effects on survival (summarized in the lower box in Table 2). CNGA3 is the only gene that overlaps between the 22 genes with type M effects and the 107 genes with type effects, which means that CNGA3 is found to have effects modulated by both methylation and other mechanisms.
Using the ADD model with the same FDR = 0.2 (corresponding posterior probability cutoff = 0.579), we obtain 22 genes that have significant methylation effects and 78 genes that have significant gene expression effects on patient survival. By comparing the gene lists derived by the iBAG and ADD models using Venn diagrams (Supplementary Figs S4.1 and S4.2), we observe that 59 of 78 genes with significant gene expression effects obtained by the ADD model overlap with the genes with non-zero type effects obtained by the iBAGunified model, and have a directional association with survival time (both are positive or negative). There are 12 common genes when comparing the genes with significant methylation effects by the ADD model and the genes with non-zero type M effects (effects modulated by methylation). Among the 10 genes with non-zero type M effects obtained only by the iBAGunified model, five genes (SARMS1, C1QA, UFD1L, CBFB and MVP) are found to be significantly modulated by methylation (see Supplementary Table S1). However, for the 10 genes with significant methylation effects obtained only by the ADD model, only two of them (ANK3 and IL11RA) are found to be significantly modulated by methylation (see Supplementary Table S1). This indicates that for the other eight genes obtained only by the ADD model, they are shown to have significant methylation effects on survival, but their gene expression levels are not changed. This result does not seem to conform to our belief that methylation affects patient survival by depressing the gene expression. The advantage of the iBAGunified model is that it can identify the genes with effects modulated by methylation, and thus the results are more biologically interpretable.
Of the 22 genes identified by effects modulated by methylation, 14 are negatively associated with survival, whereas eight genes are positively associated with survival. Functional analysis with the database for annotation, visualization and integrated discovery (DAVID, Dennis et al., 2003) revealed that some of the genes that are negatively associated with survival are regulators of transcription (SMURF2, HOXA1, RBBP4 and POLR3C) and code for plasma membrane (CAP2, GPR116, SMURF2). Detailed results and additional discussions can be found in Supplementary Table S5.1. This gene set related with negative survival is enriched for Gene Ontology terms, cell morphogenesis and neuron differentiation, suggesting a probable role in the genesis of brain tumors. On the other hand, the effects of genes associated with positive survival are mostly intra-cellular and are related to immune systems processes (C1QA, CBFB, DPP4 and SARM1), suggesting a likely function in tumor suppression (see Supplementary Table S5.2). Although no GBM studies have so far identified these 22 genes as important biomarkers of survival, two of the genes in this list are associated with other types of glioma—MVP was found to be overexpressed in ganglio-gliomas (Aronica et al., 2003). Moreover, most of the genes in this list have important biological functions in other types of cancer. For example, HOXA1 stimulates oncogenesis through the MAPK signaling pathway and the transcription factors STAT3 and STAT5B in mammary epithelial cells (Mohankumar et al., 2008). Also, CpG islands of HOXA1 are significantly hypermethylated in lung cancer (Selamat et al., 2011), breast cancer (Park et al., 2011) and gastric carcinoma (Kang et al., 2008).
In this article, we introduce an innovative model, iBAG, to integrate two different platforms of omics data and estimate their associations with clinical outcome. Different from most existing integration approaches, which focus on either finding biological relationships among different platforms or predicting patient prognosis, our iBAG model involves a hierarchical structure, which simultaneously estimates biological mechanisms and uses this information to find significant prognostic genes. Our simulation study shows that the iBAG model can simultaneously increase the power and decrease the FDR in detecting clinically relevant genes, especially for genes with expression effects modulated only by methylation. Moreover, we can categorize all clinically relevant genes into three groups according to different biological mechanisms: genes with expression effects modulated only by methylation, genes with expression effects modulated only by other mechanisms and genes with expression effects modulated by both methylation and other mechanisms. We apply the iBAG model to integrate methylation data and gene expression data from TCGA’s GBM dataset. The results show that the iBAG model outperforms the model based on data from a single layer of biological information in both determining genes important to survival and model fitting.
The main goal of the iBAG model is to (i) identify more disease-associated genes, and (ii) achieve better predictive power, by treating gene expression as the downstream event that is regulated by different mechanisms (e.g. methylation, copy number and microRNA). We choose to treat the gene expression as a downstream event regulated by different mechanisms so that the iBAG model can help us identify more disease-associated genes. There are several reasons underlying this choice. First, acknowledging that a gene’s expression can be modulated by different mechanisms (e.g. methylation, copy number and microRNA), even if these mechanisms do not have a direct effect on survival, we can still identify genes whose modulated expressions potentially impact survival. Second, as the measure of gene expression from microarray technology is usually noisy, iBAG can effectively identify, which part of the gene expression is actually modulated by various factors from other platforms, thus denoising the expression to find prognostic genes. In addition, as shown by our analysis of the GBM data, if we simply use methylation information additively to gene expression to estimate the methylation effects on patient survival, we find that many methylation effects related to survival do not significantly change the gene expression levels. The iBAG model can help us eliminate these genes and obtain results that are more biologically interpretable.
As our main goal is to identify important genes associated with patient survival, we assume that the methylation effect on gene expression and the gene expression effect on patient survival are all linear and independent. By making this assumption, the conditional posterior distributions are in closed forms, which save us on computation cost. However, this assumption may not reflect the true biological process; therefore, if the main interest is to make predictions about clinical outcome, then more general forms of functions (e.g. non-parametric functions) may need to be considered. In our study, we focus on finding purely associational relationships between genes and patients’ survival times. Independent functional experiments and datasets are needed to validate any causal relationships or implications. In addition, although our implementation is Bayesian, the fundamental idea of the integrative hierarchical modeling can be applied using frequentist approaches as well. Although we illustrate the integration of only two platforms at a time, integrating three or more platforms can also be done by following a similar framework. This will require a deeper understanding of the fundamental biological relationships among different data platforms. We leave these tasks for future consideration. The iBAG model provides a useful and intuitive framework for integrating multiple platforms to improve diagnosis and prognosis in cancer. A freely available R software for the iBAG model is available under the ‘software’ link at: http://odin.mdacc.tmc.edu/ ∼vbaladan/.
Click here for additional data file (supp_bts655_supplementary_materials_iBAG_FINAL_online.pdf)
The authors thank Virginia Mohlere and LeeAnn Chastain for editing the manuscript. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Cancer Institute, the National Institutes of Health or the National Science Foundation.
Funding: This work was partially supported by the Cancer Center Support Grant (CCSG P30 CA016672). K.A.D.’s research is partially supported by the MD Anderson Cancer Center SPORE grants in Brain Cancer (P50 CA127001 03), in Breast Cancer (P50 CA116199), and in Prostate Cancer (P50 CA140388 02). V.B.’s research is partially supported by National Institutes of Health grant (R01 CA160736) and NSF grant (IIS-0914861). J.S.M.’s research is partially supported by NIH grant (R01 CA107304).
Conflict of Interest: none declared.
|Albert JH,Chib S. Bayesian analysis of binary and polychotomous response dataJ. Am. Stat. Assoc.Year: 199388669679|
|Aronica E,et al. Overexpression of the human major vault protein in gangliogliomasEpilepsiaYear: 2003441166117512919388|
|Bass AJ,et al. SOX2 is an amplified lineage-survival oncogene in lung and esophageal squamous cell carcinomasNat. Genet.Year: 2009411238124219801978|
|Benjamini Y,Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testingJ. Roy. Stat. Soc. BYear: 199557289300|
|Bonato V,et al. Bayesian ensemble methods for survival prediction in gene expression dataBioinformaticsYear: 20112735936721148161|
|Casella G,George EI. Explaining the Gibbs samplerAm. Stat.Year: 199246167174|
|Chin L,et al. Making sense of cancer genomic dataGenes Dev.Year: 20112553455521406553|
|Daemen A,et al. A kernel-based integration of genome-wide data for clinical decision supportGenome Med.Year: 200913919356222|
|Dennis G Jr,et al. DAVID: database for annotation, visualization, and integrated discoveryGenome Biol.Year: 20034P312734009|
|de Tayrac M,et al. Simultaneous analysis of distinct Omics data sets with integration of biological knowledge: multiple factor analysis approachBMC GenomicsYear: 2009103219154582|
|Efron B,et al. Least angle regressionAnn. Statist.Year: 200432407499|
|Fabiani E,et al. Analysis of genome-wide methylation and gene expression induced by 5-aza-2’-deoxycytidine identifies BCL2L10 as a frequent methylation target in acute myeloid leukemiaLeuk. Lymphoma.Year: 2010512275228421077739|
|Fridlyand J,et al. Breast tumor copy number aberration phenotypes and genomic instabilityBMC CancerYear: 200669616620391|
|Glinsky GV. Integration of HapMap-based SNP pattern analysis and gene expression profiling reveals common SNP profiles for cancer therapy outcome predictor genesCell CycleYear: 200652613262517172834|
|Hamid JS,et al. Data integration in genetics and genomics: methods and challengesHum. Genomics ProteomicsYear: 20091113|
|Handolias D,et al. Clinical responses observed with imatinib or sorafenib in melanoma patients expressing mutations in KITBr. J. CancerYear: 20101021219122320372153|
|Harrel FE. Regression Modeling Strategies, with Applications to Linear Models, Survival Analysis and Logistic RegressionYear: 2001New YorkSpringer|
|Hoggart CJ,et al. Simultaneous analysis of all SNPs in genome-wide and re-sequencing studiesPLoS Genet.Year: 20084e100013018654633|
|Hudson TJ,et al. International network of cancer genome projectsNatureYear: 201046499399820393554|
|Kang GH,et al. DNA methylation profiles of gastric carcinoma characterized by quantitative DNA methylation analysisLab Invest.Year: 20088816117018158559|
|Karpenko O,Dai Y. Relational database index choices for genome annotation dataBioinformatics and Biomedicine Workshops (BIBMW)Year: 2010 IEEE International Conference, pp. 264–268.|
|Lanckriet GRG,et al. A statistical framework for genomic data fusionBioinformaticsYear: 2004202626263515130933|
|Li J,et al. The Bayesian lasso for genome-wide association studiesBioinformaticsYear: 20112751652321156729|
|Mohankumar KM,et al. Transcriptional activation of signal transducer and activator of transcription (STAT) 3 and STAT5B partially mediate homeobox A1-stimulated oncogenic transformation of the immortalized human mammary epithelial cellEndocrinologyYear: 20081492219222918276758|
|Morris JS,et al. Bayesian analysis of mass spectrometry data using wavelet-based functional mixed modelsBiometricsYear: 20086447948917888041|
|Nanjundan M,et al. Amplification of MDS1/EVI1 and EVI1, located in the 3q26.2 amplicon, is associated with favorable patient prognosis in ovarian cancerCancer Res.Year: 2007673074308417409414|
|Newton MA,et al. Detecting differential gene expression with a semiparametric hierarchical mixture methodBiostatisticsYear: 2004515517615054023|
|Noushmehr H,et al. The cancer genome atlas research network, identification of a CpG island methylator phenotype that defines a distinct subgroup of gliomaCancer CellYear: 20101751052220399149|
|Park T,Casella G. The Bayesian lassoJ. Am. Stat. Assoc.Year: 2008103681686|
|Park SY,et al. Promoter CpG island hypermethylation during breast cancer progressionVirchows Arch.Year: 2011458738421120523|
|Pegram M,Slamon D. Biological rationale for HER2/neu (c-erbB2) as a target for monoclonal antibody therapySemin. Oncol.Year: 20005131911049052|
|Peng J,et al. Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancerAnn. Statist.Year: 201045377|
|Qin LX. An integrative analysis of microRNA and mRNA expression-a case studyCancer Inform.Year: 2008636937919259417|
|Scott KL,et al. GOLPH3 modulates mTOR signalling and rapamycin sensitivity in cancerNatureYear: 20094591085109019553991|
|Selamat SA,et al. DNA methylation changes in atypical adenomatous hyperplasia, adenocarcinoma in situ, and lung adenocarcinomaPLoS OneYear: 20116e2144321731750|
|Storey JD,Tibshirani R. Statistical significance for genome-wide experimentsProc. Natl Acad. Sci. USAYear: 20031009440944512883005|
|Tanner M,Wong W. The calculation of posterior distributions by data augmentation (with discussion)J. Am. Stat. Assoc.Year: 198782528550|
|Tseng CW,et al. Integrative network analysis reveals active microRNAs and their functions in gastric cancerBMC Syst. Biol.Year: 201159921703006|
|Tibshirani R. Regression shrinkage and selection via the lassoJ. Roy. Stat. Soc. BYear: 199658267288|
|Tomioka N,et al. Novel risk stratification of patients with neuroblastoma by genomic signature, which is independent of molecular signatureOncogeneYear: 20082744144917637744|
|van Wieringen WN,et al. Survival prediction using gene expression data: a review and comparisonComput. Stat. Data Anal.Year: 20095315901603|
|van Wieringen WN,et al. Matching of array CGH and gene expression microarray features for the purpose of integrative genomic analysesBMC BioinformaticsYear: 2012138022559006|
|Verhaak RG,et al. Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1Cancer CellYear: 2010179811020129251|
|Vogelstein B,Kinzler KW. The multistep nature of cancerTrends Genet.Year: 199391381418516849|
|Weir B,et al. Somatic alterations in the human cancer genomeCancer CellYear: 2004643343815542426|
|Waters KM,et al. Data merging for integrated microarray and proteomic analysisBrief Funct. Genomic Proteomic.Year: 2006526127216772273|
|Witten DM,Tibshirani R. Extensions of sparse canonical correlation analysis, with applications to genomic dataStat. Appl. Genet. Mol. Biol.Year: 2009828|
C-indexes for the three models in the training and test datasets
|non-INT model||ADD model||iBAGunified model|
|Training data||0.73 (0.02)||0.77 (0.03)||0.80 (0.03)|
|Test data||0.70 (0.03)||0.75 (0.02)||0.76 (0.03)|
Genes with significant gene expression effects obtained by iBAGunified model at FDR = 0.05 sorted according to their GeneIDs
|Type M genes||SPON2, CAP2*, POLR3C, CNGA3*, DPP4, GPR116, FKBP1A, SARM1*, RNF115*, HOXA1*, PCP4*, CYB5R2, RBBP4, SMURF2, TK1, C1QA*, UFD1L*,C2orf44*, SF3B5*,CASP4, CBFB*, MVP* LPCAT3, TRIB1, PEMT, TAB1, DCTN2, FARS2, RPP40, PNPLA6, OS9, SLC27A5, TMEM115, POLI, NXPH3, ADCY8, C16orf42*, CLTC, STX2, SEP10, E2F4, CNGA3*, AIM1, CSTA*, FCER1G*, FHIT*, ZBTB1, FRAT2, NPTXR, PISD, CCDC19, FAM50B*, ZNF544*, DKK3*, SREBF1, GRIK5, GSTM3, MNX1*, HSPA1A*, IGBP1, IL10RB, INPPL1, IPW, ITPR2, KARS, LRP3, MAP3K10, NCF2*, ATIC*, ACO1|
|Type genes||PABPC3, MRTO4, VPS28, PDE8A*, ENPP2, WBP11, PFDN2*, PHKG1, POLR2H, RC3H2, NDE1, FBXO34, ARHGEF10L, C12orf35, PPP2R2A, ADI1, GIMAP5*, AMBRA1, BIN3, UBFD1, BEX4, EPB41L5, RGS3*, ELOVL5, RPE*, RPS4X, CFB*, PLEK*, PORCN, SP3*, SP100*, GNS, STAT6*, SURF2, TACC1, HOXC4, TLE1, TOP1, UBE2V2, VDAC3, SLC39A7, KIAA1012, ADIPOR2, SLC24A6, ZNF430, NPRL3, SH3BGRL3*, ZNF528, MT4*, CSDA, RUVBL1, HERC2, DIRAS3*, EIF1AY, VAPB, RPL23, SNCAIP, KIAA0141, HS3ST2|
bts655-TF1Type M effects: effects modulated by methylation; type effects: effects modulated by other mechanisms; asterisk: genes significantly modulated by methylation; genes in bold font: Genes positively associated with patient survival; Genes in regular font: Genes negatively associated with patient survival.
Previous Document: Cognitive impairment in the preclinical stage of dementia in FTD-3 CHMP2B mutation carriers: a longi...
Next Document: Density Parameter Estimation for Finding Clusters of Homologous Proteins - Tracing Actinobacterial P...