Survival models with preclustered gene groups as covariates.  
Jump to Full Text  
MedLine Citation:

PMID: 22177110 Owner: NLM Status: MEDLINE 
Abstract/OtherAbstract:

BACKGROUND: An important application of high dimensional gene expression measurements is the risk prediction and the interpretation of the variables in the resulting survival models. A major problem in this context is the typically large number of genes compared to the number of observations (individuals). Feature selection procedures can generate predictive models with high prediction accuracy and at the same time low model complexity. However, interpretability of the resulting models is still limited due to little knowledge on many of the remaining selected genes. Thus, we summarize genes as gene groups defined by the hierarchically structured Gene Ontology (GO) and include these gene groups as covariates in the hazard regression models. Since expression profiles within GO groups are often heterogeneous, we present a new method to obtain subgroups with coherent patterns. We apply preclustering to genes within GO groups according to the correlation of their gene expression measurements. RESULTS: We compare Cox models for modeling disease free survival times of breast cancer patients. Besides classical clinical covariates we consider genes, GO groups and preclustered GO groups as additional genomic covariates. Survival models with preclustered gene groups as covariates have similar prediction accuracy as models built only with single genes or GO groups. CONCLUSIONS: The preclustering information enables a more detailed analysis of the biological meaning of covariates selected in the final models. Compared to models built only with single genes there is additional functional information contained in the GO annotation, and compared to models using GO groups as covariates the preclustering yields coherent representative gene expression profiles. 
Authors:

Kai Kammers; Michel Lang; Jan G Hengstler; Marcus Schmidt; Jörg Rahnenführer 
Publication Detail:

Type: Journal Article; Research Support, NonU.S. Gov't Date: 20111216 
Journal Detail:

Title: BMC bioinformatics Volume: 12 ISSN: 14712105 ISO Abbreviation: BMC Bioinformatics Publication Date: 2011 
Date Detail:

Created Date: 20120306 Completed Date: 20120618 Revised Date: 20130626 
Medline Journal Info:

Nlm Unique ID: 100965194 Medline TA: BMC Bioinformatics Country: England 
Other Details:

Languages: eng Pagination: 478 Citation Subset: IM 
Affiliation:

Department of Statistics, TU Dortmund University, Dortmund, Germany. kammers@statistik.tudortmund.de 
Export Citation:

APA/MLA Format Download EndNote Download BibTex 
MeSH Terms  
Descriptor/Qualifier:

Breast Neoplasms
/
genetics,
mortality Databases, Genetic Female Gene Expression Profiling / methods Genomics Humans Proportional Hazards Models* Survival Analysis* Vocabulary, Controlled 
Comments/Corrections 
Full Text  
Journal Information Journal ID (nlmta): BMC Bioinformatics Journal ID (isoabbrev): BMC Bioinformatics ISSN: 14712105 Publisher: BioMed Central 
Article Information Download PDF Copyright ©2011 Kammers et al; licensee BioMed Central Ltd. openaccess: Received Day: 20 Month: 6 Year: 2011 Accepted Day: 16 Month: 12 Year: 2011 collection publication date: Year: 2011 Electronic publication date: Day: 16 Month: 12 Year: 2011 Volume: 12First Page: 478 Last Page: 478 ID: 3377939 Publisher Id: 1471210512478 PubMed Id: 22177110 DOI: 10.1186/1471210512478 
Survival models with preclustered gene groups as covariates  
Kai Kammers1  Email: kammers@statistik.tudortmund.de 
Michel Lang1  Email: lang@statistik.tudortmund.de 
Jan G Hengstler2  Email: hengstler@ifado.de 
Marcus Schmidt3  Email: marcus.schmidt@frauen.klinik.unimainz.de 
Jörg Rahnenführer1  Email: rahnenfuehrer@statistik.tudortmund.de 
1Department of Statistics, TU Dortmund University, Dortmund, Germany 

2Leibniz Research Centre for Working Environment and Human Factors (IfADo), TU Dortmund University, Dortmund, Germany 

3Department of Obstetrics and Gynecology, Mainz University, Medical School, Mainz, Germany 
We present prediction models for survival times built from high dimensional gene expression data. The challenge is to construct models that are complex enough to have high prediction accuracy but that at the same time are simple enough to allow biological interpretation. Univariate approaches use single genes as covariates in survival time models, whereas multivariate models perform dimension reduction through gene selection (see, e.g., [^{1}]). In addition, the combination of clinical data and gene expression data is a hot topic of research [^{2},^{3}] and is included in our model building procedure. Analysis of the prognostic index [^{4}] and the Brier Score [^{5},^{6}] can be used to assess the predictive performance of the models.
Here, we present models with higher interpretability by combining genes to gene groups (e.g. biological processes) and then using these groups as covariates in the survival models. The hierarchically ordered 'GO groups' (Gene Ontology) are particularly suitable [^{7}]. The Gene Ontology (GO) project provides structured, controlled vocabularies and classifications according to molecular and cellular biology. The current ontologies of the GO project are biological process, molecular function, and cellular component. These three areas are considered rather independent of each other and we make use of the biological process ontology.
A problem when relating gene groups with gene expression profiles is that the genes in each gene group may have different expression profiles: interesting subgroups may not be detected due to heterogeneous or anticorrelated expression profiles within one gene group. We propose to cluster the expression profiles of genes in every gene group and preselect relevant clusters (preclustering).
For statistical analysis, the Cox regression model [^{8}] is a wellknown method for modeling censored survival data. It can be used for identifying covariates that are significantly correlated with survival times. Due to the highdimensional nature of microarray data we cannot obtain the parameter estimates directly with the Cox log partial likelihood approach. However, we can combine the Cox model with selection and shrinkage procedures and compare the prediction performance of the obtained models. Based on these models statistical selection procedures are applied. Univariate selection and forward selection have been shown to have problematic performance in highdimensional settings. Therefore we do not show their results in this work. We focus on presenting the results for ridge regression [^{9}] and lasso regression [^{10},^{11}] as shrinkage methods. Note that lasso regression is a variable selection method as well.
In order to integrate the clinical information and microarray data in survival models properly, it is a common approach to handle the clinical covariates as unpenalized mandatory variables [^{3},^{12}]. In addition to the genomic information, clinical covariates like age, tumor size and tumor stage may be important predictors for survival times of patients. These approaches show that the combination of genomic and clinical information may also improve predictions.
Our aim is the combination of methods for survival prediction with biological a priori knowledge. On real gene expression data sets we evaluate the potential of including preclustered gene groups as covariates in survival models. Models built with gene groups alone have equal or decreased prediction accuracy since many genes are not yet annotated to their corresponding functions. However, we will show that after adding the preclustering information to the gene groups the resulting models have improved interpretability while prediction performance remains stable.
In the next chapter we introduce the methods for analyzing survival data, for preclustering genes, for model selection, and for evaluating the prediction accuracy of the resulting survival models. Then we present and discuss results on two real gene expression data sets.
We first present the notation, the Cox model and how the covariates are defined that are used in the Cox models  especially the preclustering Algorithm is presented. Then we describe the log partial likelihood concept derived for the Cox model and introduce model selection/shrinkage methods. Since most methods for dimension reduction or shrinkage require the selection of a tuning parameter that determines the amount of shrinkage, finally, we describe how to choose the tuning parameter for each method.
In the following, we assume that we have a sample size of n patients, and a (possibly rightcensored) survival time for the response. In order to cope with censored survival times data we use the Cox model, also known as proportional hazards regression model [^{8}]. Cox suggested that the risk of an event (e.g. cancer recurrence, death or any date of interest) at time t for a patient with given covariate vector x = (x_{1},..., x_{p}) is modeled as
(1)
h(tx)=h0(t)exp(β′x), 
where h_{0}(·) is an arbitrary baseline hazard function and β = (β_{1},..., β_{p}) a vector of regression coefficients. In the classical setting with n > p, the regression coefficients are estimated by maximizing the log partial likelihood
(2)
l(β)= ∑i=1nδiβ′xi log∑j∈R(ti)exp(β′xj). 
For patient i, this expression contains the possibly censored failure time t_{i}, the (noncensoring)indicator δ_{i }(equal to 1 if t_{i }is a true survival time and to 0 if it is censored) and the vector of gene (or summarized gene group) expression values x_{i}.
Further, R(t_{i}) is the risk set at time t_{i}; this is the set of all patients who have not yet failed nor been censored. The value of β′x_{i }is called prognostic index or risk score of patient i.
In the following, we assume that the data consists of two different categories of covariates
clinical covariates Z = (Z_{1 }, ..., Z_{q}): e.g. tumor size, tumor grade, age
genomic covariates X = (X_{1}, ..., X_{p}): gene expression values of single genes or combined gene expression values for gene groups.
For a detailed analysis we consider three different types of Cox models. We start with the simple model using only the clinical covariates
(3)
h(tZ)=h0(t)exp(γ′Z). 
The second model consists of p genomic covariates X = (X_{1},..., X_{p}). In our genetic regression models we use single genes, gene groups as well as preclustered gene groups as covariates. A gene group must be appropriately summarized in order to obtain one representative value for each individual (patient). We summarize the gene expression measurements from all genes belonging to one GO group or cluster via the first principle component of all genes that belong to this gene group. In the following we will consider three types of genomic models:
genes
groups
preclustered GO groups.
In the last step, we combine the genomic models with the clinical model, which can be written as
(4)
h(tX,Z)=h0(t)exp(β′X+γ′Z) 
Due to the small number of clinical covariates, the shrinkage and dimension reduction procedures will only be applied to the genomic covariates.
In order to find K homogeneous subgroups of genes within one GO group containing N genes, we use the partitioning around medoids (PAM) cluster analysis (cf. [^{13}]). The PAM procedure is based on the search for K representative genes, the medoids, among the N genes to be clustered. To achieve the goal of finding K medoids that minimize the sum of dissimilarities of the genes to their closest medoid
(5)
∑i=1Nminj=1,...,Kd(i,j), 
where d(i, j) is the dissimilarity of the ith and jth gene, the two following steps are carried out iteratively until convergence, starting with K sequentially selected genes as initial solution:
Build: Select sequentially K initial clusters and assign each gene to its closest medoid.
Swap: Minimize the objective function (5) by switching medoids with other genes of the same cluster.
To find correlated subgroups, the dissimilarity
d(i,j)=1Cor(xi,xj) 
of the ith and jth gene with the gene expressions x_{i }and x_{j }is based on their Pearson correlation. This yields small dissimilarities between positively correlated genes and large values for negatively correlated genes, respectively.
The number of clusters K for the PAM algorithm has to be chosen in advance. To find tight clusters of highly correlated genes, [^{14}] suggest using the Intra Cluster Correlation:
ICC=2n(n1)∑iCi. 
Here, the values C_{i }are the elements of the lower triangle of the correlation matrix of the N_{j }genes within a single cluster. The maximum mean ICC among the K = 2,..., N  1 possible cluster configurations corresponds to the optimal number of clusters within one GO group.
For comparing our results to those being published in the literature, we make use of the following two most established and successful shrinkage procedures: L_{1 }(lasso) and L_{2 }(ridge) penalized regression. Univariate and forward stepwise selection do not produce satisfactory results for our high dimensional settings. We have compared these two methods in our analysis, and in agreement with previous results from Boevelstad et al. [^{4},^{12}] prediction performance was always worse (data not shown). We present the methods for the model containing clinical and genomic information.
L_{1 }(lasso) and L_{2 }(ridge) penalized estimation methods shrink the estimates of the regression coefficients towards zero relative to the maximum likelihood estimates. Both methods are similar in nature, but the results of L_{1 }and L_{2 }penalization can be very different. We perform the penalization only on the highdimensional genomic covariates, the clinical covariates are handled as unpenalized mandatory variables.
The lasso shrinks the regression coefficients toward zero by penalizing the absolute values instead of their squares. The penalized log partial likelihood thus becomes l(β,γ)λ∑j=1pβj.[^{11}].
Ridge regression [^{9}] shrinks the regression coefficients by imposing a penalty on their squared values. The regression coefficients are estimated by maximizing the penalized log partial likelihood l(β,γ)λ∑j=1pβj2,where λ∑j=1pβj2 is the penalty term and l(β, γ) is given by (2). Applying an L_{2 }penalty tends to result in many small but nonzero regression coefficients, whereas penalizing with the absolute values has the effect that many regression coefficients are shrunk exactly to zero. Thus the lasso also is a variable selection method.
We applied both methods using the R package penalized [^{15}]. In both methods the tuning parameter λ controls the amount of shrinkage and is obtained again by crossvalidation.
The model complexity of the prediction methods depends on a tuning parameter λ. We use Mfold crossvalidation as proposed by [^{16}] for estimating λ. The Mfold crossvalidated log partial likelihood (CVPL) is given by
(6)
CVPL(λ)= ∑m=1Mlβ^m(λ),γ^mlmβ^m(λ),γ^m, 
where l(β, γ) denotes the log partial likelihood given in (2) and l_{m}(β, γ) the log partial likelihood when the mth fold (m = 1,..., M) is left out.
The difference of the two terms compared in the formula is that in the right term the likelihood is evaluated without the mth fold, and on the left side it is evaluated with all patients. In both cases the parameters β and γ are estimated without the mth fold. The estimate of β and γ when the mth fold is left out is denoted by β^m and γ^m. The optimal value of λ is chosen to maximize the sum of the contributions of each fold to the log partial likelihood.
Next we describe how we evaluate the prediction performance of the models. We make use of three different model evaluation criteria. The whole procedure is applied to two wellknown data sets. The basic idea is to split the data into a training set for model fitting and a test set for model evaluation, i.e. for determining the prediction performance. It is important to note that we have to consider several splits of the data into training and test sets due to the extreme dependence of the results on such a split (cf. [^{4},^{17}]).
In order to obtain a fair comparison of the prediction methods, we divide the data 100 times at random in a training and test set at the ratio of 2:1. After computing the optimal tuning parameter λ^train by 10fold crossvalidation using the training data, we estimate the regression coefficients β^train and γ^trainon the whole training data set. For each split into training data and test data, we calculate on the test set the three evaluation criteria explained in the next subsections. The results are compared with the help of boxplots and prediction error curves.
We assign patients to subgroups based on their prognosis, into one with good and one with bad prognosis. If the prognostic index β^xi+γ^zi of patient i is higher, the survival time is expected to be shorter. For this reason, a patient i in the test set is assigned to the highrisk group if its prognostic index is above the median of all prognostic indices calculated on the test set. We apply a logrank test on the two prognostic groups and use the pvalue as an evaluation criterion for the usefulness of the grouping. Boevelstad [^{4}] points out that a disadvantage of this criterion is that it does not consider the ranking of the patients within the groups and it may not be biologically meaningful.
The prognostic index β^xi+γ^zi is used as a single continuous covariate in a Cox model. We fit the model hi(txi,zi)=h0(t)expαβ^xi+γ^zi. Using the likelihood ratio test, we test the null hypothesis α = 0 versus α ≠ 0 and assess the prediction performance with the obtained pvalue. A small pvalue indicates ability of the prognostic index to discriminate between short and long survivors.
The prediction performance can also be assessed based on the (integrated) Brier Score that was introduced by [^{5}] in survival context. The consistent estimate of the expected Brier Score BS(t) is defined as a function of time t >0 by
(7)
BS(t)=1n∑i=1nŜ(tXi,Zi)2⋅1(ti≤t∧δi=1)Ĝ(ti)+(1Ŝ(tXi,Zi))2⋅1(ti>t)Ĝ(t), 
where Ŝ⋅Xi,Zi stands for the estimated survival for patient i and denotes the KaplanMeier estimate of the censoring distribution. The estimation of Ŝ⋅Xi,Zi is performed via the Breslow estimator of the cumulative baseline hazard function (see, e.g., [^{18}], Chapter 8.8). Good predictions at time t are reflected by small Brier Scores. Note that the Brier Score BS(t) is dependent on the point in time t. The integrated Brier Score IBS, given by
(8)
IBS=1t*∫ 0t*BS(t)dt, 
is a score for the average predition performance for all time points in the interval [0, t *]. In accordance with [^{6}], we calculate the IBS for the two data sets for t^{* }= 10 years due to high censoring after 10 years of survival.
For investigating the relationship between microarray gene expression data and censored survival data, we analyze two published breast cancer data sets with the methods described above. In this section, we present the results for the evaluation procedure applied to these two data sets. Standard approaches focus on single genes as covariates [^{1},^{4},^{19}]. We integrate additional biological knowledge by building models with preclustered GO groups as covariates. In order to assess the merit of this approach, we also present results for models using only genes or only GO groups as explanatory variables and combine the genomic information with the clinical data. In order to obtain a fair comparison of models with different types of genomic covariates, we only use those genes that are annotated to GO groups. We have to consider several splits of the data into training and test set due to the dependence of the results on such a split. We first present detailed results for one specific random split, then we present comprehensive results summarizing 100 random splits.
At this point we want to highlight that the proposed methods are computationally intensive. Due to the nested crossvalidation procedure for obtaining the optimal tuning parameter λ and the preclustering approach, we performed all computations on the LiDOng high performance computing cluster of TU Dortmund University with 432 nodes and up to 64 GB RAM per node. The calculation takes several weeks to accumulate all results for one high dimensional data set.
The Dutch breast cancer (DBC) data set is a subset of the original data set with 24 885 gene expression measurements from n = 295 women with breast cancer [^{20}]. After data preprocessing as proposed by [^{21}] our analysis is performed with 1 890 genes, that are annotated to at least one GO group, according to the biological process ontology. We obtained the data from the website https://www.msbi.nl/dnn/People/Houwelingen.aspx. In total, there are 5 560 GO groups to which at least one of these genes is annotated. The mean number of genes included in these GO groups is approximately 17 genes, 90% of all GO groups contain at most 30 genes. For 79 patients an event was observed. The clinical covariates are age, size, nodes and grade.
The Mainz cohort (MC) study consists of n = 200 nodenegative breast cancer patients who were treated at the Department of Obstetrics and Gynecology of the Johannes Gutenberg University Mainz between the years 1988 and 1998 [^{22}]. All patients underwent surgery and did not receive any systemic therapy in the adjuvant setting. Gene expression profiling of the patients' RNA was performed using the Affymetrix HGU133A array, containing 22 283 probe sets, and the GeneChip System. The normalization of the raw data was done using RMA from the Bioconductor package affy. The raw. cel files are deposited at the NCBI GEO data repository with accession number GSE11121. For covariates in the survival models, 17 834 genes and 8 587 GO groups are available. The mean number of genes included in these GO groups is approximately 102 genes, 90% of all GO groups contain at most 146 genes. There have been 47 events observed. The clinical data covers age at diagnosis, tumor size and grade as well as the estrogen receptor status.
We apply the model selection methods and three evaluation criteria to one specific random split of the Mainz cohort study into training and test data. Model building and evaluation are performed as explained in the evaluation procedure section. We split the 200 breast cancer patients into training set and test set, where 2/3 of the patients (in this case 133) are assigned to the training set and 1/3 (here 67) to the test set. We use the training data for estimating the tuning parameter λ^train and the regression coefficients β^traintrain and γ^train and the test data for evaluation. Table 1 shows the results for the two prediction methods, using genes, GO groups, or preclustered gene groups as covariates.
This example indicates that the predictive performance of models built with GO groups alone and of models with preclustered GO groups is comparable with classical models using only genes as covariates. The pvalues for model assessment are similar, but in addition, we have more information in the final model; annotations of preclustered GO groups can help clinicians to investigate the selected genes according to their biological function.
For illustration of the results presented in Table 1 we show KaplanMeier curves for two prognostic groups of patients derived by dividing all patients according to the median prognostic index of the patients in the test set. Here we used lasso regression for model selection and the logrank test for evaluation. We compare models with genes, GO groups, and preclustered GO groups as covariates (see Figure 1).
For all three types of genomic covariates the two prognostic groups are clearly separated on the test data, with significant differences in overall survival (p <0.02) between the highrisk group and the lowrisk group. The separation between the two groups is best when using a model containing preclustered GO groups (p = 0.0092).
We have observed high variability of the chosen tuning parameters and the parameter estimates depending on the split into training and test data. In order to quantify which covariates are consistently selected in different splits and how stable the evaluation measures are, we calculated results for 100 random splits and compared the selected genes and GO groups.
In Figures 2 (DBC) and 3 (MC), we present boxplots for the results for the two cancer data sets, after applying the evaluation procedure for lasso and ridge regression for each of the three types of genomic covariates (genes, GO groups, preclustered GO groups). Results for the clinical model are presented as a reference.
Rows of the plots correspond to two model evaluation criteria, the prognostic index and the Integrated Brier Score, and the columns correspond to two types of models: the genomic model and the genomic model with clinical covariates. Results for the logrank test are nearly the same as for the prognostic index and therefore not shown here. In each plot we show the results for the two model selection methods. The pvalues for the prognostic index are shown on the log_{10 }scale, thus a value of 2, e.g., corresponds to a pvalue of 0.01. Small values for the integrated Brier Score correspond to good prediction performance. For both evaluation criteria in all plots the horizontal line at the median indicates the reference model containing only clinical information.
The following main statements can be deduced from the plots:
Lasso regression with preclustered GO groups has the best prediction performance for the DBC data set, see the median of the pvalues across 100 splits in Figure 2. In the Mainz cohort study, we see the same result for the genomic model using the Brier Score for evaluation (see Figure 3).
Methods using GO groups or preclustered GO groups as covariates perform in general as well as models using only genes.
The prognostic index and the Brier Score yield similar results.
It is noticeable that for the MC study and prognostic index as performance measure the model using only genomic information is worse than the clinical model (Figure 3, upper left), but the clinicalgenomic model is comparable to the clinical model.
The optimal tuning parameter varies considerably between the splits. The interquartile range for the number of chosen covariates for L_{1 }regression and for all three different types of covariates ranges approximately from 5 to 12 for the Mainz cohort study and from 3 and 20 for the DBC data set (see Figure 4). There is a higher variance on the number of chosen covariates for the DBC data set. Next, we have a closer look at the run of the curves of the Brier Score over time for L_{1 }models with preclustered GO groups in comparison to the other models. Prediction error curves [^{5},^{23}] (averaged values for the Brier Score calculated at each time point for 100 splits) for models with the three different types of genomic covariates are shown in Figure 5 and 6 for the DBC data set and the MC study, respectively. The performance of the clinical model serves as reference. For both data sets, the model with preclustered GO groups has in comparison with the clinical model a better prediction performance over time. The preclustered models outperform the clinical models, starting at four years for the DBC data set and at three years for the MC study. The other two genomic models are also inferior to the preclustered models. Furthermore, we investigate which preclustered groups are most frequently selected across all 100 splits. Table 2 contains the numbers of the most frequently selected covariates, the corresponding GO groups with GO IDs [^{7}] and further information concerning the medoid gene, the cluster size and the annotation for the GO groups that are helpful for the biologist. We observe that most of the chosen cluster are subgroups of large GO groups and consist of more 100 genes. The value of the effect indicates whether a high value of the corresponding covariate has an increasing (+1) or decreasing (1) influence on patients' risk to die. For a detailed analysis of the effects the boxplots in Figure 7 show the variation of the estimated regression coefficients in the cox regression model for the most frequently chosen clusters, represented via medoid genes. First of all, the direction of the effect among all splits into training and test data is stable. From this it follows that a detected cluster has a consistent effect on patients' survival  either positive or negative. The first two clusters (from GO:0043170 and GO:0007049) shown in Table 2 are chosen in more than 80 percent of the splits into training and test data. Their parameter estimates are negative, i.e. high expression values of the included genes lead to reduced risk to die and thus to longer survival.
The typical challenge when relating survival times to gene expression measurements is a relatively small number of individuals compared to a large number of predictors. In this case the use of classical approaches is not possible. In accordance with [^{4}], the lasso regression method seems most suitable and promising: its prediction performance is slightly better compared to ridge regression and the solution is sparse [^{4}] and [^{12}]. show that ridge regression performs better than all the other methods. In our analysis, ridge regression leads in general to comparable but not better results compared to the lasso. However, an important disadvantage of this method is that it does not select variables. We observe relevant differences between highrisk and lowrisk patients, but there are too many genes or GO groups to be further investigated. The preclustering approach is beneficial concerning prediction performance in the lasso setting and leads to comparable results in the other models. However, a main benefit of preclustering is that we detect genes with similar expression patterns and that these gene subgroups are correlated with survival. In addition, we can have a detailed view on the GO groups containing the preclustered subgroups. Table 2 shows that the cluster sizes as well as the corresponding GO groups are quite large. However, in this case the selection of the top 4 clusters is quite stable. For gaining further biological insight a more detailed analysis of the composition of these clusters is required and promising.
In terms of the Brier Score, we showed that the prediction performance of models using clinical, genomic or both information is comparable. It seems that these different kind of covariates contain an overlap of information for predicting survival.
Our comparative study shows that different model selection procedures can be used to identify genes and (preclustered) GO groups related to survival outcomes and to build models for predicting survival times of future patients.
The integration of GO groups is useful, since they contain aggregated information of biological function and thus are often more informative than single genes. It is encouraging that in terms of prediction performance, our results obtained with (preclustered) GO groups as predictors are comparable to those using only genes as predictors. Thus the potentially improved interpretability makes these models with GO groups competitive. We demonstrated that this result holds true also for models using GO groups and not only genes. Our agenda in the present work was:
Constructing models with a relatively small subset of relevant covariates that are enriched with additional gene group information in terms of the Gene Ontology.
Presenting a new approach of preclustering genes from one functional group due to different expression profiles within one GO group.
Comparing prediction rules for the three types of covariates (genes, gene groups, preclustered gene groups).
Adding clinical information and comparing the results to single use of genomic data.
The next step for improving our models is to integrate more detailed information concerning the hierarchically structured gene ontology. For coping with high correlations between GO groups one can follow the approach of [^{24}] where correlations between neighboring GO groups in the GO graph are iteratively removed. Finally, in future projects, the biological interpretation of the identified gene groups will include not only the interpretation of the (preclustered) GO groups according to overall function, but it is also helpful to take a closer look at the single genes contained in these gene groups.
We make use of the R package penalized [^{15}] that provides algorithms for penalized estimation in Cox proportional hazards models. The package is freely available from http://cran.rproject.org[^{25}]. R code for model selection and evaluation is available at http://www.statistik.tudortmund.de/survivalGO.html.
KK and JR developed the ideas for the manuscript, KK and ML performed the statistical and computational analyses. MS and JGH generated and provided the Mainz cohort data, all authors read and approved the manuscript.
The work on this paper has been supported by the German Research Foundation (DFG) within the Research Training Group "Statistical Modelling", project C2 and the Collaborative Research Center SFB 876 "Providing Information by ResourceConstrained Analysis", project A3.
References
Gui J,Li H,Penalized Cox regression analysis in the highdimensional and lowsample size settings, with applications to microarray gene expression dataBioinformaticsYear: 200521133001300810.1093/bioinformatics/bti42215814556  
Boulesteix AL,Porzelius C,Daumer M,Microarraybased classification and clinical predictors: on combined classifiers and additional predictive valueBioinformaticsYear: 200824151698170610.1093/bioinformatics/btn26218544547  
Binder H,Schumacher M,Allowing for mandatory covariates in boosting estimation of sparse highdimensional survival modelsBMC BioinformaticsYear: 2008914101918182099  
Bøvelstad HM,Nygård S,Størvold HL,Aldrin M,Frigessi A,Lingjaerde OC,Predicting survival from microarray dataa comparative studyBioinformaticsYear: 200723162080208710.1093/bioinformatics/btm30517553857  
Graf E,Schmoor C,Sauerbrei W,Schumacher M,Assessment and comparison of prognostic classification schemes for survival dataStat MedYear: 19991817182529254510.1002/(SICI)10970258(19990915/30)18:17/18<2529::AIDSIM274>3.0.CO;2510474158  
Schumacher M,Binder H,Gerds T,Assessment of survival prediction models based on microarray dataBioinformaticsYear: 200723141768177410.1093/bioinformatics/btm23217485430  
Gene Ontology ConsortiumThe Gene Ontology project in 2008Nucleic Acids ResYear: 200836 DatabaseD440D44417984083  
Cox DR,Regression models and life tables (with discussion)J R Stat Soc BYear: 1972342187220  
Hoerl AE,Kennard RW,Ridge regression: biased estimation of nonorthogonal problemsTechnometricsYear: 197012556710.2307/1267351  
Tibshirani R,Regression shrinkage and selection via the lassoJ R Stat Soc BYear: 199658267288  
Tibshirani R,The lasso method for variable selection in the Cox modelStat MedYear: 199716438539510.1002/(SICI)10970258(19970228)16:4<385::AIDSIM380>3.0.CO;239044528  
Bøvelstad HM,Nygård S,Borgan O,Survival prediction from clinicogenomic modelsa comparative studyBMC BioinformaticsYear: 20091041310.1186/147121051041320003386  
Kaufman L,Rousseeuw PJ,Finding Groups in Data  An introduction to cluster analysisYear: 1995Wiley, New York  
Haan JRD,Piek E,van Schaik RC,de Vlieg J,Bauerschmidt S,Buydens LMC,Wehrens R,Integrating gene expression and GO classification for PCA by preclusteringBMC BioinformaticsYear: 20101115810.1186/147121051115820346140  
Goeman J,penalized: L1 (lasso) and L2 (ridge) penalized estimation in GLMs and in the Cox modelYear: 2008 [R package version 0.923].  
Verweij PJ,van Houwelingen HC,Crossvalidation in survival analysisStat MedYear: 199312242305231410.1002/sim.47801224078134734  
EinDor L,Zuk O,Domany E,Thousands of samples are needed to generate a robust gene list for predicting outcome in cancerProc Natl Acad Sci USAYear: 2006103155923592810.1073/pnas.060123110316585533  
Klein JP,Moeschberger ML,Survival Analysis Techniques for Censored and Truncated DataYear: 2003Second  
HaibeKains B,Desmedt C,Sotiriou C,Bontempi G,A comparative study of survival models for breast cancer prognostication based on microarray data: does a single gene beat them all?BioinformaticsYear: 200824192200220810.1093/bioinformatics/btn37418635567  
van de Vijver MJ,He YD,van't Veer LJ,Dai H,Hart AAM,Voskuil DW,Schreiber GJ,Peterse JL,Roberts C,Marton MJ,Parrish M,Atsma D,Witteveen A,Glas A,Delahaye L,van der Velde T,Bartelink H,Rodenhuis S,Rutgers ET,Friend SH,Bernards R,A geneexpression signature as a predictor of survival in breast cancerN Engl J MedYear: 2002347251999200910.1056/NEJMoa02196712490681  
van Houwelingen HC,Bruinsma T,Hart AAM,Veer LJV,Wessels LFA,Crossvalidated Cox regression on microarray gene expression dataStat MedYear: 200625183201321610.1002/sim.235316143967  
Schmidt M,Hasenclever D,Schaeffer M,Boehm D,Cotarelo C,Steiner E,Lebrecht A,Siggelkow W,Weikel W,SchifferPetry I,Gebhard S,Pilch H,Gehrmann M,Lehr HA,Koelbl H,Hengstler JG,Schuler M,Prognostic effect of epithelial cell adhesion molecule overexpression in untreated nodenegative breast cancerClin Cancer ResYear: 200814185849585510.1158/10780432.CCR08066918794096  
Gerds TA,Schumacher M,Consistent estimation of the expected Brier score in general survival models with rightcensored event timesBiom JYear: 20064861029104010.1002/bimj.20061030117240660  
Alexa A,Rahnenführer J,Lengauer T,Improved scoring of functional groups from gene expression data by decorrelating GO graph structureBioinformaticsYear: 200622131600160710.1093/bioinformatics/btl14016606683  
R Development Core TeamR: A Language and Environment for Statistical ComputingYear: 2010R Foundation for Statistical Computing, Vienna, Austria 
Figures
Tables
One random split into training and test data for the Mainz cohort study
Method  Covariates  p_{LR}  p_{PI}  p_{IBS}  λ  sel.cov 

L_{1}  genes  0.0190  0.0017  0.1042  11.72  19 
L_{1}  GO  0.0176  0.0018  0.1103  10.75  16 
L_{1}  clustered  0.0092  0.0002  0.0830  28.53  5 


L_{2}  genes  0.0098  0.0003  0.0877  5112.08  17834 
L_{2}  GO  0.0541  0.0097  0.1022  11749.16  6530 
L_{2}  clustered  0.0690  0.0006  0.0896  96499.04  31229 
Results for the two prediction methods using (i) genes, (ii) GO groups, and (iii) preclustered GO groups. For ridge regression, nearly all covariates are kept in the model since parameter estimates are unlikely to get shrunken exactly to 0.LR= ^logrank test, PI= ^prognostic index,IBS= ^Integrated Brier Score,sel.cov= ^number of selected covariates.
Top 10 selected covariates for preclustered GOgroups according to 100 splits into training and test data
count  GO  effect  medoid  clustersize  annotation 

85  GO:0043170  1  209258_s_at  410  macromolecule metabolic process 
81  GO:0007049  1  210052_s_at  222  cell cycle 
74  GO:0050896  +1  211908_x_at  102  response to stimulus 
52  GO:0032501  +1  212195_at  310  multicellular organismal process 
40  GO:0032501  +1  210935_s_at  362  multicellular organismal process 
21  GO:0050794  +1  210417_s_at  312  regulation of cellular process 
18  GO:0043170  1  211693_at  434  macromolecule metabolic process 
18  GO:0050896  +1  204118_at  230  response to stimulus 
16  GO:0006952  +1  203535_at  27  defense response 
15  GO:0042221  1  219140_s_at  39  response to chemical stimulus 
Names are GenBank IDs for the medoid genes and GO IDs for GO groups [^{7}]. The first column corresponds to the selected number for the covariate across 100 splits into training and test data for L1 regression on the Mainz cohort study. The value of the effect indicates whether the covariate has an increasing(+1) or decreasing (1) effect on patients' risk to die.
Article Categories:

Previous Document: Diagnostic accuracy of pulmonary embolism ruleout criteria: a systematic review and metaanalysis.
Next Document: Genetic analysis of the merozoite surface protein1 block 2 allelic types in Plasmodium falciparum c...