Classification models for early detection of prostate cancer.  
Jump to Full Text  
MedLine Citation:

PMID: 18464915 Owner: NLM Status: MEDLINE 
Abstract/OtherAbstract:

We investigate the performance of different classification models and their ability to recognize prostate cancer in an early stage. We build ensembles of classification models in order to increase the classification performance. We measure the performance of our models in an extensive crossvalidation procedure and compare different classification models. The datasets come from clinical examinations and some of the classification models are already in use to support the urologists in their clinical work. 
Authors:

Joerg D Wichard; Henning Cammann; Carsten Stephan; Thomas Tolxdorff 
Related Documents
:

18779085  Decision manifoldsa supervised learning algorithm based on selforganization. 19424445  Objectbased classification of residential land use within accra, ghana based on quickb... 15364095  Genetic design of feature spaces for pattern classifiers. 17186805  Structured oneclass classification. 18763805  Development and application of a needle trap device for timeweighted average diffusive... 21318665  Internal migration in the ussr: 18971926. 
Publication Detail:

Type: Comparative Study; Evaluation Studies; Journal Article 
Journal Detail:

Title: Journal of biomedicine & biotechnology Volume: 2008 ISSN: 11107251 ISO Abbreviation: J. Biomed. Biotechnol. Publication Date: 2008 
Date Detail:

Created Date: 20080509 Completed Date: 20080604 Revised Date: 20091118 
Medline Journal Info:

Nlm Unique ID: 101135740 Medline TA: J Biomed Biotechnol Country: United States 
Other Details:

Languages: eng Pagination: 218097 Citation Subset: IM 
Affiliation:

Institute of Medical Informatics, Charité  Universitätsmedizin, Hindenburgdamm 30, 12200 Berlin, Germany. joergwichard@web.de 
Export Citation:

APA/MLA Format Download EndNote Download BibTex 
MeSH Terms  
Descriptor/Qualifier:

Data Interpretation, Statistical* Decision Support Systems, Clinical* Diagnosis, ComputerAssisted / methods* Humans Male Models, Biological* Models, Statistical Prostatic Neoplasms / classification*, diagnosis* Reproducibility of Results Sensitivity and Specificity 
Comments/Corrections 
Full Text  
Journal Information Journal ID (nlmta): J Biomed Biotechnol Journal ID (publisherid): JBB ISSN: 11107243 ISSN: 11107251 Publisher: Hindawi Publishing Corporation 
Article Information Download PDF Copyright © 2008 Joerg D. Wichard et al. openaccess: This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Received Day: 12 Month: 9 Year: 2007 Accepted Day: 2 Month: 1 Year: 2008 Print publication date: Year: 2008 Electronic publication date: Day: 20 Month: 4 Year: 2008 Volume: 2008Elocation ID: 218097 ID: 2366047 PubMed Id: 18464915 DOI: 10.1155/2008/218097 
Classification Models for Early Detection of Prostate Cancer  
Joerg D. Wichard^{1,2}I3*  
Henning Cammann^{1}  
Carsten Stephan^{3}  
Thomas Tolxdorff^{1}  
^{1}Institute of Medical Informatics,
Charité  Universitätsmedizin,
Hindenburgdamm 30,
12200 Berlin,
Germany 

^{2}Molecular Modelling Group, Institut für Molekulare Pharmakologie,
Robert Rössle Straße 10,
13125 Berlin,
Germany 

^{3}Department of Urology,
Charité  Universitätsmedizin,
Charitéplatz 1,
10098 Berlin,
Germany 

Correspondence: *Joerg D. Wichard: joergwichard@web.de [other] Recommended by Daniel Howard 
Prostate cancer is one of the most common types of cancer among male patients in the western world. The number of expected new cases in the USA for the year 2006 was 235,000 with 27,000 expected deaths [^{1}]. Early detection of prostate cancer improves the chances of a curative treatment and a lot of progress has been made in this field during the last decade. The early detection is considerably enhanced by the measurement of prostatespecific antigen (PSA) in conjunction with other clinically available data like age, digital rectal examination (DRE), and transrectal ultrasonography (TRUS) variables like prostate volume. We compared several classification models and analyzed their performance on the clinical dataset with an extended crossvalidation procedure. The models were linear discriminant analysis (LDA), penalized discriminant analysis (PDA) [^{2}], logistic regression [^{3}], classification and regression trees (CARTs) [^{4}], multilayer perceptron (MLP) [^{5}], support vector machines (SVMs) [^{6}, ^{7}], and nearest neighbour classifiers [^{8}]. All these models are implemented in an opensource Matlabtoolbox that is available on the internet [^{9}].
This study will help to improve the software package ProstataClass [^{10}] which was developed at Charité and currently uses an artificial neural network as classification engine. This program is successfully used in clinical practice for several years.
We had access to the clinically available data of 506 patients with 313 cases of prostate cancer (PCa) and 193 nonPCa. The data were selected from a group of 780 patients randomly. The data entry for each patient included age, PSA, the ratio of free to total prostatespecific antigen (PSARatio), TRUS, and the diagnostic finding from the DRE which was a binary variable (suspicious or nonsuspicious). Blood sampling and handling were performed as described in Stephan et al. [^{11}]. The samples were taken before any diagnostic or therapeutic procedures, and sera were stored at 80°C until analyzed. After thawing at room temperature, samples were processed within 3 hours. Prostate volume was determined by transrectal ultrasound using the prolate ellipse formula. The scatter plot of the variables under investigation is shown in Figure 1. PCa and nonPCa patients were histologically confirmed by 6–8 core prostate biopsy.
The average output of several different models f_{i}(x) is called an ensemble model:
(1)
f^(x)=∑i=1Kωifi(x), 
The central feature of the ensemble approach is the generalization ability of the resulting model. In the case of regression models (with continuous output values), it was shown that the generalization error of the ensemble is in the average case lower than the mean of the generalization error of the singleensemble members (see Krogh and Vedelsby 1995 [^{14}]). This holds in general, independent of the model class, as long as the models constituting the ensemble are diverse with respect to the hypothesis of the unknown function. In the case of (binary) classification models, the situation was not so clear because the classical biasvariance decomposition of the squared error loss in regression problems (Geman et al. [^{15}]) had to be extended to the zeroone loss function. There are several approaches dealing with this problem, see Kong and Dietterich [^{16}], Kohavi and Wolpert [^{17}], or Domingos [^{18}].
The zeroone loss function is not the only possible choice for classification problems. If we are interested in a likelihood whether a sample belongs to one class or not, we can use the error loss from regression and consider the binary classification problem as a regression problem that works on two possible outcomes. In practice, many classifiers are trained in that way.
Our ensemble approach is based on the observation that the generalization error of an ensemble model could be improved if the models on which averaging is done disagree and if their residual errors are uncorrelated [^{13}]. To see this, we have to investigate the contribution of the single model in the ensemble to the generalization error. We consider the case where we have a given dataset D = {(x_{1}, y_{1}),…, (x_{N}, y_{N})} and we want to find a function f(x) that approximates y at new observations of x. These observations are assumed to come from the same source that generated the training set D, that is, from the same (unknown) probability distribution P. It should be noted that f depends also on D. The expected generalization error Err(x, D) given a particular x and a training set D is
(2)
Err(x,D)=E[(y−f(x))2∣x,D], 
(3)
Err(x)=ED[Err(x,D)], 
(4)
Err(x)=σ2+(ED[f(x)]−E[y∣x])2 +ED[(f(x)−ED[f(x)])2]=σ2+Bias(f(x))2+Var(f(x)), 
(5)
Err(x)=σ2+Bias(f^(x))2+Var(f^(x)), 
(6)
Var(f^)=E[(f^−E[f^])2]=E[(∑i=1Kωifi)2]−(E[∑i=1Kωifi])2=∑i=1Kωi2(E[fi2]−E2[fi]) +2∑i<jωiωj(E[fifj]−E[fi]E[fj]), 
Our model selection scheme is a mixture of bagging [^{20}] and crossvalidation. Bagging or Bootstrap aggregating was proposed by Breiman [^{20}] in order to improve the classification by combining classifiers trained on randomly generated subsets of the entire training sets. We extended this approach by applying a crossvalidation scheme for model selection on each subset and after that we combine the selected models to an ensemble. In contrast to classical crossvalidation, we use random subsets as crossvalidation folds. In Kfold crossvalidation, the dataset is partitioned into K subsets. Of these K subsets, a single subset is retained as the validation data for testing the model, and the remaining K − 1 subsets are used for model training. The crossvalidation process is then repeated K times with each of the K subsets used only once as the validation data. The K results from the folds then can be averaged to produce a single estimation.
If we lack relevant problemspecific knowledge, crossvalidation methods could be used to select a classification method empirically [^{21}]. This is a common approach because it seems to be obvious that no classification method is uniformly superior, see, for example, Quinlan [^{22}] for a detailed study. It is also a common approach to select the model parameters with crossvalidation [^{23}]. The idea to combine the models from the K crossvalidation folds (stacking) was described by Wolpert [^{24}].
We suggest to train several models on each CVfold, to select the best performing model on the validation set, and to combine the selected models from the Kfolds. If we train models of one type but with different initial conditions (e.g., neural networks with different numbers of hidden neurons), then we find proper values for the free parameters of the model. We could extend that by combining models from different classes in order to increase the model diversity. We call this a heterogeneous ensemble or mixed ensemble and applied this method effectively to regression problems [^{25}] and classification tasks [^{26}].
Our model selection scheme works as follows: for the Kfold CV, the data is divided Ktimes into a training set and a test set, both sets containing randomly drawn subsets of the data without replications. The size of each test set was 25% of the entire dataset.
In every CVfold, we train several different models with a variety of model parameters (see Section 5 for an overview of the models and the related model parameters). In each fold, we select only one model to become a member of the final ensemble (namely, the best model with respect to the test set). This means that all models have to compete with each other in a fair tournament because they are trained and validated on the same dataset. The models with the lowest classification error in each CVfold are taken out and added to the final ensemble, receiving the weight ω_{i} = 1/k (see (1)). All other models in this CVfold are deleted.
We can use this model selection scheme in two ways. If we have no idea or prior knowledge, where classification or regression method should be used to cope with a specific problem, we could use this scheme in order to look for an empirical answer and to compare the performance of the different model classes. The other way is the estimation of model parameters for the different model classes described in Section 5.
In this section, we give a short overview of the model classes that we used for ensemble building. All models belong to the wellestablished collection of machinelearning algorithms for classification and regression tasks, so details can be found in the textbooks like, for instance, Hastie et al. [^{2}]. The implementation of these models in an opensource toolbox together with a more detailed description can be found in [^{9}]. The toolbox is an opensource MATLAB Toolbox which allows the integration of existing implementations of classification algorithms and it contains more then the few model classes described in the text.
The LDA is a simple but useful classifier. If we assume that the two classes K = {0, 1} have a Gaussian distribution with mean μ_{k} and they share the same covariance matrix Σ, then the linear discriminant functionδ_{k}(x), k = {0,1} is given by
(7)
δk(x)=xTΣ−1μk−12μkTΣ−1μk+log(πk), 
(8)
f(x)=arg maxk=(0,1){δk(x)}. 
Logistic regression (Log.Reg.) is a model for binomial distributed dependent variables and is used extensively in the medical and social sciences. Hastie et al. [^{2}] pointed out that the Logistic Regression model has the same form as the LDA, the only difference lies in the way, the linear coefficients are estimated. See Hosmer and Lemeshow for a detailed introduction [^{27}]. We used the binary Log.Reg. to compute the probability of the dichotomic variable y (PCa or nonPCa) from the m independent variables x:
(9)
y=11+exp(z) 
(10)
z=a0+∑i=1maixi, 
We train a multilayer feedforward neural network “MLP” with a sigmoid activation function. The weights are initialized with Gaussiandistributed random numbers having zero mean and scaled variances. The weights are trained with a gradient descend based on the Rprop algorithm [^{28}] with the improvements given in [^{29}]. The MLP works with a common weight decay with the penalty term
(11)
P(w→)=λ∑i=1N−wi21+wi2 , 
Over the last decade, SVMs have become very powerful tools in machine learning. An SVM creates a hyperplane in a feature space that separates the data into two classes with the maximum margin. The feature space can be a mapping of the original features (x, x′) into a higherdimensional space using a positive semidefinite function:
(12)
(x,x′)↦k(x,x′). 
(13)
k(x,x′)=(x⋅x′) linear=(x⋅x′+1)d polynomial =exp(−∥x−x′∥σ2) rbf 
Trees are conceptually simple but powerful tools for classification and regression. For our purpose, we use the classification and regression trees (CARTs) as described in Breiman et al. [^{4}]. The main feature of the CART algorithm is the binary decision role that is introduced at each tree node with respect to the information content of the split. In this way, the most discriminating binary splits are near the tree root building an hierarchical decision scheme. A sketch of a decision tree is shown in Figure 3. It is known that trees have a high variance, so they benefit from the ensemble approach [^{20}]. These trees ensembles are also know as random forests. The parameters of the tree models are related to splitting the tree nodes (the impurity measure and the split criterion, see [^{2}] for a detailed description).
A Knearestneighbor classifier (KNN) takes a weighted average over the labels z_{i} of those observations z_{i} in the training set that are closest to the query point x. This denotes as
(14)
f(x)=1∑wi∑zi∈Nk(x)wizi, 
We compared the model classes described above in a unified framework under fair conditions. Thus, we trained an ensemble of each model class consisting of 11 ensembles members (11 CVfolds in the training scheme described in Section 4). The performance of each ensemble model was assessed on the 20% of data (validation set), which was initially removed and never included in model training (see Figure 2). This procedure was independently repeated 20 times. This means that all modelbuilding processes, that is, the random removal of 20% of the data, the construction of a classification model ensemble on the remaining 80% of the data as outlined in Section 4, and the final test on the unseen validation data were performed each time. Finally, the mean average prediction values with respect to the validation set were calculated and are listed in Table 2. In some cases, it is useful to apply a kind of data preprocessing like balancing. If the distribution of the two classes differ in the sense, that one class is only represented with a small number of examples, we can balance the data in the training set. This can improve the convergence of several training algorithms and has also an impact to the classification error [^{35}]. We apply balancing in the way that we reduce the number of samples in the one class until we have an balanced ratio of the class labels. The ratio of the class labels in the validation set was never changed because it reflects the real data distribution. Balancing was only applied to the training data. We used three different performance measures in order to compare the different classification models. Therefore, we have to define the four possible outcomes of a classification that can be formulated in a 2 × 2 confusion matrix, as shown in Table 1. The accuracy,
(15)
Accuracy=tp+tntp+tn+fp+fn, 
(16)
Specificity=tntn+fp, 
(17)
Sensitivity=tptp+fn. 
The precision or positive predictive value (PPV) is given by
(18)
PPV=tptp+fp, 
(19)
FScore=2⋅Sensitivity⋅PPVSensitivity+PPV, 
The ROCcurve offers the opportunity to calculate the specificity at a fixed sensitivity level and vice versa. This is important because, from the clinical point of view, a high sensitivity 95% is wanted to detect all patients with PCa first. To avoid a high falsepositive rate, we computed the specificity at the level of 95% sensitivity (SPS95) from the ROCcurve as another important performance measure.
To have an impression about the correct classified nonPCa patients in this case, we computed the specificity at the level of 95% sensitivity (SPS95) from the ROCcurve. If we compare the outcome of the statistical analysis of the model performance as listed in Table 1 for the unbalanced case and in Table 2 for the balanced case, we can state that the differences between the different classifiers are marginal. Even the more sophisticated classification models (SVMs or Mixed Ensembles) could not outperform the robust linear candidates (LDA/PDA).
Tables 2 and 3 present the main results with only small differences between the classifiers. The standard deviations of the performance measures are given except for the ROCcurvebased measures (AUC and SPS95). Most papers in this field do not discuss this really complex problem and it cannot be solved in the scope of this paper, but it should be mentioned. As an example of a special solution of this problem, see the paper of Hilgers [^{36}].
We compared several classification models with respect to their ability to recognize prostate cancer in an early stage. This was done in an ensemble framework in order to estimate proper model parameters and to increase classification performance. It turned out that all models under investigation are performing very well with only marginal differences and are compareable with similar studies, like, for example, Finne et al. [^{37}], Remzi et al. [^{38}], or Zlotta et al. [^{39}]. In future research, it should be investigated whether these results are valid for other populations of patients (e.g., screening data) and other PSA test assays and whether the performance of classification could be increased by including new variables or by splitting the groups of patients into different PSA ranges.
References
1.  Jemal A,Siegel R,Ward EM,Thun MJ. Cancer facts & figures2006Atlanta, Ga, USA: Department of Epidemiology and Surveillance Research, American Cancer Society; Tech. Rep. 
2.  Hastie T,Tibshirani R,Friedman J. The Elements of Statistical Learning. 2001New York, NY, USA: Springer; Springer Series in Statistics 
3.  Bard Y. Nonlinear Parameter Estimation. 1974New York, NY, USA: Academic Press; 
4.  Breiman L,Friedman J,Olshen R,Stone C. Classification and Regression Trees. 1993Monterey, Calif, USA: Wadsworth and Brooks; 
5.  Bishop CM. Neural Networks for Pattern Recognition. 1995Oxford, UK: Oxford University Press; 
6.  Vapnik VN. The Nature of Statistical Learning Theory. 1999New York, NY, USA: Springer; 
7.  Boser BE,Guyon I,Vapnik VN. A training algorithm for optimal margin classifiersIn: Proceedings of the 5th Annual ACM Conference on Computational Learning Theory (COLT '92)July 1992Pittsburgh, Pa, USA:144–152. 
8.  Merkwirth C,Parlitz U,Lauterborn W. Fast nearestneighbor searching for nonlinear signal processingPhysical Review E 2000;62(2):2089–2097. 
9.  Wichard JD,Merkwirth C. ENTOOL—A Matlab toolbox for ensemble modeling2007 http://www.jwichard.de/entool/ 
10.  Stephan C,Cammann H,Semjonow A,et al. Multicenter evaluation of an artificial neural network to increase the prostate cancer detection rate and reduce unnecessary biopsiesClinical Chemistry 2002;48(8):1279–1287. [pmid: 12142385] 
11.  Stephan C,Klaas M,Müller C,Schnorr D,Loening SA,Jung K. Interchangeability of measurements of total and free prostatespecific antigen in serum with 5 frequently used assay combinations: an updateClinical Chemistry 2006;52(1):59–64. [pmid: 16391327] 
12.  Perrone MP,Cooper LN. Mammone RJWhen networks disagree: ensemble methods for hybrid neural networksNeural Networks for Speech and Image Processing 1993New York, NY, USA: ChapmanHall; :126–142. 
13.  Krogh A,Sollich P. Statistical mechanics of ensemble learningPhysical Review E 1997;55(1):811–825. 
14.  Krogh A,Vedelsby J. Tesauro G,Touretzky D,Leen TNeural network ensembles, cross validation, and active learningAdvances in Neural Information Processing Systems 1995;7Cambridge, Mass, USA: MIT Press; :231–238. 
15.  Geman S,Bienenstock E,Doursat R. Neural networks and the bias/variance dilemmaNeural Computation 1992;4(1):1–58. 
16.  Kong EB,Dietterich TG. Errorcorrecting output coding corrects bias and varianceIn: Proceedings of the 12th International Conference on Machine Learning (ICML '95)July 1995Tahoe City, Calif, USA:313–321. 
17.  Kohavi R,Wolpert D. Saitta LBias plus variance decomposition for zeroone loss functionsIn: Proceedings of the 13th International Conference on Machine Learning (ICML '96)July 1996Bari, ItalyMorgan Kaufmann; :275–283. 
18.  Domingos P. A unified biasvariance decomposition for zeroone and squared lossIn: Proceedings of the 17th National Conference on Artificial IntelligenceJulyAugust 2000Austin, Tex, USA:564–569. 
19.  Naftaly U,Intrator N,Horn D. Optimal ensemble averaging of neural networksNetwork: Computation in Neural Systems 1997;8(3):283–296. 
20.  Breiman L. Bagging predictorsMachine Learning 1996;24(2):123–140. 
21.  Schaffer C. Selecting a classification method by crossvalidationIn: Proceedings of the 4th International Workshop on Artificial Intelligence and StatisticsJanuary 1993Fort Lauderdale, Fla, USA:15–25. 
22.  Quinlan JR. Comparing connectionist and symbolic learning methodsComputational Learning Theory and Natural Learning Systems 1994;1Cambridge, Mass, USA: MIT Press; :445–456. 
23.  Guyon I,Elisseeff A. An introduction to variable and feature selectionJournal of Machine Learning Research 2003;3:1157–1182. 
24.  Wolpert DH. Stacked generalizationNeural Networks 1992;5:241–259. 
25.  Wichard JD,Merkwirth C,Ogorzałek M. Detecting correlation in stockmarketsPhysica A 2004;344(12):308–311. 
26.  Rothfuss A,StegerHartmann T,Heinrich N,Wichard JD. Computational prediction of the chromosomedamaging potential of chemicalsChemical Research in Toxicology 2006;19(10):1313–1319. [pmid: 17040100] 
27.  Hosmer DW,Lemeshow S. Applied Logistic Regression. 1989New York, NY, USA: John Wiley & Sons; 
28.  Riedmiller M,Braun H. A direct adaptive method for faster backpropagation learning: the RPROP algorithm In: Proceedings of the IEEE International Conference on Neural Networks, vol. 1MarchApril 1993San Francisco, Calif, USA:586–591. 
29.  Igel C,Hüsken M. Bothe H,Rojas RImproving the Rprop learning algorithmIn: Proceedings of the 2nd International ICSC Symposium on Neural Computation (NC '02)May 2000Berlin, GermanyAcademic Press; :115–121. 
30.  Merkwirth C,Wichard JD,Ogorzałek M. Stochastic gradient descent training of ensembles of DTCNN classifiers for digit recognitionIn: Proceedings of the 16th European Conference on Circuit Theory and Design (ECCTD '03), vol. 2September 2003Kraków, Poland:337–341. 
31.  Cristianini N,ShaweTaylor J. An Introduction to Support Vector Machines and Other KernelBased Learning Methods. 2000Cambridge, UK: Cambridge University Press; 
32.  Vapnik VN,Tscherwonenkis AJ. Theorie der Zeichenerkennung. 1979Berlin: Akademie; 
33.  Chang CC,Lin CJ. Libsvm—Alibrary for support vector machines2001 
34.  Bellman RE. Adaptive Control Processes. 1961Princeton, NJ, USA: Princeton University Press; 
35.  Merkwirth C,Ogorzałek M,Wichard JD. Stochastic gradient descent training of ensembles of DTCNN classifiers for digit recognitionIn: Proceedings of the 16th European Conference on Circuit Theory and Design (ECCTD '03), vol. 2September 2003Kraków, Poland:337–341. 
36.  Hilgers RA. Distributionfree confidence bounds for ROC curvesMethods of Information in Medicine 1991;30(2):96–101. [pmid: 1857255] 
37.  Finne P,Finne R,Auvinen A,et al. Predicting the outcome of prostate biopsy in screenpositive men by a multilayer perceptron networkUrology 2000;56(3):418–422. [pmid: 10962306] 
38.  Remzi M,Anagnostou T,Ravery V,et al. An artificial neural network to predict the outcome of repeat prostate biopsiesUrology 2003;62(3):456–460. [pmid: 12946746] 
39.  Zlotta AR,Remzi M,Snow PB,Schulman CC,Marberger M,Djavan B. An artificial neural network for prostate cancer staging when serum prostate specific antigen is 10 ng./ml. or lessJournal of Urology 2003;169(5):1724–1728. [pmid: 12686818] 
Article Categories:

Previous Document: A genetic screen for dihydropyridine (DHP)resistant worms reveals new residues required for DHPblo...
Next Document: Peroxisome ProliferatorActivated Receptorgamma Ligands: Potential Pharmacological Agents for Targe...