Document Detail

Classification models for early detection of prostate cancer.
Jump to Full Text
MedLine Citation:
PMID:  18464915     Owner:  NLM     Status:  MEDLINE    
Abstract/OtherAbstract:
We investigate the performance of different classification models and their ability to recognize prostate cancer in an early stage. We build ensembles of classification models in order to increase the classification performance. We measure the performance of our models in an extensive cross-validation procedure and compare different classification models. The datasets come from clinical examinations and some of the classification models are already in use to support the urologists in their clinical work.
Authors:
Joerg D Wichard; Henning Cammann; Carsten Stephan; Thomas Tolxdorff
Related Documents :
18779085 - Decision manifolds--a supervised learning algorithm based on self-organization.
19424445 - Object-based classification of residential land use within accra, ghana based on quickb...
15364095 - Genetic design of feature spaces for pattern classifiers.
17186805 - Structured one-class classification.
18763805 - Development and application of a needle trap device for time-weighted average diffusive...
21318665 - Internal migration in the ussr: 1897-1926.
Publication Detail:
Type:  Comparative Study; Evaluation Studies; Journal Article    
Journal Detail:
Title:  Journal of biomedicine & biotechnology     Volume:  2008     ISSN:  1110-7251     ISO Abbreviation:  J. Biomed. Biotechnol.     Publication Date:  2008  
Date Detail:
Created Date:  2008-05-09     Completed Date:  2008-06-04     Revised Date:  2009-11-18    
Medline Journal Info:
Nlm Unique ID:  101135740     Medline TA:  J Biomed Biotechnol     Country:  United States    
Other Details:
Languages:  eng     Pagination:  218097     Citation Subset:  IM    
Affiliation:
Institute of Medical Informatics, Charité - Universitätsmedizin, Hindenburgdamm 30, 12200 Berlin, Germany. joergwichard@web.de
Export Citation:
APA/MLA Format     Download EndNote     Download BibTex
MeSH Terms
Descriptor/Qualifier:
Data Interpretation, Statistical*
Decision Support Systems, Clinical*
Diagnosis, Computer-Assisted / methods*
Humans
Male
Models, Biological*
Models, Statistical
Prostatic Neoplasms / classification*,  diagnosis*
Reproducibility of Results
Sensitivity and Specificity
Comments/Corrections

From MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine

Full Text
Journal Information
Journal ID (nlm-ta): J Biomed Biotechnol
Journal ID (publisher-id): JBB
ISSN: 1110-7243
ISSN: 1110-7251
Publisher: Hindawi Publishing Corporation
Article Information
Download PDF
Copyright © 2008 Joerg D. Wichard et al.
open-access: This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Received Day: 12 Month: 9 Year: 2007
Accepted Day: 2 Month: 1 Year: 2008
Print publication date: Year: 2008
Electronic publication date: Day: 20 Month: 4 Year: 2008
Volume: 2008E-location ID: 218097
ID: 2366047
PubMed Id: 18464915
DOI: 10.1155/2008/218097

Classification Models for Early Detection of Prostate Cancer
Joerg D. Wichard1,2I3*
Henning Cammann1
Carsten Stephan3
Thomas Tolxdorff1
1Institute of Medical Informatics, Charité - Universitätsmedizin, Hindenburgdamm 30, 12200 Berlin, Germany
2Molecular Modelling Group, Institut für Molekulare Pharmakologie, Robert Rössle Straße 10, 13125 Berlin, Germany
3Department of Urology, Charité - Universitätsmedizin, Charitéplatz 1, 10098 Berlin, Germany
Correspondence: *Joerg D. Wichard: joergwichard@web.de
[other] Recommended by Daniel Howard

1. INTRODUCTION

Prostate cancer is one of the most common types of cancer among male patients in the western world. The number of expected new cases in the USA for the year 2006 was 235,000 with 27,000 expected deaths [1]. Early detection of prostate cancer improves the chances of a curative treatment and a lot of progress has been made in this field during the last decade. The early detection is considerably enhanced by the measurement of prostate-specific antigen (PSA) in conjunction with other clinically available data like age, digital rectal examination (DRE), and transrectal ultrasonography (TRUS) variables like prostate volume. We compared several classification models and analyzed their performance on the clinical dataset with an extended cross-validation procedure. The models were linear discriminant analysis (LDA), penalized discriminant analysis (PDA) [2], logistic regression [3], classification and regression trees (CARTs) [4], multilayer perceptron (MLP) [5], support vector machines (SVMs) [6, 7], and nearest neighbour classifiers [8]. All these models are implemented in an open-source Matlab-toolbox that is available on the internet [9].

This study will help to improve the software package ProstataClass [10] which was developed at Charité and currently uses an artificial neural network as classification engine. This program is successfully used in clinical practice for several years.


2. DATA

We had access to the clinically available data of 506 patients with 313 cases of prostate cancer (PCa) and 193 non-PCa. The data were selected from a group of 780 patients randomly. The data entry for each patient included age, PSA, the ratio of free to total prostate-specific antigen (PSA-Ratio), TRUS, and the diagnostic finding from the DRE which was a binary variable (suspicious or nonsuspicious). Blood sampling and handling were performed as described in Stephan et al. [11]. The samples were taken before any diagnostic or therapeutic procedures, and sera were stored at 80°C until analyzed. After thawing at room temperature, samples were processed within 3 hours. Prostate volume was determined by transrectal ultrasound using the prolate ellipse formula. The scatter plot of the variables under investigation is shown in Figure 1. PCa and non-PCa patients were histologically confirmed by 6–8 core prostate biopsy.


3. ENSEMBLES

The average output of several different models fi(x) is called an ensemble model:

[Formula ID: eq1]
(1) 
f^(x)=∑i=1Kωifi(x),
where we assume that the model weights ωi sum to one ∑Ki = 1ωi = 1. There are several suggestions concerning the choice of the model weights (see Perrone and Cooper [12]) but we decided to use uniform weights with ωi = 1/K for the sake of simplicity and not to run into overfitting problems as reported by Krogh and Sollich [13].

The central feature of the ensemble approach is the generalization ability of the resulting model. In the case of regression models (with continuous output values), it was shown that the generalization error of the ensemble is in the average case lower than the mean of the generalization error of the single-ensemble members (see Krogh and Vedelsby 1995 [14]). This holds in general, independent of the model class, as long as the models constituting the ensemble are diverse with respect to the hypothesis of the unknown function. In the case of (binary) classification models, the situation was not so clear because the classical bias-variance decomposition of the squared error loss in regression problems (Geman et al. [15]) had to be extended to the zero-one loss function. There are several approaches dealing with this problem, see Kong and Dietterich [16], Kohavi and Wolpert [17], or Domingos [18].

The zero-one loss function is not the only possible choice for classification problems. If we are interested in a likelihood whether a sample belongs to one class or not, we can use the error loss from regression and consider the binary classification problem as a regression problem that works on two possible outcomes. In practice, many classifiers are trained in that way.

Our ensemble approach is based on the observation that the generalization error of an ensemble model could be improved if the models on which averaging is done disagree and if their residual errors are uncorrelated [13]. To see this, we have to investigate the contribution of the single model in the ensemble to the generalization error. We consider the case where we have a given dataset D = {(x1, y1),…, (xN, yN)} and we want to find a function f(x) that approximates y at new observations of x. These observations are assumed to come from the same source that generated the training set D, that is, from the same (unknown) probability distribution P. It should be noted that f depends also on D. The expected generalization error Err(x, D) given a particular x and a training set D is

[Formula ID: eq2]
(2) 
Err(x,D)=E[(y−f(x))2∣x,D],
where the expectation E[⋅] is taken with respect to the probability distribution P. We are now interested in
[Formula ID: eq3]
(3) 
Err(x)=ED[Err(x,D)],
where the expectation ED[⋅] is taken with respect to all possible realizations of training sets D with fixed sample size N. According to Geman et al. [15], the bias/variance decomposition of Err(x) is
[Formula ID: eq4]
(4) 
Err(x)=σ2+(ED[f(x)]−E[y∣x])2 +ED[(f(x)−ED[f(x)])2]=σ2+Bias(f(x))2+Var(f(x)),
where E[yx] is the deterministic part of the data and σ2 is the variance of y given x. Balancing between the bias and the variance terms is a crucial problem in model building. If we try to decrease the bias term on a specific training set, we usually increase the variance term and vice versa. We now consider the case of an ensemble average f^(x), consisting of K individual models as defined in (1). If we put this into (4), we get
[Formula ID: eq5]
(5) 
Err(x)=σ2+Bias(f^(x))2+Var(f^(x)),
and we can have a look at the effects concerning bias and variance. The bias term in (5) is just the average of the biases of the individual models in the ensemble. So we should not expect a reduction in the bias term compared to single models. According to Naftaly et al. [19], the variance term of the ensemble could be decomposed in the following way:
[Formula ID: eq6]
(6) 
Var(f^)=E[(f^−E[f^])2]=E[(∑i=1Kωifi)2]−(E[∑i=1Kωifi])2=∑i=1Kωi2(E[fi2]−E2[fi]) +2∑i<jωiωj(E[fifj]−E[fi]E[fj]),
where the expectation is taken with respect to D. The first sum in (6) marks the lower bound of the ensemble variance and is the weighted mean of the variances of the ensemble members. The second sum contains the cross terms of the ensemble members and disappears if the models are completely uncorrelated [13]. So the reduction in the variance of the ensemble is related to the degree of independence of the single models [19].


4. CROSS-VALIDATION AND MODEL SELECTION

Our model selection scheme is a mixture of bagging [20] and cross-validation. Bagging or Bootstrap aggregating was proposed by Breiman [20] in order to improve the classification by combining classifiers trained on randomly generated subsets of the entire training sets. We extended this approach by applying a cross-validation scheme for model selection on each subset and after that we combine the selected models to an ensemble. In contrast to classical cross-validation, we use random subsets as cross-validation folds. In K-fold cross-validation, the dataset is partitioned into K subsets. Of these K subsets, a single subset is retained as the validation data for testing the model, and the remaining K − 1 subsets are used for model training. The cross-validation process is then repeated K times with each of the K subsets used only once as the validation data. The K results from the folds then can be averaged to produce a single estimation.

If we lack relevant problem-specific knowledge, cross-validation methods could be used to select a classification method empirically [21]. This is a common approach because it seems to be obvious that no classification method is uniformly superior, see, for example, Quinlan [22] for a detailed study. It is also a common approach to select the model parameters with cross-validation [23]. The idea to combine the models from the K cross-validation folds (stacking) was described by Wolpert [24].

We suggest to train several models on each CV-fold, to select the best performing model on the validation set, and to combine the selected models from the K-folds. If we train models of one type but with different initial conditions (e.g., neural networks with different numbers of hidden neurons), then we find proper values for the free parameters of the model. We could extend that by combining models from different classes in order to increase the model diversity. We call this a heterogeneous ensemble or mixed ensemble and applied this method effectively to regression problems [25] and classification tasks [26].

Our model selection scheme works as follows: for the K-fold CV, the data is divided K-times into a training set and a test set, both sets containing randomly drawn subsets of the data without replications. The size of each test set was 25% of the entire dataset.

In every CV-fold, we train several different models with a variety of model parameters (see Section 5 for an overview of the models and the related model parameters). In each fold, we select only one model to become a member of the final ensemble (namely, the best model with respect to the test set). This means that all models have to compete with each other in a fair tournament because they are trained and validated on the same dataset. The models with the lowest classification error in each CV-fold are taken out and added to the final ensemble, receiving the weight ωi = 1/k (see (1)). All other models in this CV-fold are deleted.

We can use this model selection scheme in two ways. If we have no idea or prior knowledge, where classification or regression method should be used to cope with a specific problem, we could use this scheme in order to look for an empirical answer and to compare the performance of the different model classes. The other way is the estimation of model parameters for the different model classes described in Section 5.


5. CLASSIFICATION MODELS

In this section, we give a short overview of the model classes that we used for ensemble building. All models belong to the well-established collection of machine-learning algorithms for classification and regression tasks, so details can be found in the textbooks like, for instance, Hastie et al. [2]. The implementation of these models in an open-source toolbox together with a more detailed description can be found in [9]. The toolbox is an open-source MATLAB Toolbox which allows the integration of existing implementations of classification algorithms and it contains more then the few model classes described in the text.

5.1. Linear discriminant analysis

The LDA is a simple but useful classifier. If we assume that the two classes K = {0, 1} have a Gaussian distribution with mean μk and they share the same covariance matrix Σ, then the linear discriminant functionδk(x), k = {0,1} is given by

[Formula ID: eq7]
(7) 
δk(x)=xTΣ−1μk−12μkTΣ−1μk+log⁡(πk),
where πk denotes the frequency of occurrence of the class labels. The predicted class labels are given by
[Formula ID: eq8]
(8) 
f(x)=arg⁡ maxk=(0,1){δk(x)}.
We also implemented two modifications: the quadratic discriminant analysis (QDA) and the PDA, as described in detail in Hastie et al. [2]. Linear method are usually conceptually simple, robust, fast, and, in particular in high-dimensional problems, they could be very powerful.

5.2. Logistic regression model

Logistic regression (Log.Reg.) is a model for binomial distributed dependent variables and is used extensively in the medical and social sciences. Hastie et al. [2] pointed out that the Logistic Regression model has the same form as the LDA, the only difference lies in the way, the linear coefficients are estimated. See Hosmer and Lemeshow for a detailed introduction [27]. We used the binary Log.Reg. to compute the probability of the dichotomic variable y (PCa or non-PCa) from the m independent variables x:

[Formula ID: eq9]
(9) 
y=11+exp⁡(z)
with
[Formula ID: eq10]
(10) 
z=a0+∑i=1maixi,
wherein the model coefficients are estimated with a second-order gradient decent (quadratic approximation to likelihood function). This could be a critical issue in high-dimensional problems because these calculations are time and memory consuming.

5.3. Multilayer perceptron

We train a multilayer feed-forward neural network “MLP” with a sigmoid activation function. The weights are initialized with Gaussian-distributed random numbers having zero mean and scaled variances. The weights are trained with a gradient descend based on the Rprop algorithm [28] with the improvements given in [29]. The MLP works with a common weight decay with the penalty term

[Formula ID: eq11]
(11) 
P(w→)=λ∑i=1N−wi21+wi2 ,
where w→ denotes the N weight vector of the MLP and a small regularization parameter λ. The number of hidden layers, the number of neurons, and the number of regularization parameter are adjusted during the CV-training. We further applied the concept of an ϵ-insensitive error loss that we introduced in the context of cellular neural networks (CNNs) [30].

5.4. Support vector machines

Over the last decade, SVMs have become very powerful tools in machine learning. An SVM creates a hyperplane in a feature space that separates the data into two classes with the maximum margin. The feature space can be a mapping of the original features (x, x′) into a higher-dimensional space using a positive semidefinite function:

[Formula ID: eq12]
(12) 
(x,x′)↦k(x,x′).
The function k(⋅, ⋅) is called the kernel function and the so-called kernel trick uses Mercer's condition, which states that any positive semidefinite kernel k(x, x′) can be expressed as a dot product in a highdimensional space (see [31] for a detailed introduction). The theoretical foundations of this approach were given by Vapnik's statistical learning theory [6, 32] and later extended to the nonlinear case [7]. We use an implementation of SVMs that is based on the libsvm provided by Chang and Lin [33] with the standard kernels:
[Formula ID: eq13]
(13) 
k(x,x′)=(x⋅x′)  linear=(x⋅x′+1)d polynomial =exp⁡(−∥x−x′∥σ2) rbf
The parameters of the model are, with respect to the kernel-type, the polynomial degree d, the width of the rbf σ2 and the value concerning the cost of constrain violation during the SVM training.

5.5. Trees

Trees are conceptually simple but powerful tools for classification and regression. For our purpose, we use the classification and regression trees (CARTs) as described in Breiman et al. [4]. The main feature of the CART algorithm is the binary decision role that is introduced at each tree node with respect to the information content of the split. In this way, the most discriminating binary splits are near the tree root building an hierarchical decision scheme. A sketch of a decision tree is shown in Figure 3. It is known that trees have a high variance, so they benefit from the ensemble approach [20]. These trees ensembles are also know as random forests. The parameters of the tree models are related to splitting the tree nodes (the impurity measure and the split criterion, see [2] for a detailed description).

5.6. Nearest-neighbor classifier

A K-nearest-neighbor classifier (KNN) takes a weighted average over the labels zi of those observations zi in the training set that are closest to the query point x. This denotes as

[Formula ID: eq14]
(14) 
f(x)=1∑wi∑zi∈Nk(x)wizi,
where Nk(x) denotes the k-element neighborhood of x, defined in a given metric, and wi is the related distance. Common choices are the L1, L2, and the L metrics. The parameters of the model are the number of neighbors and the choice of the metric. KNNs offer a very intuitive approach to classification problems because they are based on the concept of similarity. This works fine in lower dimensions but leads to problems in higher dimensions, known as the curse of dimensionality [34].


6. APPLICATION TO THE CLINICAL DATA

We compared the model classes described above in a unified framework under fair conditions. Thus, we trained an ensemble of each model class consisting of 11 ensembles members (11 CV-folds in the training scheme described in Section 4). The performance of each ensemble model was assessed on the 20% of data (validation set), which was initially removed and never included in model training (see Figure 2). This procedure was independently repeated 20 times. This means that all model-building processes, that is, the random removal of 20% of the data, the construction of a classification model ensemble on the remaining 80% of the data as outlined in Section 4, and the final test on the unseen validation data were performed each time. Finally, the mean average prediction values with respect to the validation set were calculated and are listed in Table 2. In some cases, it is useful to apply a kind of data preprocessing like balancing. If the distribution of the two classes differ in the sense, that one class is only represented with a small number of examples, we can balance the data in the training set. This can improve the convergence of several training algorithms and has also an impact to the classification error [35]. We apply balancing in the way that we reduce the number of samples in the one class until we have an balanced ratio of the class labels. The ratio of the class labels in the validation set was never changed because it reflects the real data distribution. Balancing was only applied to the training data. We used three different performance measures in order to compare the different classification models. Therefore, we have to define the four possible outcomes of a classification that can be formulated in a 2 × 2 confusion matrix, as shown in Table 1. The accuracy,

[Formula ID: eq15]
(15) 
Accuracy=tp+tntp+tn+fp+fn,
seems to be the canonical error measure for almost all classification problems if the dataset is balanced. Other important measures are the specificity that quantifies how well a binary classification model correctly identifies the negative cases (non-PCa patients),
[Formula ID: eq16]
(16) 
Specificity=tntn+fp,
and the sensitivity, which is the proportion of true positives of all diseased cases (PCa patients) in the population,
[Formula ID: eq17]
(17) 
Sensitivity=tptp+fn.
A high sensitivity is required when early diagnosis and treatment are beneficial, which is the case in PCa.

The precision or positive predictive value (PPV) is given by

[Formula ID: eq18]
(18) 
PPV=tptp+fp,
and is the proportion of patients with positive test results who are correctly diagnosed. The F-Score is the harmonic mean of precision and sensitivity,
[Formula ID: eq19]
(19) 
F-Score=2⋅Sensitivity⋅PPVSensitivity+PPV,
and it is useful if the classes in the classification problem are not equally distributed. Another measure is the area under curve (AUC) wherein the curve is the receiver operating characteristic (ROC-curve)curve. The ROC-curve is the graphical plot of the sensitivity versus the (1-specificity) for a binary classifier as its discrimination threshold is varied.

The ROC-curve offers the opportunity to calculate the specificity at a fixed sensitivity level and vice versa. This is important because, from the clinical point of view, a high sensitivity 95% is wanted to detect all patients with PCa first. To avoid a high false-positive rate, we computed the specificity at the level of 95% sensitivity (SPS95) from the ROC-curve as another important performance measure.

To have an impression about the correct classified non-PCa patients in this case, we computed the specificity at the level of 95% sensitivity (SPS95) from the ROC-curve. If we compare the outcome of the statistical analysis of the model performance as listed in Table 1 for the unbalanced case and in Table 2 for the balanced case, we can state that the differences between the different classifiers are marginal. Even the more sophisticated classification models (SVMs or Mixed Ensembles) could not outperform the robust linear candidates (LDA/PDA).

Tables 2 and 3 present the main results with only small differences between the classifiers. The standard deviations of the performance measures are given except for the ROC-curve-based measures (AUC and SPS95). Most papers in this field do not discuss this really complex problem and it cannot be solved in the scope of this paper, but it should be mentioned. As an example of a special solution of this problem, see the paper of Hilgers [36].


7. CONCLUSIONS

We compared several classification models with respect to their ability to recognize prostate cancer in an early stage. This was done in an ensemble framework in order to estimate proper model parameters and to increase classification performance. It turned out that all models under investigation are performing very well with only marginal differences and are compareable with similar studies, like, for example, Finne et al. [37], Remzi et al. [38], or Zlotta et al. [39]. In future research, it should be investigated whether these results are valid for other populations of patients (e.g., screening data) and other PSA test assays and whether the performance of classification could be increased by including new variables or by splitting the groups of patients into different PSA ranges.


References
1. Jemal A,Siegel R,Ward EM,Thun MJ. Cancer facts & figures2006Atlanta, Ga, USA: Department of Epidemiology and Surveillance Research, American Cancer Society; Tech. Rep.
2. Hastie T,Tibshirani R,Friedman J. The Elements of Statistical Learning. 2001New York, NY, USA: Springer; Springer Series in Statistics
3. Bard Y. Nonlinear Parameter Estimation. 1974New York, NY, USA: Academic Press;
4. Breiman L,Friedman J,Olshen R,Stone C. Classification and Regression Trees. 1993Monterey, Calif, USA: Wadsworth and Brooks;
5. Bishop CM. Neural Networks for Pattern Recognition. 1995Oxford, UK: Oxford University Press;
6. Vapnik VN. The Nature of Statistical Learning Theory. 1999New York, NY, USA: Springer;
7. Boser BE,Guyon I,Vapnik VN. A training algorithm for optimal margin classifiersIn: Proceedings of the 5th Annual ACM Conference on Computational Learning Theory (COLT '92)July 1992Pittsburgh, Pa, USA:144–152.
8. Merkwirth C,Parlitz U,Lauterborn W. Fast nearest-neighbor searching for nonlinear signal processingPhysical Review E 2000;62(2):2089–2097.
9. Wichard JD,Merkwirth C. ENTOOL—A Matlab toolbox for ensemble modeling2007 http://www.j-wichard.de/entool/
10. Stephan C,Cammann H,Semjonow A,et al. Multicenter evaluation of an artificial neural network to increase the prostate cancer detection rate and reduce unnecessary biopsiesClinical Chemistry 2002;48(8):1279–1287. [pmid: 12142385]
11. Stephan C,Klaas M,Müller C,Schnorr D,Loening SA,Jung K. Interchangeability of measurements of total and free prostate-specific antigen in serum with 5 frequently used assay combinations: an updateClinical Chemistry 2006;52(1):59–64. [pmid: 16391327]
12. Perrone MP,Cooper LN. Mammone RJWhen networks disagree: ensemble methods for hybrid neural networksNeural Networks for Speech and Image Processing 1993New York, NY, USA: Chapman-Hall; :126–142.
13. Krogh A,Sollich P. Statistical mechanics of ensemble learningPhysical Review E 1997;55(1):811–825.
14. Krogh A,Vedelsby J. Tesauro G,Touretzky D,Leen TNeural network ensembles, cross validation, and active learningAdvances in Neural Information Processing Systems 1995;7Cambridge, Mass, USA: MIT Press; :231–238.
15. Geman S,Bienenstock E,Doursat R. Neural networks and the bias/variance dilemmaNeural Computation 1992;4(1):1–58.
16. Kong EB,Dietterich TG. Error-correcting output coding corrects bias and varianceIn: Proceedings of the 12th International Conference on Machine Learning (ICML '95)July 1995Tahoe City, Calif, USA:313–321.
17. Kohavi R,Wolpert D. Saitta LBias plus variance decomposition for zero-one loss functionsIn: Proceedings of the 13th International Conference on Machine Learning (ICML '96)July 1996Bari, ItalyMorgan Kaufmann; :275–283.
18. Domingos P. A unified bias-variance decomposition for zero-one and squared lossIn: Proceedings of the 17th National Conference on Artificial IntelligenceJuly-August 2000Austin, Tex, USA:564–569.
19. Naftaly U,Intrator N,Horn D. Optimal ensemble averaging of neural networksNetwork: Computation in Neural Systems 1997;8(3):283–296.
20. Breiman L. Bagging predictorsMachine Learning 1996;24(2):123–140.
21. Schaffer C. Selecting a classification method by cross-validationIn: Proceedings of the 4th International Workshop on Artificial Intelligence and StatisticsJanuary 1993Fort Lauderdale, Fla, USA:15–25.
22. Quinlan JR. Comparing connectionist and symbolic learning methodsComputational Learning Theory and Natural Learning Systems 1994;1Cambridge, Mass, USA: MIT Press; :445–456.
23. Guyon I,Elisseeff A. An introduction to variable and feature selectionJournal of Machine Learning Research 2003;3:1157–1182.
24. Wolpert DH. Stacked generalizationNeural Networks 1992;5:241–259.
25. Wichard JD,Merkwirth C,Ogorzałek M. Detecting correlation in stockmarketsPhysica A 2004;344(1-2):308–311.
26. Rothfuss A,Steger-Hartmann T,Heinrich N,Wichard JD. Computational prediction of the chromosome-damaging potential of chemicalsChemical Research in Toxicology 2006;19(10):1313–1319. [pmid: 17040100]
27. Hosmer DW,Lemeshow S. Applied Logistic Regression. 1989New York, NY, USA: John Wiley & Sons;
28. Riedmiller M,Braun H. A direct adaptive method for faster backpropagation learning: the RPROP algorithm In: Proceedings of the IEEE International Conference on Neural Networks, vol. 1March-April 1993San Francisco, Calif, USA:586–591.
29. Igel C,Hüsken M. Bothe H,Rojas RImproving the Rprop learning algorithmIn: Proceedings of the 2nd International ICSC Symposium on Neural Computation (NC '02)May 2000Berlin, GermanyAcademic Press; :115–121.
30. Merkwirth C,Wichard JD,Ogorzałek M. Stochastic gradient descent training of ensembles of DT-CNN classifiers for digit recognitionIn: Proceedings of the 16th European Conference on Circuit Theory and Design (ECCTD '03), vol. 2September 2003Kraków, Poland:337–341.
31. Cristianini N,Shawe-Taylor J. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. 2000Cambridge, UK: Cambridge University Press;
32. Vapnik VN,Tscherwonenkis AJ. Theorie der Zeichenerkennung. 1979Berlin: Akademie;
33. Chang CC,Lin CJ. Libsvm—Alibrary for support vector machines2001
34. Bellman RE. Adaptive Control Processes. 1961Princeton, NJ, USA: Princeton University Press;
35. Merkwirth C,Ogorzałek M,Wichard JD. Stochastic gradient descent training of ensembles of DT-CNN classifiers for digit recognitionIn: Proceedings of the 16th European Conference on Circuit Theory and Design (ECCTD '03), vol. 2September 2003Kraków, Poland:337–341.
36. Hilgers RA. Distribution-free confidence bounds for ROC curvesMethods of Information in Medicine 1991;30(2):96–101. [pmid: 1857255]
37. Finne P,Finne R,Auvinen A,et al. Predicting the outcome of prostate biopsy in screen-positive men by a multilayer perceptron networkUrology 2000;56(3):418–422. [pmid: 10962306]
38. Remzi M,Anagnostou T,Ravery V,et al. An artificial neural network to predict the outcome of repeat prostate biopsiesUrology 2003;62(3):456–460. [pmid: 12946746]
39. Zlotta AR,Remzi M,Snow PB,Schulman CC,Marberger M,Djavan B. An artificial neural network for prostate cancer staging when serum prostate specific antigen is 10 ng./ml. or lessJournal of Urology 2003;169(5):1724–1728. [pmid: 12686818]

Article Categories:
  • Research Article


Previous Document:  A genetic screen for dihydropyridine (DHP)-resistant worms reveals new residues required for DHP-blo...
Next Document:  Peroxisome Proliferator-Activated Receptor-gamma Ligands: Potential Pharmacological Agents for Targe...