Genomewide prediction of discrete traits using Bayesian regressions and machine learning.  
Jump to Full Text  
MedLine Citation:

PMID: 21329522 Owner: NLM Status: MEDLINE 
Abstract/OtherAbstract:

BACKGROUND: Genomic selection has gained much attention and the main goal is to increase the predictive accuracy and the genetic gain in livestock using dense marker information. Most methods dealing with the large p (number of covariates) small n (number of observations) problem have dealt only with continuous traits, but there are many important traits in livestock that are recorded in a discrete fashion (e.g. pregnancy outcome, disease resistance). It is necessary to evaluate alternatives to analyze discrete traits in a genomewide prediction context. METHODS: This study shows two threshold versions of Bayesian regressions (Bayes A and Bayesian LASSO) and two machine learning algorithms (boosting and random forest) to analyze discrete traits in a genomewide prediction context. These methods were evaluated using simulated and field data to predict yettobe observed records. Performances were compared based on the models' predictive ability. RESULTS: The simulation showed that machine learning had some advantages over Bayesian regressions when a small number of QTL regulated the trait under pure additivity. However, differences were small and disappeared with a large number of QTL. Bayesian threshold LASSO and boosting achieved the highest accuracies, whereas Random Forest presented the highest classification performance. Random Forest was the most consistent method in detecting resistant and susceptible animals, phi correlation was up to 81% greater than Bayesian regressions. Random Forest outperformed other methods in correctly classifying resistant and susceptible animals in the two pure swine lines evaluated. Boosting and Bayes A were more accurate with crossbred data. CONCLUSIONS: The results of this study suggest that the best method for genomewide prediction may depend on the genetic basis of the population analyzed. All methods were less accurate at correctly classifying intermediate animals than extreme animals. Among the different alternatives proposed to analyze discrete traits, machinelearning showed some advantages over Bayesian regressions. Boosting with a pseudo Huber loss function showed high accuracy, whereas Random Forest produced more consistent results and an interesting predictive ability. Nonetheless, the best method may be casedependent and a initial evaluation of different methods is recommended to deal with a particular problem. 
Authors:

Oscar GonzálezRecio; Selma Forni 
Publication Detail:

Type: Journal Article Date: 20110217 
Journal Detail:

Title: Genetics, selection, evolution : GSE Volume: 43 ISSN: 12979686 ISO Abbreviation: Genet. Sel. Evol. Publication Date: 2011 
Date Detail:

Created Date: 20110314 Completed Date: 20110510 Revised Date: 20130630 
Medline Journal Info:

Nlm Unique ID: 9114088 Medline TA: Genet Sel Evol Country: France 
Other Details:

Languages: eng Pagination: 7 Citation Subset: IM 
Affiliation:

INIA, Ctra La Coruña km 7,5, 28040 Madrid, Spain. gonzalez.oscar@inia.es 
Export Citation:

APA/MLA Format Download EndNote Download BibTex 
MeSH Terms  
Descriptor/Qualifier:

Algorithms* Animals Artificial Intelligence Bayes Theorem Computer Simulation GenomeWide Association Study* Humans Plants / genetics Polymorphism, Single Nucleotide Quantitative Trait Loci 
Comments/Corrections 
Full Text  
Journal Information Journal ID (nlmta): Genet Sel Evol Journal ID (isoabbrev): Genet. Sel. Evol ISSN: 0999193X ISSN: 12979686 Publisher: BioMed Central 
Article Information Download PDF Copyright ©2011 GonzálezRecio and Forni; licensee BioMed Central Ltd. openaccess: Received Day: 31 Month: 5 Year: 2010 Accepted Day: 17 Month: 2 Year: 2011 collection publication date: Year: 2011 Electronic publication date: Day: 17 Month: 2 Year: 2011 Volume: 43 Issue: 1 First Page: 7 Last Page: 7 ID: 3400433 Publisher Id: 12979686437 PubMed Id: 21329522 DOI: 10.1186/12979686437 
Genomewide prediction of discrete traits using bayesian regressions and machine learning  
Oscar GonzálezRecio1  Email: gonzalez.oscar@inia.es 
Selma Forni2  Email: selma.forni@pic.com 
1INIA. Ctra La Coruña km 7.5, 28040 Madrid. Spain 

2Genus Plc, 100 Bluegrass Commons Blvd. Ste 2200. Hendersonville, TN, USA 
The availability of thousands of markers from high throughput genotyping platforms offers an exciting prospect to predict the outcome of complex traits in animal breeding using genomic information (the socalled genomic selection) and in personalized medicine. Besides production and other functional traits, genomic selection offers a novel challenge for discovering genetic variants affecting important diseases in humans, plants and livestock, and also for breeding resistant individuals to improve farm profitability.
The statistical treatment of the genetic basis of these traits is not straightforward because multiple genes, gene by gene interactions and gene by environment interactions underlie most complex traits and diseases. Capturing all marker signals is currently challenging. Besides the large p small n problem, the statistical treatment of the categorical nature of a trait may increase parameterization. So far, methods dealing with genomeassisted evaluations have focused on traits expressed or recorded in a continuous and Gaussian manner [^{1}^{}^{3}]. However, other traits (e.g. disease, survival) are generally recorded in a binary or fewclassed manner (e.g. healthy or sick, number of occurrences, status). Most methods dealing with genomeassisted evaluations may be extended in a relatively well known manner to analyze categorical traits [^{4}^{}^{6}]. A larger amount and various types of genomic information (e.g. single nucleotide polymorphisms, copy number variants or DNA sequencing) for several species are likely to be available in the future. Using this large amount of data may be highly informative, yet quite challenging for current methods from the point of view of computation efficiency. Genomewide association studies (GWAS) and genomic selection methods must be adapted to cope with these challenges.
Machinelearning is becoming more and more popular to deal with the difficulties stated above, and has been previously applied in GWAS in humans [^{7}] and livestock [^{8}^{}^{10}]. Machinelearning methods aim at improving a predictive performance measure by repeated observation of experiences. They are model specification free, and may capture hidden information from large databases. This is appealing in a genomic information context in which multiple and complex relationships between genes exist. The ensemble methods, such as Random Forest (RF) algorithms [^{11}] and boosting [^{12}], are the most appealing alternatives to analyze complex discrete traits using dense genomic markers information, and have been previously applied in GWAS for human diseases [^{13},^{14}]. They may provide a measurement of the importance of each marker on a given trait and good predictive performance. Boosting has been previously applied in a genomic selection context for regression problems using the L_{2 }loss function [^{8}]. RF and boosting do not require specification of the mode of inheritance and hence may account for nonadditive effects. Further, they are fast algorithms, even when handling a large amount of covariates and interactions, and can be applied to both classification and regression problems.
The objective of this study was to present the threshold extension of two Bayesian regression methods that are used in genomeassisted evaluations (Bayes A and Bayesian LASSO), a boosting algorithm for discrete traits, to describe more thoroughly the RF alternative to deal with discrete traits in a genomewide prediction context, and to apply them to both simulated and real data to compare their predictive ability.
Let y ={y_{i}} be a vector of phenotypes recorded in a binary fashion (0/1) from n animals genotyped for p markers X = {x_{i}}. Four different methods were applied: two linear regressions using a Bayesian framework, and two machinelearning ensemble algorithms.
A threshold version of Bayes A (TBA) model was proposed here, which is an extension of the Bayesian regression proposed by Meuwissen et al. [^{1}]. The traditional threshold model [^{4}] postulates that there is an underlying random variable, called liability (λ) that follows a continuous distribution, and that the observed dichotomy is the result of the position of the liability with respect to a fixed threshold (t):
phenotype={0 if λ <t1 if λ≥t 
The liability is taken as the response variable. The proposed modification consists of the linear regression of the single nucleotide polymorphism (SNP) coefficients on a liability variable with Gaussian distribution. The TBA can be described as follows:
λ=μ1+Xb+e 
where, λ is the underlying liability variable vector for y, μ is the population mean, 1 is a column vector (n×1) of ones; b = {b_{j}} corresponds to the vector for the regression coefficient estimates of the p markers or SNP assumed normally and independently distributed a priori as N(0, σj2), where σj2 is an unknown variance associated with marker j. The prior distribution of σj2 is assumed to be distributed as the scaled inverse chisquare σj2~υjsj2χυj−1, with υ_{j }= 4 and sj2=0.002. Elements of the incidence matrix X, of order n × p, may be set up as for different additive, dominant or epistatic models. In the more practical scenario, it takes values 1, 0 or 1 for marker genotypes aa, Aa and AA, respectively. The residuals (e) are assumed to be distributed as N(0, σe2), with residual variance σe2=1, as stated above. As in a regular threshold model, two parameters have to be set fixed (e.g. threshold and the residual variance are set to zero and one, respectively) since these parameters are not identifiable in a liability model.
This method can be solved via the Gibbs sampler described in Meuwissen et al. [^{1}], with the simple incorporation of the data augmentation algorithm to sample the individual liabilities from their corresponding truncated normal distribution as described in Tanner and Wong [^{15}]. The joint posterior distribution of the n liabilities is:
Prob(λμ,b,t)=∏i=1n{Φ[t−(μ+xib)]σe}1−yi{1−Φ[t−(μ+xib)]σe}yi 
The Bayesian LASSO described by Park and Casella [^{16}] and its version for genomic selection detailed in de los Campos et al. [^{17}] can also be extended to discrete traits [^{18}]. As stated in the previous model, the response variable is a liability response (λ) that follows a continuous distribution. The Bayesian threshold LASSO (BTL) can be solved as:
λ=μ1+Xβ+e, 
where λ is the vector of liabilities for all individuals, μ is the population mean, 1 is a column vector (n × 1) of ones; β∧ are the LASSO estimates with their respective incidence matrix X as described for model TBA. As a modeling choice, e was considered the vector of independently and identically distributed residuals, as e∼N(0,σe2). In accordance with tradition, we fixed the threshold to be 0 and the residual variance to be 1 as described for model TBA; alternate choices result in the same model.
In a fully Bayesian context, the LASSO estimates (β∧) can be interpreted as posterior modes estimates when the regression parameters have independent and identical doubleexponential priors [^{19}]. Park and Casella [^{16}] have proposed a conditional Laplace prior specification for the LASSO estimates of the form:
p(βσe2)=∏j=1pγ2σe2e−γβj/σe2, 
where σe2 is the residual variance, and γ is a parameter controlling the shrinkage of the distribution. Inferences about γ may be done in different ways [^{16}]. To follow the Bayesian specifications, a gamma prior is proposed here for γ^{2}, with known rate (r) and shape (δ) hyperparameters, as described by de los Campos et al. [^{17}]. Samples from posterior distributions of those estimates may be drawn from the Gibbs sampling algorithm described in de los Campos et al. [^{17}], with the corresponding data augmentation algorithm for liabilities, as described for TBA.
Gradient boosting may be classified as an ensemble method [^{20}]. This algorithm combines different predictors in a sequential manner with some shrinkage on them [^{12}] and performs variable selection. Gradient boosting forms a "committee" of predictors with potentially greater predictive ability than that of any of the individual predictors in the form:
y=μ+∑m=1Mvhm(y;X) 
Each predictor (h_{m}(y; X) for m ∈ (1, M)) is applied consecutively to the residual from the committee formed by the previous ones. This algorithm can be calculated using importance sampling learning ensembles as follows:
(Initialization): Given data (y, X), let the prediction of phenotypes be F_{0 }= μ, with μ being the population mean.
Then, for m in {1 to M}, with M being large, calculate the loss function (L) for (yi,Fm−1(xi)+h(yi;xi,jm))
where j_{m }is the SNP (only one SNP is selected at each iteration) that minimizes ∑i=1nL(yi,Fm−1(xi)+h(yi;xi,jm)) at iteration m, h(y_{i}; x_{i}, j_{m}) is the prediction of the observation using SNP j at the current iteration, F_{m}_{1}(x_{i}) is the updated prediction at the previous iteration and L(·) is a given loss function. The updated prediction at each iteration m may be expressed as F_{m}(x_{i}) = F_{m}_{1}(x_{i})+v·h(y_{i}; x_{i}, j_{m}) with v being some shrinkage factor that, without loss of generality, can be assumed constant and small (0<v <1), but it may be optimized to balance predictive ability and computation time.
Therefore, after the initialization, the algorithm flows as follows:
Step 1: Compute residuals as rm=y−∑i=0m−1v⋅Fm−1(xi), and fit the weak learner for each SNP j (j ∈{1,..., p}) to current residuals, where ν was set to 0.01.
Step 2: Select SNP j, where j=argminj∑i=1nL(yi,Fm−1(xi)+h(yi;xi,jm)), i.e. the SNP minimizing the loss function.
Step 3. Update predictions as F_{m}(x_{i}) = F_{m}_{1}(x_{i})+ν·h(y_{i}; x_{i}, j_{m}), (i∈{1,..., n}), where h(y_{i}; x_{i}, j_{m}) is the estimate for individual i obtained by regressing the current residual (r_{i}) at iteration m on its genotype for the SNP selected in step 2.
Step 4: Increase the iteration index m by 1, and repeat steps 24 until a convergence criterion is reached.
Here, we used ordinary least square regression as predictor h(y; X) and two different loss functions: the L_{2 }loss function (L2B), which is a quadratic error term in the form (y_{i}F_{m }(y_{i}; x_{i}, j_{m}))^{2}, and a pseudoHuber loss function (LhB) in the form log[cosh(yi−Fm(yi;xi,jm))]. The pseudo Huber loss function is a priori more appealing for discrete traits because it is continuous, differentiable, greater than or equal to the logit loss function and overcomes the disadvantage of the squared loss by becoming more linear when (y_{i}F_{m}(y_{i}; x_{i}, j_{m})) tends to infinite. The choice of the number of iterations, M, is a model comparison problem which may be overcome in many different ways [^{12},^{20}]. Here, a crossvalidation design was used as described in GonzálezRecio et al. [^{8}]. More details on the gradient boosting can be found in Freund and Schaphire [^{21}], Friedman [^{12}] and GonzálezRecio et al. [^{8}].
Random Forest can be viewed as a machine learning ensemble algorithm and was first proposed by Breiman [^{11}]. It is massively nonparametric, robust to overfitting and able to capture complex interaction structures in the data, which may alleviate the problems of analyzing genomewide data. This algorithm constructs many decision trees on bootstrapped samples of the data set, averaging each estimate to make final predictions. This strategy, called bagging [^{22}], reduces error prediction by a factor of the number of trees.
A RF algorithm aimed at genomewide prediction is described next, in a more extensive manner than the previous methods, as this is the first time that this algorithm is used in a genomic breeding value prediction context:
Let y (n × 1) be the data vector consisting of discrete observations for the outcome of a given trait, and X = {x_{i}} where x_{i }is a (p × 1) vector representing the genotype of each animal (0, 1 or 2) for p SNP, to which T decision trees are built (see classification and regression tree theory e.g. [^{20}]). Note that main SNP effects, SNP interactions, environmental factors or combinations thereof may be also included in x_{i}. This ensemble can be described as an additive expansion of the form:
y=μ+∑t=1Tctht(y;X) 
Each tree (h_{t}(y; X) for t∈(1, T)) is distinct from any other in the ensemble as it is constructed from n samples from the original data set selected at random with replacement, and at each node only a small group of SNP are randomly selected to create the splitting rule. Each tree is grown to the largest extent possible until all the terminal nodes are maximally homogeneous. Then, c_{t }is some shrinkage factor averaging the trees. The trees are independent identically distributed random vectors, each of them casting a unit vote for the most popular outcome of the disease at a given combination of SNP genotypes.
Each tree minimizes the average loss function of the bootstrapped data, and is constructed using a heuristic approach as follows:
1. First, bootstrapped samples from the whole data set are drawn with replacement so that realization (y_{i}, x_{i}) may appear several times or not at all in the bootstrapped set Ψ^{(t) }t = (1,..., T).
2. Then, draw mtry out of p SNP markers at random, and select the SNP j, j∈(1,..., mtry), where
j=argminjL(y,ht(X)), 
with L(y, h_{t}(X)) being a certain loss function. i.e. SNP j is the one that minimizes a given loss function at the current node, and is selected in this step. The algorithm takes a fresh look at the data that have arrived at each node and evaluate all possible splits. Many loss functions can be chosen (e.g. logit function, squared loss function, misclassification rate, entropy, Gini index, ...). The behavior of a given loss function may depend on the nature of the problem. The squared loss function is popular for continuous response variables, and the logit function for categorical responses.
3. Split the node in two child nodes according to SNP j genotype that one individual may or may not have (e.g. individuals with the risk allele will pass to a child node, and the remaining animals will pass to the other child node).
4. Repeat steps 23 until a minimum node size is reached (usually <5). The predicted value of the genotype x_{i }is the majority vote for the outcome at the terminal nodes (for regression problems, it is the average phenotype of the individuals in the node).
Finally, a large amount of trees are constructed repeating steps 14 to grow a random forest. The forest may be stopped when the generalization error averaged across the out of bag samples (see section below) have converged. Convergence may be visually tested but it may also be determined using traditional methods for convergence testing of Monte Carlo Markov chains.
Final predictions can be made by averaging the values predicted at each tree to obtain a probability of being susceptible. In a naïve 0 = nonsusceptible/1 = susceptible scenario, individuals with probability <0.5 may be considered as nonsusceptible. To predict observations of new individuals, their marker genotypes are passed down each tree, and the estimate of the corresponding terminal nodes is assigned to the new individual in each tree. The predictions of each tree in the RF algorithm are averaged for each animal to compute the final prediction.
There are two main aspects that can be tuned in random forest: the first one is the number of SNP or covariates sampled at random for each node (mtry). Generalized crossvalidation strategies can be used to optimize mtry. In high dimensional problems such as GWAS, Goldstein et al. [^{23}] have suggested mtry to be fixed to >0.1 p. The algorithm may speed up for smaller mtry values. Nonetheless, crossvalidation can be used to determine the best value of mtry for each trait, although at an expense of increasing computation time. Genetic background may influence the behavior of this tuning parameter. The second aspect is the criterion to select the best SNP to split the node. As commented above, different criteria may be used and the best choice may depend on the nature of the problem. Entropy theory seems the most appealing to evaluate genomic information on discrete traits (as concluded from pilot studies, results not shown). Other loss functions such as the L_{1}loss function or the misclassification rate could be implemented in an easy manner. Without loss of generality we show how to implement the entropy theory in the node splitting decision. The information gain (IG) for each covariate s drawn at random in a given node was calculated as described in Long et al. [^{9}]:
Suppose there are Nk+ individuals with genotype k (k ∈ {0, 1, 2}) at each SNP covariate x_{j }showing y = 1 (e.g. presence of disease) at such node, and Nk− individuals with the same genotype with y = 0 (e.g. absence of disease). The information gain for each covariate x_{j }can be calculated as:
IG(xj)=H(Pr(Y))−∑k=12(∑C=+,−NkCN(−∑C=+,−NkCNklog2NkCNk)) 
where Nk=Nk++Nk−, and H(Pr(y))=−∑y∈APr(y)log2Pr(y) is the entropy of the probability distribution of y, and A is the set of all states that y can take ({0,1}). The SNP covariate with the highest IG at each node is used to split the node into two new child nodes, each one containing the individuals from the parent node with the risk or the nonrisk allele, respectively.
There are two features involved in the RF algorithm that deserve further attention: the out of bag samples, and the variable importance.
The out of bag data (OOB) is an interesting feature of RF. Each tree is grown using a bootstrapped sample of the data, which leaves roughly one third of the observations out because some animals will appear more than once and others will not appear at all. The samples that do not appear are called the OOB samples. The OOB acts as a tuning/validation set at each tree and is almost identical to a nfold cross validation, removing the need for a set aside test or tune test. Tuning of parameters can be done along the RF using the OOB, and generalization error can be calculated as the error rate of the OOB [^{11},^{24}].
RF may use the OOB to provide an importance measure of predictor variables (SNP or environmental effects). The relative variable importance (VI) is estimated as follows. After each tree is constructed, the OOB are passed down the tree and the prediction accuracy of disease outcome is calculated using the chosen criterion (e.g. misclassification rate, L_{2 }loss function). Then, genotypes for the pth SNP are permuted in the OOB, and the accuracy for the permuted SNP is again calculated. The relative importance is calculated as the difference between these prediction accuracies (that of the original OOB and that of the OOB with the permuted variable). This step is repeated for each covariate (SNP) and the decrease of accuracy is averaged over all trees in the random forest. The variable importance may be expressed as a percentage of the accuracy obtained with the most important SNP, and provides insight in the level of association of the SNP with the disease. The SNP with higher VI may be of interest for prediction of trait susceptibility (e.g. disease resistance, low fertility) at low marker density, candidate gene studies or gene expression studies.
Our own java code has been developed for implementing RF for categorical or continuous traits under a genomewide prediction context, and is available upon request to the authors.
Simulated and field data sets were used for the model comparisons. Description of these data is given next.
QMSim software [^{25}] was run to simulate a population of thousands of animals genotyped for roughly 10,000 markers. First, 1000 historical generations were generated in a population with effective size decreasing from 1400 to 400 to mimic a bottleneck, in order to produce a realistic level of LD for the platform used in the simulation. At this point, 40 generations were generated to achieve a population size of 21,000 animals. Then, 20,000 females and 300 males from the last historical population were selected as founders, followed by 15 generations of selection for estimated breeding values from best linear unbiased predictions and random matings. During these generations, replacement ratio were set at 0.83 and 0.45 for males and females, respectively. A random sample of 2500 animals in generations 11 to 14 was used as training set, while the whole generation 15 was used as testing set (1500 animals). Phenotypes were simulated as a Gaussian distribution with heritability equal to 0.25. Then, the phenotype of the animals was coded as 0 or 1 depending on whether their simulated phenotype was below or above, respectively, of the population average (using only generation from 11 to 14), which creates a discrete scenario for the phenotypes.
A genome was simulated with 30 chromosomes 100 cM long. Two scenarios with different numbers of QTL were simulated. In the first, three QTL were randomly located along each chromosome with effects sampled from a gamma distribution. This generated 90 QTL affecting the trait that still segregated in the training population. A second scenario with 33 QTL per chromosome was also simulated with a total of 1000 QTL having some effect on the trait and following a traditional infinitesimal model specification.
Then, 9990 biallelic markers were uniformly distributed along the genome and coded as 0, 1 or 2, regarding the number of copies of the most frequent allele. Simulation was performed to obtain a linkage disequilibrium close to 0.33 (squared correlation of the alleles at two consecutive loci). Ten replicates were analyzed, and the mean and standard deviations are presented.
A field data set was used here to illustrate the behavior of the methods in classification problems applied to genomewide prediction of disease resistance in pigs. In this study we used one of the most important congenital diseases in pig industry as response variable: scrotal hernia (SH). Most affected individuals cannot feed effectively and consequently growth is affected [^{26}]. This leads to higher feed costs, slower throughput, lack of product uniformity and consequent loss in income. In a nucleus breeding population, such individuals cannot be considered for use as breeding stock and effectively end up as culls. Heritability estimates around 0.30 and prevalence between 1% have been reported previously for this trait [^{27},^{28}].
Data were provided by PIC North America, a Genus Plc company. The data set contained records of scrotal hernia incidence (score 0 or 1) in 2768 animals from three different lines. Animals from two purebred lines (A and B) were born in elite genetic nuclei, where environmental conditions were better controlled and risk of infections was lower. Animals from a crossbred line (C), from line A and other lines not used in this study, were born in commercial herds. Selection emphasis in line A was placed on reproduction and lean growth efficiency. Line B has been selected mainly for reproductive traits. Selection against scrotal hernia was equally emphasized in both lines A and B. The prevalence of the disease ranged between 1 and 2% in all lines. Genotypes of all animals with phenotypic records were obtained for 6742 SNP located in different genomic regions identified as candidate regions in previous studies [^{29},^{30}]. A comprehensive scan under the available marker density was performed with all chromosomes being covered. After genotype editing following Ziegler et al. [^{31}], 5302 SNP were retained, and all 923 total animals from line A, 919 from line B and 700 from line C were used. Fifty per cent of animals in the data set of each line were affected with scrotal hernia. For each individual and main effect for SNP jth, we defined two covariates xj1 and xj2, with xj1=1 if the genotype was aa (0, otherwise), and xj2=1 if the genotype was AA (0, otherwise).
Analyses within each line were performed leaving out the 15% youngest individuals, as testing set. The raw phenotype was used as dependent variable in a control case design. Note that systematic effects were not included as covariates for simplicity, although any covariate may be included in the algorithms without loss of generality. The predicted susceptibilities of animals in the testing set were the percentages of trees in a random forest that a given animal was considered as affected.
Performance of the models was based on predictive ability to correctly predict genetic susceptibility in the testing sets. The true genetic susceptibilities of individuals in the simulated data set are known. However, true genetic merits are unknown in the field data case. Therefore, predictive ability was evaluated in a different manner in the field data, as described below.
The true genetic susceptibilities were obtained from the simulations and followed a Gaussian distribution, whereas distributions of predicted susceptibilities were dependent on the model used. A Gaussian distribution was assumed for Bayesian regressions and an unknown distribution bounded between 0 and 1, representing the probability of individual i to be susceptible, for machine learning methods. Pearson's correlations were calculated between true and predicted genetic susceptibility merit for each model and simulated scenario.
In addition, the area (AUC) under the receiving operating characteristic curve was calculated for each model in each simulation. This curve is a graphical plot of the sensitivity, or true vs. false positive rate (1 − specificity) for a binary classifier system as its discrimination threshold changes [^{32}]. The AUC can be used as a model comparison criterion and can be interpreted as the probability that a given classifier assigns a higher score to a positive example than to a negative one, when the positive and negative examples are randomly picked. Individuals with a true genetic susceptibility above or below the population average were assumed positive or negative cases, respectively. Models with higher values of AUC are desirable and are considered more robust.
True genetic susceptibilities of individuals in the field data are unknown. Instead, estimated breeding values (EBV) for SH susceptibility obtained from routine genetic evaluation using the BLUP method [^{33}] were assumed as the true genetic values. Routine evaluations included 6.9 million animals in the pedigree and approximately 2.3 million records of SH. The effects of line, litter, farm, and month of birth nested into farm were included in the threshold animal model used in the analyses. This may indeed be a crude approximation because EBV were calculated under a linear model with strong assumptions of linearity, additivity, non migration or non selection, although millions of records and animals are used in these genetic evaluations and the accuracy ranged between 0.50 and 0.96 for 95% of the EBV. To minimize the issue of this approximation, animals were classified as susceptible or nonsusceptible. Nonsusceptible animals were those in the lower α percentile of the EBV distribution in each line, whereas those in the upper (1α) percentile were considered as susceptible (α ∈ {5,10,25,50}). Lower values of α selected the more extreme animals, thus a smaller approximation error is expected.
Predicted accuracy was calculated between these EBV (y) and predictions (y^) in the testing set from methods TBA, BTL, RF, L2B or LhB. The predictive accuracy was estimated using misclassification rate, the phi coefficient correlation, sensitivity and specificity.
The phi coefficient correlation is the equivalent to the Pearson's product moment correlation for binary variables. It can be calculated as
rφ=p(y^=1y=1)−p(y^=1)p(y=1)p(y^=1)p(y=1)p(y^=0)p(y=0) 
This coefficient may be not robust enough under certain circumstances such as those in which the categories are extremely uneven. Under these circumstances r_{ϕ }has a maximum absolute value determined by the distribution of y^ and y.
Sensitivity and specificity for a given classifier may be computed as
Sensitivity=number of TNnumber of TN+number of FP, 
and
Specificity=number of TPnumber of TP+number of FN 
Sensitivity measures the proportion of healthy animals that are identified as not being affected (TN = true negatives), whereas specificity measures the proportion of affected animals that are correctly identified as such (TP = true positives). Values of sensitivity and specificity closer to 1 are preferred. Specificity and sensitivity are more informative than raw rate of misclassification, as the latter does not differentiate if misclassification is on true healthy or true affected animals.
Furthermore, all animals in the respective testing sets were used to calculate the AUC statistic, described above, for each method within a line. Animals with SH were considered as positive examples, whereas animals without SH were considered negative examples. As stated before, AUC measures predictive ability and may be considered as a model comparison criterion. Higher AUC values are desirable, as mentioned above.
Table 1 shows the average predictive ability (standard deviations in parentheses) across replicates, measured as Pearson correlation, between true and predicted genomic values, and also using the AUC statistic for each model on each simulated data set. Machinelearning methods showed higher averaged accuracy in the simulated data set than Bayesian regression, although with a large standard deviation across replicates. Smaller differences between Bayesian regressions and machinelearning were found in the simulated scenario with 1000 QTL. TBA and L2B were the methods showing poorest accuracy (0.26 ± 0.10 and 0.24 ± 0.04, respectively) in the scenarios with 90 and 1000 QTL, respectively. The boosting algorithm, both L2B and LhB, achieved the highest averaged accuracy (0.370.41) in the simulated data set with a smaller number of QTL. In contrast, methods BTL and LhB showed better predictive ability in the 1000 QTL scenario, 0.35 ± 0.04 and 0.34 ± 0.06, respectively. Differences between methods within replicates were in accordance with the averages shown in Table 1, although standard deviations between methods across replicates were large. The AUC ranged between 0.610.66 for Bayesian regression and between 0.63 and 0.70 for machinelearning methods. Although similar values were found for all methods, RF showed higher and preferable classification performance according to this parameter (0.70 ± 0.07 for 90 QTL and 0.69 ± 0.04 for 1000 QTL). It is not possible to draw clear conclusions on the preferred method according to the number of QTL affecting the trait, in light of the results from the simulations. Nonetheless, there is a slightly better behavior of machinelearning on traits with a small number of genomic regions affecting the outcome of the trait. Previous studies have also shown good performance of boosting in dealing with different continuous traits in real data [^{8}].
Bayesian regression showed larger Pearson correlations than ensemble algorithms in the scenario with a larger number of QTL. Method BTL achieved the largest Pearson correlation (0.38), followed by TBA and LhB (0.33). Method RF showed the smallest Pearson correlation (0.22) in this simulated scenario and the largest AUC (0.72). This suggests that RF ranked individuals less accurately than other methods when a large number of QTL affects additively the trait, but the method is more accurate than other methods at discerning between healthy and affected individuals.
It must be pointed out that the simulated scenarios are purely additive and other more realistic scenarios with a more complex interaction between genes and biological pathways might provide different results.
The three data sets had a disease occurrence of 50%. The relative predictive importance obtained with RF for each SNP covariate xjl in each line is plotted in Figure 1. Many more SNP were identified as predictors of SH in line A than in line B and C, suggesting that many more genomic regions may be associated to SH in line A than in line B or C. Lines B and C showed few genomic regions with a large relative importance variable associated to the genetic resistance to SH. Thirty seven, four and six SNP had a larger relative variable importance than 50% in lines A, B and C, respectively. The odds ratio of SNP with VI > 50% ranged from 1.41 to 2.17 in line A, from 2.56 to 3.03 in line B and from 1.86 to 2.50 in line C, suggesting a considerable risk of being susceptible to SH of those animals carrying the unfavorable alleles. The SNP with the largest importance estimate (VI = 100%) in line C had also the maximum VI in line B, but had a VI < 21% in line A. These results suggest that the genetic variants presented in line B and C in this genomic region provide a relatively larger predictive ability of SH than genetic variants in the same genomic region in line A. The relative VI of the most important SNP in line A was lower than 2% in lines B and C, although other SNP in LD with those may have been detected in these lines. Fifty, 44 and 48 markers with VI greater than 99.5 percentile were found in lines A, B and C, respectively. Most represented chromosomes were SSC4, SSC7, SSC14 and SSC17 in line A, SSC1, SSC2, SSC6 and X chromosome in line B, and SSC8 in line C. Validation of these results and conclusions about their role in genetic or biological pathways should be performed on different populations and studies.
Tables 2, 3 and 4 show the predictive accuracy of each method within lines A, B and C, respectively. RF had an equal or better predictive accuracy in the pure lines at α = 0.05, 0.25 and 0.50, than the rest of methods used in this study. Only L2B achieved a larger phi correlation (1.00) than RF (0.75) in line B at α = 0.05, and BTL showed higher accuracy at α = 0.10 in the purebred lines. Misclassification rate and sensitivity + specificity followed similar trends. RF and L2B were the most accurate at correctly detecting the most extreme animals in lines A and B, respectively, i.e. lower misclassification, and larger r_{ϕ}, sensitivity and specificity were achieved at α = 0.05. RF and L2B achieved misclassification = 0, r_{ϕ }= 1, sensitivity = 1 and specificity = 1 at α = 0.05 in lines A and B, respectively, which means a perfect classification of the most extreme animals. At this α level, TBA and BTL showed misclassification = 17%, r_{ϕ }= 0.71 in line A and misclassification = 14%, r_{ϕ }= 0.75 in line B, and were either less sensitive or specific than RF and L2B. RF outperformed BTL at α = 0.05, 0.25 and 0.50 in lines A and B, whereas TBL achieved better predictive accuracy at α = 0.10. RF doubled the r_{ϕ }obtained with TBA at α = 0.50 in line A, and was 12% larger in Line B.
None of the methods was clearly preferred in the crossbred (line C), where similar phi correlations were found for RF, TBA and boosting, with larger robustness for LhB at α < 0.50. No differences were found between RF, TBA and LhB to correctly detect most extreme animals in the crossbred line. The Huber loss function was more robust than the squared sum of residuals at analyzing binary traits, in accordance with its resemblance with the L_{1 }loss function.
RF showed consistently larger AUC values than the other methods whichever line (Table 5), whereas a clear trend was not extracted from the AUC values of other methods. For instance, the boosting algorithms had larger AUC values (0.660.67) than Bayesian regression (0.62) in line C, but lower in line A (0.550.60 vs 0.640.65). This result also suggests that RF is less dependent on the choice of the threshold for classifying healthy and affected animals, providing larger stability to the classification.
The true genetic architecture of SH is obviously unknown and no conclusions on its relationship with the performance of the different methods can be extracted. There was no clear relationship between the preferred method and the number of relevant genomic regions identified in each line (Figure 1). The choice of the model to be used in genomewide prediction of traits like SH may depend on the interest of the breeder. For instance, detection of susceptible animals was done more accurately in line A using RF, whereas the Bayesian regressions were preferred in line B. Thus, a different method may be desired depending on the objective of the breeding program. The model with higher sensitivity would be preferred in a breeding program aiming at eradicating a given disease or trait. In a specifity+sensitivity scenario, RF was the best method at α = 0.05, 0.25 and 0.50, and it also showed the larger AUC values, regardless of the line.
Results showed that RF had the lowest risk, among methods used here, of misclassifying animals for lowmedium heritability discrete traits in all lines, although all methods had considerable misclassification risks at α = 0.50. However, in a disease resistance genomeassisted prediction context, for instance, we are mainly interested in correctly detecting the most susceptible or resistant animals (lower α values), and RF seemed to perform slightly better than the Bayesian regressions to detect susceptibility to SH in this population, mainly in line A. Note that the threshold versions presented here incorporate n liability variables to be estimated in the model, increasing the parameterization of the models, and therefore hampering their predictive ability.
Results from the analyses of the crossbred line were not conclusive, as different behaviors between methods were found for different α values. This may be explained by the larger genetic heterogeneity expected in line C which may not be captured with only 5000 markers.
A small number of animals was used in the testing set and only punctual estimates are given here. This may be important at low α levels with a smaller number of records. Uncertainty about these estimates may be reported from their posterior densities [^{34}] in the case of Bayesian methods and using bootstrap or crossvalidation strategies in the case of this version of RF [^{11}]. Uncertainties are not reported in this study because this data set aims at serving just as an example of three different models applied to discrete traits in a genomeassisted prediction context without overloading the discussion. Furthermore, the preferred model may be casespecific.
The misclassification rate and the logit function were also used as splitting criteria in RF but with poorer predictive ability (results not shown). Here, hyperparameters were set as fixed, although it is possible to assign them a prior distribution for their estimation [^{35}]. Nonetheless, a minor improvement on predictive ability is expected if the adhoc choice of the parameters is within a sensible range of values.
Two Bayesian regressions (TBA and BTL) and two machinelearning algorithms (RF and boosting) were proposed here to analyze discrete traits in a genomewide prediction context. Machinelearning performed better than Bayesian regression with a small number of QTL with pure additive effects. RF seemed to outperform other methods in the field data sets, with better classification performance within and across data sets. It is an elegant method with an interesting predictive ability for studies on discrete traits using whole genome information. It is also easily interpretable as it is based on naïve decision rules. The boosting algorithms may achieve high predictive accuracy if a casespecific loss function is used, although it may be influenced by genetic architecture. Comparison between Bayesian regressions was dependent on the data set used, although the threshold version of the Bayesian LASSO seemed to be preferred to the threshold Bayes A.
RF and boosting do not need an inheritance specification model and may account for nonadditive effects without increasing the number of covariates in the model or computing time. Results from this study showed some advantages in the use of machine learning to analyze discrete traits in genomewide prediction, although model comparisons for specific case problems are encouraged.
The authors declare that they have no competing interests.
SF participated in the statistical analyses, discussion of results, got access and edited the data, coordinated the study and helped writing the manuscript. OGR participated in the statistical analyses and editing of the data, development of the software, discussion of the results and drafted the manuscript. Both authors read and approved the final manuscript and contributed equally to this study.
Several people contributed to the accomplishment of this study. The authors express their gratitude to Dr. Alan Mileham and his team at Genus Plc for working with DNA samples and genotyping, Dr. Nader Deeb and Dr. Matthew Cleveland for designing the experiment, Dr. David McLaren for revising the manuscript and Dr. Denny Funk for supporting the partnership between institutions. Authors thank J.A. JiménezMontero for help with the simulations.
References
Meuwissen THE,Hayes BJ,Goddard ME,Prediction of total genetic value using genomewide dense marker mapsGeneticsYear: 20011571819182911290733  
Gianola D,PerezEnciso M,Toro MA,On marker assisted prediction of genetic value: beyond the ridgeGeneticsYear: 200316334736512586721  
Gianola D,Fernando RL,Stella A,GenomicAssisted prediction of genetic value with semiparametric proceduresGeneticsYear: 20061731761177610.1534/genetics.105.04951016648593  
Wright S,An analysis of variability in number of digits in an inbred strain of guinea pigsGeneticsYear: 19341950653617246735  
Gianola D,Theory and analysis of threshold charactersJ Animal SciYear: 19825410791096  
Villanueva B,Fernández J,GarcíaCortés LA,Varona L,Daetwyler HD,Toro MA,Accuracy of genomewide evaluation for disease resistance in aquaculture breeding programmesProceedings of the 9th World congress on genetics applied to livestock production: 16 August 2010; Leipzighttp://www.kongressband.de/wcgalp2010/assets/html/0325.htm  
Szymczak S,Biernacka JM,Cordell HJ,GonzálezRecio O,König IR,Zhang H,Sun YV,Machine learning in genomewide association studiesGenet EpidemiolYear: 200933S51S5710.1002/gepi.2047319924717  
GonzalezRecio O,Weigel KA,Gianola D,Naya H,Rosa GJM,L_{2}Boosting algorithm applied to high dimensional problems in genomic selectionGenet ResYear: 20109222723710.1017/S0016672310000261  
Long N,Gianola D,Rosa GJM,Weigel KA,Avendaño S,Machine learning classification procedure for selecting SNPs in genomic selection: Application to early mortality in broilersJ Animal Breed GenetYear: 200712437738910.1111/j.14390388.2007.00694.x  
Long N,Gianola D,Rosa GJM,Weigel KA,Kranis A,GonzálezRecio O,Radial basis function regression methods for predicting quantitative traits using SNP markersGenet ResYear: 20109220922510.1017/S0016672310000157  
Breiman L,Random forestMachine LearningYear: 20014553210.1023/A:1010933404324  
Friedman JH,Greedy functions approximation: a gradient boosting machineAnn StatYear: 2001291189123210.1214/aos/1013203451  
GarcíaMagariños M,LópezdeUllibarri I,Cao R,Salas A,Evaluating the ability of treebased methods and logistic regression for the detection of SNPSNP interactionAnn Hum GenetYear: 20097336036919291098  
Sun YV,Bielak LF,Peyser PA,Turner ST,Sheedy II PF,Boerwinkle E,Kardia SL,Application of machine learning algorithms to predict coronary artery calcification with a sibshipbased designGenet EpidemiolYear: 20083235036010.1002/gepi.2030918271057  
Tanner MA,Wong WH,The calculation of posterior distributions by data augmentationJ Am Stat AssocYear: 1987818286  
Park T,Casella G,The Bayesian LassoJ Am Stat AssocYear: 200810368168610.1198/016214508000000337  
de los Campos G,Naya H,Gianola D,Crossa J,Legarra A,Manfredi E,Weigel KA,Cotes JM,Predicting quantitative traits with regression models for dense molecular markers and pedigreeGeneticsYear: 200918237538510.1534/genetics.109.10150119293140  
GonzálezRecio O,Lopez de Maturana E,Vega T,Broman K,Engelman C,Detecting SNP by SNP interactions in rheumatoid arthritis using a two step approach with Machine Learning and a Bayesian Threshold LASSO modelBMC ProceedingsYear: 20093S6320018057  
Tibshirani R,Regression shrinkage and selection via the lassoJ Royal Stat Soc BYear: 199658267288  
Hastie T,Tibshirani R,Friedman JH,The elements of statistical learning. Data mining, inference and predictionYear: 2009New York, Springer  
Freund Y,Schapire RE,Saitta L, Morgan KaufmannExperiments with a new boosting algorithmproceeding of the Thirteen International conference on Machine Learning: 1996; San FranciscoYear: 1996148156  
Breiman L,Bagging predictorsMachine LearningYear: 199624123140  
Goldstein BA,Hubbard AE,Cutler A,Barcellos LF,An application of random Forest to a genomewide association data set: Methodological considerations & new findingsBMC GeneticsYear: 2010114910.1186/14712156114920546594  
Tibshirani R,Bias, variance, and prediction error for classification rulesTechnical ReportYear: 1996Statistics Department, University of Toronto  
Sargolzaei M,Schenkel FS,QMSIM: A large scale genome simulator for livestockBioinformaticsYear: 20092568068110.1093/bioinformatics/btp04519176551  
Straw B,Bates R,May G,Anatomical abnormalities in a group of finishing pigs: prevalence and pig performanceJ Swine Health ProdYear: 2009172831  
Lingaas F,Ronningen K,Epidemiological and genetical studies in Norwegian pig herds. II. Overall disease incidence and seasonal variationActa Vet ScandYear: 19913289961950855  
Vogt DW,Ellersieck MR,Heritability of susceptibility to scrotal herniation in swineAm J Vet ResYear: 199051150115032396801  
Plastow G,Sasaki S,Yu TP,Deeb N,Prall G,Siggens K,Wilson E,Practical application of DNA markers for genetic improvementProceedings of the twentyeighth National Swine Improvement Federation meeting: 2003; Des MoinesYear: 2003150154  
Hu ZL,Dracheva S,Jang W,Maglott D,Bastiaansen J,Rothschild MF,Reecy JM,A QTL resource and comparison tool for pigs: PigQTLdbMammalian GenomeYear: 20051679280010.1007/s003350050060916261421  
Ziegler A,Konik IR,Thompson JR,Biostatistical Aspects of GenomeWide Association StudiesBiom JYear: 20085012110.1002/bimj.200710398  
Green DM,Swets JM,Signal detection theory and psychophysicsYear: 1966New York: John Wiley and sons  
Henderson CR,Best linear unbiased estimation and prediction under a selection modelBiometricsYear: 19753142344710.2307/25294301174616  
Sorensen D,Gianola D,Likelihood, Bayesian and MCMC Methods in Quantitative GeneticsYear: 2002New York: Springer Verlag  
Yi N,Xu S,Bayesian LASSO for quantitative trait loci mappingGeneticsYear: 20081791045105510.1534/genetics.107.08558918505874 
Figures
[Figure ID: F1] 
Figure 1
SNP covariate relative variable importance (VI) in each line using random forest algorithm. 
Tables
Accuracy (standard error across replicates in parentheses), measured as Pearson correlation between predicted and true genomic assisted values, and area under the operating characteristic curve for different methods and number of QTL
# QTL  TBA  BTL  RF  L_{2}B  L_{h}B  

Pearson correlation  90  0.26 (0.03) 
0.33 (0.04) 
0.36 (0.04) 
0.37 (0.07) 
0.41^{1 } (0.07) 


1000  0.32 (0.16)  0.35 (0.01)  0.30 (0.03)  0.24 (0.01)  0.34 (0.02)  


AUC  90  0.61 (0.01) 
0.65 (0.02) 
0.70 (0.02) 
0.65 (0.04) 
0.69 (0.03) 
1000  0.66 (0.01)  0.66 (0.00)  0.69 (0.01)  0.63 (0.01)  0.66 (0.01) 
^{1}Higher value is desirable; the best value is in bold face; TBA = Threshold Bayes A, BTL = Bayesian Threshold LASSO, RF = Random Forest; L_{2}B = L_{2}boosting algorithm, L_{h}B = L_{h}boosting algorithm
Specificity, sensitivity, phi correlation and misclassification rate for each model at detecting different α and (1α) percentiles of extreme animals in the testing set within line A
Parameter  Method  α (number of records)  

0.05 (12)  0.10 (79)  0.25 (98)  0.50 (138)  


Specificity^{1}  TBA  1  0.71  0.58  0.56 
BTL  1  0.94  0.75  0.74  
RF  1  0.88  0.78  0.79  
L_{2}B  0.75  0.71  0.64  0.65  
L_{h}B  0.75  0.71  0.61  0.67  


Sensitivity^{1}  TBA  0.75  0.58  0.58  0.56 
BTL  0.75  0.53  0.53  0.47  
RF  1  0.52  0.52  0.46  
L_{2}B  0.75  0.48  0.48  0.51  
L_{h}B  0.50  0.45  0.45  0.42  


Phi correlation^{1}  TBA  0.71  0.24  0.16  0.13 
BTL  0.71  0.39  0.27  0.22  
RF  1  0.33  0.29  0.26  
L_{2}B  0.48  0.16  0.12  0.17  
L_{h}B  0.24  0.13  0.06  0.09  


Misclassification rate (%)^{2}  TBA  17  39  42  43 
BTL  17  38  39  40  
RF  0  41  39  38  
L_{2}B  25  47  46  42  
L_{h}B  42  49  49  46 
^{1}Higher value is desirable; the best value for each percentile is in bold face;
^{2}Lower value is desirable; the best value for each percentile is in bold face;
TBA = Threshold Bayes A, BTL = Bayesian Threshold LASSO, RF = Random Forest; L_{2}B = L_{2}boosting algorithm, L_{h}B = L_{h}boosting algorithm
Specificity, sensitivity, phi correlation and misclassification rate for each model at detecting different α and (1α) percentiles of extreme animals in the testing set within line B
Parameter  Method  α (number of records)  

0.05 (7)  0.10 (25)  0.25 (78)  0.50 (137)  


Specificity^{1}  TBA  0.75  0.86  0.74  0.75 
BTL  0.75  0.86  0.61  0.58  
RF  0.75  0.57  0.48  0.37  
L_{2}B  1  0.71  0.57  0.48  
L_{h}B  0.75  0.71  0.57  0.63  


Sensitivity^{1}  TBA  1  0.95  0.64  0.58 
BTL  1  1  0.75  0.75  
RF  1  1  0.95  0.94  
L_{2}B  1  0.72  0.56  0.64  
L_{h}B  0.67  0.78  0.73  0.69  


Phi correlation^{1}  TBA  0.75  0.80  0.34  0.34 
BTL  0.75  0.90  0.34  0.32  
RF  0.75  0.70  0.50  0.38  
L_{2}B  1  0.40  0.12  0.12  
L_{h}B  0.42  0.46  0.28  0.32  


Misclassification rate (%)^{2}  TBA  14  8  35  34 
BTL  14  4  29  32  
RF  14  12  19  31  
L_{2}B  0  28  44  43  
L_{h}B  29  24  32  36 
^{1}Higher value is desirable; the best value for each percentile is in bold face;
^{2}Lower value is desirable; the best value for each percentile is in bold face;
TBA = Threshold Bayes A, BTL = Bayesian Threshold LASSO, RF = Random Forest; L_{2}B = L_{2}boosting algorithm, L_{h}B = L_{h}boosting algorithm
Specificity, sensitivity, phi correlation and misclassification rate for each model at detecting different α and (1α) percentiles of extreme animals in the testing set within line C
Parameter  Method  α (number of records)  

0.05 (7)  0.10 (24)  0.25 (80)  0.50 (104)  


Specificity^{1}  TBA  1  0.50  0.64  0.71 
BL  0  0.25  0.61  0.71  
RF  1  0.75  0.75  0.71  
L_{2}B  1  1  0.96  0.98  
L_{h}B  1  1  0.82  0.69  


Sensitivity^{1}  TBA  0.33  0.30  0.54  0.53 
BL  0.5  0.30  0.44  0.43  
RF  0.33  0.35  0.52  0.51  
L_{2}B  0.17  0.20  0.15  0.15  
L_{h}B  0.33  0.20  0.46  0.45  


Phi correlation^{1}  TBA  0.26  0.16  0.17  0.24 
BL  0.35  0.35  0.05  0.15  
RF  0.26  0.08  0.26  0.23  
L_{2}B  0.17  0.20  0.17  0.24  
L_{h}B  0.26  0.20  0.28  0.15  


Misclassification rate (%)^{2}  TBA  57  67  43  38 
BL  57  71  50  43  
RF  57  58  40  39  
L_{2}B  71  67  56  44  
L_{h}B  57  67  41  43 
^{1}Higher value is desirable; the best value for each percentile is in bold face;
^{2}Lower value is desirable; the best value for each percentile is in bold face;
TBA = Threshold Bayes A, BTL = Bayesian Threshold LASSO, RF = Random Forest; L_{2}B = L_{2}boosting algorithm, L_{h}B = L_{h}boosting algorithm
Area under the receiver operating characteristic curve^{1 }for each model and breed line in the field pig data
TBA  BL  RF  L_{2}B  L_{h}B  

Line A  0.64  0.65  0.67  0.55  0.60 
Line B  0.70  0.69  0.73  0.60  0.72 
Line C  0.62  0.62  0.67  0.67  0.66 
TBA = Threshold Bayes A; BTL = Bayesian Threshold LASSO; RF = Random Forest; L_{2}B = L_{2}boosting algorithm; L_{h}B = L_{h}boosting algorithm
^{1}Higher value is desirable; the best value for each line is in bold face
Article Categories:

Previous Document: Breastfeeding and the risk of rotavirus diarrhea in hospitalized infants in Uganda: a matched case c...
Next Document: Purification and biochemical characterization of pancreatic phospholipase A2 from the common stingra...