Estimated breeding values and association mapping for persistency and total milk yield using natural cubic smoothing splines.  
Jump to Full Text  
MedLine Citation:

PMID: 19891765 Owner: NLM Status: MEDLINE 
Abstract/OtherAbstract:

BACKGROUND: For dairy producers, a reliable description of lactation curves is a valuable tool for management and selection. From a breeding and production viewpoint, milk yield persistency and total milk yield are important traits. Understanding the genetic drivers for the phenotypic variation of both these traits could provide a means for improving these traits in commercial production. METHODS: It has been shown that Natural Cubic Smoothing Splines (NCSS) can model the features of lactation curves with greater flexibility than the traditional parametric methods. NCSS were used to model the sire effect on the lactation curves of cows. The sire solutions for persistency and total milk yield were derived using NCSS and a wholegenome approach based on a hierarchical model was developed for a large association study using single nucleotide polymorphisms (SNP). RESULTS: Estimated sire breeding values (EBV) for persistency and milk yield were calculated using NCSS. Persistency EBV were correlated with peak yield but not with total milk yield. Several SNP were found to be associated with both traits and these were used to identify candidate genes for further investigation. CONCLUSION: NCSS can be used to estimate EBV for lactation persistency and total milk yield, which in turn can be used in wholegenome association studies. 
Authors:

Klara L Verbyla; Arunas P Verbyla 
Related Documents
:

16478945  Genetic parameters for various random regression models to describe total sperm cells p... 9785375  Herdlevel test performance based on uncertain estimates of individual test performance... 10722145  Introduction, persistence and fadeout of porcine reproductive and respiratory syndrome... 21671635  A computational approach to design and evaluate enzymatic reaction pathways: applicatio... 20221255  Diffusion, crowding & protein stability in a dynamic molecular model of the bacterial c... 10490945  Sparse code shrinkage: denoising of nongaussian data by maximum likelihood estimation 
Publication Detail:

Type: Journal Article; Research Support, NonU.S. Gov't Date: 20091105 
Journal Detail:

Title: Genetics, selection, evolution : GSE Volume: 41 ISSN: 12979686 ISO Abbreviation: Genet. Sel. Evol. Publication Date: 2009 
Date Detail:

Created Date: 20091201 Completed Date: 20100114 Revised Date: 20100927 
Medline Journal Info:

Nlm Unique ID: 9114088 Medline TA: Genet Sel Evol Country: France 
Other Details:

Languages: eng Pagination: 48 Citation Subset: IM 
Affiliation:

Victorian Department of Primary Industries, Bundoora, VIC, 3083, Australia. Klara.Verbyla@dpi.vic.gov.au 
Export Citation:

APA/MLA Format Download EndNote Download BibTex 
MeSH Terms  
Descriptor/Qualifier:

Animals Breeding* Cattle / genetics*, physiology Female Lactation / genetics* Male Milk / chemistry* Models, Genetic Polymorphism, Single Nucleotide Quantitative Trait, Heritable 
Comments/Corrections 
Full Text  
Journal Information Journal ID (nlmta): Genet Sel Evol ISSN: 0999193X ISSN: 12979686 Publisher: BioMed Central 
Article Information Download PDF Copyright ?2009 Verbyla and Verbyla; licensee BioMed Central Ltd. openaccess: Received Day: 23 Month: 3 Year: 2009 Accepted Day: 5 Month: 11 Year: 2009 collection publication date: Year: 2009 Electronic publication date: Day: 5 Month: 11 Year: 2009 Volume: 41 Issue: 1 First Page: 48 Last Page: 48 ID: 2785761 Publisher Id: 129796864148 PubMed Id: 19891765 DOI: 10.1186/129796864148 
Estimated breeding values and association mapping for persistency and total milk yield using natural cubic smoothing splines  
Klara L Verbyla1  Email: Klara.Verbyla@dpi.vic.gov.au 
Arunas P Verbyla23  Email: ari.verbyla@adelaide.edu.au 
1Victorian Department of Primary Industries, Bundoora, VIC, 3083, Australia 

2School of Agriculture, Food and Wine, The University of Adelaide, Adelaide, SA 5005, Australia 

3Mathematical and Information Sciences, CSIRO, Urrbrae, SA 5064, Australia 
For dairy producers, the accurate description of lactation curves is a valuable tool for selection and management. Lactation curves provide a description of milk yield performance, which make it possible to predict total milk yield from a single or several test days early in lactation. Thus, producers can make early management decisions based on the predicted individual production. Different mathematical equations have been proposed to model lactation curves. Usually such curves are modelled using parametric models with fixed or random coefficients, for example random regression models, Wood's Lactation Curve (the commonly applied gamma equations), Wilmink's Curve and Legendre polynomials. Alternatively, mechanistic models which describe the lactation curves based on the biology of lactation have been used [^{1}]. In 1999, White and colleagues [^{2}] proposed and demonstrated that Natural Cubic Smoothing Splines (NCSS) can model the features of lactation curves with greater flexibility than the traditional parametric methods. This has been further supported by the work of Druet and colleagues [^{3}]. In addition, NCSS are particularly useful in an animal breeding setting since they can be incorporated into linear mixed models.
A lactation curve describes many important features of lactation and some of these features, namely time to peak, total milk yield and rate of decline after the peak yield, were examined in this study. The rate of decline in milk production after peak yield is the typical definition of milk yield persistency. High persistency is characterized by a slow rate of decline after peak yield, while low persistency is characterized by a high rate of decline after peak yield. Persistency has been reported to have a significant economic impact [^{4}]. Highly persistent cows or cows with a flat lactation curve are reported to be more profitable because of fewer health and reproductive problems with less energy imbalance. The links between health disorders, fertility and persistency have been investigated with varied results [^{5},^{6}].
Total milk yield is a wellknown economically important trait. However, selection for high total milk yield has been shown to have detrimental health effects [^{7}]. If an animal has a low persistency, selection for high milk yield can cause significant metabolic stress. In 2004, Muir and colleagues [^{8}] have reported that selection for increased persistency might increase total yields without increasing disease incidences or fertility problems. Subsequently, Togashi and Lin [^{9},^{10}] have investigated different selection strategies to maximize milk yield without decreasing persistency.
Although the definition of persistency is now generally agreed upon, methods of estimation still vary. In 1996, Gengler [^{11}] provided a review of many common definitions of persistency, which included ratios of an early test day or period to latelactation testday or period and measures formulated to be independent of total yield. Other reported measures are the difference between one set day for peak yield (or the estimated breeding value (EBV) at this day) early in lactation and a test day late in lactation (or EBV at this day), or the sum of the yield or EBV over this time period. Novel approaches for calculating persistency have been presented by Druet and colleagues [^{12}] and Togashi and Lyn [^{13}]. Cole and VanRaden [^{14}] and Cole and Null [^{15}] have shown that routine genetic evaluations are feasible for persistency. Some of these methods assume one set day for peak yield for all animals, which in reality is not the case. Using NCSS allows the exact estimation of a unique peak day and yield at peak for each animal.
Many QTL and association studies have been conducted for total milk yield and a few QTL studies have investigated persistency. Such studies usually involved either the use of single markers or a genome scan to establish association with a specific trait. Wholegenome approaches have been developed, for example genetic random variable elimination (GeneRaVE) [^{16},^{17}] and wholegenome average interval mapping (WGAIM) [^{18}]. Wholegenome methods allow for background genetic effects by incorporating all markers, and thus all the associations between marker and trait are estimated simultaneously.
The first objective of this paper was to demonstrate that NCSS could be used successfully to estimate sire breeding values for two important features of the lactation curve, persistency and total milk yield, for a specific set of sires in a large Australian study. The second objective was to conduct an association study for both persistency and total milk yield using the calculated EBV, genotype information in the form of 7541 single nucleotide polymorphisms (SNP) and a maternal grandsire pedigree. The overall aim was to use a wholegenome association study to establish markertrait associations.
Genotypic information was available for 383 Holstein Friesian (HF) progenytested bulls, which were selected on the basis of either high or low estimated breeding values for the Australian selection index. The index's primary emphasis is on protein production. Data on all these bulls' daughters and their contemporaries were extracted from the Australian Dairy Herd Improvement Scheme (ADHIS) database. The data set consisted of Holstein Friesian cows that calved during the period 1983 to 2006 and were in the same herd year and season as the daughters of the 383 genotyped sires. Records were removed when calving date was missing or when the test date was outside the 5 to 305 d in milk (DIM) period. Only first lactations were included since it has been demonstrated that genetic correlations for persistency between consecutive parities are high [^{19}] (> 0.85 reported between the first two parities) despite previous results disagreeing with this study (see [^{19}] for discussion of results). This data set contained over 15 millions test day records from the daughters of 38,381 sires in 6,384 herds and thus was too large for use in a single analysis. In order to provide an unbiased analysis, six random samples were selected from the full data set by randomly sampling 1,000 herds [^{14},^{20}]; each sampled herd had to contain at least 1,000 test day records. Each sample contained approximately 15,000 to 20,000 sires and 400,000 to 450,000 cows. These six subsamples were used for the estimation of the variance components in the model discussed below.
A selected data set was created and consisted of data concerning only the specific 383 sires of interest and their offspring. This data set contained 333,068 Holstein Friesian daughters with 2,311,834 records and was used to estimate the sire effect EBV for persistency and total milk yield (incorporating information based on the six subsamples). A maternalgrandsire pedigree dating back to 1940 and consisting of 2864 animals was available for the 383 sires.
A total of 9918 SNP markers were scored on the 383 sires using Parallele (Affymetrix, Santa Clara, CA). After adjusting for monomorphic SNP, missing genotypes, unknown location, minimum allele frequency (> 2.5%) and deviation of observed genotype frequencies from expected frequencies calculated from allele frequencies (Hardy Weinberg equilibrium), the number of polymorphic markers amounted to 7541 with an average of 251 SNP per chromosome (29 autosomes plus one sex chromosome). The remaining missing values in the SNP information were replaced by their expected value calculated using haplotypes of five SNP markers [^{21}].
NCSS were used to model the sire influence on lactation curves of dairy cows in the randomly sampled data and also in the selected data set. The randomly selected data sets were used to estimate variance components in the model discussed below. The six sets of estimates were averaged and all but one (as discussed later) of the variances components were fixed at their average value in the analysis of the selected data set. The aim was to reduce the bias in using the selected data by ensuring that the variance component estimates reflected those that would be obtained if the full data was analysed.
For the analysis on the selected data, the main features of the lactation curves were extracted. The sire's influence on the peak lactation milk yield and the corresponding day of peak milk yield were estimated, and for each sire, the EBV for persistency and total milk yield were subsequently computed. This constituted the first stage of analysis.
Then, the EBV for persistency and total milk yield were used in the second stage association study. Appropriate weights were calculated for the second stage analyses, reflecting the information available for each sire. A discussion of weights for twostage analysis has been presented by Smith and colleagues [^{22}] in the context of plant breeding but the methods are more widely applicable and relevant for the analyses conducted in this paper.
A mixed model was used for both the sampled and selected test day data, namely
The vector y is the Nx1 vector of testday milk yields on the cows in both the randomly sampled and the selected data sets. The fixed effects were given by X_{0 }?_{0}, and consisted of trends for the age of cow at test (a fixed effects cubic polynomial) and a fixed effect for year by season; a factor of 46 levels representing year by season interactions. The random effects in the model included herdtestday effects represented by u_{0h }(with design matrix Z_{0h}), independent effects with mean zero and variance , and the random cubic orthogonal polynomial regression coefficients for the c cows in the data are given by u_{0c }(with design matrix Z_{0c}), with mean zero and variance matrix G_{0c }? I_{c}; G_{0c }is a 4 ? 4 variance matrix (? is the Kronecker product). The random cubic regression using orthogonal polynomials was included to model cow lactation across the repeated measures of milk yield over the lactation period and it incorporates permanent environmental effects and genetic effects since the maternal grandsire pedigree was not included in the stage I model. It would have been preferable to include the pedigree in this first stage of modelling, especially if EBV were of prime interest since they would then reflect relationships between sires, but we were unable to do so due to limitations in computing power. However, the pedigree was used in the association analysis discussed and presented below. All random effects were assumed to have a normal distribution and to be mutually independent. The error term was assumed independently distributed as N(0, ?^{2}I_{N}).
The term Zg represents the sire effects on lactation over time. Thus Z is a design matrix for the sire of cow effect. The vector g is the vector of sire contributions to the lactation curves of the cows. Thus g can be partitioned into components that correspond to individual sires; that is for the 383 sires for the selected data set.
The contribution to the lactation curve of cows for the j th sire, was modelled using NCSS [^{2},^{23}], that is (j = 1, 2, ..., 383) as
where the spline is represented by a fixed linear (or straight line) component, X_{s1 }?_{js}, and a correlated random component, Z_{s1}u_{js}, to allow for nonlinear patterns in the lactation curve attributable to sires. Note that u_{js }~N(0, I_{n2}) uses the formulation of Verbyla and colleagues [^{23}], where is the variance component for the random component of the NCSS and n is the number of knotpoints for the NCSS. The same knot points were used for all sires. The full design matrices for and in (1) become respectively, X_{s }= Z(X_{s1 }? I_{383}) and Z_{s }= Z(Z_{s1 }? I_{383}) for the 383 sires in the selected data set.
Notice that the cow random coefficients and NCSS provide for the variancecovariance structure that would arise because of repeated measurements on the individual cows.
The full model is given by
and the marginal distribution of y is therefore given by
where X? = X_{0 }?_{0 }+ X_{s }?_{s }are the fixed effects, and the variance matrix H is given by
It was possible to fit this model, whereas more complex models (for example allowing for splines for each cow) were simply too large to be fitted.
The key component of the statistical model is the NCSS, one for each sire. This term formed the basis of the analysis of the milk yield characteristics that were influenced by the choice of sire. Once the mixed model (3) is fitted, the sire NCSS can be used to determine the peak milk yield, the time at which the peak occurs, milk yield persistency, and total milk yield over the full lactation.
Some basic results involving NCSS are required in order to determine peak yield, persistency and total milk yield. The first derivative is required to determine the day of peak milk yield. NCSS can then be used to find the peak milk yield value for each sire. The total milk yield is the area under the NCSS for each sire and requires integration of the NCSS.
Suppose we have a quantitative explanatory variable t with corresponding values or knotpoints T_{L }<t_{1 }? t_{2 }? ... ? t_{n }<T_{R }on an interval [T_{L}, T_{R}]. In our context, this variable is DIM, and the interval is [6,305]. Selection of the knot points t_{i }is discussed below.
Suppose that g_{j}(t_{i}) is the value of the NCSS for the jth sire at the knotpoint t_{i}, which represents one value of the vector g_{j}. To simplify the notation we drop the subscript j. Green and Silverman [^{24}] have shown that the values g_{i }= g(t_{i}) and the second derivatives ?_{i }= g"(t_{i}) at the knot points t_{i }characterize the NCSS; note that ?_{1 }= ?_{n }= 0. In fact, for t_{i }? t ? t_{i+1 }and h_{i }= h_{i+1 }t_{i},
While the terms in (4) do not match the formulation in (2), White and colleagues [^{25}] have shown the equivalence of various forms on the NCSS. Equation (2) is useful in fitting models in statistical software packages, whereas (4) is useful for postfitting calculations.
Several results are needed to develop the second stage of the analysis, namely the association study. Equation (4) can be written as
where g and ? are vectors of the g_{i }and ?_{i}, respectively, and a_{1 }and a_{2 }are known vectors explicitly defined using (4), and which are equal to zero, apart from the two indices i and i + 1. Using equation (2.4) of [^{24}], we can then write
where Q and R are known matrices given on pages 12 and 13 of [^{24}]; Q and R are functions of h_{i}. Thus any value of the function g can be found using the values at the knot points.
Using (4), the first derivative of g(t) can be shown to be (t_{i }? t ? t_{i+1})
where
and
Equation (6) is used to determine the time for maximum or peak milk yield.
Typically, there is a single maximum or peak milk yield day at which . The first step is to use the spline to determine the interval containing the peak milk yield. In most cases, the interval containing peak values has first derivatives at the knot points satisfying and where the hat indicates the estimated g; if there is no turning point in the lactation curve, the maximum will occur at the initial time point and there will not be any interval satisfying the inequalities. Once the interval containing the maximum milk yield is determined, the equation is solved and involves finding the acceptable root of the quadratic equation (6).
Estimated persistency was calculated as the difference between the milk yield at peak lactation and an end day, namely
where t_{max }and t_{end }are the time of peak milk yield and the end time (t_{end }= 305 DIM) respectively. The time period differs between sires because of differing peak lactation times t_{max}. The estimated milk yields and were calculated for each sire using (4).
Both variability of the actual time of peak yield attributable to sires and difference in persistency were examined using a fixed time (60 DIM). Relationships between peak lactation time, peak lactation value, lactation at the end of the lactation period, persistency and total milk yield were also examined.
The total milk yield for cows attributable to sires was found by calculating the area under the NCSS for each sire. The area under the curve can be found by integration,
and using (4) it is easy to show
Evaluation of (8) for each sire involves using estimates of g_{i }and ?_{i}, and using the same arguments leading to (5), can be written in terms of g at the knotpoints as
where the b vectors are functions of h_{i }as given in (8).
The association analyses are conducted in the second stage of the analysis. However, the 'data' for the second stage are estimates or predictions from stage 1 and hence have an associated error that should be carried through to the next stage of analysis. These estimates are also correlated, but to provide a simple analysis, an approximation along the lines of [^{22},^{26}] is carried out. The weights are determined as follows.
The predicted persistency involves finding and . Thus for a single sire, and using (5),
where is a known vector. The variance matrix of ?, which we denote by V, is available via the prediction error variance matrix, and the underlying spline variance matrix as outlined [^{23}].
If A_{c }is the matrix whose rows are given by , and using the ideas in [^{22},^{26}], our weights are given by
the diagonal elements of the inverse of the full variance matrix of the persistency estimates. Note that (10) ignores the error associated with estimating t_{max}.
The same argument was used to develop weights for the total milk yield estimates using (9).
We examined additive SNP marker associations for both persistency and total milk yield using the methods of Kiverii [^{16},^{17}] with a component of the method discussed by Verbyla and colleagues [^{18}]. Including the polygenic effects using the maternalgrandsire pedigree, with the resulting additive relationship matrix, was also shown to be important.
The statistical model for markertrait association was given by
where y_{m }is the vector of estimated effects for a single trait (m stands for persistency or total milk yield) from the first stage of the analysis, 1 is a vector of 'ones', ? is an overall mean effect, M_{a }is a matrix of additive SNP scores (see below) with associated size vector ?_{a}, a is a vector of (polygenic) additive random effects with distribution , where A is derived from the full maternal grandsire pedigree and e_{m }is a residual vector distributed as where W_{m }is a diagonal matrix of weights derived from the first stage of the analysis using (10). Note that W_{m }is a known matrix for this second stage of the analysis and is different for each of the two traits, persistency and total milk yield.
The additive (m_{a}) scores for a SNP with alleles A and B are given by 1 for genotype AA, 0 for genotype AB and 1 for genotype BB. Thus M_{a }contains the scores m_{a }for each SNP for each sire.
The GeneRaVE or genetic random variable elimination approach presented by Kiiveri [^{16},^{17}] was used for the analysis without the polygenic effects a. The current theory and implementation of GeneRaVE does not allow random effects to be included. Ideally the polygenic effects should be included. Indeed ignoring them would produce a biased selection since it is likely that truly nonsignificant markers would be selected because the between sire stratum of variation is omitted. However, in order to at least partially correct for the bias, a further stage of analysis is described below. Thus for selection of SNP markers, (11) became
If ?_{j }is the size of the effect of the j th SNP, the model developed in [^{21},^{22}] was
so that the size effects conditional on a variance parameter (v_{j}) follow a normal distribution and hence are random effects. The variances were assumed to follow a gamma distribution with shape parameter k and scale parameter b. This formulation leads to a complex marginal distribution for ?_{j }which is a function of ?_{j}. The dependence on the modulus leads to sparse regression variable selection by enabling estimates of size to be exactly zero. In practice, this was accomplished by setting ?_{j }equal to zero if the absolute magnitude was below 10^{6}.
To control for false positives, a 10fold crossvalidation approach was used to find optimal values for the parameters k and b. An additional scale parameter can also be optimised in the crossvalidation. This parameter scales the response so that the threshold of 10^{6 }is relative to a common scale over different traits. The crossvalidation involved subdividing the data into 10 random groups, leaving out each group in turn, and predicting the response for that group using the SNP selection process with the nine remaining groups as the data set. The minimum mean square error of prediction across all crossvalidations was used as the criterion for selecting k, b and the scale (denoted b0sc in the GeneRaVE documentation and in the results section).
In 2007, Verbyla and colleagues [^{18}] presented a method for QTL analysis using a forward selection approach with a simpler random effects model for the sizes. The variances v_{j }were assumed to be equal and nonrandom. In their approach, QTL were moved to the fixed effects part of the model since they were determined. In this paper, we used Kiiveri's [^{16},^{17}] selection approach in conjunction with the approach reported by Verbyla and colleagues [^{18}], which consists of moving the complete set of selected SNP to the fixed effects part of the model. The nonselected SNP were omitted in subsequent analyses. At this point, we were also able to include the pedigree information. Thus equation (11) was used for the final analysis, but ?_{a }was the vector of sizes only for the selected SNP and the matrix M_{a }contained the additive scores only for the selected SNP.
The significance of the selected SNP was conducted using a standard Wald statistic, namely the estimated SNP size effect divided by the corresponding standard error. Approximate pvalues were determined using a standard normal distribution. The resulting significant SNP were used with NCBI Bos taurus build Btau_4.0 to construct a list of possible candidate genes [^{27}].
The statistical model given by (3) was fitted using ASREML [^{28}] and included lactation curves attributable to the sires in the subsampled and selected (383 sires) data sets. The spline term Z_{s}u_{s }in (3) is automatically constructed by ASREML using the approach outlined in [^{23}]. In ASREML, the knot points used for the NCSS are usually the unique values of the explanatory variable and in this case it would have been each observed DIM. Typically such a dense set of knot points is not necessary. By reducing the number of knotpoints, computation and time requirements were kept reasonable. The number and their placement are often empirical, although White and colleagues [^{2}] have suggested that eight knot points is usually sufficient for modelling lactation curves. Druet and colleagues [^{3}] have used six knot points successfully. The knot points were positioned at a subset of 6, 36, 66, 96, 126, 156, 186, 231, 261 and 305 DIM. These knot points were selected empirically on the basis of the expected shape of the lactation curve. The number of knot points examined was 6, 8 and 10. Parameter estimates and predictions based on the model were used for comparison, and it was found that six knot points were sufficient for an accurate representation of the lactation curve. Interestingly, loglikelihoods varied across the number of knot points used, but the stability of parameter estimates was clear for six and eight knot points. The final knot points selected were 6, 36, 96, 156, 231, and 305 DIM.
Estimates of persistency and total milk yield were based on the lactation curves obtained using ASREML and were programmed for calculation in R [^{29}]. This included determination of the interval containing the turning point using (6), the calculation of the day at which peak lactation occurred, also using (6), and the peak milk yield using (4). This enabled the sire component of persistency using (7) to be estimated. The area under the lactation curve as given by (8) was also calculated in the R language. The R code includes the calculation of necessary weights for stage two of the analysis, namely the determination of markertrait association. The R code is available from the authors.
GeneRaVE is available as the R package RChip from Mathematical and Information Sciences at CSIRO http://www.bioinformatics.csiro.au/survival.shtml and this package was used for selection of markers. The subsequent fitting of selected markers as fixed effects using (11) was carried out using ASREML [^{28}].
The six random samples were used to estimate the variance components for the selected data set analysis. The results of these six analyses were very similar, the differences reflecting the sampling variation. The mean of the variance component over the six random samples for the herd test day was = 7.00, while the residual variance had a mean of = 4.115. To determine the cubic orthogonal polynomial random regressions covariance matrix for cows over DIM, the estimated matrices obtained from the analyses of the six random samples were averaged and this average is given in Table 1 (with estimated correlations between the components of the random regression given above the diagonal). These values ( , and the values in Table 1) were fixed in the analysis of the selected data set using only the daughters of the 383 sires and the same mixed model. However, the variance component for the spline term Z_{s}u_{s }in (3) was estimated using the selected data since the focus was on the variation among the 383 sires. The estimated variance component for the spline component was = 2.93.
In the analysis of the selected data, we found that the estimated milk yield rises to a peak for 369 of the 383 sires and then gradually declines. For the remaining 14 sires, peak yield was estimated to occur at the initial time of 6 DIM. The fitted NCSS for the impact of sire on milk yield are presented in Figure 1 for a (random) subset of 30 sires. The variation in milk yield that is attributable to sires is well illustrated in Figure 1. The estimated lactation curves in Figure 1 all display a decline in milk production postpeak. The postpeak declines vary, and hence display a varying level of persistence. Using a mathematical model for such a diversity of curves could prove to be very restrictive and may miss features found using NCSS.
Potentially, a key aspect of persistency is the timing of peak milk yield. A histogram of the time of peak yield is given in Figure 2 and illustrates the considerable variation (from about 15 to 70 DIM) across sires with a mean time of approximately 40 DIM, rather than 60 DIM which is often used to estimate persistency. Note the single sire outlier at 150 DIM for peak yield. This sire produced an extremely flat lactation curve and was highly persistent after the peak. Persistency was also calculated using the fixed time of 60 DIM for comparison purposes.
The estimated persistency values (using the actual peak) for the sire effects are presented as a histogram in Figure 3. The distribution showed some skewness to the right indicating that several sires exhibit good persistency (low values), while some sires lead to larger persistency values.
The estimated persistency values based on the estimated peak yield were plotted against the corresponding persistency using the 60 DIM milk yields as the maximum in Figure 4. There was a very strong correlation (0.97) between the two measures. Despite the strong correlation, Figure 4 shows some scatter and reranking of values. Notice also that using 60 DIM resulted in a downward bias in terms of estimated persistency (almost all values were below the y = x line presented). Hence, while the choice of peak DIM may not be totally critical, we favour using the estimated peak whenever possible. However, due to the high correlation between the two measures, the use of the 60 DIM peak yield would seem sufficient in cases where the extra complexity and computational demands cannot be justified.
The definition of persistency used in this paper is one of many possible definitions. Because the peak in milk yield varies across sires, the total time period that defines persistency varies. To examine the impact of the definition of persistency, two further analyses were conducted. First, a fixed time span of 200 days postpeak was used to define persistency. The raw sample correlation between this fixed span persistency and our original measure of persistency was 0.88 while it was 0.90 with the fixed 60 DIM. In the second analysis the original persistency was divided by the time span. The correlation in this case was 0.91 using the estimated peak and 0.99 using 60 DIM. These results suggest a level of consistency across the various definitions of persistency.
The estimated areas or total milk yields are presented in a histogram in Figure 5. The distribution may be a mixture of a number of components. There may be a genetic reason for this pattern due to the pedigree or SNP markers.
The relationships between estimated time to peak, estimated peak value, estimated final value (305 d), estimated persistency and estimated total milk yield (Area) are presented in Figure 6. Total milk yield showed little correlation with persistency. In 2009, Cole and VanRaden [^{14}] reported a similarly small correlation (0.03). It has been shown in previous studies, that the correlation found between total milk yield and persistency is highly variable and dependent on the definition of persistency with both positive and negative correlations, ranging from less than 0 to over 0.50 [^{14},^{30}].
The DIM of peak yield showed little correlation with any other variable, other than a small positive correlation with total milk yield. DIM of peak yield has been reported as correlated to persistency [^{8}], however in our study, peak yield rather than time of peak yield was highly correlated with persistency (0.53). The definition used here for persistency states that a lower value for persistency indicates a flat lactation curve and a more highly persistent cow. The positive correlation means that the higher the peak the greater the decrease in yield after the peak (low persistency). This result clearly indicates that animals with a lower peak yield are more persistent. This could be explained by a resultant reduction in metabolic stress, in agreement with the findings of Dekkers and colleagues [^{4}]. Figure 1 also shows that a lower peak generally occurs in conjunction with a more gradual decline in predicted milk production, resulting in a more persistent animal. Peak yield was also positively correlated with final milk yield (0.45) and total milk yield (0.72). A high correlation between peak yield and final milk yield has been previously reported [^{31}].
Overall our results support some previous findings, such as peak yield being directly linked to persistency. A higher peak generally means an animal will have a lower persistency. Our findings do not support a correlation between peak DIM and persistency but this may be due to the definition used for persistency here.
In the GeneRaVE analysis of persistency, the three tuning parameters were set at b = 10^{7}, k = 0 and b0sc = 0.02 after crossvalidation. All three parameters force effects to zero, b0sc being a scaling factor to help achieve a sparse solution. With these settings (which achieved a low mean squared prediction error) 51 SNP were selected for association with persistency. The selected 51 SNP were moved to the fixed effects part of the model and the remainder of the SNP were discarded. Since a maternalgrandsire pedigree was available for the 383 sires, this was incorporated in the subsequent analysis using (11) with the selected SNP. The estimate of the additive genetic variance was = 0.76, compared to an average estimated error variance of 0.42; it should be noted that for the association study fixed weights and hence estimated variances from the stage 1 analysis were used at the residual level. Since these vary across sires, an average value is presented to provide an indication of the relative size of additive genetic and residual variation. The pedigree effects have a profound impact on the significance of the selected SNP, because they ensure the appropriate error in testing for significance. The standard errors of the estimated SNP effects when the pedigree was included were two to three times larger than when the pedigree was ignored. Unfortunately, it is not currently possible to include random effects in a GeneRaVE analysis but research is underway to do so. The final 18 SNP that were significant at the 0.10 level are shown in Table 2. Figure 7 is a plot of the persistency EBV calculated using the NCSS against the predicted marker assisted breeding values (MEBV) for persistency. The MEBV was calculated using the significant SNP effects in Table 2 and polygenic effect calculated using the pedigree information. There was a strong correlation (0.95) between the EBV and MEBV but considerable variation still remains unexplained.
For total milk yield the GeneRaVE tuning parameters were set at b = 10^{7}, k = 0 and b0sc = 2.75 after crossvalidation. The last parameter reflects the different measurement scale for total milk yield in comparison to persistency. Fiftytwo SNP were selected for total milk yield using GeneRaVE. Shifting these putative SNP effects to the fixed effects part of the model and including the pedigree ( = 47, 843 compared to an average estimated error variance of 3,572) reduced the number of SNP to 18 (at the 0.10 level), which are presented in Table 2. Figure 8 is a plot of the observed (using the spline model) and predicted (using the selected SNPs and the pedigree) total milk yields. The correspondence is very good and in fact much better than for persistency (a correlation of 0.996). A single outlier corresponds to a sire with a large weight (from stage 1) and hence lower information content.
In the association mapping study carried out here, we found SNP associations for persistency and milk yield that had previously been reported, as well some newly identified regions or genes that need further analysis.
In the association analysis for persistency, two of the 18 SNP, found significant at the 0.05 significance level (Table 1) are within known genes. There are 14 SNP that appear closely associated with known genes and two other SNP closely associated with hypothetical protein producing loci. One highly significant SNP was found on BTA4 (47.8 Mbp) in the gene CFTR (cystic fibrosis transmembrane conductance receptor) involved in cystic fibrosis in humans. This gene functions as a small conductance chloride channel in epithelial membranes and its function in homeostasis and energy control makes it an ideal candidate gene for involvement in persistency [^{32}].
On BTA15, the SNP appears to associate with the uncoupling protein 3, UCP3, a mitochondrial protein carrier thought to be related to metabolic traits and obesity [^{33}]. Another gene detected in the association analysis of persistency is PAPD1, a polyA polymerase associated domain containing 1. It has been postulated that PAPD1 and UCP3 are involved with obesity and metabolism [^{34}]. Obesity is known to effect lactogenesis [^{35}]. Leptin, a protein hormone produced by adipocytes (fat cells) which has important effects in regulating body weight, metabolism and food intake, has been shown to inhibit hepatocyte growth factorinduced ductal morphogenesis of bovine mammary epithelial cells [^{36}]. It is possible that PAPD1 and UCP3 genes have a similar effect, thereby affecting persistency.
On BTA28, an SNP was significant at the 0.05 level for both persistency and milk yield analyses which suggests an association with the leucinerich repeat, immunoglobin like and transmembrane domain 1, LRIT1, gene. This region has already been shown to be involved in milk production [^{37}]. There are other significant SNP for persistency that may be associated with known or hypothetical genes and that may be causative, but these need further investigation.
For the total milk yield, the 18 significant SNP are closely associated with known or predicted genes (table 3). The SNP found on BTA1 point to regions already identified as having possible effects on milk yield [^{38}]. This analysis, like the association analysis for persistency, found many SNP in or near genes involved in various functions such as protein binding, signal transduction, receptor binding and membrane stability. The SNP on BTA16 appears to be associated with a gene coding for ATPase, H+ transporting, lysosomal 13 kDa, V1 subunit G3(ATP6V1G3). The SNP on BTA23 and BTA14, respectively, are in regions already shown to have an impact on milk yield [^{39}]. The significant SNP on BTA12, 19 and 24 were in, or close to, genes with known function, but these genes have not previously been associated with milk yield and thus need further investigation.
NCSS originally discussed in 1999 by White and colleagues [^{2}] was found very useful to model lactation curves. The methodology described in our paper continues the work of White and colleagues [^{2}] and Druet and colleagues [^{3}] and provides a flexible approach to model lactation curves. The advantage of such a representation is the ease with which important characteristics of the lactation curve such as time to peak, yield at peak, persistency and total milk yield can be determined. Not constraining the curves to have a particular parametric form is also an advantage because it is not necessary that all lactation curves follow the strict form that is implied by such functions.
In our paper, we have extended the use of NCSS for the estimation of EBV of 383 sires for persistency of lactation and total milk yield, two important characteristics of the lactation curve. Sire EBV can be found for both traits allowing the ranking of sires and hence enabling selection and management decisions to be made in practice. NCSS can be used to easily model the sire influence of all the important features of the lactation curve. Importantly, persistency can be calculated using the estimated peak rather than a fixed day across all animals. However, this may not be possible to implement in the situation of a breeding association since the computational demands and the extreme number of records may be too great.
The genomewide association study found SNP associated with persistence of milk yield and total milk yield that were close to genes of known or postulated function, part of these confirming previous results. The inclusion of the polygenic effect in the analysis was crucial in establishing significant associations. It would be possible to repeat the association study with the genotyped animals using the Illumina Bovine SNP50 chip but it would be necessary to increase the number of genotyped animals to have sufficient power to identify significant QTL.
Lastly, the use of 'sparse' selection tools [^{16},^{17}] is useful to reduce important SNP to an appropriate number. Despite the successful discovery of SNP related to milk persistence and total milk yield, the association mapping conducted here is largely exploratory and several issues still require further investigation. The first issue concerns additional fixed and random effects that are typically necessary in such an analysis. This is particularly important because pedigree information is often available and the association between genotypes is modelled using an additive relationship matrix through a random effect. Including such information can have a major impact on the association mapping, as shown here when the pedigree was included. The second issue relates to the status of the selected markers. As random effects, they will be shrunk towards zero, while if taken as fixed effects after selection, some bias is likely to occur. The degree of such bias is unknown. These issues are currently investigated by the authors and colleagues.
The authors declare that they have no competing interests.
Both authors contributed equally to all parts of the study. Both authors have read and approved the final manuscript.
We would like to acknowledge Phillip Bowman for the extraction of the data, the Australian Dairy Herd Improvement Scheme (ADHIS) for the data, Michael Goddard and the Victorian Department of Primary Industries for the support of the first author. The second author acknowledges financial support from the National Statistics Program of the Australian Grains Research and Development Corporation and the Division of Mathematical and Information Sciences, CSIRO, Australia. We thank Julian Taylor for important discussions regarding GeneRaVE, Wayne Pitchford, Ben Hayes and two anonymous reviewers for critical comments on the manuscript.
References
Grossman M,Hartz SM,Koops WJ,Persistency of lactation yield: A novel approachJ Dairy SciYear: 1999822192219710531606  
White IM,Thompson R,Brotherstone S,Genetic and environmental smoothing of lactation curves with cubic splinesJ Dairy SciYear: 19998263263810194684  
Druet T,Jaffrezic F,Boichard D,Ducrocq V,Modeling Lactation Curves and Estimation of Genetic Parameters for First Lactation TestDay Records of French Holstein CowsJ Dairy SciYear: 2003862480249012906066  
Dekkers JCM,Ten Hag JH,Weersink A,Economic aspects of persistency of lactation in dairy cattleLivest Prod SciYear: 199853237252  
Appuhamy J,Cassell BG,Dechow CD,Cole JB,Phenotypic relationships of common health disorders in dairy cows to lactation persistency estimated from daily milk weightsJ Dairy SciYear: 2007904424443417699063  
Harder B,Bennewitz J,Hinrichs D,Kalm E,Genetic parameters for health traits and their relationship to different persistency traits in German Holstein dairy cattleJ Dairy SciYear: 2006893202321216840638  
Jones WP,Hansen LB,ChesterJones H,Response of Health Care to Selection for Milk Yield of Dairy CattleJ Dairy SciYear: 199477313731527836603  
Muir BL,Fatehi J,Schaeffer LR,Genetic relationships between persistency and reproductive performance in firstlactation Canadian HolsteinsJ Dairy SciYear: 2004873029303715375065  
Togashi K,Lin CY,Genetic improvement of total milk yield and total lactation persistency of the first three lactations in dairy cattleJ Dairy SciYear: 2008912836284318565941  
Togashi K,Lin CY,Maximization of Lactation Milk Production Without Decreasing PersistencyJ Dairy SciYear: 2005882975298016027211  
Gengler N,Persistency of lactation yields: A reviewInterbull BulletinYear: 1996128796  
Druet T,Jaffrezic F,Ducrocq V,Estimation of genetic parameters for test day records of dairy traits in the first three lactationsGenet Sel EvolYear: 20053725727115823235  
Togashi K,Lin CY,Selection for milk production and persistency using eigenvectors of the random regression coefficient matrixJ Dairy SciYear: 2006894866487317106117  
Cole JB,VanRaden PM,Genetic evaluation and best prediction of lactation persistencyJ Dairy SciYear: 2006892722272816772591  
Cole JB,Null DJ,Genetic evaluation of lactation persistency for five breeds of dairy cattleJ Dairy SciYear: 2009922248225819389984  
Kiiveri H,A Bayesian approach to variable selection when the number of variables is very largeScience and Statistics: A Festschrift for Terry SpeedYear: 2003Lecture Notes  Monograph Series. Institute of Mathematical Statistics127143  
Kiiveri H,A general approach to simultaneous model fitting and variable elimination in response models for biological data with many more variables than observationsBMC BioinformaticsYear: 200818410693  
Verbyla AP,Cullis BR,Thompson R,The analysis of QTL by simultaneous use of the full linkage mapTheor Appl GenetYear: 20071169511117952402  
Weller JI,Ezra E,Leitner G,Genetic analysis of persistency in the Israeli Holstein population by the multitrait animal modelJ Dairy SciYear: 2006892738274616772593  
Samor? AB,Groen AF,Boettcher PJ,Jamrozik J,Canavesi F,Bagnato A,Genetic Correlation Patterns Between Somatic Cell Score and Protein Yield in the Italian HolsteinFriesian PopulationJ Dairy SciYear: 2008914013402118832227  
Dai JY,Ruczinski I,LeBlanc M,Kooperberg C,Imputation methods to improve inference in SNP association studiesGenet EpidemiolYear: 20063069070216986162  
Smith AB,Cullis BR,Gilmour AR,The analysis of crop evaluation data in AustraliaAust N Z J StatYear: 200143129145  
Verbyla AP,Cullis BR,Kenward MG,Welham SJ,The analysis of designed experiments and longitudinal data by using smoothing splinesJ R Stat Soc Ser C Appl StatYear: 199948269300  
Green PJ,Silverman BW,Nonparametric Regression and Generalized Linear ModelsYear: 1994London: Chapman & Hall  
White IMS,Cullis BR,Gilmour AR,R T,Smoothing biological data with splinesProceedings of the International Biometrics conferenceYear: 1999308316  
Frensham A,Cullis B,Verbyla A,Genotype by environment variance heterogeneity in a twostage analysisBiometricsYear: 19975313731383  
NCBI Map Viewer: Bos Taurus (cattle) genome viewhttp://www.ncbi.nlm.nih.gov/mapview/map_search.cgi?taxid=9913  
Gilmour AR,Gogel BJ,Cullis BR,Thompson R,ASREML Program user manualYear: 20062Hemel Hempstead, HP1 1ES, UK: VSN International Ltd  
R Development Core TeamR: A language and environment for statistical computingYear: 2007Vienna, Austria: R Foundation for Statistical Computing  
Jakobsen JH,Madsen P,Jensen J,Pedersen J,Christensen LG,Sorensen DA,Genetic parameters for milk production and persistency for Danish Holsteins estimated in random regression models using REMLJ Dairy SciYear: 2002851607161612146494  
Rekaya R,Carabano MJ,Toro MA,Bayesian analysis of lactation curves of HolsteinFriesian cattle using a nonlinear modelJ Dairy SciYear: 2000832691270111104290  
Mekus F,Laabs U,Veeze H,Tummler B,Genes in the vicinity of CFTR modulate the cystic fibrosis phenotype in highly concordant or discordant F508del homozygous sib pairsHum GenetYear: 200311211112483292  
Sherman EL,Nkrumah JD,Murdoch BM,Li C,Wang Z,Fu A,Moore SS,Polymorphisms and haplotypes in the bovine neuropeptide Y, growth hormone receptor, ghrelin, insulinlike growth factor 2, and uncoupling proteins 2 and 3 genes and their associations with measures of growth, performance, feed efficiency, and carcass merit in beef cattleJ Anim SciYear: 20088611617785604  
Xiao Q,Wu XL,Michal JJ,Reeves JJ,Busboom JR,Thorgaard GH,Jiang Z,A novel nuclearencoded mitochondrial poly(A) polymerase PAPD1 is a potential candidate gene for the extreme obesity related phenotypes in mammalsInt J Biol SciYear: 2006217117816810331  
Rasmussen KM,Hilson JA,Kjolhede CL,Obesity may impair lactogenesis IIJ NutrYear: 20011313009S3011S11694637  
Yamaji D,Kamikawa A,Soliman MM,Ito T,Ahmed MM,Makondo K,Watanabe A,Saito M,Kimura K,Leptin inhibits hepatocyte growth factorinduced ductal morphogenesis of bovine mammary epithelial cellsJpn J Vet ResYear: 20075418318917405355  
Ashwell MS,Heyen DW,Sonstegard TS,Van Tassell CP,Da Y,VanRaden PM,Ron M,Weller JI,Lewin HA,Detection of quantitative trait loci affecting milk production, health, and reproductive traits in Holstein cattleJ Dairy SciYear: 20048746847514762090  
Nadesalingam J,Plante Y,Gibson JP,Detection of QTL for milk production on Chromosomes 1 and 6 of Holstein cattleMamm GenomeYear: 200112273111178740  
Kucerova J,Lund MS,Sorensen P,Sahana G,Guldbrandtsen B,Nielsen VH,Thomsen B,Bendixen C,Multitrait quantitative trait loci mapping for milk production traits in Danish Holstein cattleJ Dairy SciYear: 2006892245225616702292 
Figures
Tables
The estimated variances, covariances and correlations for the cubic random regression due to cows used in the analysis of the selected data
P_{0}  P_{1}  P_{2}  P_{3}  

P_{0}  6.48  0.20  0.14  0.13 


P_{1}  1.24  6.24  0.17  0.37 


P_{2}  0.83  0.97  5.34  0.06 


P_{3}  0.58  1.68  0.26  3.26 
The values were found by averaging the results from analyses of six random subsets of the full data set; orthogonal polynomials were used (and are denoted by P_{0 }to P_{3}); the diagonal values are the estimated variances, the values below the diagonal are estimated covariances, and the values above the diagonal are the estimated correlations between orthogonal polynomial components
Locations of SNP found significant for persistency
Chromosome  Location (Mbp)  Size  Z ratio  pvalue 

BTA2  13.2  0.26  2.86  0.0043 
BTA2  9.3  0.20  2.29  0.0223 
BTA3  95.5  0.22  2.11  0.0351 
BTA4  47.8  0.21  1.82  0.0683 
BTA4  52.6  0.43  3.27  0.0011 
BTA5  8.4  0.34  1.84  0.0659 
BTA5  8.2  0.29  2.55  0.0108 
BTA6  25.5  0.26  2.29  0.0219 
BTA7  84.3  0.35  3.63  0.0003 
BTA8  16.6  0.42  3.75  0.0002 
BTA10  22.1  0.25  2.54  0.0110 
BTA10  62.5  0.30  2.64  0.0082 
BTA13  35.5  0.17  1.95  0.0511 
BTA14  48.9  0.20  1.69  0.0916 
BTA15  51.8  0.31  2.99  0.0028 
BTA16  16.3  0.17  1.72  0.0856 
BTA28  32.8  0.41  3.62  0.0003 
BTAX  70.1  0.22  2.54  0.0112 
Selected additive SNP together with the chromosome, the size of the effect on the persistency, the Z ratio (estimate over standard error) and a Pvalue based on the standard normal distribution; for additive effects, the difference between the homozygotes is twice the stated size value
Locations of SNP found significant for total milk yield
Chromosome  Location (Mbp)  Size  Z ratio  pvalue 

BTA1  139.0  84.83  3.67  0.0002 
BTA6  22.1  237.30  1.67  0.0940 
BTA9  38.1  62.69  1.95  0.0510 
BTA11  98.4  162.40  1.75  0.0794 
BTA12  34.6  46.71  1.80  0.0714 
BTA14  52.8  29.66  1.68  0.0936 
BTA19  59.6  77.12  4.36  0.0000 
BTA19  14.6  40.22  3.01  0.0026 
BTA23  14.8  339.10  2.14  0.0326 
BTA23  17.2  82.13  3.59  0.0003 
BTA23  13.1  33.12  1.79  0.0728 
BTA24  9.1  467.00  2.18  0.0289 
BTA24  23.2  74.61  3.57  0.0004 
BTA26  23.6  87.41  2.36  0.0184 
BTAX  1.5  33.48  1.91  0.0559 
BTAX  45.1  289.60  2.06  0.0397 
BTAX  71.0  25.64  2.08  0.0374 
BTAX  21.1  31.69  1.81  0.0706 
Selected additive SNP together with the chromosome, the size of the effect on the total milk yield, the Z ratio (estimate over standard error) and a Pvalue based on the standard normal distribution; for additive effects, the difference between the homozygotes is twice the stated Size value
Article Categories:

Previous Document: Generation of lung epitheliallike tissue from human embryonic stem cells.
Next Document: Largescale insertional mutagenesis of a coleopteran stored grain pest, the red flour beetle Triboli...