Evaluation of regression methods when immunological measurements are constrained by detection limits.  
Jump to Full Text  
MedLine Citation:

PMID: 18928527 Owner: NLM Status: MEDLINE 
Abstract/OtherAbstract:

BACKGROUND: The statistical analysis of immunological data may be complicated because precise quantitative levels cannot always be determined. Values below a given detection limit may not be observed (nondetects), and data with nondetects are called leftcensored. Since nondetects cannot be considered as missing at random, a statistician faced with data containing these nondetects must decide how to combine nondetects with detects. Till now, the common practice is to impute each nondetect with a single value such as a half of the detection limit, and to conduct ordinary regression analysis. The first aim of this paper is to give an overview of methods to analyze, and to provide new methods handling censored data other than an (ordinary) linear regression. The second aim is to compare these methods by simulation studies based on real data. RESULTS: We compared six new and existing methods: deletion of nondetects, single substitution, extrapolation by regression on order statistics, multiple imputation using maximum likelihood estimation, tobit regression, and logistic regression. The deletion and extrapolation by regression on order statistics methods gave biased parameter estimates. The single substitution method underestimated variances, and logistic regression suffered loss of power. Based on simulation studies, we found that tobit regression performed well when the proportion of nondetects was less than 30%, and that taken together the multiple imputation method performed best. CONCLUSION: Based on simulation studies, the newly developed multiple imputation method performed consistently well under different scenarios of various proportion of nondetects, sample sizes and even in the presence of heteroscedastic errors. 
Authors:

HaeWon Uh; Franca C Hartgers; Maria Yazdanbakhsh; Jeanine J HouwingDuistermaat 
Related Documents
:

17784797  Mixedeffects logistic regression for estimating transitional probabilities in sequenti... 10380207  Computing minimum description length for robust linear regression model selection. 19541797  Nearestneighbor and logistic regression analyses of clinical and heart rate characteri... 19364187  Handling nonnormality and variance heterogeneity for quantitative sublethal toxicity te... 22921537  Beta oscillations in the corticobasal ganglia loop during parkinsonism. 15287087  Hierarchical modelling of small area and hospital variation in shortterm prognosis aft... 
Publication Detail:

Type: Comparative Study; Journal Article; Research Support, NonU.S. Gov't Date: 20081017 
Journal Detail:

Title: BMC immunology Volume: 9 ISSN: 14712172 ISO Abbreviation: BMC Immunol. Publication Date: 2008 
Date Detail:

Created Date: 20081202 Completed Date: 20090219 Revised Date: 20130605 
Medline Journal Info:

Nlm Unique ID: 100966980 Medline TA: BMC Immunol Country: England 
Other Details:

Languages: eng Pagination: 59 Citation Subset: IM 
Affiliation:

Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, Leiden, the Netherlands. h.uh@lumc.nl 
Export Citation:

APA/MLA Format Download EndNote Download BibTex 
MeSH Terms  
Descriptor/Qualifier:

Animals Bias (Epidemiology) Data Interpretation, Statistical Humans Immunity, Innate Immunologic Techniques* Models, Theoretical* Regression Analysis Research Design* Sample Size Statistics as Topic 
Comments/Corrections 
Full Text  
Journal Information Journal ID (nlmta): BMC Immunol ISSN: 14712172 Publisher: BioMed Central 
Article Information Download PDF Copyright ? 2008 Uh et al; licensee BioMed Central Ltd. openaccess: This is an Open Access article distributed under the terms of the Creative Commons Attribution License (), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Received Day: 17 Month: 3 Year: 2008 Accepted Day: 17 Month: 10 Year: 2008 collection publication date: Year: 2008 Electronic publication date: Day: 17 Month: 10 Year: 2008 Volume: 9First Page: 59 Last Page: 59 ID: 2592244 Publisher Id: 14712172959 PubMed Id: 18928527 DOI: 10.1186/14712172959 
Evaluation of regression methods when immunological measurements are constrained by detection limits  
HaeWon Uh12  Email: h.uh@lumc.nl 
Franca C Hartgers2  Email: F.C.Hartgers@lumc.nl 
Maria Yazdanbakhsh2  Email: m.yazdanbakhsh@lumc.nl 
Jeanine J HouwingDuistermaat1  Email: J.J.Houwing@lumc.nl 
1Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, Leiden, the Netherlands 

2Department of Parasitology, Leiden University Medical Center, Leiden, the Netherlands 
The number of immunological parameters that can be measured in large scale epidemiological studies has been rapidly increasing. Not all of these quantitative levels can be determined precisely. Reasons for this lack of precision are that the signal produced by the stimulant is too small for the instrumentation to discriminate the signal from the background noise, or a signal is registered, but certain (laboratory) criteria that identify the substance are not met. Values that cannot be quantified are called nondetects (NDs). We assume that all NDs are below a given detection limit (DL), and therefore we are dealing with censored data. Simple solutions such as deletion of NDs and single value substitution are often used, but it is unknown to what extent these methods provide unbiased results and thus would be adequate for the analysis. Applying various approaches yielded different parameter estimates in the environmental studies [^{1},^{2}].
When the number of NDs is rather small, one approach of dealing with NDs is simply dropping NDs and apply linear regression to the remaining data. A second commonly used approach is to substitute NDs with a certain value smaller than the DL (0, DL/2 or DL) and to use linear regression [^{3},^{4}]. The validity of these approaches will depend on the number and the unknown range of NDs. A third common practice is to dichotomize the cytokine measurements based on a certain cutoff point (DL or median) and to apply logistic regression to this binary variable [^{3},^{5}]. A major drawback of this approach is that by dichotomizing much information is lost. Note also that the choice of 0, DL/2 or DL in the single value substitution and the threshold in the logistic regression approach is arbitrary. An important issue is then how to decide which method is optimal for a particular data set. Moreover, more sophisticated statistical methods may be needed for analyzing this type of data.
This paper is motivated by a study on the relationship between intensity of parasite infection and cytokines measurements resulting from whole blood assay after stimulation with lipopolysaccharide (Table 1). One of the cytokine measurements has only a small proportion of NDs (5.5%), whereas the second measurement has a relatively large proportion of NDs (66%). In addition to the presence of censored measurements, the distribution of cytokine measurements is often positively skewed. Skewed distributions in biology often closely fit the lognormal distribution and this characterization can be advantageous in the biological system when many factors act in multiplicative ways [^{6},^{7}]. Therefore, it can be assumed that the cytokine variables are normally distributed after an appropriate (log)transformation.
Considering the efforts made by collecting data, it seems worth while to investigate sophisticated and (maybe) timeconsuming statistical methods to analyze data appropriately [^{8}]. In this paper we review several commonly used methods in immunology and more advanced methods used in other fields such as environmetrics and econometrics [^{1},^{9},^{10}]. A second goal is to evaluate the performances of these methods via simulation studies [^{2},^{11},^{12}]. The validity and precision of simple methods such as deletion and single value substitution will be studied for various scenarios including different proportions on ND's and different error models. In addition the utility of advanced statistical methods will be quantified.
In Table 2, the six methods for analyzing data containing NDs that were considered are summarized: removal of nondetects (DELETION), single substitution of NDs with half of the value of DL (DL/2), extrapolation by regression on order statistics (ROS), multiple imputation using maximum likelihood estimation (MI), tobit regression (TOBIT), and logistic regression (LOGIT). To study the performance of these methods, we simulated data sets of size 200, 400 and 1,000 with proportions of NDs of 10, 30, 50 and 70%.
Regarding RMSE we summarized results in Figure 1. The considered simulation settings were as follows. The plots on the first column represent the scenario with negative effect imitating Cytokine 1, whilst the second to fourth columns show the positive effect of malaria intensity (Cytokine 2). The Columns also show the effect of different thresholds. For the plots of the first column the cutoff points of DL values were relatively large with 14.7, 10, 17, and 29 pg/ml corresponding to 10, 30, 50 and 70% of the whole data sets. In contrast, the second to fourth columns the DLvalues were very close to zero, namely 0.7, 1.6, 2.7, and 4.6 pg/ml.
The three rows display the results from the three different covariates. For the quantitative covariates generated from threecomponent mixture (row 1) and twocomponent mixture (row 2), the simulation results were similar. Therefore, the results from the threecomponent mixture imitating Cytokine 1 are discussed in details.
In Additional file 1 the results were summarized in terms of bias, root mean square error (RMSE) and coverage probability of 95% confidence interval (CI). Entries in the table are averages of 1,000 replications. In terms of bias, at all levels of NDs and for all sample sizes, the Deletion method produced the least accurate estimates. In contrast, for small data sets containing a small proportion of NDs (size 200 and 10% NDs), the TOBIT model produced nearly unbiased estimates, while the DL/2 method also performed well. With respect to RMSE, the MI method performed best in general. Note that the ROS method produced the smallest RMSE values, although it produced relatively large biases. This indicates that ROS underestimates the variance. For visualization of the results in terms of RMSE, the performance of the two bestperforming methods ? TOBIT and MI ? were compared for different sample sizes in Figure 2. The advantage of using the TOBIT model with small percentage of NDs seems to disappear with increasing sample sizes.
In Table 3 the averaged variances of parameter estimates were given for the TOBIT, MI, and LOGIT methods. The variances of the TOBIT model increased with increasing proportion of NDs, while the level of variance using the MI method remained stable throughout the different sample sizes and the percentages of NDs. The LOGIT method with the proportion of NDs of 30% and 50% produced smaller variances than with a very small and very large proportions of NDs.
Regarding efficiency of the methods, the right panel of Figure 2 shows the power to detect at the nominal significance level of ? = 5% for the TOBIT, MI, and LOGIT methods. The MI method was the best at all proportions of NDs and for all sample sizes. For a small proportion of NDs, the performance of the TOBIT and MI methods was equivalent. Overall, the LOGIT method performed worst.
Additionally we compared the performance of TOBIT, DL/2 and MI approaches under heteroscedastic errors. The results are depicted in Figure 3. The RMSE of the TOBIT and DL/2 methods increased rapidly with increasing proportion of NDs. The MI approach appeared to be most robust and RMSE was below 0.1 for proportions of ND under 30%.
In the last row of Figure 1, the results for the third (microscopic) category with reference to negative category are given. In contrast to quantitative covariates, there were dissimilarities in the behavior. RMSE did not increase as rapidly as with quantitative covariates, when the proportion of NDs becomes large. However, the actual RMSE values were much higher. Although the order of the best methods did not vary much, the TOBIT model gave bad performance when the sample size was small (n = 200).
Different choices yielded different results as illustrated for our motivating data as can be seen in Table 4. To determine the optimal method for this particular data set, simulation results are used as a reference. For the two cytokine responses, the residuals of simple linear regression on intensity of parasite infection using DL/2 imputation for NDs are given in Figure 4.
For Cytokine 1, considering the rather small sample size (181), small proportion of NDs (5.5%), and homoscedastic errors (Figure 4) the TOBIT method (with parameter estimate ?^ = 0.190 and the corresponding pvalue = 0.00008) might be a good choice. The simple DL/2 method (?^ = 0.186) gave similar results, which confirmed the simulation results. The MI method gave the next best estimate ?^ = 0.149. Logistic regression using the median value as a cutoff point (pvalue = 0.155) resulted in loss of power, and the estimate by the ROS method (?^ = 0.085) was greatly biased. It was noted that the estimate by the Deletion method (?^ = 0.012) was of different direction even with this small proportion of NDs.
Next we consider Cytokine 2 as outcome variable. As can be derived from Figure 4, heteroscedasticity of errors was indicated for Cytokine 2. Based on the simulation results, for a large number of NDs and for heteroscedastic errors the MI method (with parameter estimate ?^ = 0.550 and the corresponding SE(?^) = 0.113) might be preferable to others. The DL/2 method (?^ = 0.545 and SE(?^) = 0.093) yielded a similar effect estimate, but the standard error of the parameter estimate was smaller compared to the MI method. Since the proportion of NDs was larger than the median, DL was used as a cutoff point for logistic regression. The use of binary rather than continuous data caused loss of power and the estimate was not even significant any more. The results given by the ROS method were greatly biased. Finally, as could be expected from the simulation study the TOBIT method overestimated the effect size (?^ = 0.792).
In the field of immunology there is great need for specialized methods for analysis of data in order to improve accuracy and power. In this paper we proposed advanced methods to deal with data sets when DL plays a significant role. Via simulation studies we first evaluated performances of several methods. Because NDs are not missing at random, biases can be expected when simply dropping NDs. Even with proportion of NDs of 10%, the bias was unacceptable. For parameter estimation substituting DL/2 in NDs was reasonable, but the variance was underestimated. Furthermore, as illustrated by our data set the choice of the imputed value (0, DL/2, DL) remains an issue. For large proportion of NDs, ROS appeared to yield large biases. Analogous to the DL/2 method, the variance of parameter estimates was underestimated. The TOBIT method appeared to be an elegant method to deal with a small proportion of NDs under the constant error assumption. If possible, the normality assumption should be checked before considering the TOBIT model (Figure 4) [^{13}]. For larger proportions of NDs (larger than 10%), MI outperforms the other methods in terms of RMSE. Since imputations are multiple, the MI method takes into account the uncertainty about the true values of the NDs. Furthermore, it is rather robust against heteroscedasticity of errors. Figure 2 showed for large sample size the MI method produces more accurate estimates than the TOBIT method. Note that the MI method might be improved by using more sophisticated methods to compute the mean and the standard deviation of a truncated normal distribution [^{14}]. However, diminishing variances by increasing proportion of NDs require the careful use of the MI method when proportion of NDs is greater than 50%.
We also compared results from the different scenarios: (1) whether there is positive relationship between dependent and independent variables, (2) when the characteristics of covariates were changed (three and two component mixture, or categorized), and (3)whether the closeness of the detection limit to zero will influence the results (Figure 1). In general, the type of included covariates in the model did not influence the findings. Therefore, our findings in this paper can be used as a reference. Nevertheless, careful consideration should be given to what are the appropriate methods for analyzing each specific data.
The limitation of our simulation study lies in skewed error distributions. However, we studied a simple solution of dichotomizing a continuous variable. Although this is an inefficient approach, and determination of cutoff points remains arbitrary [^{5}], for some situations creating a binary outcome variable could be the most sensible option when measurements can easily be categorized. The method can also be extended to more than two categories by using ordered logistic regression (or proportional odds model). Note that to reflect the natural ordering of the categories, ordered logistic regression should be preferred to multinomial logistic regression [^{15},^{16}]. Additional advantage of using ordered logistic regression is that the results can be presented in one parameter. In contrast, in using multinomial logistic regression as in our simulation study, the first (or most common) level will be considered as reference category (negative level), and the inference of remaining two categories compared to the reference will be given. Although making more categories than two might improve performance, the determination of categories remains arbitrary.
When data are very skewed and normality cannot be achieved by the usual transformation, quantile regression could be considered [^{17}]. This is an econometric regression model, in which a specified conditional quantile (or percentile) of the outcome variable is expressed as a linear function of covariates.
Simply the lines split the population into two parts with the proportion of 70, 80 or 90% lying below the line, and the proportion 30, 20, or 10% above the line, respectively. Similar to logistic regression the choice of the quantile is arbitrary. However, it assumes no underlying distribution, and is reported to be robust against heteroscedastic errors. Good performance of quantile (or median) regression method have been reported elsewhere [^{18}]. However, when the proportion of NDs is greater than 50%, median regression is not suitable. Also with normally distributed data (after appropriate transformation), the improvement using median regression would be little. The computation of quantile regression is possible using R, SAS, and Stata.
In this paper we considered a single variable restricted with NDs. Extending to multiple regression, such as multiple cytokine measurements of the same individuals and/or related cytokine levels within some set, it is very probable that we encounter NDs in more than one covariate and with different DLs. It can be expected that the large number of correlated cytokine variables would enhance the advantage of using the multiple imputation techniques [^{19}]. In fact, using information on other correlated variables such as families would improve the performance of MI. It is not the purpose of this paper to stress that the MI method should be used everywhere in the presence of DL. Nevertheless, we showed that the search for new methods might gain deeper understanding of data, and that simulation studies can contribute to decide the optimal methods for measurement data with NDs.
We showed that a dichotomization of continuous variable generally causes loss of information, hence loss of power. We compared the several linear regression methods to deal with the data containing NDs based on simulation studies. The TOBIT method produced the most accurate estimates with the least bias. When the amount of NDs is relatively small (? 30%) and the normality assumption is met as Cytokine 1 in our example data, the use of the TOBIT method is recommended. However, as reported elsewhere [^{20},^{21}], the TOBIT model is sensitive to the violation of normality assumption. Therefore, when heteroscedastic errors are suspected, and/or the amount of NDs is large, robust statistical methods have to be considered. We proposed to employ multiple imputation technique. The MI method performed consistently well under different scenarios of various proportion of NDs (? 50%), sample sizes and even in the presence of heteroscedastic errors.
The following linear regression model is considered for the outcome y_{i }of subject i with a covariate x_{i}
(1) y_{i }= ? + ?x_{i }+ ?_{i},
where ?_{i }is random noise and i = 1, ..., n. The error ? is assumed to be uncorrelated with x and to have a mean equal to zero and a constant variance. The parameters ? and ? denote the intercept and the average change in y with x. By Ordinary Least Squares (OLS) the estimated slope and intercept of the regression line can be computed. However, in immunological data the ys in equation (1) are only partly observed. A lower threshold or detection limit, DL, interferes with measurements of low levels as follows:
(2)
yi?=yiifyi?>DLyi?=NDs,ifyi??DL. 
Since NDs of cytokine measurements reflect levels of exposure, they cannot be considered as missing at random (MAR) [^{22}]. Therefore, deleting the lowest values is expected to produce biased results. Other types of methods to analyze these data are imputation and modelling of NDs. An overview of the available methods is given in Table 1. In environmental statistics a method called robust regression on order statistics (ROS) approach exists [^{1},^{9}]. This method is often used to compute summary statistics.
To reflect uncertainty about imputation, we propose to employ multiple imputation approach as introduced by Little and Rubin [^{22},^{23}]. Based on a truncated normal distribution, we first compute the mean and the standard deviation. This can be done using the functions cenmle or ros from the Rpackage NADA [^{24}]. Then, the values for NDs were generated randomly and m complete data sets are created and each data set is analyzed separately. Rubin (Chapter 3, [^{25}]) gives the following rule for combining the results. With m imputations, we obtain m different sets of the point estimate ?^i as well as standard errors s_{1}, ..., s_{m}. The pooled MI point estimate is then simply the average of the m estimates: ??=1m?j=1m?^i.
The variance estimate associated with ?? has two components. The withinimputation variance can be estimated by the average of the complete data variance U?=1m?j=1msi2. The betweenimputation variancem is the variance of the estimate ??, B=1m?1?j=1m(?^1???)2 The total varianceis defined by T = ? + (1 + m^{1})B and inferences are based on the approximation ??/T^{1/2 }~ t_{?}, where the degrees of freedom are given by ?=(m?1)[1+U?(1+m?1)B].
Finally, two nonimputation methods for incorporating NDs into regression models are investigated. Without adding uncertainty on the distribution of the NDs, the outcomes can be dichotomized and logistic regression can be applied. However, the relationship between the covariate and the outcome is now on a logit scale instead of a linear one. A more sophisticated approach is to use maximum likelihood estimation (MLE) method for leftcensored data, called TOBIT model after the economist James Tobin [^{26}]. The model is written as a combination of
yi?=yiifyi>DL(OLS?part)yi?=DLotherwise(probit?part). 
The probit part determines whether the outcome variable is belowDL, and the OLS part is a truncated regression model. The TOBIT model estimates a regression model for the data above DL, and assumes that the censored data (below DL) have the same distribution of errors as the observed data. The weakness of this method is that it may be more vulnerable to violation of the assumptions about the error distribution. Many comments can be found in the literature that in the presence of heteroscedasticity the Tobit estimates are inconsistent, and that there is only limited information about the direction of the bias [^{20},^{21}].
We simulated data sets by drawing samples from a population similar to the example data in the Background section, and by allocating a proportion of observations as NDs.
For the covariate x (infection intensity) we used (1) a threecomponent normal mixture distribution, (2) a twocomponent normal mixture distribution, and (3) three classes. The threecomponent normal mixture distribution has means equal to 0.77, 3.35 and 4.59 and a withincomponent variance of 0.027. The proportions of the three components were 0.83, 0.13 and 0.04, respectively. The twocomponent normal mixture distribution has means equal to 0.77 and 3.69 and a withincomponent variance of 0.069, with their proportions 0.84 and 0.16, respectively.
Then, based on the characteristic of Cytokine 1, outcome variables were generated using the following regression model,
(3) y_{i }= 3.04 0  16x_{i }+ ?_{i},
for individual i ? {1, ..., n}. Based on Cytokine 2, we generated outcome variables as
y_{i }= 0.66 + 0.27x_{i }+ ?_{i}.
And, ? were assumed to be standard normally distributed.
Based on biology, the malaria parasite measurements lend to be categorized in three classes: negative, submicroscopic, and microscopic. Instead of looking at the effect of malaria with continuous measurements, we considered the categorical malaria variable, say z. The dummy code z_{i }= (z_{i1 }z_{i2 }z_{i3})^{? }denotes a vector of malaria category indicators for the ith subject, with elements z_{ij }= 1 if ith subject has jth category; otherwise z_{ij }= 0. The categorical covariate vector z were then generated following the multinomial distribution of categorized malaria status with proportions of 0.69, 0.14, and 0.17. Based on Cytokine 1, y were generated following the model:
(4) y_{i }= 2.97  0.13z_{i2 } 0.58z_{i3 }+ ?,
while based on Cytokine 2
(5) y_{i }= 0.84 + 0.13z_{i2 }+ 0.77z_{i3 }+ ?.
Here ? were assumed to be standard normally distributed.
We then considered data samples of size n = 200, 400 and 1, 000. The proportions of NDs were set 10%, 30%, 50% and 70%. The corresponding cutoff points of DL values were: (1) for imitation of Cytokine 1, 14.7, 10, 17, and 29 pg/ml, and (2) for mimicking Cytokine 2, 0.7, 1.6, 2.7, and 4.6 pg/ml.
For studying the effect of heteroscedastic errors we used the same model as in (3) but now with a variance depending on the value of x by using ? ~ N(0, x).
In general, accuracy of estimate can be evaluated by bias, which represents the closeness to the true values, and precision measures the ability to repeat a previous estimates (regardless of accuracy). The combination of both accuracy and precision of estimate can be investigated by the root mean square error (RMSE) as follows:
RMSE=bias2+Variance. 
Therefore, parameter estimates provided by the various methods were compared in terms of mean bias and RMSE. Also coverage probability was provided, which is the probability that the confidence interval of the estimates contains the value. Additionally, for the unbiased methods performances were also compared for their hypothesis testing abilities in terms of power. The Waldtype statistic ?^/SE(?^) was used for testing. It is approximately distributed as a tdistribution with n ? 2 degrees of freedom for n observations in each sample for continuous outcome.
All computations have been done using the program language R [^{27}].
HWU performed simulation, analyzed the data, and drafted the manuscript. JHD participated in interpretation of statistical methods, conception and design of the study. MY and FH participated in discussion on biological issues and provided data. All authors edited and approved the written manuscript.
Results of simulation studies in accuracy and precision. Results were obtained by different approaches at various proportions of nondetects (entries are averages of 1000 repetitions).
Click here for additional data file (14712172959S1.pdf)
This work was financially supported by the Royal Netherlands Academy of Arts and Sciences (KNAW SPIN project 05PP35). We thank the anonymous reviewer for the comments.
References
Helsel DL. Nondetects And Data Analysis: Statistics for censored environmental data. 2005John Wily and Sons, New York;  
Lubin JH,et al. Epidemiologic evaluation of measurement data in the presence of detection limitsEnviron Health Perspect 2004;112:1691–1696. [pmid: 15579415]  
Diness BR,Fisker AB,Roth A,Yazdanbakhsh M,Sartono E,Whittle H,Nante JE,Lisse IM,Ravn H,Rodrigues A,Aaby P,Benn CS. Effect of highdose vitamin A supplementation on the immune response to Bacille CalmetteGuerin vaccineAm J Clin Nutr 2007;86:1152–1159. [pmid: 17921396]  
Hornung RW,Reed LD. Estimation of average concentration in the presence of nondetectable valuesAppl Occup Environ Hyg 1990;5:46–51.  
Christiansen SC,et al. Inflammatory cytokines as risk factors for a first venous thrombosis: a prospective populationbased studyPLoS Medicine 2006;3:1414–1419. [doi: 10.1371/journal.pmed.0030334]  
Koch AL. The logarithm in biology. 1. Mechanismsgenerating the lognormal distribution exactlyJ Theoret Biol 1966;12:276–290. [pmid: 5972197] [doi: 10.1016/00225193(66)901196]  
Limpert E,Stahel W,Abbt M. Lognormal distributions across the sciences: keys and cluesBioScience 2001;51:341–352. [doi: 10.1641/00063568(2001)051[0341:LNDATS]2.0.CO;2]  
Genser B,Cooper PJ,Yazdanbahksh M,Barreto ML,Rodirigues LC. A guide to modern statistical analysis of immunological dataBMC Immunology 2007;8:27. [pmid: 17963513] [doi: 10.1186/14712172827]  
Lee L,Helsel D. Statistical analysis of waterquality data containing multiple detection limits: Slanguage software for regression on order statisticsComputers & Geosciences 2005;31:1241–1248. [doi: 10.1016/j.cageo.2005.03.012]  
HeijmansAntonissen C,Wesseldijk F,Munnikes RJM,Huygen FJPM,Meijden P van der,Hop WCJ,Hooijkaas H,Zijlstra1 FJ. Multiplex bead array assay for detection of 25 soluble cytokines in blister fluid of patients with complex regional pain syndrome type 1Mediators Inflamm 2006;2006:28398. [pmid: 16864900]  
Richardson DB,Ciampi A. Effects of exposure measurements error when an exposure variable is constrained by a lower limitAm J Epidemiol 2002;157:355–363. [pmid: 12578806] [doi: 10.1093/aje/kwf217]  
Schisterman EF,Vexler A,Whitcomb BW,Liu A. The limitation due to exposure detection limits for regression modelsAm J Epidemiol 2006;163:374–383. [pmid: 16394206] [doi: 10.1093/aje/kwj039]  
Arabmazar A,Schmidt P. An investigation of the robustness of the tobit estimator to nonnormalityEconometrica 1982;50:1055–1063. [doi: 10.2307/1912776]  
Eilers PHC. Illposed problems with counts, the composite link model and penalized likelihoodStat Modelling 2007;7:239–254.  
Long KZ. Vitamin A supplementation reduces the monocyte chemoattractant protein1 intestinal immune response of Mexican childrenJ Nutr 2006;136:2600–2605. [pmid: 16988133]  
Agresti A. Categorical Data Analysis. 2006New York: John Wiley & Sons;  
Koenker R. Quantile Regression. 2005Cambridge University Press;  
White IR,Koupilova I,Carpenter J. The use of regression models for medians when observed outcomes may be modified by interventionStatistics in Medicine 2003;22:1083–1096. [pmid: 12652555] [doi: 10.1002/sim.1408]  
Dai JY,Ruczinski I,LeBlanc M,Kooperberg C. Imputation methods to improve inference in SNP association studiesGenet Epi 2006;30:690–702. [doi: 10.1002/gepi.20180]  
Maddala GS. LimitedDependent and Qualitative Variables in Econometrics. 1983Cambridge University Press, New York;  
Austin PC,Escobar M,Kopec JA. The use of the Tobit model for analyzing measures of health statusQuality of Life Research 2000;9:901–910. [pmid: 11284209] [doi: 10.1023/A:1008938326604]  
Little RJA,Rubin DB. Statistical Analysis with Missing Data. 1987New York: John Wiley & Sons;  
Hopke PK,Liu C,Rubin DB. Multiple imputation for multivariate data with missing and belowthreshhold measurements: timeseries concentrations of pollutants in the arcticBiometrics 2001;57:22–33. [pmid: 11252602] [doi: 10.1111/j.0006341X.2001.00022.x]  
NADA for R: Nondetects and Data Analysis for the R statistical computing environment  
Little RJA,Rubin DB. Multiple Imputation for Nonresponse in Surveys. 1987New York: John Wiley & Sons;  
Tobin J. Estimation of relationships for limited dependent variablesEconometrica 1958;26:24–36. [doi: 10.2307/1907382]  
The R Project for Statistical Computing  
Stata Statistical Software  
SAS Statistical Software 
Figures
Tables
Description of cytokine data
Cytokine  sample size  proportion of NDs  DL 
1  181  5.5%  10 pg/ml 
2  173  66%  5 pg/ml 
Methods used for comparing
Methods  Description  Software  Disadvantage 
Deletion  Remove NDs.  Any statistical package  Bias 
DL/2  Substitute each ND with half of the value of DL.  Any statistical package  Large RMSE for large proportion of NDs 
ROS  After computing a linear regression for data versus their normalized scores belowDL values are extrapolated under distributional assumption.  Rpackage NADA [^{24},^{27}]  Underestimation of variance for large proportion of NDs 
MI  Estimation of mean and standard deviation by MLE. Creating 10 complete samples. Pool the results from 10 individual analyses.  R (software available on request)  Bias for small proportion of NDs 
TOBIT  Parametric estimation method for incorporating NDs.  R, Stata [^{28}], SAS [^{29}]  Sensitive to heteroscedastic errors 
LOGIT  Create binary dependent variable of NDs (0s) and detects (1s).  Any statistical package  Loss of information & parameter estimates are less interpretable 
Variance of estimates provided by the MI, TOBIT and LOGIT approaches at various proportions of nondetects (entries are averages of 1000 repetitions)
Sample size of 200  Sample size of 400  Sample size of 1000  
% NDs  MI  TOBIT  Logit  MI  TOBIT  Logit  MI  TOBIT  Logit 
10%  0.0050  0.0044  0.0358  0.0022  0.0022  0.0164  0.0008  0.0009  0.0063 
30%  0.0055  0.0050  0.0185  0.0022  0.0025  0.0089  0.0007  0.0010  0.0035 
50%  0.0055  0.0061  0.0195  0.0021  0.0030  0.0095  0.0006  0.0012  0.0037 
70%  0.0048  0.0092  0.0319  0.0018  0.0043  0.0147  0.0005  0.0017  0.0056 
Application to real data: for the LOGIT model two cutoff points were used, median for Cytokine 1 and DL for Cytokine 2
Methods  ?^ (Slope)  SE(?^)  ?^ SE(?^)  pvalue  
Cytokine 1  Standard methods  
Deletion  0.012  0.042  0.277  0.782  
Substitution of 0  0.255  0.055  4.645  < 0.0001  
Substitution of DL/2  0.186  0.047  3.951  0.0001  
Substitution of DL  0.117  0.040  2.879  0.005  
LOGIT (Median)  0.198  0.139  1.423  0.155  
Advanced methods  
ROS  0.085  0.039  2.184  0.030  
TOBIT  0.190  0.048  3.960  0.00008  
MI  0.149  0.046  3.264  0.0006  


Cytokine 2  Standard methods  
Deletion  0.801  0.129  6.193  < 0.0001  
Substitution of 0  0.585  0.1101  5.311  < 0.0001  
Substitution of DL/2  0.545  0.093  5.842  < 0.0001  
Substitution of DL  0.504  0.079  6.418  < 0.0001  
LOGIT (DL)  0.208  0.113  1.841  0.066  
Advanced methods  
ROS  0.497  0.081  6.135  < 0.0001  
TOBIT  0.792  0.179  4.436  < 0.0001  
MI  0.550  0.127  4.322  < 0.0001 
Article Categories:

Previous Document: Reduced postprandial energy expenditure and increased exogenous fat oxidation in young woman after i...
Next Document: Rapture of renal angiomyolipoma during pregnancy: a case report.