## ABSTRACT

Data below detection limits, left-censored data, are common in environmental microbiology, and decisions in handling censored data may have implications for quantitative microbial risk assessment (QMRA). In this paper, we utilize simulated data sets informed by real-world enterovirus water data to evaluate methods for handling left-censored data. Data sets were simulated with four censoring degrees (low [10%], medium [35%], high [65%], and severe [90%]) and one real-life censoring example (97%) and were informed by enterovirus data assuming a lognormal distribution with a limit of detection (LOD) of 2.3 genome copies/liter. For each data set, five methods for handling left-censored data were applied: (i) substitution with LOD/^{−2} from known values under 97% censoring. MI method 2 was the next overall best method. For medium to severe censoring, MI method 1 may result in the least error. If unsure of the distribution, MI method 2 may be a preferred method to avoid distribution misspecification.

**IMPORTANCE** This study evaluates methods for handling data with low (10%) to severe (90%) left-censoring within an environmental microbiology context and demonstrates that some of these methods may be appropriate when using data containing concentrations below a limit of detection to estimate infection risks. Additionally, this study uses a skewed data set, which is an issue typically faced by environmental microbiologists.

## INTRODUCTION

Methodologies for handling data below limits of detection (LOD) have been a long-recognized issue in many scientific disciplines, and it is a frequent reality for environmental health and exposure scientists. Data below a LOD for which the true value is unknown are often referred to as “left-censored.” If data are above a particular value but the true value is unknown, these are referred to as “right censored.” For example, in microbiology, a left-censored data point may be one for which nothing was detected with a particular method or assay, but it is unknown whether there are truly zero organisms of interest in the sample. An example of a right-censored data point in microbiology might be a plate count of “TNTC” (too numerous to count), where the lower bound of the possible value may be known but the true value of the count is unknown.

The issue of left-censored data is familiar in environmental microbiology. Virus recovery efficiency challenges and needs for adjustments to viral concentration data to more closely reflect reality have been acknowledged (1). Additionally, more recent environmental microbiological studies have addressed that substitution is not an appropriate method for handling LOD data. In 2015, Pouillot et al. (2) used methods from survival analysis and Bayesian inference models to characterize distributions of norovirus and male-specific coliphage concentrations due to these methods' ability to handle left-censoring (2). The distributions were then used to model viral concentrations and log reductions due to various water treatment steps (2). In 2016, Vergara et al. used nondetects (left-censored values) and data analysis (NADA) methods developed by Helsel et al. in fitting distributions to norovirus and human adenovirus concentration data utilized in a quantitative microbial risk assessment (QMRA) (3, 4).

The method chosen for handling data below LOD has important implications for estimating risks within QMRA. In several sensitivity analyses for exposure models, pathogen concentration was found to be one of the biggest drivers, if not the number one driver, of infection risk (5–8). For pathogenic viruses for which there are low infectious doses relative to those of other microorganisms, the method for handling data at or below LOD may have notable impacts on predicted health outcomes. Interpreting data below the LOD within an environmental health context is particularly important, as data may be used to determine health risks associated with exposures. Substituting values below the LOD with a constant, such as the LOD, LOD/2, or LOD/

In 2014, the U.S. EPA published a document recognizing that substitution methods may not be appropriate for microbial field data, especially considering computing advances that allow for the use of more rigorous methods, such as MLE, KM, and multiple imputation (MI) (15). However, multiple options for handling censored data were discussed (15). With many available and recommended methods existing in current and previous literature surrounding LOD issues in environmental sciences, there is a need for the exploration of these methods within a QMRA context to demonstrate the potential impact of LOD methodology on estimated risk and to identify superior methods that support accurate risk assessments. Additionally, previous simulation studies comparing LOD methods often have not incorporated highly right-skewed data, which is common in environmental microbiology (16–19). The purpose of this study was to evaluate substitution, MLE, KM, and two MI methods for lognormal highly right-skewed data sets with low, medium, high, and severe censoring, as defined by the U.S. Army Public Health Command in 2015 (14), in addition to a real-life example in which 97% of samples were left-censored (20). Left-censoring in microbial data has been shown to occur at all of these censoring levels (20–23), making these censoring degrees relevant to the field of QMRA. The effects of LOD methods applied to simulated data on viral concentrations in drinking water on estimated viral doses and infection risks were then quantified and compared.

## RESULTS

For medium, high, and severe degrees of censoring, MI method 1 (which used MLE methods to estimate parameters of the lognormal distribution and imputing censored data points with values from this distribution below the LOD) estimated mean viral concentrations and predicted infection risks closest to those of the known, unmasked data sets (Tables 1 and 2). Additionally, this method resulted in the smallest root mean square errors (RMSEs) and biases in dose and infection risk for medium to severe degrees of censoring (Tables 3 and 4). MI method 2 (which assumed that all data below the LOD followed a uniform distribution and imputed values from this distribution for censored values) was the next overall best method for estimating the mean viral concentration of the data set (Table 1). MI method 2 resulted in the lowest RMSEs and biases in dose and infection risk for low censored data (Tables 3 and 4). As degrees of censoring increased, ranges of dose and infection risk biases and RMSEs increased, meaning that performance of the methods became more variable.

Biases for low to severe censoring for MI method 1 were positive, indicating that it overpredicts risk. However, with 97% censoring, MI method 1 began to underpredict some doses and infection risks. The MLE method underpredicted some doses and infection risks for all degrees of censoring. The infection risk bias with the smallest magnitude (1.14 × 10^{−4}) was observed under 90% censoring using MI method 1. The bias greatest in magnitude (9.93 × 10^{−1}) was observed for the MLE method with 97% censoring. This means that a predicted risk value of 0.01 could be incorrectly estimated to be as large as approximately 1.00 using an MLE method with 97% censored data. The MLE method had some bias values on the order of 10^{−1} for all censoring degrees, meaning that at any censoring degree, it is possible that the MLE method could result in a predicted risk that is 0.1 larger or smaller than the true infection risk. When comparing the distributions of predicted infection risks with 90% censored data sets, the MLE method predicted a maximum infection risk of 1 where the actual maximum infection risk was 0.21 (Table 2).

In the real-life example in which 97% of data were below the LOD, MI method 1 produced the lowest dose and infection risk biases. Using this method, the largest bias for dose was 3.14 viral particles, while the smallest bias was −0.058 viral particles (Table 4). For infection risk, the largest bias measured using MI method 1 was 1.17 × 10^{−2}, while the smallest bias was −2.14 × 10^{−4} (Table 4). This means that, at worst, this method may overpredict risk by 1.17 × 10^{−2} or may underpredict risk by 2.14 × 10^{−4}.

Most of the methods, aside from the MLE method, performed well in estimating the mean, minimum, and maximum infection risks. The substitution method performed well under low, medium, and high censoring (Table 2). Under severe censoring, the mean and maximum infection risks were closely estimated, but the minimum infection risks were overestimated. A similar pattern was observed for KM and MI method 2 (Table 2). MI method 1 consistently estimated an infection risk that was, at most, 1.17 × 10^{−2} from the actual value (Table 4), and minimum and maximum infection risks were closely predicted, even when the level of censoring was severe (Table 2).

All infection risks resulting from uncensored simulated data sets resulted in risks greater than the risk target, 1/10,000 (Table 2). This demonstrates that even when the 50th percentile of a distribution would result in a risk target, using the mean concentration of highly right-skewed data sets could result in risk predictions roughly 1- to 2-log_{10} larger than the risk target.

The sensitivity analysis demonstrated that 25% increases or decreases in the assumed geometric mean, geometric standard deviation, and LOD used to inform the assumed distribution for simulated data sets did not affect the conclusions that MI method 1 was overall superior in handling medium, high, and severe degrees of censoring and the real-life example of censoring while MI method 2 was superior in handling low degrees of censoring (see Tables S1 to S4 in the supplemental material).

## DISCUSSION

In a comparison of all measures of performance, MI method 1 was overall the most superior method, because it resulted in the smallest biases and RMSEs and captured distributions of infection risk well for the largest number of censoring categories (medium, high, and severe censoring) and for the real-life example. Aside from being useful in providing accurate estimations of censored data, this method slightly overpredicted the true risk for most censoring levels, making it a conservative tool in assessing risk and preferred over a method that may underestimate the true risk.

MI method 2 consistently performed the best for low censoring. However, it is possible that other methods may have been superior if a different parameter estimate, aside from the mean, had been used as a point of comparison. Within a QMRA context, means are typically used in deterministic modeling because this parameter is more conservative than the median, as it is sensitive to high exposure concentrations that may be experienced (16, 18).

The MLE method did not perform as well as other methods in these simulations. Although the MLE method has performed well in other simulation studies, other work has noted that the MLE method does not perform as well for highly skewed data, producing larger mean square errors (24). The substitution method did not perform as poorly as expected, as this method outperformed the MLE and KM methods. However, because R statistical software packages exist that make the implementation of multiple-imputation methods accessible, substitution methods should not be the first approach in handling left-censored data (25).

This study did not consider a scenario in which distribution misspecification occurs. If the assumed distribution is not the distribution by which the true data abide, this could lead to poor performance of methods that assume a particular distribution (24). If one is unsure of the distribution of the data for data sets with medium to severe degrees of censoring, MI method 2 may be more reliable than MI method 1, as distribution misspecification could result in poor performance of MI method 1. However, if one is confident that the distribution of the data has been accurately identified or if the assumption is strongly informed, then MI method 1 may be more appropriate to use. A source of uncertainty in this study was assumptions regarding microbial distribution characteristics that informed simulated data sets. Although real-life values from a microbial data set were used, future studies should evaluate these methods for other anticipated or observed microbial data set distributions.

Both MI methods are easily applicable within QMRA, as opposed to the MLE or KM methods, because they allow for the use of the full data set and are not restricted to using summary statistics for the data set. Use of the full data set is helpful if the intention is to utilize it in stochastic exposure modeling. MI method 1 could be used to fit distributions to data sets containing left-censored data so that these distributions could later be randomly sampled to represent a variety of expected concentrations. MI method 2 could be used to create a mixed distribution, where the uniform distribution would be sampled for a certain percentage of the time to represent the proportion of censored data in the data set. The uncensored data could then be sampled from discretely, or one could sample from a distribution fit to these uncensored concentrations (26). This study identifies methods for handling highly skewed, left-censored microbial concentration data and quantifies the impact of various methods on accuracy in calculating summary statistics and estimating infection risks with concentration data. Adoption of more mathematically intensive methods that offer better accuracy in handling left-censored data, such as the MI methods suggested in this study, are becoming more practical, as free software packages are available. Methods evaluated in this study can be used to predict infection risks associated with quantified microbial concentrations. There are a number of previous microbiological and QMRA studies that could have benefited from these methods in handling left-censored data due to various levels of left-censoring (21, 27, 28). In a study conducted by Herzog et al., the LOD was one of the most sensitive factors for health risk estimation in multiple modeled scenarios (29). As efforts continue to develop or modify current microbial detection methods with greater sensitivity, statistical methods, such as the ones in this study, can be utilized to bridge the current gap in addressing commonly experienced LOD issues in microbiology.

## MATERIALS AND METHODS

Simulating quantitative PCR drinking water virus concentration data.R statistical software was used for all simulations and evaluations in this study (25). A distribution that would be used to simulate the known data sets was first created by calculating an enterovirus concentration (genome copies/liter) that would be needed to result in an annual 1/10,000 infection risk, based on assumed parameters for the equations used to estimate dose and infection risk.

A lognormal distribution for virus concentrations was assumed, as this distribution has been applied to microbial water quality data, specifically for enteric viruses, and for other types of environmental concentration data (1, 21, 30). The concentration that would result in a 1/10,000 annual infection risk, the tolerable risk established by the U.S. EPA, using the following equations was set equal to the geometric mean of the lognormal distribution (3.66 × 10^{−5} genome copies/liter) used to create the “known” simulated data set (31). The geometric standard deviation of detected enterovirus in water samples (78.2 genome copies/liter) from a study conducted by Pearce-Walker in 2017 was used to represent the expected variability of viral concentrations in drinking water samples (20). The LOD for simulated data sets (2.3 genome copies/liter) was calculated using methods described by Pearce-Walker, assuming a filtered sample volume of 1,000 liters. Although not the processed sample volume utilized by Pearce-Walker (20), we used this larger sample volume, sometimes used when sampling water samples for viruses (32), in order to create a scenario in which LOD may be relatively close to infectious doses of viruses. In the case of this scenario, the LOD (2.3 genome copies/liter) was 2 orders of magnitude lower than the enterovirus infectious dose estimated by the QMRA wiki (http://qmrawiki.canr.msu.edu/index.php/Quantitative_Microbial_Risk_Assessment_(QMRA)_Wiki). However, some viruses, such as poliovirus and rotavirus, have infectious doses estimated to be as low as 1.41 and 6.17 viral particles, respectively (http://qmrawiki.canr.msu.edu/index.php/Quantitative_Microbial_Risk_Assessment_(QMRA)_Wiki). In this study, it was assumed that all detected viruses were viable, as this has been a recommended conservative approach in other enteric virus QMRA contexts (33). Equation 1 was used to estimate enterovirus exposures, equivalent to dose in this case, and has been used by the World Health Organization for microbial drinking water exposure estimates (34):
*V* is the volume of water consumed per day (liters), and *C* is the virus concentration (viral particles/liter).

It was assumed that a person drinks 2 liters of water per day, as this value has been recommended in other drinking water QMRA contexts (35). To represent an organism with a low infectious dose, the exponential dose-response model for enterovirus was used, as this has been recommended by the QMRA wiki (http://qmrawiki.canr.msu.edu/index.php/Quantitative_Microbial_Risk_Assessment_(QMRA)_Wiki):
*P*_{infect, daily} is the daily infection risk from drinking water and *k* equals 3.74 × 10^{−3}, a constant recommended by the QMRA wiki (http://qmrawiki.canr.msu.edu/index.php/Quantitative_Microbial_Risk_Assessment_(QMRA)_Wiki). The parameter *k* represents the probability of a single organism surviving and infecting the host (36). The annual infection risk was estimated as follows:
*P*_{infect, annual} is the annual infection risk.

Four degrees of censoring—low (10%), medium (35%), high (65%), and severe (90%)—within defined ranges stated by the U.S. Army Public Health Command (14) were considered. Additionally, 97% censoring was investigated as a “real-life example,” informed by an enterovirus data set in which approximately >96% (183/190) samples were censored (20). For each degree of censoring, 1,000 simulated data sets, assumed as “true” known data for our purposes, were created. Each individual data set included 100 viral concentrations, and this data set was then saved such that all data points were known, including concentrations below the theoretical LOD. A copy of this data set was then altered so that either 10%, 35%, 65%, 90%, or 97% of the concentrations were below the theoretical LOD concentration assumed in this study. Methods for handling left-censored data were applied to these censored data sets, and outcomes were compared to our “true” outcomes.

Substitution methods.Although multiple substitution values (0, LOD/

Maximum likelihood estimation and Kaplan-Meier methods.Using the NADA package in R, MLE and KM methods were used. The NADA package uses methods by Helsel (25, 38, 39). In using the cenmle and cenfit functions, inputted data were labeled as censored or uncensored. For censored values, the LOD was used as a placeholder for these values. As MLE and KM are not imputation methods, censored values were not replaced with a value. Rather, summary statistics were estimated for the entire data set, including censored concentrations. More information regarding the MLE and KM methods implemented by the NADA package can be found at https://cran.r-project.org/web/packages/NADA/NADA.pdf.

Multiple-imputation methods.Two distribution-based multiple-imputation methods were used in this study. This method involves assuming that the entire data set, including values that fall below the LOD, follows a particular distribution. This distribution is then used to impute values for censored data. This approach has been used in other left-censoring methodology studies, and its use within an environmental context has been encouraged (9).

The first multiple-imputation method (MI method 1) used MLE methods to estimate the parameters of a lognormal distribution fit to the full simulated data set, including censored concentrations. Values lower than the LOD were then imputed from this distribution for all censored values. To estimate the parameters of the lognormal distribution, the function fitdistcens from the R package fitdistRplus was used (40). This method has performed well in other simulation studies addressing environmental censored data (9).

The second multiple-imputation method (MI method 2) assumed a uniform distribution (minimum = 0, maximum = LOD) for all values less than the LOD. Left-censored values were then replaced with a number randomly selected from this uniform distribution (26).

Comparing estimated doses.RMSEs were calculated to compare estimated doses and infection risks with known values, where a lower RMSE value indicates closer estimation to the known value. This method has been utilized in other studies to evaluate methods for handling left-censored data (41, 42). Biases were also calculated to evaluate the direction of error for each LOD method. A smaller magnitude of bias indicated a closer estimation to the true value.

Sensitivity analysis.To address uncertainty in the geometric mean, geometric standard deviation, and LOD used to define the distribution for creating simulated data sets, a sensitivity analysis was conducted. The geometric mean, geometric standard deviation, and LOD were individually decreased and increased by 25%. The dose and infection risk biases and RMSEs for each method were calculated and compared to baseline values.

## ACKNOWLEDGMENTS

A.M.W. was supported by a Mel and Enid Zuckerman College of Public Health award and by the Western Alliance to Expand Student Opportunities (WAESO) Louis Stokes Alliance for Minority Participation (LSAMP) Bridge to Doctorate (BD) National Science Foundation (NSF) grant no. HRD-1608928.

## FOOTNOTES

- Received 18 May 2018.
- Accepted 8 August 2018.
- Accepted manuscript posted online 17 August 2018.
Supplemental material for this article may be found at https://doi.org/10.1128/AEM.01203-18.

## REFERENCES

- Copyright © 2018 American Society for Microbiology.