Previous Article | Next Article ![]()
Applied and Environmental Microbiology, June 2008, p. 3490-3496, Vol. 74, No. 11
0099-2240/08/$08.00+0 doi:10.1128/AEM.02789-07
Copyright © 2008, American Society for Microbiology. All Rights Reserved.

L. Geue,4 and
B. Engel1
Central Veterinary Institute of Wageningen University and Research Centre, P.O. Box 65, 8200 AB Lelystad, The Netherlands,1 Department of Food Science, Cornell University, Ithaca, New York 148532,2 Quality Milk Production Services, Cornell University, Ithaca, New York 14850,3 Institute for Epidemiology, Federal Research Institute for Animal Health, Seestrasse, 55, 16868 Wusterhausen, Germany4
Received 11 December 2007/ Accepted 23 March 2008
|
|
|---|
|
|
|---|
The required number of isolates in a sample that need to be typed in order to be 95% confident that all strains have been found can be derived from Bayesian inference (7). Bayesian sample size calculation combines prior information, based on expert opinion or pilot data sets from related studies, with data from additional typing studies to generate the posterior probability of detecting all strains that are present in a sample (1, 19). This study included data sets from the ruminant gastrointestinal tract and farm environments and allowed exploration of strain heterogeneity across bacterial species within sample types and across sample types within bacterial species. It continued the discussion initiated by Singer et al. (19) and Altekruse et al. (1) by showing that for different sources of samples different numbers of isolates need to be genotyped due to variability in the heterogeneity of bacterial populations. It demonstrated that the interplay between statisticians and microbiologists is essential for meaningful sample size estimation. Specifically, the aims of the current study were (i) to provide a single WinBUGS program code to perform all calculations with a large variety of data, (ii) to explain methods for the derivation of priors and the impact that they have on the posterior distributions, and (iii) to extend previously reported methodology (1, 19) by relaxing the assumption about equal expected relative frequencies of strains within samples.
|
|
|---|
|
View this table: [in a new window] |
TABLE 1. Types and numbers of samples and typing methods used to obtain bacterial isolates and typing information for assessment of within-sample strain heterogeneity
|
A second data set was used to compare the impact of relatively uninformative versus informative priors on outcomes of Bayesian sample size estimates. The isolates of non-type-specific E. coli, including VTEC, were obtained from fecal samples, rumen fluid, the small intestine (i.e., duodenum and ileum), and the large intestine (i.e., cecum and colon) of five sheep kept in indoor stalls. Feces samples were collected daily on days 1 to 21. Rumen fluid samples were collected 2 to 10 times for each animal at 1- to 7-day intervals using an esopharyngeal tube. One sample from the small intestine and one sample from the large intestine were collected from all animals at necropsy on day 21. Samples were stored at 4°C for a maximum of 12 h and homogenized, and 1:10, 1:100, and 1:1,000 dilutions were prepared using brain heart infusion broth (Difco). Portions (100 µl) of each dilution were plated on separate MacConkey agar plates and incubated at 37°C overnight, and E. coli was identified and counted as lactose-fermenting microorganisms using the dilution plates on which individual colonies were identifiable. For each sample five isolates of E. coli were genotyped using a PCR with enterobacterial repetitive intergenic consensus sequence primers (5).
Finally, to extend the methodology to samples in which strains were not assumed to be present at the same frequency, a set of bovine E. coli isolates was used. Seventy-six fecal samples were collected from 76 beef calves, and these samples represented the first samples in time, including 10 isolates screened for each sample from the longitudinal study reported by Geue et al. (8). For each sample, 10 isolates of E. coli (a total of 760 isolates) were screened for the presence of verotoxin 1 and verotoxin 2, for intimin, and for hemolysin using PCR tests as reported by Geue et al. and Döpfer et al. (4, 8). Of the 760 E. coli isolates, 425 (55.9%) were positive for verotoxin 1, verotoxin 2, or both verotoxins, 55 (7.2%) isolates were positive for verotoxin 1, verotoxin 2, or both verotoxins in combination with intimin, 124 (16.3%) isolates were positive for verotoxin 1, verotoxin 2, or both verotoxins in combination with the hemolysin, and 52 (6.8%) isolates were positive for intimin in combination with hemolysin. These four categories are not mutually exclusive. The average numbers of isolates in all of the isolates screened that were found to be positive for the combinations of virulence markers are shown in Table 2.
|
View this table: [in a new window] |
TABLE 2. Overview of strain typing data: average numbers of isolates typed, average numbers of strains observed, assumed numbers of types per sample, values used to construct the prior distributions, and numbers of isolates required to be typed in order to identify all strains present in a sample with 95% probability
|
Sample size calculations. (i) Combination of prior information and relevant data.
Bayesian statistical inference was used to calculate the number of isolates (N) that must be genotyped to identify all strains present in a future sample with a high (e.g., 95%) probability. Bayesian inference comprises a combination of prior information about parameters in the model and information from relevant data (7). Here, the parameters are the unknown probabilities (
1,
2,...
k) for a sample to contain either exactly one strain, two strains, or up to a maximum of, e.g., six strains (k = 6). The prior information is a summary of what is "known" about
1,
2,...
6 prior to the use of the relevant data. The prior information may be obtained from data that are related to the problem but are not directly applicable (perhaps taken from the literature) or from expert opinion. A prior probability is attached to each possible value of a parameter. Consequently, the prior probability takes the form of a probability distribution. The relevant data are directly applicable to the particular bacterial species in the setting of interest. So, while the prior information contains "soft" information, gathered from related sources, the relevant data represent "hard" information, and both are represented by a statistical model.
Attention is focused on the probability p (derived from
1,
2,...
6) that all strains that are present in a future sample will actually be observed. The result of the calculations is again a distribution, referred to as the posterior (distribution), which offers an up-to-date summary of the information about the parameters. A large sample from the posterior of probability p is generated by a Markov chain Monte Carlo algorithm, as implemented in the WinBUGS package (20). The median for this sample is presented as an estimate for p and the 2.5 and 97.5 percentile points as Bayesian confidence bounds (a 95% credible interval). For each value of N that is specified in the program as a possible future sample size, an estimate and interval for p are derived. The program can be run for a range of potential values for N. An appropriate choice can be made from a table or a plot (for instance, the value for N where the estimate for probability p exceeds 95%).
Choice of priors.
Let there be k different bacterial strains in a sample, where k is assumed to be known. For the probabilities
1,
2,...
k that a sample contains exactly 1, 2, or k strains, we need a prior distribution for positive numbers that add up to 1. The Dirichlet distribution has this property and is a convenient distribution for use as a prior distribution. The form of this distribution depends on the values of its shape parameters (
1...
k) that have to be specified.
![]() | (1) |
1...
k will be chosen such that equation 1 reasonably reflects the available prior information. A rule of thumb is that prior distribution 1 mimics the information in m =
1 +... +
k imaginary samples, with a proportion of the samples with exactly one strain (
1/m), a proportion (
1/m) of the samples with exactly two strains (
2/m), etc.
When little is known a priori, a prior will be chosen that expresses hardly any preference for possible values for
1 ... .
k between 0 and 1. Popular choices for such a relatively uninformative prior are:
![]() | (2a) |
![]() | (2b) |
Alternatively, when a stronger opinion is voiced about the values of
1 ...
k, based on expert opinion or previously published information, a more informative prior may be chosen. For illustration, the choice of an informative prior for non-type-specific fecal E. coli is discussed below, based on data from a previously conducted experiment, as shown in Table 3. Initially, following the rule of thumb, the
values are chosen to be equal to the counts in Table 3, replacing 0 by 0.5. This is practically equivalent to adding the data of Table 3 to the other relevant data and using prior 2a or 2b. However, when we do not feel quite confident about the data that inform the prior (Table 3), maybe because they relate to somewhat different samples or experimental conditions, we may decide to choose the prior more cautiously. To that end, we multiply the
values by a factor (
) less than 1; i.e.,
i is replaced by the smaller 
i, where i = 1... k. The prior expected
values remain the same (and equal to
1/m...
k/m), but the prior distribution is wider, expressing the uncertainty about the relevance of the data that inform the prior (Table 3). The smaller the factor
that we use, the wider the prior is. The WinBUGS program has simple facilities to see what the prior looks like, so a suitable value for
may be chosen given the uncertainty about the relevance of the data in Table 3. When there is doubt about the choice of prior, several priors could be used to check their impact on the choice of N in relation to the intended value for p. In Table 2, priors are relatively uninformative, except for the prior for ovine fecal E. coli (data set with n = 50).
|
View this table: [in a new window] |
TABLE 3. Data from ovine fecal samples (n = 50) for non-type-specific E. coli: numbers of strain types observed and fractions of the numbers of strain types used to construct an informative prior distribution for the sample size calculations based on the second ovine fecal data set (n = 80) in Table 2a
|
The assumption that all strains are equally likely to be observed is relaxed for the special case of two strains or two groups of strains. All strains of interest are placed into one group, while the remaining strains are placed into another group. The data consist of the observed number of isolates with strains that belong to the first group per total number of isolates that are genotyped per sample. In this way, the required future sample size may be calculated for a sample containing heterogeneous populations of pathogens that will be screened, for example, for a rare strain type of interest. Technical details about the model for this special case, including information about the choice of prior distributions, are presented in the Appendix.
Presently, any dependence between data (e.g., repeated measurements for animals) is not taken into account. Technically, it is possible to include a suitable dependence structure in the model of the relevant data. However, this fine-tuning of the model and the WinBUGS program requires intimate knowledge of the data and a fair amount of statistical expertise. It is important to distinguish between an actual analysis of new experimental data, possibly by Bayesian inference, and the present calculation of the required sample size. The sample size calculations will often be based on a simplified model. The present calculations are expected to offer a reasonable indication of the order of magnitude of the required sample size. When the correlation between data is marked, the calculations result in a lower boundary for the required sample size.
|
|
|---|
![]() View larger version (19K): [in a new window] |
FIG. 1. Probability p of finding all strains of a species present in sample when N isolates per sample are characterized. Squares indicate L. monocytogenes, circles indicate S. uberis, and triangles indicate K. pneumoniae. Filled symbols and solid lines indicate fecal samples. Open symbols and dashed lines indicate soil samples. Cut-offs with the horizontal dotted line that marks the 95% probability p yield the numbers of required isolates (N) (e.g., N is about 6 for L. monocytogenes in soil ).
|
![]() View larger version (16K): [in a new window] |
FIG. 2. K. pneumoniae (a) and L. monocytogenes (b) from bovine fecal samples have different widths of the 95% credible interval (indicated by errors bars) and result in different sample sizes for the 95% probability of typing all strains present in the samples.
|
![]() View larger version (24K): [in a new window] |
FIG. 3. Probability p of finding all strains of VTEC and non-type-specific E. coli in small intestine (triangles), large intestine (multiplication signs), rumen (squares), and feces (diamonds) samples when N isolates per sample are typed. The dotted horizontal line indicates 95% probability. The triangles, multiplication signs, squares, and diamonds show posterior probability distributions based on expert opinion and a uniform prior. The asterisks show distributions based on an informative prior for ovine fecal E. coli derived from the analysis of an independent fecal data set (n = 50 [Table 1]).
|
|
|
|---|
We provide a single program in WinBUGS code that enables the user to perform all the required calculations together. In contrast to previous publications (1, 19), there is no need to resort to additional programs to, e.g., evaluate some probabilities relevant to the calculations by simulation as model input. The WinBUGS package (20) is freely available on the internet (http://www.mrc-bsu.cam.ac.uk/bugs). The WinBUGS programs, as used for the current sample size calculations, are available from D. Döpfer, together with instructions. A microbiologist provides microbial typing data and expert opinion about how many types are expected to exist in the data. The informed user of the WinBUGS program, for example a microbiologist or a statistician, has to specify prior information about model parameters based on the expert opinion or independent data sets and information about the uncertainty about this prior information. The choice of prior distributions is discussed in some detail. The relatively flat priors (priors 1b and 1c) can be used routinely, when the user has little prior information or intends to include little prior information in addition to the data that are considered directly relevant for the bacterial species and particular study. The statistician informs the microbiologist with regard to the number of isolates per sample that needs to be analyzed, based on the microbiologist's research question and expert opinion, as well as relevant data.
The process of updating the information, as illustrated for the informative prior for ovine fecal E. coli with information derived from an independent pilot data set, is iterative, and sample size information can be improved with each study that is undertaken. Updating information through consecutive studies lies at the core of Bayesian statistical inference (2, 7).
Bayesian statistical approaches are often criticized for employing expert opinion to generate prior distributions. Calculations in this study show that with the same priors derived from expert opinion (e.g., L. monocytogenes in soil versus fecal samples), different posterior distributions for the probability p of detecting all strains present can be obtained (10 versus 6 strains [Table 2]). This demonstrates that even data sets of modest size (10 samples for L. monocytogenes [Table 1]) contain information that is extracted by the Bayesian inference and reflected by the posterior distributions of p. Credible intervals for p for different data sets may vary in width, as demonstrated for K. pneumoniae versus L. monocytogenes in fecal samples (Fig. 2a and b). This difference in credible intervals demonstrates that the posterior critically depends on the amount and heterogeneity of the data. The "molecular typing walk" through the ovine gastrointestinal tract illustrates how different the numbers of E. coli isolates necessary for typing can be, where the rumen and large intestines have higher values than the small intestines. The presence of E. coli strains with different relative frequencies (e.g., specific virulotypes versus all other strains) can be detected at a certain level of confidence, as demonstrated using the bovine fecal E. coli isolates.
It is often assumed that all bacterial strains that are present in a sample are equally likely to be isolated. This may not always be true. For example, potentially pathogenic VTEC is known to comprise about 1% of all E. coli in the feces of ruminants (15). The number of isolates that needs to be tested to find at least one isolate of a given type, if it is present in a less-than-average percentage of all cases, may be far higher (14). This is particularly relevant when selective or indicator media for detection of the strain of interest are not available For example, in one of our laboratories, a real-time PCR test for detection of VTEC in bulk tank milk is used. No selective or indicator media are available for non-O157:H7 VTEC, and the high heterogeneity of E. coli strains in bulk tank milk turns finding a VTEC isolate into the proverbial looking for a needle in a haystack. In the analysis presented here, the assumption that strains are equally likely to be present in a sample is relaxed for strains that are collected in two groups (for example, different groups of genotypes, virulotypes, or other typing strategies, such as serotyping). We are presently developing an extension of the model where the assumption is relaxed further. The present study is meant to further enhance the awareness about the numbers of isolates that need to be typed in a sample for heterogeneous populations of pathogens, as initiated by Singer et al. (19) and Altekruse et al. (1).
Another improvement in the calculation of numbers of isolates that need to be typed may be to incorporate hierarchy in the data (e.g., data from field studies comprising repeated measurements per farm, animal, food processing plant, or site). To this end, the current WinBUGS program needs to be fine-tuned, which requires considerable statistical expertise and goes beyond routine use of the present WinBUGS programs.
Independent of the typing method, it is likely that there will be heterogeneity of isolates in many sample types and surveys, which makes typing of multiple isolates per sample necessary so that information is not lost. Variation across niches, species, and time implies that sample size calculations benefit from a pilot study before large-scale molecular typing studies are performed. The worst outcome of a "blind" typing and sampling strategy for heterogeneous populations of pathogens would be a failure to detect a zoonotic or bioterrorism hazard. Given the progress of automated typing of microorganisms, it is not unthinkable that multiple isolates per sample will be typed in the near future.
|
|
|---|
![]() |
![]() |
![]() |
Details of the statistical calculations: model with two (groups of) strains.
Let all strains of interest be placed in one group, referred to as type A, while the remaining strains are placed in another group, referred to as type B. Types A and B are now the only types in the analysis. Let x be the number of isolates of type A that are observed in a sample where n isolates are genotyped (x = 0... n). Let
1,
2, and
3 be the probabilities that a random sample contains only type A isolates, only type B isolates (i.e., no type A isolates), or both type A and B isolates, coded by t = 1, t = 2, and t = 3, respectively. Depending on the type of sample, t = 1, t = 2, and t = 3, and x follow a binomial distribution with total n and probability 1, 0, and
, respectively. Note that we no longer assume that all strains in a sample have an equal probability of being observed, since the probability
is not restricted to be equal to 0.5. The probability of detecting at least one isolate of each type that is present in the sample, when N isolates are genotyped, is:
![]() |
![]() |
probabilities we use a Dirichlet prior distribution, as described previously, and for probability
we use a beta prior distribution. Actually, the beta distribution for
is the same as the Dirichlet distribution for
and (1 –
) for k = 2. Prior 2b yields the uniform distribution for
for the interval from 0 to 1.
Published ahead of print on 31 March 2008. ![]()
Present address: Moredun Research Institute, Pentlands Science Park, Penicuik EH26 0PZ, Scotland. ![]()
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»