Previous Article | Next Article ![]()
Applied and Environmental Microbiology, May 2005, p. 2355-2364, Vol. 71, No. 5
0099-2240/05/$08.00+0 doi:10.1128/AEM.71.5.2355-2364.2005
Copyright © 2005, American Society for Microbiology. All Rights Reserved.
Departments of Mathematics,1 Statistics,6 Biological and Agricultural Engineering,2 Biological Sciences,3 Microbiology, Molecular Biology and Biochemistry,4 Environmental Biotechnology Institute, University of Idaho, Moscow, Idaho 838445
Received 21 September 2004/ Accepted 16 November 2004
|
|
|---|
|
|
|---|
The forces of the environment and the events of reproduction and growth are themselves stochastic in nature (3, 21, 24, 32, 36, 39, 44, 45, 51, 53), yet simple ecological models, such as the Verhulst logistic equation, result in deterministic predictions. However, smooth convergence to asymptotic results is not what is usually seen, even in rigorous experimental settings (17, 31). Hence, a more realistic alternative to population growth modeling is to confront stochastic equations with the data at hand (references 5, 14, and 17 and citations therein and references 41 and 47).
In this paper, we describe a novel application of a stochastic population model to analyze how environmental conditions influence microbial growth dynamics using an extensive experimental data set. The primary model used was the stochastic Ricker (SR) equation (27, 51). Our secondary model was a novel use and application of the SR model in an analysis of variance (ANOVA)-like format to account for differences between and within experimental conditions. We use rigorous statistical methods to characterize the effects of chemical supplements on the following aspects of bacterial growth: the rate of attaining stationary phase, the population density at which stationary phase occurs, and the variability associated with the growth process. The experimental data were time series measures of Escherichia coli growth in a basal medium amended with all 32 possible combinations of five supplements, namely, glucose, NH4Cl, HCl, EDTA, and NaCl (Fig. 1). All supplement combinations were tested in triplicate. The methods and results presented here have broad implications and can be applied to any situation in which the experimentalist wishes to determine the factors that affect growth responses of microbial isolates.
![]() View larger version (54K): [in a new window] |
FIG. 1. Plotted are the three replicated growth curves (OD x 100 as a function of time [min]) of each of the 32 design types (or treatment combinations). Each replicated growth curve is marked with a different line type. In the titles above each plot, the factors present or absent are indicated by 0 or 1, respectively, with the sequence of treatments being glucose, NH4Cl, HCl, EDTA, and NaCl.
|
|
|
|---|
Theoretical background. (i) Primary model: the SR model.
We modeled the growth dynamics of each treatment combination using a stochastic logistic growth model based on the Ricker model (51). The Ricker model has a long history in population ecology modeling (35) and has been used as a discrete version of the well-known Verhulst logistic differential equation (40). The Ricker equation expresses the one-step-ahead population size as a function of the current population size and includes a density-dependent effect:
![]() | (1) |
= a/b. The dynamic behavior of this model up to and including chaos is well known (17, 27, 40). A stochastic version of this model bridges the gap between theoretical ideas and available data sets. The stochastic formulation of the deterministic model leads to a hypothesis that explains how the departures from the deterministic predictions occur in a given time series of population growth (17). The sources of variability may come from observational error or from process error. Observational error relates to the fact that at a given time, the total number of individuals cannot be known exactly, and the experimenter's estimate is based on a sample. Process error comes from uncertainty inherent in the process of growth itself. Population dynamics theory further subdivides the process error into "demographic" and "environmental" stochasticity. Demographic stochasticity represents the variability due to random contributions of births, deaths, and migrations of individuals in the population (17, 33). Environmental stochasticity represents the effect of external factors on the individuals of the population (23). For example, in Baranyi's stochastic models (5, 6, 7), the equations' coefficients are random variables that represent the biological variability between the individual cells in the population. In the ecological theory context, this is a model for demographic stochasticity (23). Cushing et al. (17) have shown that considering both demographic and environmental stochasticity is essential to adequate population dynamics modeling but that integrating both sources of uncertainty in a single model is not an easy task.
Because we were dealing with an experiment in which a given environment was imposed on bacterial populations, we chose to model environmental stochasticity alone. The stochastic Ricker model is written as follows (27):
![]() | (2) |
Et is a random shock to the population growth rate at time t, and the Et are independent and normally distributed with mean zero and variance one. Because the current population size depends on the previous observation, this model has the Markov property and in the log scale it becomes a first-order nonlinear autoregressive model:
![]() | (3) |
Setting a equal to 0 and b equal to 0 defines a discrete-time Brownian motion process with zero drift where there is no population density feedback typical of ecological processes (11, 23, 27). When a is not equal to 0 and b is equal to 0, the model is also a discrete-time Brownian motion with added drift. If a is not equal to 0 and b is not equal to 0, the model includes (positive or negative) density dependence. In particular, the model in which a is not equal to 0 and b is <0 represents a stochastic logistic growth. Under this model, the population no longer attains a single deterministic equilibrium, as in the Ricker equation, but instead, it approaches a "cloud of points" (60), a stationary distribution which can be approximated by a gamma probability density function (20, 24) whose mean is a/b. The point a/b represents a center for return tendencies: it is the population abundance at which the average change is Nt, conditional on Nt1 being zero, thus accounting for the stationary phase. Note that the stochastic Ricker model does not include a term to account for the lag and death phases of bacterial growth in a batch culture, but as will be shown, the model serves as a very good approximation of these growth phases during the monitored period, despite the fact that the death phase is often evident. We later propose and briefly explore simple modifications to the stochastic Ricker model and other models to account for the processes occurring during these two phases of bacterial growth in batch cultures.
The problem of connecting a time series of observations with the proposed model in equation 2 requires the specification of a likelihood function. For each of the treatments, we estimated the corresponding parameters of the SR model as follows: suppose population abundances of a single culture are observed from time 0 to q. The likelihood function is a function of the unknown model parameters
= [a, b,
2]'. It is the joint probability density function for the vector of random variables X = [X1, X2, ..., Xq]' conditional on X0 = x0 and evaluated at the recorded vector of values x = [ln n0, ln n1, ..., ln nq]' = [x0, x1, ..., xq]' (25, 27). Because of the Markov property, the joint probability density function of the observed data, given our proposed model, is just the product of the individual transition density functions. The maximum-likelihood parameter estimates (MLEs) are the parameter values that make the observed data "most probable" or "most likely," i.e., they maximize the likelihood function for time series data, which is given by (27):
![]() | (4) |
2 under this model are identical to the least-squares estimates obtained by a linear regression of the observed yt on nt1, where t = 1,2, ..., q. This makes parameter estimation a straightforward task, and many commercial statistical packages can be used. We note that the confidence intervals returned by those packages are not correct in this case, because the observations at each time step are not independent of each other. Furthermore, the value b = 0 is at the edge of the set of values b < 0 for which the stochastic process Nt is ergodic, thus violating one of the regularity conditions under which a
2 approximation to the likelihood ratio test is valid. Using the
2 approximation thus yields inflated type I error rates (27). Instead, confidence intervals for the parameter estimates have to be found using the parametric bootstrap (PB) (29, 38). The PB of a stochastic model of population dynamics involves the following steps (26, 27). From a given data set, the ML estimates of the parameters are calculated for the chosen stochastic model and used to simulate many time series data sets (e.g., 2,000) of the same length as the original data set, using the same model. The ML estimates are then found for each of these data sets. Their histogram (or kernel density estimate) represents an estimate of their sampling distribution, from which various descriptive statistics can be obtained (e.g., the 2.5 and 97.5 percentiles).
(ii) Secondary model: an ANOVA variant of the SR model.
In predictive microbiology, secondary models describe the dependence of primary-model parameters on environmental factors. Here, the primary-model parameters are contained in the stochastic growth rate function (i.e., the terms in brackets in equation 2). Hence, the natural way to rigorously determine whether a certain combination of environmental factors affects the growth rate is to adopt an approach that is conceptually "ANOVA-like" and test whether there are differences in the estimated model parameters within and between experimental scenarios. Inference using this approach was greatly facilitated in this study because our data set was a perfect 25 factorial design with three replicates of each experimental treatment. If the general hypothesis that each of the 32 treatments has a distinct effect on the population dynamics of bacteria is supported by the data, then the speed of attaining stationary phase, the population size around which the cell counts vary in stationary phase, and the variability of the process (i.e., each of the SR model parameters) should differ between the 32 experimental treatments. To estimate the model parameters for a single treatment, the joint likelihood of the three replicated time series has to be considered. Because a treatment's replicates were independent, the likelihood function for a given experimental treatment was taken as the product of the individual likelihoods. Note that although the ideal situation is having a full factorial design, just like in the traditional ANOVA methods, the overall procedure could be modified to account for unbalanced designs.
Various hypotheses were contained in the general hypothesis described above, and modified multivariate techniques can be used to identify patterns in the SR model parameter space. If there is no difference between the model parameters found in, e.g., two experimental treatments, the data for these two treatments can be pooled and the SR model parameters can be estimated anew. In that fashion, the total number of possible treatment effects, combinations, and interactions can be reduced and the main factors that drive the system dynamics can be identified. The patterns of variation in the 96 data sets were summarized in order to propose a set of competing hypotheses consistent with the observations and to group the data accordingly. To do so, the model parameters for each of the 96 data sets were calculated as described before. These values were used as coordinates in a three-dimensional space. Canonical-variates analysis was used to identify patterns of variation among treatments. The design types were plotted in a two-dimensional canonical-variates space. A careful inspection of the canonical-variates plot allowed the formulation of seven hypotheses that were consistent with the data. Each hypothesis grouped all of the 32 experimental treatments in a particular way and proposed that distinct population dynamics occurred within each group. We then tested and compared the degree to which each of the seven hypotheses was in agreement with the observed patterns of variation. Confidence intervals using a PB were calculated for the model parameters under each hypothesis (see reference 27 for details).
(iii) Model selection.
Statistical theory and evidence from many biological disciplines (15) show that traditional stepwise regression methods based on a series of likelihood ratio tests may miss the best model or hypothesis consistent with the data. Also, a series of pairwise comparisons can lead to erroneous P values in likelihood ratio tests and inflated type I errors. Therefore, we relied here on the Akaike information criterion (AIC) (1) and on the Bayesian information criterion (BIC) (54) to simultaneously assess the quality of each of those hypotheses to explain the data.
The use of the AIC and BIC for hypothesis selection has a strong theoretical rooting in information theory. For a given data set, the AIC gives an estimate of the expected, relative, directed distance between the fitted model and the unknown true mechanism that generated the data (15). Thus, the decision rule for model selection using those statistics is to choose the model with the lowest AIC or BIC. For a fixed data set, adding more parameters to the model reduces that distance but further increases uncertainty in the estimation process. That trade-off between underfitting and overfitting is directly expressed in both the AIC and BIC as a term that penalizes their scores as a function of the number of estimated parameters in the model. We note that Schwarz (54) showed that if the true model is within the suite of evaluated models, then the BIC is guaranteed to find it. Hence, we used both statistics to derive the conclusions of the hypothesis selection. For the stochastic Ricker model, the AIC is equal to:
![]() | (5) |
) is the likelihood function evaluated at the MLEs and p is the number of model parameters. The BIC is calculated with:
![]() | (6) |
Finally, an evaluation of the quality of the best-fitted model was done via a residual analysis. Under the proposed model, the residuals should be normally distributed, centered around zero, and nonautocorrelated. A strong deviation from normality, if it appears, is an indicator that the current model mechanism is insufficient to explain the available data. All the calculations were done using MATLAB 6.5.1, release 13 (The MathWorks, Inc., Natick, Mass.).
|
|
|---|
Primary model: the SR model.
The SR model predictions found through a process error fit were superior to those generated by alternative models using an observation error fit, as is clearly illustrated in Fig. 2. In fact, even improved observation error models including a death phase do not fit as well as the SR model. This figure also illustrates the fact that a process error model fit is conceptually different from an observation error fit in that it naturally yields a stochastic one-step-ahead prediction of future population sizes. The advantage of this becomes evident while plotting the predictions from both types of statistical fit (Fig. 2). The residuals obtained by an observation error fit of the Ricker model were at least 1 order of magnitude larger than the process error fit residuals. A typical deterministic prediction using an observation error fit differs dramatically from the one-step-ahead predictions generated using the stochastic Ricker model parameter estimates.
![]() View larger version (19K): [in a new window] |
FIG. 2. Stochastic (Stoch.) Ricker model predictions compared to the observed data, an observation error fit of the Ricker equation, and an observation error fit of an ordinary differential equation (ODE) model that accounts for the rate of substrate consumption. In the ODE model, n is the population abundance, r is the resource variable, and max, k, µ, and are constants:
This model assumes that a population is growing according to the well-known Monod function and that individuals are dying at a rate µn.
|
2 of the stochastic logistic growth models fitted to each of the experimental growth curves are reported in Fig. 3a to c. In Fig. 3d to f, the parameters for supplement combinations 26 through 32 were omitted for clarity.
![]() View larger version (47K): [in a new window] |
FIG. 3. Box plots of the stochastic Ricker model parameter estimates as a function of the design type for treatments 1 to 32 (a, b, and c) and for designs 1 to 25 (d, e, and f). The bar inside the box indicates the median, the boxes delimit the interquartile range, and the vertical lines extend to the maximum and minimum.
|
The MLEs of the stochastic Ricker model for each of the 96 data sets were used to run a canonical-variate analysis. Fisher's discriminant function based on the estimated parameter values of the stochastic Ricker models identified the vectors in the multidimensional space on which, when the data points were orthogonally projected, the experimental designs were maximally separated (34). The test for the relative contributions of the three selected eigenvalues and the total canonical structure extracted by Fisher's discriminant function indicated that only the first two eigenvalues were significant (P values of < 0.001, < 0.001, and 0.0684, respectively). Furthermore, they cumulatively explained 87.26% of the variation in the three-dimensional space of [a,b,
2]. A closer examination of the canonical structure indicated that the extracted canonical variate number 3 was not very useful to explain the variation in the three-dimensional space. The plot of the designs in the discriminant space (Fig. 4) was used to identify possible design groupings and to formulate the following hypotheses.
![]() View larger version (10K): [in a new window] |
FIG. 4. (Left) Experimental treatments plotted in a two-dimensional space according to their mean coordinate values in canonical variates 1 and 2. (Right) Same as before, but treatment 26 is excluded for clarity. The two canonical variates represent the two axes in the three-dimensional space of a, b, and 2 along which the highest separation of the data points is obtained, i.e., these are the vectors along which most of the variability seen in the values of the model parameters occurs. Hence, two treatments for which the biochemical environment leads to similar growth patterns will appear closer when plotted in the two-canonical-variates space.
|
(ii) Hypothesis 1.
Because the variability in the data coming from design 26 was much higher than the rest (Fig. 3c), this design can be separated from the rest and constitute another treatment by itself, besides the two proposed by hypothesis 0. Hence, this hypothesis proposes to divide the designs into three groups, corresponding to three separate population dynamics types.
(iii) Hypothesis 2.
Four groups explain the data: designs 27 to 32, in which EDTA and NaCl are both present and at least one of NH4Cl and HCl is present; design 26; designs 16 and 24; and all other designs.
(iv) Hypothesis 3.
Four experimental groups defined by just two supplements explain the data. The treatments are EDTA and NaCl, and the experimental groups are as follows: EDTA and NaCl absent, EDTA present and NaCl absent, EDTA absent and NaCl present, and both present. This hypothesis suggests that there is a significant interaction between EDTA and NaCl and that the effects of that interaction drive the dynamics of the system.
(v) Hypothesis 4.
Hypothesis 4 considers the same experimental groups as in hypothesis 3 but also includes the presence or absence of glucose as a factor. Hence, a total of eight experimental groups are proposed by this hypothesis.
(vi) Hypothesis 5.
All of the supplements except glucose have an effect, and interactions exist between EDTA plus NaCl and NH4Cl plus HCl. There are a total of 16 experimental groups.
(vii) Hypothesis 6: full model.
The data at hand can be best explained by proposing 32 distinct types of population dynamics types corresponding to each of the 32 experimental treatments with three replicates each.
Model selection.
To evaluate the consistency of each hypothesis with the data, the BIC and AIC statistics were calculated from the likelihood of each hypothesis evaluated at the MLEs. To estimate the parameters under each hypothesis, we pooled the replicates of designs contained within each group proposed by the seven hypotheses. The joint likelihood function for each group was computed as the product of the individual time series likelihoods.
Both the AIC and the BIC clearly favored hypothesis 3 (Table 1), implying that a good explanation of the changes in a, b, and
2 was obtained while considering just the presence or absence of EDTA and NaCl. With the parameter estimates and confidence intervals obtained using PB (Table 2), interaction plots were drawn (Fig. 5). The interaction plots show that the order of the effect of NaCl changed as the state of the EDTA treatment changed. Thus, the effects of these two factors were not simply additive. When the data coming from the treatments with the above-mentioned interaction were excluded, then the effects of the other factors became visible (Fig. 3d to f). The BIC values of hypotheses 0, 1, 2, 4, and 5 show that even when other tentative explanations may be valid and better than the full model, the patterns of variation in the time series cannot be explained as well as with hypothesis 3. The confidence intervals obtained with a PB of hypothesis 5 showed that the interactions between NaCl plus EDTA and NH4Cl plus HCl were not significant (Fig. 5). Thus, in hypotheses 4 to 6, the treatment effects were just additive and could not generate as much variation as the one generated by the interaction between EDTA and NaCl. The second-best hypothesis was the reduced hypothesis (hypothesis 0), which considered just two groups. This hypothesis had the benefit of a reduced number of parameters to estimate. However, with just two groups it is not possible to estimate the interaction effects.
|
View this table: [in a new window] |
TABLE 1. Hypothesis selectiona
|
|
View this table: [in a new window] |
TABLE 2. Stochastic Ricker model parameter estimates and PB confidence intervals for the model specified by hypothesis 3a
|
![]() View larger version (24K): [in a new window] |
FIG. 5. Interaction plots for hypothesis 3 and hypothesis 5. (Left) Closed circles represent absence of NaCl, open circles, presence of NaCl. (Right) Open circles represent absence of both NH4Cl and HCl, closed circles, presence of both chemicals. Parametric bootstrap confidence limits are marked with crosses.
|
![]() View larger version (17K): [in a new window] |
FIG. 6. Plotted are the quantile-quantile graphs of the residuals for each of the four experimental design groups in hypothesis 3 (see "Model selection" for details). The points far from the straight lines in the lower right plot represent the individual model residuals that deviated from the model normality assumption.
|
|
|
|---|
Model selection via the AIC and BIC overcomes two major problems reported in the literature. The first one is overparameterization. In the search for more mechanistic models, many parameters are introduced, for example, to model the probability of growth as a long nonlinear function (57). Baranyi et al. (9) have already warned about the effects of overparameterization in predictive microbiology. Our results show that by using the AIC or BIC, a good compromise between the number of parameters included in the model and the quality of the predictions is reached. The second problem is stepwise regression as a model selection tool. As mentioned before, it has been repeatedly shown that stepwise regression can lead to an erroneous choice of the best available model. The information criteria used here overcome that problem by evaluating all the models at the same time, thereby allowing a proper identification of the best available model.
Process error fitting is a more realistic and mechanistic modeling approach. It is a realistic approach because it aims to explain the variability seen in time series of population abundance by proposing stepwise stochastic changes in the growth rate as a function of the experimental medium (Fig. 2). It is a mechanistic approach because it effectively explains and reproduces the observed growth patterns through proposed biological processes well rooted in first principles. Our model basically states that exponential growth and density-dependent effects are random and change as a function of the growth medium. The effects of the environmental factors were directly translated into changes in the basic growth characteristics.
Modeling the process error is not a mere fitting technique. Deterministic mathematical models (e.g., Gompertz and logistic equations) are often proposed as the true underlying "perfect" mechanism and fitted to the data via nonlinear fitting techniques (8, 13, 16, 19, 52, 59). Deviations from the deterministic predictions are accounted for as observational uncertainty. On the other hand, the SR model asserts that the process of population growth is stochastic by nature and can be modeled with a Markov process. The transition probability density function of this Markov process is used to predict trends and also as a natural tool for parameter estimation via ML. In the presence of substantial observation error, the process error model ML parameter estimates and predictions are biased, and the converse is also true (25, 28, 43, 46). However, in the context of this paper, because the microtiter plate reader is highly accurate (coefficient of variation = standard deviation/mean = 1.08%) and large populations are being sampled, observation error can be neglected. Further evidence supporting the use of the SR model was given here by the quantile residual plots (Fig. 6), in which generally the assumptions of normally distributed errors were met. However, we saw deviations from the normal model at the early stages (lag phase) and toward the end (death phase). Early deviations can possibly be additionally affected by the detectability limitations of the use of OD as a surrogate for population size (18), as well as by a lag phase.
The SR model can be modified to accurately predict more complex biological phenomena of bacterial batch cultures, such as the lag and death phases, the effect of time-varying environmental factors, and multiple-species microbial interactions. Well-known ideas in ecology can be translated into stochastic population models to explain the data. As an example, suppose that at small initial population sizes bacteria cannot grow well due to harsh environmental conditions, e.g., low pH values. This situation might change at larger population sizes, as metabolic products could increase the pH. That is, at intermediate population sizes, the population might grow much better than at small population sizes. Furthermore, a critical initial population size might exist below which bacteria cannot grow and above which they succeed. This phenomenon is well known in ecology as the Allee effect (2, 20), and stochastic population models exist in the literature that account for it (20, 21). A stochastic modeling approach to this problem could, for instance, lead to accurate estimates of that critical initial population size below which bacteria fail to grow.
To account for environmental stochasticity and the fluctuations over time in resources and in the concentrations of the chemical supplements, the stochastic Ricker model could be easily modified. To include the main effects of the chemical factors without the interactions, the growth function in the exponential term of the stochastic Ricker model could be written as follows:
![]() |
Accounting for the population lag phase (in the sense of Baranyi [5]) is also straightforward using the SR model. During lag phase, bacterial numbers seem to fluctuate randomly around the initial density. However, the case of zero growth is just a special case of the SR model. Indeed, no growth is equivalent to setting a equal to 0 and b equal to 0 in equation 2 (see "Theoretical background" above). Then, accounting for the lag phase amounts to specifying what is known in the statistics literature as a stochastic change point problem. While our model effectively accounts for the pattern of variability seen in the data, there are other aspects of growth dynamics that could be modeled in future work. For example, delay differential equation models, like the one proposed by Baker et al. (4), are useful in estimating the fraction of cells that are dividing, the degree of initial synchronization of the cells, and the initial distribution of cells in the cell cycle. It would be interesting to extend Baker's model to a stochastic differential equation and properly account for stochastic effects. However, parameter estimation via ML for stochastic differential equation models is a wide-open research area and presents many challenging statistical problems. These topics constitute materials for future research.
Finally, another of the latest concerns in predictive microbiology is the consideration of microbial interactions (37, 58). A two-species deterministic discrete Ricker model exists in the literature (40). This model can be easily transformed into a stochastic model in the form of equation 2 (22, 28). Preliminary results not presented here show that this competition model can be effectively used to analyze the seminal data set of Gause (31).
Stochastic modeling techniques are maturing, and there is no reason why theoretical and applied studies in microbiology should be deprived of such useful mathematical statistics tools and concepts. The results of this paper have broad implications for both basic and applied research and can be regarded as a different starting point to fulfill the urgent need of simple stochastic models of microbial growth.
We also thank Brian Dennis for reading the manuscript and providing many insightful comments and ideas.
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»