Previous Article | Next Article ![]()
Applied and Environmental Microbiology, August 2007, p. 4975-4983, Vol. 73, No. 15
0099-2240/07/$08.00+0 doi:10.1128/AEM.00128-07
Copyright © 2007, American Society for Microbiology. All Rights Reserved.
,
Centre for Ecological and Evolutionary Synthesis (CEES), Department of Biology, University of Oslo, Oslo N-0316, Norway,1 Matforsk AS, Norwegian Food Research Institute, N-1430 Ås, Norway,2 Department of Mathematics, University of Oslo, Oslo N-0316, Norway,3 Hedmark University College, Hamar N-2318, Norway4
Received 18 January 2007/ Accepted 3 June 2007
|
|
|---|
|
|
|---|
When working with model microbial communities, an experimenter may select genetic markers that distinguish between the different community members at the taxonomic level of interest. A group of organisms with a shared genotype for such a marker may be said to form a genetically coherent unit (GCU). In this report, we present a novel approach to high-throughput GCU quantification from mixed microbial populations. Using direct mixed template sequencing, we describe the use of multivariate statistical analysis of complex DNA sequence electropherograms for the accurate estimation of relative genotype abundance. An outline of the approach is shown in Fig. 1.
![]() View larger version (15K): [in a new window] |
FIG. 1. Flow chart outlining the general methodology. The selection of a genetic marker and modeling approach is determined by the experimental setting. If PLSR is to be performed, the experimenter must supply a set of training data for model computation. If an MLR approach is preferred, the electropherograms for the selected genetic marker for the GCUs involved in the experiment must be supplied, preferably along with a test set of data for estimation of prediction accuracy and bias. Once the model has been computed and properly validated, it can be applied to real-time mixed population experiments.
|
We also present results from the analysis of two sets of direct community sequence data from mixed population experiments. In both cases, this analysis provided us with time series population data, thus demonstrating the usefulness of our approach. For the first set, we used the gltA gene model to monitor the colonization kinetics of different C. jejuni strains in the chicken cecum. Second, the 16S rRNA gene model was used to describe the dynamics in a 9-h anaerobic fermentation with three bacterial species commonly found in the gastrointestinal tracts of humans and animals.
|
|
|---|
Generation of mixed sequence spectra for multivariate modeling.
PCR amplification of the gltA genes was carried out using the primers glt1F (5'-CGT CTT TTT GAT TCT TTY CCT GAT AAT-3') and glt1R (5'-GAA GTW GAA GCA TTT TGY TCA TGA TCT G-3') (3). The 25-µl PCR mixture contained 1x Hot Start buffer (Finnzymes, Espoo, Finland), 200 µM of a deoxynucleoside triphosphate (dNTP) mixture, 1 U of DyNAzyme II Hot Start DNA polymerase (Finnzymes), 0.2 µM of each primer, and 1 µl of DNA. The amplification profile was an initial step of 95°C for 10 min and then 35 cycles of 95°C for 30 seconds, 50°C for 2 min, and 72°C for 30 seconds, and a final extension at 72°C for 7 min. The relatively high number of PCR cycles used was required due to the low colonization numbers of C. jejuni in the chicken cecum.
Amplified DNA fragment concentrations were estimated by agarose gel electrophoresis of the dilution series and subsequent analysis using a Typhoon 8600 variable-mode imager (Amersham Pharmacia Biotech, Little Chalfont, Buckinghamshire, United Kingdom) and the TotalLab Image Master software, version 1.11 (Amersham Pharmacia Biotech). Fragment concentrations were subsequently equilibrated by dilution.
The chain termination and labeling reactions for sequencing were carried out using the primer glt1F and BigDye v3.1 sequencing chemistry (Applied Biosystems, Foster City, CA) according to instructions supplied by the manufacturer. Sequencing was carried out with an ABI PRISM 3100 genetic analyzer (Applied Biosystems) using a standard protocol.
PCR amplification of 16S rRNA genes was carried out using a universally conserved primer set (17). The PCR mixture contained 1.25 U of AmpliTaq Gold DNA polymerase (Applied Biosystems), 200 µM of a dNTP mixture, 1 µl of template DNA with 0.2 µM of each primer, and 2.5 mM of MgCl2 in a total volume of 25 µl. The amplification profile was a 5-min activation step at 95°C, followed by 25 cycles of 95°C for 15 s, 50°C for 15 s, and 72°C for 1 min. The reaction was terminated with a 7-min elongation step at 72°C. Fragment concentrations were estimated and equilibrated as described above.
The chain termination and labeling reactions for sequencing were carried out using the universally conserved primer U515F (2) nested within the original fragments and BigDye v1.1 sequencing chemistry (Applied Biosystems) according to instructions supplied by the manufacturer. Sequencing was carried out as described above.
gltA data.
The test set data used for evaluation of the C. jejuni quantification model consisted of 18 mixtures of seven closely related strains of pathogenic Campylobacter jejuni designated G10, G12, G98, G109, G114, G125, and G147 (Fig. 2, black graph lines). The genetic region used for the analysis was an
230-nucleotide fragment of the gltA gene (corresponding to positions 1604595 to 1604823 in the C. jejuni subsp. jejuni NCTC 11168 strain [20]), which has 14 sites that are polymorphic between the strains. Since the data region is relatively poor in information, the modeling tends to be heavily influenced by random variations in the spectra. In order to counter this problem, emission readings for polymorphic sites were culled, i.e., spectral values for the point of base calling as well as three flanking values on each side (for a total of seven measurements per polymorphic site), were extracted from the raw data files, producing 98-by-4 spectral matrices. The same was done for the seven pure strain sequence spectra. Each reduced spectrum was subsequently rescaled by multiplying all values within blocks of 7 by 4 (i.e., emission readings for one polymorphic site) with the ratio between the total spectral mean and the block mean, giving equal weight to each block of variables. Furthermore, each such matrix was unfolded (i.e., column 2 is concatenated onto the end of column 1, column 3 onto this vector, etc.) and mean normalized (i.e., each vector entry is divided by the total vector mean), producing vectors of 392 measurement values.
![]() View larger version (22K): [in a new window] |
FIG. 2. Predicted versus measured proportions of the seven C. jejuni strains used in the 18 test set samples. Estimates are approximate percentages. Note, however, that predicted proportions below zero are produced, making percentage an inaccurate term.
|
16S rRNA gene data.
The data used for PLSR calibration were constructed from cloned 16S rRNA partial gene sequences from a chicken cecal library. The seven GCUs included in the experimental design were representatives of the genera Bacteroides, Clostridium, Escherichia (E. coli), Eubacterium, Faecalibacterium, Lactobacillus, and Ruminococcus (genus names were found by performing a BLAST search of the GenBank database). Eighty-seven training samples of amplified DNA fragments from the seven GCUs were mixed according to a simplex lattice design (4) in which the mixture ratios are spread evenly over the response region. In our case, this means that in the calibration samples, each component varied between 0 and 100% relative abundance (100% being the samples with only one genotype), corresponding to an 87-by-7 mixture matrix, including three central samples where all seven GCUs were mixed in equal ratios. These samples were sequenced, and spectral data corresponding to a 35-nucleotide region (positions 622 to 656 in the E. coli 16S rRNA gene), between the points of base calling for two conserved flanking residues, were extracted from the raw data files for use in subsequent numerical analysis. This region has 20 sites that are polymorphic between the seven GCUs. Even though the extracted data region was identical for each spectrum in terms of the number of nucleotides scanned, the number of emission readings varied slightly from sample to sample. The main cause of this heterogeneity is probably the highly polymorphic nature of the DNA sequences which causes slight differences in migration rates during sequencing electrophoresis. In order to remove this variation, each spectrum was subjected to a cubic spline interpolation technique. This treatment allows for the coercion of all spectra to a given length, which in this case was chosen to be 450 entries per spectral vector. Interpolation of emission spectra was carried out using the R function "spline" as implemented in the library "stats" using the "fmm" spline-fitting algorithm (6), producing 87 450-by-4 emission matrices. These matrices were unfolded, and mean normalized, and all pretreated spectra were collected in an 87-by-1,800 matrix, i.e., each of the 87 calibration samples was represented by a spectral row vector of 1,800 numeric entries.
The 16S rRNA data used as a test set consisted of a 45-point simplex lattice design with the GCUs Bacteroides, E. coli, and Faecalibacterium (i.e., 45 samples in which the three GCUs are mixed to spread evenly over a response region between 0 and 100% relative abundance), representing a subset of the training data, as well as 7 samples of all seven GCUs mixed in various ratios, for a total of 52 samples (see Fig. 5, black graph lines). The genetic region represented in the raw data was identical to the one described above, and all spectra were given the same pretreatment as that of the calibration samples. The data set was used for testing the prediction accuracy of MLR and PLSR.
![]() View larger version (23K): [in a new window] |
FIG. 5. PLSR model predictions versus measured genotype proportions for the seven GCUs in the 52-sample 16S rRNA gene test set. The estimates are approximate percentages, but note that predictions can produce values below 0 and above 100, making the term percentage somewhat inappropriate.
|
Model systems for direct sequencing application.
Two model systems were chosen for demonstrating the application of our approach. In the first system, we were interested in the colonization dynamics of C. jejuni in chickens. C. jejuni is a very important foodborne pathogen with the main reservoir in the chicken gut (28). Understanding how this bacterium colonizes chickens is of crucial importance in preventing it from entering the human food chain. The second model system was designed to obtain basic knowledge about microbial population dynamics in gut systems. Both data sets presented are part of two larger ongoing projects where the aims are to elucidate the biological questions described above.
In the first model system, chickens were infected with a cocktail of the seven C. jejuni strains at day 1. Bacterial samples were obtained from cecal scrapings at days 16 (six samples), 20 (five samples), and 23 (six samples). Total bacterial DNA was isolated according to an optimized and automated protocol (27). PCR amplification and sequencing of the gltA gene were carried out as described above.
The second model system consisted of a growth chamber containing
1.5 liters of anaerobic basal growth medium (Oxoid, Basingstoke, Hampshire, United Kingdom), continually flushed with a mixture of N2 (80%) and CO2 (20%) in order to achieve anaerobic conditions. This simple fermentor was inoculated with a mixed culture containing the three bacterial species Bacteroides uniformis (ATCC 8492), Escherichia coli (ATCC 25922), and Lactobacillus salivarius (ATCC 11741). These species are identical to the GCUs used for multivariate modeling, with respect to the relevant 16S rRNA gene and primer sequences. The first sample was collected 2 h after inoculation, and subsequent samples were collected hourly for a total of eight samples. Optical density at 600 nm (OD600) and pH were measured at each sampling point. Total DNA was isolated according to the protocol used for the first model system. PCR amplification and sequencing of 16S rRNA genes were carried out as described above.
Multivariate regression analysis of DNA electropherograms.
The simplest multivariate analytical approach to GCU quantification is MLR using the linear mixture model. This method depends on prior knowledge of the unit spectra, i.e., the pure sequence spectra of the GCUs involved in the mixtures. If the quantitative response of the sequencing apparatus is additive, i.e., a mixture of two or more DNA fragments produces a spectrum which is a linear superimposition of the unit spectra, the mixture can be modeled as a linear combination of the unit spectra of the form y = b0 + b1x1 +...+ bixi +
, where y is the mixture spectrum, x1, ..., xi are the pure unit spectra, b1, ..., bi are regression coefficients, b0 is the intercept, and
is an error term. The least-squares MLR coefficients b1, ..., bi may then be interpreted as the relative contribution of each pure spectrum to the mixture and, thus, as the relative abundances of the different genotypes.
PLSR does not depend strictly on prior knowledge of the unit spectra of the individual GCUs in a mixture. What is required, however, is a set of calibration spectra corresponding to samples of known mixture ratios. The algorithm is based on the decomposition, X = TP' + E, where X is a matrix containing the calibration spectra, T is a matrix of scores (latent variables), P is a matrix of loadings (rotation vectors), and E is a matrix of residuals. The latent variables in T are orthogonal, effectively solving the problem of near colinearity, which is common in spectral data, and the extraction of latent variables greatly reduces the dimensionality of the data. Based on the decomposition matrices and the matrix, Y, of known GCU mixture ratios, Y can be formulated as a linear function of X, such that Y = XB + E, where B is a matrix of regression coefficients and E is a matrix of error terms. The PLSR algorithm actively uses information in Y in order to extract latent variables from X that are optimal for prediction, making this an appropriate regression method for our purpose. For a detailed discussion of these multivariate regression techniques, we refer to Martens and Næs (13). A more thorough description of the statistical analysis carried out in this work (including experimental design and data preprocessing) can also be found in the supplemental material.
All computations were carried out using statistics programming interface R (R Development Core Team [2006]). R is a language and environment for statistical computing (R Foundation for Statistical Computing, Vienna, Austria; http://www.R-project.org). PLSR was carried out using R "pls" software, with the orthogonal score algorithm for fitting multiple response models.
|
|
|---|
![]() | (1) |
j is an error term. The regression coefficients were converted to percentages and compared to the original mixture ratios (Table 1 and Fig. 2). The graphs show a good match between predicted and measured values, with an overall explained y variance of 87.4% and a root mean squared error of prediction (RMSEP) of 5.38. These estimates, however, do show that the assumption of additivity, on which the linear mixture model is based, does not exactly hold. Most prominently, the estimates of strain G109 proportions are systematically biased toward underprediction. This is reflected in the proportion of explained y variance, which is greatly reduced relative to the other GCUs, as well as the inflated standard error of prediction (Table 1). Deviation from additivity is also evident in the predicted values for the GCUs G10 and G125, which are generally overpredicted. Correlations between predicted and measured proportions show that the predictions are consistent. Since prediction errors are generally systematic, bias correction may be applied to the proportion estimates. If estimates for G10, G109, and G125 are adjusted by the addition or subtraction of their respective mean bias factors, test statistics are significantly improved, with the total explained y variance increasing to 93.5% and the mean prediction error decreasing to 3.86. These results suggested sufficient prediction accuracy for GCU quantification with experimental samples of unknown composition. Although the bias is quite clear and systematic in the present results, further validation is needed before this kind of bias correction can be carried out on a regular basis. |
View this table: [in a new window] |
TABLE 1. Explained y variance, correlations, RMSEP, and bias for the MLR proportion estimates of the seven C. jejuni strains
|
![]() View larger version (19K): [in a new window] |
FIG. 3. Proportion estimates of C. jejuni strains for 17 cecal samples from an infection experiment. The first six samples (starting at tick label 16) were collected from six different chickens at day 16 of the experiment. The next five samples were collected at day 20, and the last six samples were collected at day 23.
|
The direct MLR quantification approach was applied to the 16S rRNA test set. The 52 spectra in this set were modeled as linear combinations of the seven pure GCU unit spectra from the calibration set according to a model structurally identical to equation 1. In this case, the total explained variance was 94.1%, and the overall RMSEP was 5.61. While these are good results, the proportion predictions were less consistent than those for the C. jejuni data, and prediction errors were less systematic (see Fig. SA1 and Table SA1 in the supplemental material).
The 16S rRNA gene data consist of greatly divergent DNA sequences. This makes the electropherograms quite complex, with occasional phase shifts, causing corresponding wave tops for different subsequences to fall out of alignment. In the gltA gene data, DNA sequences are very similar, with only 14 SNPs scattered over
230 nucleotides of DNA sequence, greatly reducing the occurrence of phase shifts. This facilitates the culling of a fixed number of scans for polymorphic sites, making each such site a clearly defined data region, which again allows for rescaling over each spectrum. Such a selection of variables was not feasible for the 16S rRNA gene data, and the complexity of these spectra may aggravate the effects of any nonlinearities, making MLR less appropriate for quantification purposes.
Direct PLSR quantification.
By using PLSR to calibrate the sequencing apparatus, nonlinearities are taken into account. In order to compute PLSR predictor functions, the 87-by-7 matrix of mixture ratios from the 16S rRNA gene calibration data were modeled as a function of the 87-by-1,800 matrix of pretreated training spectra, using full cross-validation. The model converged on six partial least-squares (PLS) components (Fig. 4) and 96.0% and 96.9% explained y variance for the cross-validation and calibration, respectively. Overall, the RMSEP for the cross-validation was 4.33. The analysis produced seven predictor functions, one for each GCU included in the calibration, as shown in equation 2, as follows:
![]() | (2) |
![]() View larger version (11K): [in a new window] |
FIG. 4. R-squared (black-line) and RMSEP (red-line) curves for the PLSR cross-validation. The squared Pearson correlations are between predicted and measured y values. As more components are included in the model, RMSEP decreases and R-squared values increase precipitously until the model converges on six PLS components.
|
|
View this table: [in a new window] |
TABLE 2. Explained y variance, Pearson correlations, RMSEP, and bias of proportions predicted by the PLSR model for the seven GCUs in the 16S rRNA gene test set
|
4 h. This coincides roughly with an upsurge in L. salivarius proportions and a precipitous drop in pH, which is to be expected due to lactic acid production, which is characteristic of this GCU.
![]() View larger version (17K): [in a new window] |
FIG. 6. Proportion estimates for the nine samples from the fermentation, including the GCUs Bacteroides, E. coli, and Lactobacillus. The first time point in the diagram represents the mixed culture used for inoculation (time zero). The first fermentation sample was collected after 2 h (time 2), and subsequent samples were collected at hourly intervals. For the starter culture, pH and OD600 values have been set to those of the growth medium.
|
|
|
|---|
It should be noted that our approach is not developed for the simultaneous analysis of all members in communities with high species richness. One problem likely to be encountered when applying the described method with high-richness communities is the presence of rare community members. Our test set data contained GCUs representing as little as 3.6% of the total abundance, and these are predicted with good accuracy. However, considering that this number is of the same order as that of the estimated prediction errors (<5%), relative abundance estimates for rare GCUs should be interpreted with care. A second limitation of our approach pertaining to the analysis of complex communities is the availability of data for modeling. As species richness increases, so does the requirement for input data, which again depends upon a priori knowledge of the system in question. As such, other techniques are more suitable for exploring species-rich natural communities as a whole. Nevertheless, as we have demonstrated, our approach can be quite effective when the objective is to monitor subgroups within complex communities.
There are also certain challenges that are intrinsic to any PCR-based protocol for microbial community analysis (36). One potential source of quantitative bias is preferential DNA extraction. The choice of DNA isolation method has been shown to influence the outcome of quantitative analyses of environmental samples (11, 14), and care should be taken by investigators to minimize this source of error. In our assays, we used a very harsh mechanical treatment in order to achieve unbiased cell lysis (27). A second source of bias concerns differential amplification efficiency in mixed template PCRs. Amplification bias could be introduced by factors such as differing efficiencies in primer-template hybridization (21), differences in genome sizes and rRNA gene copy numbers (5), changing stoichiometrics during the course of the reaction (32), and stochastic effects (21). Again, the individual experimenter needs to take precautions both when performing the experiment and when interpreting the results. In this work, we used only short amplicons in order to overcome some of the challenges associated with amplification bias in mixed template PCRs. Furthermore, we used a specially selected 16S rRNA gene amplicon, the quantitative properties of which have been fairly well documented (17, 25).
It should be noted that as far as the above-mentioned bias effects are systematic and reproducible, they should have less bearing on quantitative results when the objective of the study is delineating dynamics. Even if community structure at some time point is skewed by the experimental procedure, a change in structure from that point to the next should still reflect actual community dynamics.
MLR versus PLSR for GCU quantification in mixed samples.
The overall performance of MLR using the linear mixture model for relative GCU abundance estimation suggests a fairly good predictive ability. One obvious advantage with this approach is the minimal need for information input into the modeling. Significant prediction bias can be assessed using a relatively small test set and may be taken into account when results are interpreted, as illustrated by the analysis of the C. jejuni data. This method, however, does turn out to be somewhat less consistent when the genetic data represent highly polymorphic DNA sequences, as for the 16S rRNA gene. In such a situation, MLR is probably best suited when only rough estimates are needed.
For the analysis of the more complex 16S rRNA gene data, a PLSR calibration approach proved more effective than MLR. The pretreated 16S rRNA electropherograms contain a number of variables (1,800 X variables when used for PLSR) that is much larger than the number of samples used for calibration (87 samples). In this type of data, near colinearity is often a prominent feature, making ordinary least-squares regression coefficients unstable (18). The PLSR algorithm effectively removes this problem by finding new sets of orthogonal latent variables that are linear combinations of the original predictor variables. By including only latent variables that contribute significantly to the model's explanatory power and discarding the remainder as noise, this algorithm can provide robust predictors (13). Our PLSR model converged on six PLS components. Since we were dealing with a system of seven GCUs, their proportions may vary with only 6 degrees of freedom. Thus, our model reflects the underlying dimensionality of the system. This indicates that very little noise was incorporated into the model, making our linear predictors for GCU quantification stable. It should also be noted that in general, the algorithms presented here should perform even better if the mixture data used for calibration and validation are free from errors due to pipetting and inaccuracies in estimation of DNA concentrations.
Comparison with existing technology.
Several existing technologies are well suited for investigating community structure at various levels of detail. Community fingerprinting techniques are relatively efficient, both in terms of cost and labor, and have been used extensively (9, 39, 41). The main limitation of fingerprinting methods lies in quantification, as they tend to produce patterns rather than information about specific taxa. Nevertheless, fluorescence-based fingerprinting methods can provide useful quantitative information on microbial community structure (10). Large-scale metagenomic analyses and similar environmental sequencing projects have, in several cases, provided detailed pictures of taxonomic and metabolic diversity with microbial community samples (1, 7, 29, 35). However, due to both the cost and workload associated with this type of analysis, it is currently not practical for providing high temporal or spatial resolution. Community microarrays show some promise as tools in microbial ecology (38), but working with microarrays is both relatively expensive and laborious. Furthermore, standard hybridization of array-bound oligonucleotide probes to target sequences has been shown to be poorly explained by thermodynamic parameters, making construction of quantitative community microarrays quite difficult (22). A recent development in facilitating microarray design is probe labeling according to the microsequencing principle (33). The associated workload, however, is still large compared to that of the approach described here.
Concluding remarks.
For the various reasons stated above, current techniques are, in general, not feasible alternatives for generating quantitative community data of the temporal resolution needed for deciphering microbial community dynamics and interactions. Like animals and plants, bacteria exist in communities in which interactions through phenomena, such as competition, density dependence, predation, and cooperation, occur within and among populations (15, 16, 31, 34). These are fundamental properties of any ecosystem, yet very little is known about population dynamics in microcosms. This lacuna in the existing literature is most likely related to the absence of convenient data acquisition methodologies, as well as to a historical division between the scientific disciplines of microbiology and traditional ecology. We believe that our approach, facilitating the collection of microbial community time series data, may contribute significantly to bringing microbial ecology up to speed. Moreover, microbes as model systems offer some great advantages relative to larger organisms (8). Their short generation times allow for the monitoring of potentially hundreds of generations within a week, while their small sizes facilitate easily manipulatable laboratory settings. As such, microbes can provide a convenient basis for addressing more general problems in ecology and evolution. A more profound understanding of microbial communities will also be of great value within fields such as medicine, epidemiology, and environmental science.
This work was financially supported by a research levy on certain agricultural products and by Hedmark Sparebank.
Published ahead of print on 15 June 2007. ![]()
Supplemental material for this article may be found at http://aem.asm.org/. ![]()
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»