Previous Article | Next Article ![]()
Applied and Environmental Microbiology, October 2006, p. 6578-6583, Vol. 72, No. 10
0099-2240/06/$08.00+0 doi:10.1128/AEM.00787-06
Copyright © 2006, American Society for Microbiology. All Rights Reserved.
Department of Biology, Northeastern University, Boston, Massachusetts 02115,1 Marine Science Center, Northeastern University, Nahant, Massachusetts 01908,5 Department of Statistical Science, Cornell University, Ithaca, New York 14853,2 Department of Ecology, University of Kaiserslautern, Kaiserslautern, Germany,3 Department of Environment Science, Kangwon National University, Kangwon-Do, Korea4
Received 4 April 2006/ Accepted 29 July 2006
| ABSTRACT |
|---|
|
|
|---|
| INTRODUCTION |
|---|
|
|
|---|
Why is it so difficult to determine the local number of species (however defined)? There seem to be two principal reasons. First, the number of species, or operational taxonomic units (OTUs), appears very large even in environments that are supposed to be "simple," such as extreme environments (1). Second, the observed species frequency distribution (that is, the number of OTUs observed once, twice, three times, etc.) is almost universally characterized by a large number of rare species and a small number of very abundant species (7, 12) (see Fig. 1). These factors make a complete inventory a daunting task. To our knowledge such completeness has never been claimed for any community or environment. The real challenge, however, is not the undersampling itself but how to estimate its extent: this is central to assessing total protistan richness, first locally and eventually globally.
|
| MATERIALS AND METHODS |
|---|
|
|
|---|
Study site and sampling.
The Cariaco Basin is the world's largest truly marine body of anoxic water and represents a large natural sediment trap, collecting sinking debris and biota from surface waters. A description of this environment can be found elsewhere (27, 28, 35). For this study we sampled the area at a depth of 340 m, corresponding to the lower boundary of the redox interface.
DNA isolation, PCR amplification, cloning, and sequencing.
DNA was extracted from a single 3-liter water column sample as described previously (33), followed by PCR-aided amplification of
1,000- to 1,300-bp fragments of the 18S rRNA gene using four different primer sets: (i) E528F-Univ1391RE, (ii) E528F-Univ1492RE, (iii) Euk A-Euk B followed by a nested reaction with E528F-Univ1517, and (iv) Euk A-Euk B followed by a nested reaction with 360FE-U1492R (Table 1). The PCR protocol employed HotStart Taq DNA polymerase (QIAGEN, Valencia, CA) in all cases. The PCR products were cloned, separately for each primer set, using the TA cloning kit (Invitrogen, Carlsbad, CA) according to the manufacturer's instructions. Plasmids were either isolated from overnight cultures by using the Macherey-Nagel (Easton, PA) NucleoSpin Robot-96 plasmid kit or amplified from plate colonies by using the Templiphi 100 amplification kit (Amersham Biosciences, Piscataway, NJ) according to the manufacturer's instructions. The presence of the target insert was confirmed by PCR reamplification as described above. After sequencing, we applied the Check_Chimera command of the Ribosomal Database Project (RDP) (22), as well as neighbor-joining trees with partial sequences (partial treeing analyses [18]), to eliminate chimeric sequences (29).
|
Statistical analysis.
The basic sampling model asserts that each species independently contributes a random number of representatives, which may be zero, to the sample. This number is a Poisson-distributed random variable, and its mean for a given species (that is, the expected number of representatives of a given species in the final sample) is known as the "sampling intensity" of that species (8, 38). The sampling intensities are typically roughly proportional to the species' abundances in nature, but this relationship may not be exact, due to the intervening stages required to obtain the final species frequency counts.
The parametric and nonparametric methods differ mainly in their treatment of the sampling intensities (8, 38). The parametric approach dates (at least) to the 1940s, and its main theoretical results were obtained in the 1970s (8), although significant work remains to be done (3). Essentially, the sampling intensities are assumed to follow a parametric probability distribution, which in turn generates a (mixed Poisson) parametric model for the observed frequency counts. This model is fitted to the data by maximum likelihood (ML), which yields an estimate of species richness and (asymptotic) standard error (SE), goodness-of-fit (GOF) assessments, and related statistics. Various distributional models have been proposed (3, 8), but until recently applications have been inconsistent or imprecise due to computational impediments. Using techniques from modern statistical computing theory, we obtained algorithms that compute the ML estimates to any desired precision (3). We have applied this method to several microbial richness data sets (4, 17, 40).
The nonparametric approach dates (at least) to the 1950s, but its main theoretical development began in the 1980s. One class of methods is based on nonparametric estimation of the sample coverage, i.e., the proportion of the population represented by the species appearing in the sample (8). These procedures are computationally simple and have been implemented in several software packages (11, 30) and used extensively in microbial richness estimation (19, 20, 30). A more recent class of methods employs nonparametric maximum-likelihood estimation of the distribution of sampling intensities; these methods are computationally intensive, and software has not yet become generally available. Recently, a mathematical framework has emerged to unify the nonparametric procedures (23, 38).
At present, computationally tractable parametric models often do not fit complete OTU frequency data sets, and nonparametric procedures are typically sensitive to the maximum observed frequency in the data. Both approaches thus require that the data be split into two parts: the set of "rare" species, which has lower counts in the sample, and the set of "abundant" species, which has higher counts. The splitting point or frequency, designated
, is called the "tuning parameter" in recent statistical literature (38). We apply the desired statistical procedure only to the set of lower frequency counts (
) or rare species, and we add the observed number of "abundant" species (frequencies of >
) to obtain the final estimate. In parametric modeling, the choice of
for a given model and a given data set is based on goodness of fit as measured by a (properly adjusted) chi-square statistic and by visual inspection. We initially fit every model at every possible
, and we look for the largest acceptable
, since this means using the maximum amount of the available data in the estimate (8, 38). For the coverage-based nonparametric procedures, a default
of 10 has been recommended based on expert opinion and empirical experience (38), but it is advisable to examine the results at several
values. A major goal of theoretical research is to find procedures that will allow the maximum value for
, or that are insensitive to the choice of
, or that do not require splitting of the data (3, 8, 38).
The advantages of the parametric models versus the coverage-based nonparametric methods include (i) control of the distribution of sampling intensities, leading to asymptotic normality of the parametric ML richness estimator, centered at the true richness (given the model), versus potentially unlimited bias in the nonparametric case (owing to the possible presence of arbitrarily many infinitesimally rare species [8, 38]); (ii) quantitative and graphical assessment of parametric model fit, versus limited diagnostic criteria for nonparametric estimators; (iii) use of the maximum amount of frequency data (maximum
) by selection of an optimal parametric distribution, versus unclear or default selection of
in the nonparametric case. The countervailing disadvantage is that a particular parametric model must be selected. There are currently no convincing theoretical arguments to inform this choice, because arguments to justify, e.g., the lognormal abundance distribution are readily refuted (39), and in any case the distribution of actual abundances may not be that of the operative sampling intensities. Hence, the choice of a model must at present be empirical. A general statistical methodology exists for this (6), but those methods, such as, e.g., the Akaike information criterion (AIC), typically apply to evaluation of different models on the same data set, whereas here we have the simultaneous problem of different models and different data sets, due to different values of
. We evaluate all available models at all values of
using relatively simple criteria (described below). While this in principle allows the possibility of overfitting, we find that overall the advantages of the parametric approach outweigh the disadvantage of the requirement of model selection.
We currently use seven candidate distributions, which we list below according to the sampling intensity/frequency count distribution (see http://www.stat.cornell.edu/
bunge/).
(i) The single point mass/ordinary Poisson distribution (one parameter) assumes equal proportions of species, which is unrealistic, and indeed we have never found it to fit real data. However, it is computationally simple and fast and serves as a useful lower-bound benchmark.
(ii) The gamma/gamma-mixed Poisson or negative binomial distribution (two parameters) appears to admit only relatively low diversity and rarely fits real data; it has been used for comparison with coverage-based nonparametric analyses (8).
(iii) The inverse Gaussian/inverse Gaussian-mixed Poisson distribution (two parameters) is a special case of the generalized inverse Gaussian distribution, which has long been regarded as a potential "universal" abundance model (5, 8). The generalized inverse Gaussian distribution remains computationally intractable (although we are currently working on approximations), but the inverse Gaussian distribution sometimes fits real data well.
(iv) The lognormal/lognormal-mixed Poisson distribution (two parameters) has been the subject of considerable interest in biology and other fields (8, 39) and does fit some data sets but should not be regarded as an a priori choice (39).
(v) The Pareto or power law/Pareto or power law-mixed Poisson distribution (two parameters) is a "heavy-tailed" distribution, well known in a variety of statistical applications, but appears to be new to the species problem. It does provide a good fit in some cases but is technically intractable at certain parameter values (at which moments do not exist).
(vi) The mixture of two exponentials/mixture of two geometrics (three parameters) is a model that we have recently introduced and studied, along with its three-exponential (or three-geometric) extension (see below) (3). It represents the abundances or frequency counts as a mixture (convex combination) of two groups, with one rate of decrease prevailing toward the left-hand side of the curve and another, lower rate to the right. The resulting flexibility typically permits a higher value of
than the previous models and often fits real data well.
(vii) The mixture of three exponentials/mixture of three geometrics (five parameters) is an extension of the previous model, allowing a left, a middle, and a right-hand rate of decrease. Because of the need to estimate a larger number of parameters, this model typically yields unusably high SEs in smaller data sets, although it can perform exceptionally well in large data sets. We note that in general, mixtures of exponentials can approximate a wide class of distributions (14).
We fit each model by ML at each
. We then select a model and a value of
by considering several criteria. First, we assess goodness of fit, using a chi-square statistic computed across all cells (frequency counts) as a simple measure of discrepancy, and also using a chi-square statistic computed after concatenating cells for a minimum expected cell count of 5 in order to obtain an asymptotically correct goodness-of-fit test. (Currently we are implementing a computationally intensive goodness-of-fit assessment that avoids the necessity of concatenating cells [37], and we are adding the AIC to our computations for comparison of several models at a fixed value of
, which is sometimes of interest.) We augment this numeric analysis with careful point-by-point examination of model fit, especially to the rare frequency counts. Second, we require a reasonable SE [<(estimate/2)], since inadequate precision renders the estimate useless. Third, we seek the largest possible
. Finally, we arrive at a preferred model and value of
, along with its associated estimate of species richness and SE, and related statistics. We note that while it would be desirable to give confidence intervals (CIs) for the species richness, no CI procedure has yet been verified as mathematically valid for small samples in this problem. Asymptotically the usual Gaussian 95% CI (i.e., estimate ± 1.96 · SE) is valid, but we do not generally report it here.
We then calculate the coverage-based nonparametric estimates using SPADE software (http://chao.stat.nthu.edu.tw/). We consider the estimators ACE (abundance-based coverage estimator) and ACE1. ACE1 is a high-diversity modification of ACE and is recommended if the observed coefficient of variation of the frequency data is >0.8 (9) (http://chao.stat.nthu.edu.tw/). We computed these estimators at the default
of 10 and also at
values selected by the parametric analyses. Since this procedure does not postulate a specific underlying model for the frequency data, model selection and goodness of fit do not apply.
We note that the multiple-primer procedure falls under the purview of the basic sampling model, as follows. To minimize the biases of the rRNA approach, such as the primer bias (34), we used four different PCR primer sets and constructed four different clone libraries (as described above), all originating from a single DNA extract from a single water column sample. We then pooled the resulting sequence data to obtain OTU frequency counts. The total sampling intensity of a given species (relative to the pooled data) is assumed to be the sum of its four separate sampling intensities (relative to each primer set). (This is statistically valid because the sum of Poisson random variables is again Poisson.) This model does not use the separate frequency counts from each primer set; it is doubtful that introducing the further statistical complexity necessary to model the primer set effects separately would improve the total richness estimates, but this is a topic for future research.
| RESULTS AND DISCUSSION |
|---|
|
|
|---|
We grouped the unique 18S rRNA sequences into OTUs, defining clusters by various degrees of sequence similarity, from 50 to 99%. The number of unique OTUs ranged from 1 to 107, respectively. We calculated the frequencies of each OTU at each level of similarity and subjected the resulting frequency counts to statistical analyses; the results are summarized in Table 2. As expected, the single point mass (ordinary Poisson), gamma (negative binomial), and mixture-of-three-exponentials distributions produced no usable estimates of protistan richness from these data. The single point mass unrealistically assumes equal OTU sizes, the gamma appears to be insufficiently flexible to accommodate the level of diversity encountered here (both gave goodness-of-fit test P values of <0.05 at all OTU cutoff levels and all
values), and the mixture of three exponentials gave unusably high SEs despite an acceptable fit (results not shown). We observed similar results in a prokaryotic richness analysis (17). The inverse Gaussian, lognormal, Pareto, and mixture-of-two-exponentials distributions produced comparable estimates with acceptable goodness of fit, SEs, and values for the tuning parameter
(Table 2). The inverse Gaussian and mixture of two exponentials appeared to be the best overall models in this study: Fig. 1 shows the corresponding frequency count distributions, i.e., the inverse Gaussian-mixed Poisson and the mixture of two geometrics, as fitted to the 97% similarity data. These give similar fits to the data observed but diverge somewhat when the fitted frequency count curves are projected to zero, to estimate the number of unobserved species. As noted above, without (at present) a convincing theoretical justification for a particular model, our choice of distribution remains empirical. The results in Table 2 show that different parametric distributions can give comparable results (when the SE is taken into account), which in turn suggests that they constitute different approximations to the true, unknown distribution.
|
= maximum observed frequency), which would seem to be ideal. However, when the model is fitted to the full data set, the (typically) long right-hand tail (a few very abundant species) "weighs down" the left-hand side of the curve, so as to underfit the numbers of rare species. For example, in the 99% OTU data, the observed counts extend up to a maximum frequency of 133. Figure 2 shows the fitted mixture-of-two-exponentials curves using
values of 133 (i.e., the full data set) and 31 (selected for best fit); the plot is truncated on the right at 31. The curves appear quite close, but the fitted curve with a
of 133 has a shallower slope to the left and hence in particular underfits the number of singletons; it predicts 52 singletons, while the fitted curve with a
of 31 predicts 54.5 (the observed number is 55). The difference (fitted curve with a
of 133 fitted curve with a
of 31) is shown below the main plot. The effect of the underfit is that the model based on the full data set returns a species richness estimate and SE that are both probably unrealistically low (279 [SE, 59] versus 398 [SE, 156] using a
of 31). In summary, at present we are not prepared to recommend fitting the mixture-of-two-exponentials model to the full data set in every case; model selection considerations (as discussed above) still apply.
|
(and hence greater use of the observed data) than the other models. When the OTUs are more inclusive and the frequency distribution less extreme, this advantage seems to diminish. At 97% sequence similarity, the frequency distribution can be modeled equally well by the mixture-of-two-exponentials and inverse Gaussian models, and thereafter the latter appears to be preferable (except in cases of very inclusive OTUs that combined the rRNA species with >80% sequence similarity, where the potentially plausible analyses involve only 5 data points [Table 2]). The models selected here also proved useful in estimating bacterial richness (17); they had not previously been used in microbial diversity research. On those few occasions when parametric models were used to estimate microbial richness, researchers relied predominantly on the lognormal distribution. Our data show that, even if a convincing a priori argument for a particular model such as the lognormal were available, it need not carry through to the empirical frequency data (17, 39). We therefore argue that a variety of models should be fitted to any given data set.
The coverage-based nonparametric statistics are also given in Table 2. For these data sets, ACE1 (the higher-diversity estimator) was preferred to ACE at every level of OTU definition, and in fact the value with ACE was on average 31% lower than that with ACE1 in this study. Selection of
in this case is not straightforward, since ACE1 and its SE vary considerably with
(Table 2); further diagnostic criteria are needed, and this is a topic for future research. As the level of OTU definition decreases and the data sets become less diverse, the parametric and nonparametric results converge. This empirical evidence, in addition to the theoretical and heuristic considerations discussed above, lead us to conclude that for estimation of species richness (99 or 98% sequence similarity), parametric models have a clear edge over nonparametric estimators. The advantage seems to diminish as the OTUs become more inclusive.
Finally, it is of considerable interest to obtain abundance estimates for individual OTUs. The total protistan abundance at the depth of our sample's origin has been reported to be around 3 x 106 cells liter1. We do not know the frequency distribution of OTUs in the environment, but as a first approximation we may assume that it is similar to the distribution of OTUs in the clone libraries (e.g., mixture of two exponentials in Table 2). Using the relative abundance of OTUs given by this model and the total number of cells in the sample, we estimated the abundances of OTUs in this sample (for OTUs defined as clusters of clones with >99% similarity). The protists in the sample can be provisionally divided into three groups based on their abundances. The group of most abundant protists consists of nine OTUs whose abundance exceeds 100 cells ml1; this group accounts for 51% of the total protistan community. Moderately abundant OTUs constitute 26% of the community and are represented by 28 OTUs with individual abundances ranging from 10 to 100 cells ml1. Rare OTUs (<10 cells ml1) represent the bulk of the protistan community (361 OTUs, or 91% of the entire richness) but contribute disproportionately little (23%) to the overall abundance. Among those, 118 OTUs seem exceedingly rare (<1 cell ml1). Therefore, the protistan diversity of the sample may be residing in species present in very small numbers.
In summary, the degree of richness of microbial eukaryotes in anoxic waters of the Cariaco Basin appears to be very high (Table 2). For OTUs defined as groups of clones similar at
99%, the estimated total richness per single 3-liter sample was 398 ± 156 (SE) OTUs (Table 2). Our clone libraries contained 107 OTUs, and thus we estimate that these libraries captured approximately one-fourth of all the species in the sample. This is probably the highest such fraction among published 18S rRNA gene libraries. Parametric models such as the inverse Gaussian and the mixture of two exponentials currently appear to be preferable tools for estimating microbial richness. It is useful to examine as many models as is practical; the best model may be different for different surveys, as well as for differently defined OTUs within one data set.
Our synthetic approach estimates that a single water column sample from anoxic waters in the Cariaco Basin contains hundreds of different protistan populations. It also suggests that even our multiple-primer approach and extensive sequencing failed to register up to 75% of species, as well as up to 50% of clades of the highest taxonomic position. Collectively, these findings point to a high degree of richness of protists in the target environment.
| ACKNOWLEDGMENTS |
|---|
This work was funded in part by U.S. National Science Foundation Microbial Observatory grant MCB-0348341 to S.S.E. and by Deutsche Forschungsgemeinschaft grant STO414/2-1 to T.S. The research was conducted using the resources of the Cornell Theory Center, which receives funding from Cornell University, New York State, federal agencies, foundations, and corporate partners.
| FOOTNOTES |
|---|
Contribution 254 of the Marine Science Center of Northeastern University, Boston, Mass. ![]()
| REFERENCES |
|---|
|
|
|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| J. Bacteriol. | Microbiol. Mol. Biol. Rev. | Eukaryot. Cell | All ASM Journals |
|---|