Bioscience Division,1 Decision Application Division, Los Alamos National Laboratory, Los Alamos, New Mexico 875452
Received 3 December 2001/ Accepted 22 February 2002
| ABSTRACT |
|---|
|
|
|---|
| INTRODUCTION |
|---|
|
|
|---|
The exact survey size needed for community representation depends on the frequency distribution of species in situ and the degree of representation that is desired in the sample. For bacterial communities, the frequency distribution of species has never been measured or even approximated due to the difficulty of obtaining sufficiently large and representative samples of community diversity. In contrast, abundance distributions of plant, animal, and insect species in samples from a wide variety of communities have been intensively studied during the past 80 years (12, 34). Numerous models, both mechanistic and statistical, have been proposed to describe the observed distributions of plant, animal, and insect species (34). Mechanistic models such as the geometric distribution are typically limited to explaining the abundance distribution of species that compete for a common niche or resource base. Statistical models, on the other hand, are more appropriate for large species assemblages that are functionally and phylogenetically diverse, such as soil bacterial communities.
The most frequently used statistical model for species abundance distributions is the lognormal distribution, applied first in ecology by Preston in 1948 (24). Lognormal distributions can arise solely from the multiplicative effects of biotic and abiotic factors on the abundance of individual species (19, 21). In other words, the distribution can be a statistical phenomenon of large numbers and does not depend upon specific biological or ecological mechanisms. A lognormal distribution is therefore appealing as a null model for the distribution of bacterial species abundance. Such null models can guide experimental studies of community structure and composition. In particular, a model distribution can be used to estimate the survey size required for confident documentation of a specified fraction of community diversity.
The most common way to comprehensively survey the phylogenetic diversity present in soil bacterial communities is by PCR amplification, cloning, and sequencing of 16S rRNA genes (16S rDNA) from extracted soil DNA. However, due to the expense of this method, the survey sizes have typically been quite small, consisting of only 100 to 300 clones per soil sample (1, 15, 17, 27) (see reference 40 for a recent exception). The surveys conducted by Kuske et al. (15) for arid soil bacterial communities in Arizona, United States, are of typical size (200 clones per library). The libraries were derived from pinyon pine rhizosphere and interspace (between-tree) soil communities at a cinder field created by volcanic eruption over 900 years ago (Sunset Crater) and at a site 20 km away (Cosnino) that has sandy loam soil typical of the arid northern Arizona region. The libraries were created in 1994 to identify bacterial populations specifically involved in assisting plant colonization and growth in the hot, dry, volcanic cinder soil. Unfortunately, the data exhibited characteristic limitations of small phylogenetic surveys of large, complex communities. That is, most (93%) of the species-level groups in each library were represented by only one or two clones each, suggesting inadequate sampling and a high probability of sampling error (7).
In the present study, we analyzed the four libraries at the division level and used the species richness of the libraries and the observed sizes of the Arizona soil communities to guide the construction of theoretical lognormal models of bacterial species abundance. The models provide an important baseline for understanding the general structure of soil bacterial communities, the limitations of phylogenetic surveys as currently practiced, and the requirements for improving future surveys of bacterial species diversity. We emphasize that the theoretical models are null models. That is, the models provide the best possible description of community structure at present based on currently available data, but their accuracy on a fine scale requires experimental validation.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Bacterial biomass measurement.
A 50-cm3 portion of each of the four soil samples was shipped on ice to the Soil Microbial Biomass Service, Oregon State University, Corvallis, Oreg. (now Soil Food Web Inc., Corvallis, Oreg.; www.soilfoodweb.com) for measurement of bacterial biomass. Metabolically active bacteria and total bacteria were stained with fluorescein diacetate and fluorescein isothiocyanate, respectively, and counted by using epifluorescence microscopy. Active and total cell counts were obtained for soil samples from the April 1994, September 1994, and September 1995 collections.
DNA extraction and clone libraries from soil and cinders.
Extraction of DNA and construction of 16S rRNA gene clone libraries were described previously (15). Each library contained approximately 200 clones. All clones were characterized by restriction fragment length polymorphism (RFLP) analysis (7). Clones with identical RsaI-BstUI RFLP patterns were counted as a species-level group. To evaluate the extent to which this approach underestimated the number of species-level groups in the libraries, the number of RFLP groups among a set of 221 clones from the Cosnino interspace (C0) and Sunset Crater interspace (S0) clone libraries was compared to the number of groups delineated by a criterion of
97% sequence similarity. Sequence similarity was assessed over the 16S rRNA gene region corresponding to Escherichia coli positions 270 to 768, excluding the variable loop region between positions 451 and 460. The 270 to 768 region was the longest sequenced region common to all 221 clones used in this comparison.
DNA sequencing.
The 16S rRNA gene (rDNA) templates for DNA sequencing reactions were amplified directly from glycerol stocks of 16S rRNA gene clones. Primers M13-20 (5'-GTAAAACGACGGCCAGT) and M13-24 (5'-AACAGCTATGACCATG) were used for PCR amplification. Amplified DNA was purified by using a Qiaquick PCR cleanup kit (Qiagen, Inc., Chatsworth, Calif.), and DNA concentrations were estimated by gel electrophoresis and ethidium bromide staining. Approximately 100 ng of 16S rDNA was used as the template in dye terminator cycle sequencing reactions (ABI Prism dye terminator cycle sequencing kit; Perkin-Elmer, Foster City, Calif.).
Primer p3MODrc (5'-GGACTACHAGGGTATCTAAT, E. coli positions 806 to 787) was used in sequencing reactions to obtain partial DNA sequences. Full-length sequences were obtained from 45 clones (see section on phylogenetic analysis, below) by using primers M13-20, M13-24, P3MOD (5'-ATTAGATACCCTDGTAGTCC, E. coli positions 787 to 806) (38), P3MODrc, and 533 forward (5'-CCAGCSGCCGCGGTAA, E. coli positions 519 to 533) (16) in sequencing reactions. Electrophoresis was performed with 4.0% polyacrylamide gels on a 373A Stretch DNA sequencer (Applied Biosystems, Inc., Foster City, Calif.). The nucleotide sequences determined in this study have been deposited in the NCBI database under accession numbers AF507374 to AF507801.
Phylogenetic analysis.
16S rDNA sequences were compared with sequences from the Ribosomal Database Project (RDP; version 7.0) (20) by using the Similarity Rank program to obtain Sab values to database sequences. RDP sequences with less than 307 nucleotides for comparison were excluded from the analysis. Clone sequences were assigned to recognized bacterial divisions (or "uncertain" status) based on the affiliation of nearest-neighbor sequences from the RDP. Full-length sequences were obtained from all clones with uncertain affiliation based on partial sequence comparisons (Sab values < 0.50) and from all clones that appeared to represent new candidate divisions. Full-length sequences were checked for chimeric artifacts by using the Chimera-Check program (20) and secondary-structure analyses. Full-length sequences were then used in bootstrapped phylogenetic analyses and either assigned to recognized divisions based on reliable branching order or assigned as "uncertain" if branching order was inconsistent and unreliable.
Lognormal model of bacterial species abundance.
The general lognormal species abundance distribution is as follows:
![]() | (1) |
2 is the variance of the distribution, a = (0.5/
2)0.5 is the dispersion constant, and S0 is ST a/
0.5, the number of species in the modal octave.
Assuming that one species occupies each tail of the Gaussian distribution of log2 abundance values, it follows that 1 = S(Rmax). By using this substitution in equation 1, a set of lognormal distributions was created by solving reiteratively for
when ST ranged from 2,000 to 20,000 and Rmax ranged from 10 to 12 (i.e., the population size of the most abundant species ranged from 1 x 106 to 1.7 x 107 cells [g of soil]-1). The range of values for ST was obtained from studies of the renaturation kinetics of soil bacterial DNA (29, 35, 36). The range of Rmax values was chosen so that the calculated community size, NT, from a given lognormal distribution would be consistent with the range of observed NT values (epifluorescence direct counts of total cells) from the Arizona soils used in this study.
Estimation of survey size.
Species richness sampling curves were constructed by rarefaction (i.e., simulated sampling without replacement) (13, 30). Theoretical values of species abundance were used for calculating species-level sampling curves. The theoretical values (i.e., the abundance of each theoretical species in a model community) were obtained from lognormal distribution models of communities containing approximately 108 individuals total and 2,000 to 10,000 species. For each sample size calculation, 1,000 simulations of sampling without replacement were performed by using R software (a public-domain data analysis, graphics, and programming environment, available at www.r-project.org).
Estimates of the sample sizes required for sampling a specified set of j species with 95% confidence were calculated by using the following equation:
![]() | (2) |
| RESULTS |
|---|
|
|
|---|
Division-level diversity of Arizona soil surveys.
Each Arizona soil was surveyed by constructing a 16S rRNA gene clone library. A total of 21 bacterial divisions were found among the four surveys, based on analysis of 766 clones. The affiliation of 16 of the 766 clones could not be reliably determined. The 16 full-length sequences clustered inconsistently in different divisions from one analysis to the next (data not shown) and were therefore assigned to the uncertain category, as shown in Fig. 1. Most of the clones (722 total) were affiliated with nine well-established bacterial divisions. Twenty sequences were affiliated with recently proposed candidate bacterial divisions OP3, OP4, OP10, OP11, TM6, TM7, OS-K, and WS-2. Eight clones failed to cluster closely with any previously identified bacterial division and are represented in Fig. 1 as four distinct groups provisionally named SC1, SC2, SC3, and SC4. Full-length sequences from the eight clones showed no evidence of being chimeric. Instead, the sequences appeared to represent four deeply branching bacterial lineages that have not been described previously. The depth of branching of the four lineages and the dissimilarity of the sequences to all known 16S rRNA gene sequences are consistent with criteria that have been used previously to delineate bacterial divisions (10). Thus, the sequences appear to represent four new candidate divisions.
|
In the first report of division-level diversity among the clones (15), the abundance of the Acidobacterium division was listed as 54% of 60 analyzed sequences and was revised later to 51% of 356 clones (7). Analysis of the full data set indicates that the division accounts for 49% of 766 clones. The abundance of the Proteobacteria has increased slightly, from 12% of 60 sequences (15), to 17% of 766 clones. Similar small changes in relative abundance of the other seven previously identified divisions occurred after including data from the recently sequenced clones.
Nearly half of the 21 divisions were found in all four libraries. The average abundance of the nine common divisions ranged from 2 to 82 clones per library. The divisions occurring in only one library were represented by one or two clones each. The low representation of the rare divisions made it impossible to interpret the unique occurrence of these divisions. For the nine common divisions, three qualitative points of interest were noted. First, at each site, the relative abundance of Proteobacteria was lower in the interspace soil survey than in the rhizosphere soil survey (Fig. 1). Second, the relative abundance of gram-positive bacteria was higher in the interspace soil survey than in the rhizosphere soil survey at each site. For both the Proteobacteria and gram-positive divisions, the differences in abundance between the Cosnino interspace and rhizosphere soils were small, whereas the differences between the Sunset Crater interspace and rhizosphere soils were larger. Third, the abundance of the Cytophaga-Flexibacter-Bacteroides group was higher in the Cosnino soil surveys than in the Sunset Crater soil surveys.
Abundance distribution of bacterial divisions.
Analysis of the abundance distribution of bacterial divisions in the clone libraries demonstrated significant differences between the Sunset Crater soil communities and the Cosnino soil communities (Fig. 2). More bacterial divisions were found in the volcanic cinder soil libraries (average, 14; combined total, 18 for rhizosphere and interspace libraries) than in the libraries from the sandy loam soil at Cosnino (average, 12; combined total, 14), suggesting a more skewed distribution of division abundance in the Cosnino soils.
|
Extrapolation of division-level diversity in environmental samples.
The total number of bacterial divisions in environmental samples is completely unknown but could potentially be estimated from partial survey data. Since the efficacy of different extrapolation methods is dependent on the data set, trial-and-error application of different methods is often necessary (26). We applied three different statistical methods to data from the Arizona soil surveys and a hot spring survey (10) in an attempt to estimate the total number of divisions in each environment.
Asymptotes of the division-level sampling curves in Fig. 2 were estimated by fitting the data to a two-parameter linear model of the form S(n) = Smax - BS(n)/n, where n is the number of individuals sampled, S is division-level diversity, and B is a fitted constant (26). This model is the Eadie-Hofstee transformation routinely used for estimating Vmax in enzyme kinetics. The parameters Smax and B were estimated by using a maximum-likelihood technique (26). For the Yellowstone hot spring data, the two-parameter model provided an estimate of 30 divisions maximum versus the 26 actually observed. The model fit the data from the Arizona soil samples poorly, providing estimates of Smax for the Cosnino samples that were slightly lower than the actual observed values and equal to the observed values from the Sunset Crater samples.
A ranked abundance distribution of the bacterial divisions detected in the four Arizona soils was plotted (Fig. 3) to evaluate the possibility of fitting the division data to a lognormal distribution (a parametric extrapolation method). Each set of data was best fit by a power function. However, the plotted data from each environment did not exhibit clear evidence of a mode. If the data were lognormally distributed and if over half of the divisions in each environment were represented in the sample, transformation of the ranked division abundance data to a log2 abundance scale would yield a normal distribution truncated to the left of the modal octave. The division abundance data from the four Arizona soils displayed no evidence of having a normal distribution with a mode (data not shown). The hot spring data displayed the possible beginning of a normal distribution but lacked a defined mode (data not shown). If the abundance distribution of bacterial divisions in the soils is in fact lognormal, the data suggest that less than half of the divisions in each soil have been documented.
|
|
Modeling bacterial species abundance.
To obtain a null model of bacterial species abundance for the Arizona soil communities, a series of lognormal distributions were constructed. The observed bacterial community size and observed sample diversity from each Arizona soil community were used to constrain the set of theoretical distribution models. For each theoretical distribution, the community size, NT, was calculated mathematically and compared with empirical data. Likewise, the species richness in a sample size of 200 individuals was estimated from each theoretical model by simulated sampling and compared with the observed values from the Arizona clone libraries. With this approach, we identified a reasonable set of models for the Arizona soil communities.
As shown in Fig. 4, lognormal models with about 3,000 to 8,000 species and an Rmax value of 11 produced results most consistent with the observed data from the four Arizona communities. The calculated community sizes for models with an Rmax value of 10 ranged from 2.4 x 107 to 7.8 x 107 cells (g of soil)-1 (for communities with 2,000 to 10,000 species) and from 2.5 x 108 to 6.7 x 108 cells (g of soil)-1 for models with an Rmax value of 12. These community sizes were generally either too low or too high to be consistent with the observed community sizes (total cell counts) from the Arizona soils. The calculated community size from models with an Rmax value of 11 (and 3,000 to 8,000 species) ranged from 9.8 x 107 to 1.9 x 108 individuals, compared with averages of 1.0 x 108 and 1.6 x 108 cells (g of soil)-1 observed in the Arizona soils. For the Rmax = 11 (Rmax11) models, the predicted species richness for a sample size of 200 individuals ranged from 124 ± 12 to 161 ± 11, compared with the 134 to 161 RFLP groups found previously (7) in the Arizona soil clone libraries. Given the consistency of these observations, the Rmax11 models were the most reasonable lognormal null models for the distribution of species abundance in the Arizona soils.
|
|
|
For the 4,000-species-community model (shown in Fig. 5A), the minimum sample size for documentation of species no. 1 with 95% confidence is 55 individuals (Fig. 6B). A sample of this size would be expected to include species no. 1 and 46 (±5) additional species. If a second sample of 55 individuals were taken, only species no. 1 of the 47 (±5) species in the first sample would be expected to recur at the 95% confidence level. There is progressively less chance that the other 46 species (ranked in order of decreasing population size) from the first sample would co-occur with species no. 1 in a second sample.
Applying this analysis to the Arizona soil clone libraries, only 6 to 8 of the 134 to 161 species-level groups in each library are predicted to be reproducible at the 95% confidence level. On a larger scale, confident documentation of the most abundant 2,000 species (the top 50%) from the 4,000-species model community would require a sample size of 285,400 individuals. This sample size is 11.4-fold larger than the sample size of 25,000 individuals needed for documenting a random set of 2,000 species. These data demonstrate that if 16S rDNA surveys are used (as currently practiced) for comparing the composition of complex soil bacterial communities, the sample sizes must be dramatically larger than the sizes commonly used at present. Furthermore, only a fraction of the species present in a survey will be reproducible. Given a suitable model of the abundance distribution of species in a community, the reproducible fraction of species in a sample can be estimated.
Model inaccuracy.
The models we constructed depend on observed community sizes (total cell counts) and observed sample diversity. To determine the impact of model inaccuracies (over- or underestimates of community size or diversity) on estimates of sample scale, sampling curves were calculated for lognormal models based on Rmax values of 10, 11, and 12 (Fig. 6B). By using these values, theoretical communities were constructed that contained the same number of species (4,000) but varied in size by a factor of 8 (NT = 4 x 107, 1.2 x 108, and 4 x 108 cells, respectively). The required sample size for detection of the most abundant 50 species in each community was 1,639, 1,481, and 1,472 individuals, respectively. Varying the total number of species (ST) also had a minor effect on sample scale estimates. For example, for model communities containing 4,000, 6,000, or 10,000 species (Rmax = 11), the required sample sizes were 1,481, 1,759, and 2,140 individuals, respectively. The data show that for sampling the top fraction of the community, sample size increases as species diversity (ST) increases or as community size (Rmax) decreases. These relationships change, however, as progressively larger fractions of community diversity are sampled (Fig. 6B). Nonetheless, the magnitude of calculated sample sizes for surveying a specified set of species is generally similar despite small changes (potential inaccuracies) in model parameters.
| DISCUSSION |
|---|
|
|
|---|
Interpreting division-level community composition and structure.
The structure and composition of biological communities are indicators of ecological complexity, evolutionary history, and community boundaries. At the species level, community structure can reveal differences in resource availability or in resource partitioning and succession status. However, at higher taxonomic levels such as the division level, the ecological significance of community composition and structure is ambiguous. The ecological relevance of differences in division abundance is impossible to interpret unless the change can be ascribed to natural selection acting upon a phenotype shared by most or all members of a division. While this situation is true for a few bacterial divisions, many divisions (e.g., the Proteobacteria) are known for the remarkable metabolic and ecological diversity of member species. It is therefore difficult to imagine that patterns of diversity at higher taxonomic levels are strongly shaped by primary ecological mechanisms.
The division-level structure and composition of bacterial communities may simply be a signature of communities with shared colonization history (i.e., who arrived and adapted first). As shown in Fig. 2, bacterial communities from different environments (rhizosphere versus interspace) that had shared history exhibited similar patterns of division-level structure. Environments with a common geologic history and geographic proximity would be expected to experience similar demographic processes, resulting in shared patterns of diversity at the coarsest levels of phylogenetic resolution. Such patterns may be maintained, despite environmental changes, as a result of differential population responses (declines in some populations can be counteracted by increases in the others) that buffer division-level abundance.
Determining the extent of local division diversity.
One of the primary uses of bacterial community surveys is to document the scope of phylogenetic diversity in natural environments. Every survey conducted to date has documented novel lineages at the species level or at higher taxonomic levels. The Arizona soil bacteria surveys included four deeply branching lineages (provisionally named SC1, SC2, SC3, and SC4) that appear to represent novel candidate divisions. Members of the SC1 and SC4 lineages were documented previously in surveys of a marine sample (unpublished NCBI sequence AF007732) and a Wisconsin soil sample (1), respectively. The independent collection of sequences representing these lineages supports the contention that the sequences represent legitimate, deeply branching bacterial groups. Sequences closely related to the SC2 and SC3 lineages have not yet been reported in other surveys. If the division-level status of these lineages is substantiated by additional studies, these putative divisions would raise the number of confirmed and candidate divisions in the bacterial domain from about 35 (9) to 39.
The total number of bacterial divisions that exist on a local or global scale is unknown. Since surveys of terrestrial bacterial communities are typically dominated by members of the Acidobacterium, Proteobacteria, and gram-positive divisions, determining the global extent of division-level diversity within the bacterial domain will depend upon sampling rare divisions that occur in local environments. The scale of surveys required for complete documentation of division-level diversity in a local environment could be estimated if the total number and abundance distribution of divisions were known. However, our unsuccessful attempts to extrapolate the total number of bacterial divisions in each Arizona soil community based on the observed survey data demonstrated that extrapolation of division-level diversity on a local scale will require empirical data from surveys that are either significantly larger in size, well replicated, or both.
Theoretical sampling of species diversity.
Larger surveys are also required to document the extent of bacterial species diversity in nature, but we can now predict the magnitude of these surveys by use of theoretical species abundance models. Modeling the distribution of species abundance in biological communities typically involves two distinct problems: fitting the curvature of the upper portion and fitting the curvature of the lower portion of the true distribution. The upper portion of the distribution represents the most abundant, easily sampled species, while the lower portion (the most uncertain portion, prone to greatest modeling error) represents the rare species that are difficult or impossible to sample. A single model may provide a good fit for only one portion of the distribution. For example, a community may generally fit a lognormal distribution but have a long lower tail in the distribution due to an overabundance of rare species (especially true for communities with high immigration rates) (8).
Addressing the two problems depends on a researcher's needs. If parametric extrapolation of species diversity (based on partial survey data) is the goal, selecting a model that well describes both portions of the true distribution is essential. However, in our case, error in the lower tail of the distribution is tolerable because we are most concerned with predicting sampling requirements for the dominant bacterial species (presumably, the species that contribute most to ecosystem processes). Therefore, we require a reasonable model describing at least the upper portion of the bacterial species abundance distribution.
We used a lognormal distribution as the basic model for the distribution of bacterial species abundance. At present the lognormal is the best choice as a null model of nonuniform bacterial species abundance because it is a purely statistical model requiring no assumptions about demographic, ecological, or evolutionary mechanisms that might shape bacterial community structure. A uniform distribution was recently suggested as the most appropriate distribution for bacterial communities in surface soils based on data from 16S rDNA surveys (40). The Arizona soil communities were in some regards similar to several of the communities described by Zhou et al. (40). For example, the arid Arizona soils were low-carbon (0.3% organic matter), unsaturated environments. Low clone dominance was observed in the Arizona bacterial surveys, with only six to eight clones comprising the most abundant species-level group in three of the libraries. The Arizona community surveys also yielded index values (1/D = 52, 100, 104, and 107) that were intermediate in the range of index values reported by Zhou et al. (40). Nonetheless, application of uniform distribution models to the Arizona community data produced implausible results. Most importantly, sampling simulations from uniform distribution models yielded surveys in which the abundance distribution of species-level groups differed markedly from those observed in the Arizona surveys (data not shown). In contrast, lognormal distribution models (Fig. 5) produced results consistent with the Arizona soil community surveys.
The lognormal model generates a distribution that is concave downward in the upper portion of the curve (when plotted as in Fig. 5A), a feature characteristic of the observed abundance distributions of species from plant, animal, and insect surveys (34). The traditional approach to constructing lognormal models for biological communities has been to derive the models directly from survey data. Survey data are plotted to identify the modal octave of the lognormal distribution and to measure the standard deviation or dispersion constant. This type of approach has not been possible for bacterial communities due to the extremely small size of the surveys conducted to date. Therefore, we devised a new approach that enables the construction and partial validation of lognormal null models for bacterial communities without extensive survey data.
Lognormal models of species abundance in bacterial communities can be constructed by using only two critical parameters: an estimate of the population size of the most abundant species (this defines Rmax) and an estimate of species richness. Both estimates can be selected by trial and error. The estimates create a model community of a fixed size, NT, that can be compared to an observed community size to partially validate the model. Models with community sizes that differ greatly from observed data can be quickly rejected. To further validate a model, the species richness in a simulated sample from a model can be compared to the species richness observed in a sample of identical size from a natural community. Combined, the size of a natural community and the observed species richness in a sample from the community impose severe restrictions on the modeling space and circumscribe a small set of feasible models, as shown in Fig. 4.
By using the approach above, we identified a set of lognormal models consistent with observed data from the Arizona soil communities. The set of feasible models describes communities in which about half of the species have population sizes of between 1 x 103 and 6 x 106 cells (g of soil)-1. These species comprise 99% of the bacterial biomass. The remaining 1% of biomass is distributed unequally between the remaining half of species with population sizes of less than 103 cells (g of soil)-1 (Fig. 5A and 5B). These models clearly create testable predictions. The population size of the dominant species can be confirmed experimentally, or additional surveys can be conducted to test predictions of sample diversity. The accuracy of the models, and therefore the true structure and composition of the natural communities, can thus be examined further by rational experimentation guided by concrete model predictions.
Equipped with a species abundance model, one can easily estimate the scale of surveys required for documentation of species diversity (as shown in Fig. 6A and 6B). For such predictions, the intended purpose of the survey has profound consequences on the required survey size. For example, one may wish to predict the sample size required to document a specified fraction (e.g., 10%) of the species diversity in a community or, alternatively, to predict the sample size needed for partial comparison of the species composition of two different communities. In the first case, a specified number of random species is required in a sample, whereas the identity of each species is irrelevant. For such a survey, sample size depends on additive sampling probabilities. In the second case, the identity or relative rank of species desired in the sample is paramount. In this case, the required survey size is a multiplicative function of the sampling probability of each specified species (or species rank) and is therefore much larger. In fact, as our results clearly showed, adequate documentation of even a modest number of soil bacterial species for partial comparison of community composition requires sample sizes orders of magnitude larger than those currently used.
To better illustrate the limitations of small surveys for comparison of species diversity in soil bacterial communities, we compared the species richness and species composition of simulated surveys. Five independent surveys (200 individuals each) were obtained by randomly selecting individuals from the 6,000-species model community. Species richness varied little between surveys (average, 159 species; range, 154 to 164 species; standard deviation [S.D.] = 3.6), in accord with the contention of Hughes et al. (11) that species richness, even in small surveys, has sufficient precision for use as a relative diversity index. In fact, rarefaction analysis of the model communities in Fig. 5A demonstrated over a broad range of sample sizes that species richness is reproducible (for a given sample size) with low sample-to-sample variance (data not shown).
Species composition, on the other hand, is highly variable. For example, based on the Arizona community models, 94 to 99% of the species occurring in a random sample of 200 individuals were predicted to vary between samples. This prediction was consistent with comparisons of the five simulated surveys. Only seven species were common to all five surveys (data not shown). In comparisons of four or three surveys, 11 (S.D. = 2) and 18 (S.D. = 3.5) common species were identified, respectively. In pairwise comparisons, an average of 38 common species (S.D. = 3) were identified, or in other words, paired surveys were 48% similar. The pairwise similarity of the existing clone libraries from the Arizona soils ranged from 11 to 22% (7). These observations demonstrate not only the limitations of small surveys but also, more importantly, the strength of theoretical species abundance models in guiding the interpretation of survey data.
A theoretical null model may not be an exact representation of a natural community. Since no sampling method is error free, species abundance models based on observed sample diversity may incorporate biases. PCR amplification and cloning of DNA sequences are known to introduce errors. Most of these errors distort the relative abundance of individual populations in a sample (18, 32, 33, 39), and some may inflate richness estimates up to 20% (14, 25, 37). However, we demonstrated (Fig. 6) that inaccurate modeling of community size (up to eightfold) or species diversity (up to 2.5-fold) has only minor effects on the estimates of sample sizes required for surveying the most abundant species. Consequently, we argue that theoretical models based on observed community sizes and sample diversity can effectively demonstrate the scope of sampling problems and the magnitude of surveys needed for adequate documentation of diversity in natural bacterial communities and can serve effectively as classical null models for hypothesis testing.
| ACKNOWLEDGMENTS |
|---|
We thank Joe Busch, Jody Davis, and Greg Fisher for technical assistance.
| FOOTNOTES |
|---|
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
| ||