Previous Article | Next Article ![]()
Applied and Environmental Microbiology, October 2006, p. 6773-6779, Vol. 72, No. 10
0099-2240/06/$08.00+0 doi:10.1128/AEM.00474-06
Copyright © 2006, American Society for Microbiology. All Rights Reserved.

Department of Plant Pathology, University of WisconsinMadison, Madison, Wisconsin 53706
Received 27 February 2006/ Accepted 31 July 2006
|
|
|---|
|
|
|---|
|
View this table: [in a new window] |
TABLE 1. Tools used in microbial ecology to describe and compare microbial communities
|
Here we present a new computer program, SONS (stands for shared OTUs and similarity), which compares the memberships and structures in communities by accounting for the abundance distributions of OTUs that are either endemic to one community or shared by two communities (5, 4, 6, 8, 26, 27). Using examples from medical microbial ecology, we compare the results of phylogeny-wide statistical hypothesis tests to similarity indices calculated at specific OTU definitions. SONS builds upon an analysis using DOTUR (20), which enables SONS to measure the fraction of OTUs shared by two communities as a function of genetic distance. SONS is a versatile and powerful tool that will complement the suite of tools used by microbial ecologists.
|
|
|---|
![]() | (1) |
![]() | (2) |
![]() | (2a) |
![]() | (2b) |
![]() | (2c) |
![]() | (2d) |
![]() | (2e) |
![]() | (2f) |
![]() | (2g) |
![]() | (2h) |
![]() | (2i) |
![]() | (2j) |
These estimates can be corroborated by estimating the richness of each community individually and the richness of the two communities pooled together. The difference between the pooled richness estimate and the sum of the individual richness estimates should be similar to the results of equations 1 and 2.
Estimating the fraction of OTUs shared between two communities.
Incidence-based measures of community similarity, such as the classic Jaccard (Jclas) and Sørenson (Lclas) similarity indices, calculate the ratio of shared OTUs to the total number of OTUs in individual communities:
![]() | (3) |
![]() | (4) |
Because it is currently impractical to sample most microbial communities exhaustively, equations 3 and 4 may be an underestimate of the true values. In this study, we propose substituting the value from either equation 1 or equation 2 for S12 and the single-community richness estimates as determined by DOTUR for S1 and S2. Throughout this study we report the Chao1 richness estimates for S1 and S2 (3) and the values of equation 1 for S12.
Estimating the fraction of sequences that belong to shared OTUs.
Just as the Chao1 richness estimator is a function of the number of OTUs observed once or twice in a sample (3), the estimator of the fraction of sequences in shared OTUs is a function of the number of shared OTUs that are observed at least once or twice in the community being analyzed (5, 4):
![]() | (5) |
![]() | (6) |
Using these estimators, the abundance-based Jaccard (Jabund) and Sørenson (Labund) similarity indices may be calculated (5, 4):
![]() | (7) |
![]() | (8) |
Estimating community structure similarity.
The overlap measures described by equations 7 and 8 do not account for the similarity of the relative abundances among the OTUs shared between two communities. Therefore, although they measure community overlap, they do not measure the similarity of two community structures. Yue and Clayton (26) proposed a nonparametric maximum likelihood estimator of similarity,
, to compare community structures:
![]() | (9) |
that was derived by Yue and Clayton (26).
Sensitivity to sampling.
To determine whether the values generated by equations 1 to 9 were sensitive to sampling, we randomized the sequence order and analyzed each collector's curve. Except where noted, the collectors' curves had stabilized reasonably well by the final sampling point.
SONS: shared OTUs and similarity.
SONS is a freely available computer code written in C++ that uses the "list" output file from DOTUR as an input as well as a file containing a paired list of each sequence identifier and a designation from the library from which the sequence was generated. The DOTUR-derived list file can be replaced with an analogous file if sequences were assigned to OTUs by an alternative method, as long as the file formatting is retained. The order of sequence identifiers is used to construct collectors' curves of the estimates from equations 1 to 9 for each pairwise comparison and distance level used to define OTUs. The ordering of sequences can be randomized in the software. Where possible, SONS calculates the 95% confidence interval for the final estimate. A manual, example data sets, a source code for compilation in Linux or Mac OSX, and the Windows executable file are available at the SONS website (http://www.plantpath.wisc.edu/fac/joh/sons.html).
Data sets.
All data sets used in this study are available at the SONS website. The 16S rRNA sequences used in this study were obtained from the original authors (9, 19) or from GenBank (13). Sequence alignments were obtained either from the original author (9) or by using the greengenes website (http://greengenes.lbl.gov) (13, 19). Because not all of the 16S rRNA gene sequences from the mouse cecum study overlapped or were the same length, we considered only those sequences and nucleotides that overlapped between nucleotides 100 and 600 (Escherichia coli numbering). All alignments were imported to ARB (15), and distance matrices were generated using the Juke-Cantor correction for multiple substitutions. DOTUR (20) was used to assign sequences to OTUs for every possible distance level (http://www.plantpath.wisc.edu/fac/joh/dotur.html). For ease of reporting results, OTUs will be designated OTUx.xx, where the subscript represents the maximum distance as a fraction of nucleotide substitutions between any two sequences within that OTU. Although these distance cutoffs are arbitrary and controversial, it is often helpful to think of OTUs that are defined by distances of less than 0.03 as corresponding to a strain-level delineation, of 0.03 corresponding to a species, of 0.05 corresponding to a genus, of 0.15 corresponding to a class, and of 0.20 to 0.30 corresponding to a phylum (20). Unless otherwise stated, all richness estimates were considered reliable when the estimate did not show sensitivity to additional sampling, and we report Chao1 richness values, although the Chao1 (3), ACE (7), and Jackknife (2) estimates were generally similar.
Community structures were compared using
-LIBSHUFF (22) and TreeClimber (21).
-LIBSHUFF (http://www.plantpath.wisc.edu/fac/joh/s-libshuff.html) uses a distance matrix as the input to determine whether one library represents a random sampling of another. A small P value for both comparisons indicates that the two libraries had significantly different structures. If the P value corresponding to the comparison of one library to another is small and the reverse comparison yields a high P value, then the community structure of the second library is considered to be a subset of the first (22, 24). TreeClimber (http://www.plantpath.wisc.edu/fac/joh/treeclimber.html) implements the parsimony test originally applied in studying the population biology of sexual organisms and has since been applied to asexual organisms to determine whether two or more communities harbor different structures (21). Phylogenetic trees for use in TreeClimber were generated by the neighbor-joining algorithm in ARB.
|
|
|---|
Human distal esophagus.
Pei et al. (19) obtained 714 overlapping 16S rRNA gene sequences in samples from each of three patients with healthy esophagi (patient B, patient C, and patient D). Their analysis focused on describing the phylogenetic compositions of the three communities and the aggregate pooled richness of the bacterial types. Our phylogeny-based analysis of the three community structures found the observed differences to be statistically significant (for
-LIBSHUFF, all P values were <0.0072; for TreeClimber, P was <0.0001).
Considering the small P values that we calculated using
-LIBSHUFF and TreeClimber, we used SONS to further characterize the differences between the three communities. Estimating the OTU0.03 richness of each patient's community and the richness shared between patients (Fig. 1) showed (i) that the membership of patient B's community was a subset of that of patient D and (ii) that patient C's community shared 70% of its membership with those of both patient B and patient D. Considering the relatively large confidence intervals for the richness estimates for the communities of patients B and D, it is possible that they had the same richness. We then calculated the Jabund between the three communities. The Jabund value between communities B and D was 0.94 (standard error [SE] = 0.10), which was not significantly different from 1.0; this result is supported by the overlapping richness estimates shown in Fig. 1. The Jabund values between communities B and C (0.75, SE = 0.13) and D and C (0.62, SE = 0.14) were large but were significantly different than 1.0. These estimates suggested that although low-abundance members might not have been shared between communities, the most abundant members were shared.
![]() View larger version (22K): [in a new window] |
FIG. 1. Venn diagram comparing the OTU0.03 memberships found in the distal esophagi of three patients (patient B[n = 205], patient C [n = 264], and patient D [n = 245]). Below each patient's name is the Chao1 richness estimate and the 95% confidence interval for that community. We estimated the richness of the overlapping regions based on the pairwise S1,2 Chao richness estimates shared by the three communities and by pooling two communities and estimating the fraction shared with the third community. These estimates are provided on the right side of the figure. The Chao1 richness estimate of the three libraries pooled together was 125 (CI = 98 to 191), and the sum of the individual sectors in the diagram was 117.
|
. When we used the OTU0.03 definition, the
values for the comparisons between patients B and C, B and D, and C and D were 0.26 (SE = 0.04), 0.61 (SE = 0.08), and 0.10 (SE = 0.07), respectively. When we used the OTU0.20 definition, the same comparisons yielded
values of 0.88 (SE = 0.04), 0.65 (SE = 0.07), and 0.68 (SE = 0.07), respectively. Although the community structures were not identical, it is interesting that the OTU0.03 community structure between patients B and D was well conserved, considering that patient B's membership was a subset of patient D's. It is also interesting that the OTU0.20 structures of the three communities were not more similar, considering that the OTU0.20 memberships completely overlapped.
Human gastrointestinal tract.
Eckburg et al. (9) pursued an extensive 16S rRNA gene sequencing project where they analyzed a total of 11,831 bacterial 16S rRNA sequences from six gastrointestinal tract tissue samples and one stool sample from each of three healthy patients. They found little intrapatient variation in community structures across tissue sites but found statistically significant differences in the community structures between patients and between the sequences derived from the tissue and stool samples from a common patient. We reanalyzed the data to estimate the similarities between patients and between the tissue and stool samples for individual patients.
We pooled the tissue and stool sample sequences for each patient (designated patient 70, patient 71, and patient 72) to determine whether statistically significant differences in community structures (for
-LIBSHUFF, all P values were <0.001; for TreeClimber, P was <0.001) could be attributed to differences in membership or to the abundance of individual members. By comparing the individual and shared richness estimates of the three patients, we identified a core membership of approximately 57 OTU0.03s that were found in all three communities (Fig. 2). Analysis of the fraction of sequences in each patient's community that belonged to a shared OTU0.03 suggested that the most abundant 16S rRNA gene sequences from patient 70 were most similar to the OTU0.03s shared between patients 71 (0.91, SE = 0.05) and 72 (0.98, SE = 0.03); however, the Jabund between patients 71 and 72 was only 0.34 (SE = 0.08). A dendrogram of the pairwise
values for the OTU0.03 community structure showed that the community structures were most similar among samples from individual patients (Fig. 3). This complements the results of Eckburg et al. (9), who used
-LIBSHUFF, the parsimony test, and AMOVA to show that community structures among sites within a patient were more similar to each other than they were to structures in other patients.
![]() View larger version (25K): [in a new window] |
FIG. 2. Venn diagram comparing the pooled OTU0.03 memberships found in the gastrointestinal tracts and stool samples of three patients (patient 70 [n = 4,392], patient 71 [n = 3,605], and patient 72 [n = 3,834]). The Chao1 richness estimate of the three libraries pooled together was 469 (CI = 425 to 544), and the sum of the individual sectors in the diagram was 433.
|
![]() View larger version (20K): [in a new window] |
FIG. 3. Unweighted pair group method with arithmetic mean dendrogram comparing the pairwise values between the seven gastrointestinal tract tissue and stool specimens sampled from three patients. The length of the reference bar represents a distance of 0.10 (distance = 1 ). ">" indicates that the calculated richness value represents a minimum because the estimate did not stabilize with respect to sampling effort.
|
values between the six tissue samples were 0.95, 0.94, and 0.82 for patients 70, 71, and 72, respectively. When the respective stool communities were included, the mean
values were 0.82, 0.69, and 0.65, respectively. When we pooled the sequences from the six tissue samples and compared those pools to the respective stool samples for each patient, the Jabund values for patients 70 (0.93, SE = 0.04) and 71 (0.87, SE = 0.09) were not significantly different from 1.0; however, the Jabund value for patient 72 (0.72, SE = 0.09) was significantly different from 1.00. When we used the nonparametric richness estimators to determine the richness of the tissue and stool samples as well as the shared fraction between them, we calculated Jclas coefficients of 0.60, 0.45, and 0.58 for patients 70, 71, and 72, respectively. Overall, this analysis suggests that although there were considerable differences in the community structures of the three pairs of pooled tissue and stool samples, the most abundant OTU0.03s were generally well conserved between tissue and stool samples, even a month after sampling of the tissue.
Mouse cecum.
Ley et al. (13) investigated the effects of family relationship and a mouse allele responsible for obesity on microbial community structure. We reanalyzed this data set using DOTUR and SONS to estimate the fraction of the pooled community shared between male and female mice, the fraction shared between mice of different genotypes, and the relationship among the 19 mice by using nonparametric similarity indices.
To determine whether the male offspring harbored a community significantly different than that of the female offspring, we pooled the 16S rRNA sequences obtained from samples from the nine male (1,995 sequences) and seven female (1,394 sequences) mice. A comparison of the males' and females' community structures using TreeClimber showed that the observed differences in community structure were statistically significant (P < 0.001), and analysis with
-LIBSHUFF found the female community structure to be a subset of the male community structure (P = 0.9140 [male versus female] and P < 0.001 [female versus male]). Our comparison of the male and female community OTU0.03 memberships found that the richness estimates of the male (412, CI = 353 to 512) and female (428, CI = 358 to 547) pools were not significantly different and that 236 OTU0.03s were shared between them. The fraction of 16S rRNA gene sequences from samples from the male mice that belonged to shared OTU0.03s was 0.92 (standard error = 0.05), and the fraction of 16S rRNA gene sequences from samples from the female mice that belonged to shared OTU0.03s was 0.95 (standard error = 0.03), indicating that the most abundant OTU0.03s in each community were shared. Considering that approximately 56% of the OTU0.03s were shared between the two communities, the
values for OTU0.03 (0.59; SE = 0.03) and OTU0.20 (0.84; SE = 0.02) community structure comparisons were relatively large.
In the original analysis, the mouse genotype was described with respect to a leptin gene, which did not induce obesity when it was homozygous dominant (+/+) or heterozygous (ob/+) but did induce obesity when it was homozygous recessive (ob/ob). We pooled the 16S rRNA sequences according to the genotypes of the offspring from which they were sampled to investigate the effect of host genotype on the bacterial community. TreeClimber analysis showed that there were significant differences among the community structures of three genotypes (P < 0.001). Our
-LIBSHUFF analysis showed that the pooled +/+ group's community structure was a subset of the pooled ob/+ and ob/ob groups' community structures but that the ob/+ and ob/ob groups' community structures were significantly different from each other (P = 0.95 for +/+ results versus ob/ob results; P = 0.95 for +/+ results versus ob/ob results; and all other P values were <0.001). The comparison of individual and shared richness estimates indicated that there were approximately 151 OTU0.03s in common between the three communities (Fig. 4). The calculation of Jest values between the three genotypes indicated that the ob/+ group shared a large fraction of its sequences with the +/+ (0.85; SE = 0.04) and ob/ob (0.91; SE = 0.04) groups; the Jest value between the ob/ob and +/+ groups' communities was 0.77 (SE = 0.06). The
value for the comparison between the +/+ and ob/+ mice was 0.66 (SE = 0.04), and it was 0.37 (SE = 0.02) for the other two comparisons. In contrast to the
-LIBSHUFF analyses, the SONS-based analyses suggested that when comparing OTU0.03 memberships and structures, the ob/+ and ob/ob groups were more similar to each other than they were to the +/+ group.
![]() View larger version (25K): [in a new window] |
FIG. 4. Venn diagram comparing the OTU0.03 memberships found among pooled +/+ (n = 876), ob/+ (n = 1,186), and ob/ob (n = 1,327) mice. The Chao1 richness estimate of the three libraries pooled together was 623 (CI = 535 to 761), and the sum of the individual sectors in the diagram was 554.
|
values between the communities of the 19 mice, and then we generated a dendrogram to describe the relationship of the communities. The overall result described by Ley et al. (13) indicating that communities obtained from related animals were more similar to each other than to communities from unrelated mice was also illustrated by our analyses. One difference that we observed was that the community structures of the related mothers (mother 1 and mother 3) did not cluster together to the exclusion of those of their offspring but clustered separately and with their offspring (Fig. 5). The conclusions drawn from this analysis must be made with caution because most of the estimates of richness and similarity coefficients were unstable with respect to sampling effort, yet this analysis is an example of the versatility of SONS and the possibility of correlating community membership and structure with host genotype or other variables.
![]() View larger version (21K): [in a new window] |
FIG. 5. Unweighted pair group method with arithmetic mean dendrogram comparing nonparametric estimates of the abundance-based Jaccard similarity coefficients for the OTU0.03s identified among 19 mouse cecum samples. A dendrogram constructed using the estimated Sørenson similarity index had the same topology. Although the exact branching orders were not identical, in the dendrogram, the OTU0.03 community structures of the offspring generally clustered with the mother; the community structures of mice M2A-3 and M2B-1 clustered with mother 3. The community name corresponds to the mother (e.g., M1-1 is the offspring of mother 1). The length of the reference bar represents a distance of 0.10 (distance = 1 Jabund). ">" indicates that the calculated richness value represents a minimum because the estimate did not stabilize with respect to sampling effort.
|
|
|
|---|
Although the phylogeny-based statistical analyses all showed significant differences in community structure, our analysis was able to describe the relative similarities of the memberships and structures of these communities for a specific OTU definition. One interesting result was evidence for core microbial communities that were shared among the microbial communities found in human gastrointestinal tracts and esophagi and in mouse ceca. At least 20% of the OTU0.03s were shared between individuals, sites, and genotypes, indicating that these shared populations may be responsible for essential functions. It also suggests that there are endemic OTU0.03s that serve as accessory populations, which are necessary to complement the different hosts' genetics and environments. SONS brings a powerful method of analysis to the growing toolbox of statistical methods for analyzing microbial communities by adding the ability to determine the taxonomic level at which community membership and structure are distinguishable.
SONS may also inform the debate surrounding whether bacteria have a biogeography (i.e., all species are not everywhere). Differences in richness between communities certainly suggest that bacteria have a biogeography, but we have also shown that even communities with similar levels of richness can have very different memberships. Although all methods have limits of detection, SONS should prove to be a useful tool in applications for studying bacterial biogeography (11), providing the flexibility to choose any OTU definition for comparison.
Other statistical methods, including
-LIBSHUFF, TreeClimber, UniFrac, and analysis of molecular variance, take different approaches to comparing the structures of communities. The versatility of all of these methods is limited because the user cannot set the level of taxonomic resolution that is used in the analysis and there is no indication of two communities' similarity. Alternatively, the OTU-based approaches employed in DOTUR and SONS define an OTU by a distance matrix and then determine whether there is sufficient coverage to obtain a reliable estimate of a parameter for that OTU definition by analyzing collector and rarefaction curves. Although the collector's curve of an estimate may stabilize or wander about a mean value, additional sampling improves the accuracy and precision of that estimate. Without an exhaustive sampling of two communities, it is impossible to determine those OTUs that are endemic to a community. For example, although we can say that accessory OTUs differentiate the +/+ and ob/ob groups, we cannot describe those OTUs further. Ultimately, all of these methods assume that the relative similarity of 16S rRNA gene sequences used to define an OTU is biologically informative. For a given OTU definition, two communities may not have the same membership; however, that OTU definition may not be the most biologically meaningful taxonomic resolution, since members of the same OTU may have different phenotypes.
In evaluating SONS, we selected three studies from the commensal microbial community literature because of their depth of sampling and the relatively low richness in these communities compared to that in soil. The results from SONS illustrated that most communities from similar environments share a common core community that is supplemented by an accessory community that may be determined by the host's genotype, the parent, and/or a number of other factors. Framing hypotheses that build on this statistical framework may lead to a more mature understanding of how a host shapes the formation and maintenance of its commensal communities.
We appreciate the assistance of Zhiheng Pei, Paul Eckburg, and Ruth Ley, who provided us with the 16S rRNA gene sequences and details of the experiments from their respective studies.
Present address: Department of Microbiology, University of MassachusettsAmherst, Amherst, MA 01003. ![]()
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»