Previous Article | Next Article ![]()
Applied and Environmental Microbiology, April 2007, p. 2727-2734, Vol. 73, No. 8
0099-2240/07/$08.00+0 doi:10.1128/AEM.01205-06
Copyright © 2007, American Society for Microbiology. All Rights Reserved.

Hedmark University College, Hamar, Norway,1 Norwegian Food Research Institute (MATFORSK), Ås, Norway,2 Østfold Hospital Trust, Fredrikstad, Norway,3 Ullevål University Hospital, Oslo, Norway,4 Karolinska Institute, Stockholm, Sweden,5 Norwegian National Public Health Institute, Oslo, Norway,6 University of Oslo, Oslo, Norway7
Received 24 May 2006/ Accepted 4 February 2007
|
|
|---|
|
|
|---|
The challenge with phylogroup definitions has recently been addressed through direct microbial community comparison utilizing DNA sequence alignment-based phylogenetic reconstructions (12, 13, 21, 22, 24). The problem with DNA sequence alignments and phylogenetic reconstruction, however, is that large data sets cannot readily be analyzed. The reasons for this are both that there is no objective criterion for determining the correctness of an alignment and that the number of possible phylogenetic trees increases exponentially with the number of taxa analyzed. There are currently nearly 300,000 16S rRNA gene sequences in public databases, and that number is estimated to double every 7 months (8). Relying on alignments would therefore not suit the future needs for microbial community analyses.
We present a novel approach for comparing microbial communities that is phylogroup and DNA sequence alignment independent. Our concept is based on describing the evolutionary relatedness between bacteria in an absolute multidimensional coordinate space, using multimer transformation in combination with principal component analyses (20). The microbial community analyses are subsequently performed by direct comparisons of frequency distribution landscapes, representing densities of taxa within an absolute coordinate space. Density distribution comparisons are very efficient with respect to computer operation time (CPU), enabling analyses of very large clone libraries. A further benefit of our approach is that we can directly apply multivariate statistical tools, such as multivariate analyses of variance (MANOVA) and multivariate regression for the microbial community comparisons. We have also developed a permutation-based method for determining the significance of local differences in taxon distributions within the absolute coordinate space.
It is well known that gastrointestinal (GI) bacteria have a major impact on human health (1). Particularly important is the effect of the initial colonization on the maturation of the immune system of infants (2, 7, 10, 17). The aim of our work was to evaluate our density distribution analysis as a framework for determining structures in the human microbiota. This was carried out by reanalyzing the extensive human bacterial clone library provided by Eckburg et al. (6). We present a detailed description of significant differences in the microbiota with respect to different persons and to mucosal tissues from the six major subdivisions of the colon in addition to fecal samples. We also present initial results for an in-house-generated clone library for infant fecal microbiota, showing differences for age and mode of delivery (caesarean or vaginal). Finally, we compare our density distribution approach with the commonly used tools for microbial community analyses.
|
|
|---|
Fresh stool samples were immediately frozen at 20°C. The samples were then transported to the microbial community test laboratory. Upon arrival, the samples were dissolved in 5 ml of 50 mM glucose, 25 mM Tris-HCl, and 10 mM EDTA. The samples were stored at 40°C and processed further within 1 month.
DNA purification, PCR amplification, cloning, and DNA sequencing.
Fecal suspensions were thawed on ice and slightly vortexed, and the bacteria in 500 µl of the suspension were disrupted mechanically for 40 s at maximum speed using 0.5 g 106-µm glass beads (Sigma-Aldrich, Steinheim, Germany) in a Fast-Prep bead beater (Bio 101, La Jolla, CA). DNA was then purified with the DNeasy tissue kit (QIAGEN, Hilden, Germany) following the manufacturer's recommendations, eluting the DNA in a 100-µl volume. Three independent DNA purifications were carried out for each fecal sample.
We used the primers 5' TCCTACGGGAGGCAGCAGT 3' (forward) and 5' GGACTACCAGGGTATCTAATCCTGTT 3' (reverse), targeting generally conserved 16S rRNA gene regions (16). These primers generate an amplicon of 466 bp, corresponding to the region between residues 331 and 797 when applied to the Escherichia coli 16S rRNA gene. We chose this amplicon because the quantitative properties are well documented. The relatively short amplicon is a trade-off between the robustness of the PCR and the phylogenetic information gained (19). Short sequences are also relatively resistant to the generation of chimerical amplicons. PCR amplification, cloning, and DNA sequencing were performed as previously described (19).
Clone libraries.
The human adult clone library consists of 11,831 clones. A detailed description of this library has previously been published (6). The infant clone library consists of 390 clones, and details about this library are described here. The composition of the library was of 108 sequences from children aged less than 1 month and 282 sequences from 4-month-old children. There were 218 sequences from children delivered by vaginal birth and 172 from children delivered by caesarean section for the mode of delivery category. This library has been deposited in the GenBank database (accession no. EF063741 to EF064130).
AIBIMM.
The sequences were transformed into multimer frequencies (n = 5) by the in-house-developed computer program PhyloMode (www.matforsk.no/web/sampro.nsf/downloadE/Microbial_community). The transformation was based on sliding a window of 5 nucleotides along a DNA sequence and counting the frequencies of the different multimers encountered. The sizes of the multimer windows were chosen as tradeoffs between detecting phylogenetic signals (homologous multimer equalities), avoiding base composition biases due to nonhomologous multimer equalities, and the requirements for computer operation time (20). The multimer frequency data were compressed using principal component analysis (PCA) as previously described for the alignment-independent bilinear multivariate modeling (AIBIMM) approach (20). The phylogenetic content in the PCA was evaluated by cross-validation, while the potential presence of chimerical sequences was determined empirically by conflicting multimer loadings.
AIBIMM is related to the Tetra approach previously published by Teeling et al. (23). The main difference, however, is that Tetra is designed for the detection of skewed tetranucleotide distribution in whole genomes, while AIBIMM is designed to detect phylogenetic signals in single genes (20).
A special consideration when using AIBIMM is that the sequences should have approximately the same starting and ending points. If the sequences have different lengths, then this should be corrected for by weighting using the "normalize data" option in PhyloMode, so that the weighted numbers of multimers are equal for all taxa. Clustering in the PCA plot indicates close relatedness between the taxa if the residual variance is low, while if the residual variance is high, clustering could be because the taxa are not separated within the model. Deep branches can also be difficult to resolve since perfect matches for the entire multimers are required for a phylogenetic signal. A detailed description for the phylogenetic interpretation of the AIBIMM data is given by Rudi et al. (20).
MANOVA.
Variance analyses were performed directly on the multimer frequency data by using the 50-50 MANOVA software ([11] www.matforsk.no/ola). Classical MANOVA tests perform poorly in cases with several highly correlated responses, and the tests collapse when the number of responses exceeds the number of observations (which is the case for the multimer data). In 50-50 MANOVA, the dimensionality of the data is reduced by using principal component decompositions and the final tests are still based on the classical test statistics and their distributions (11).
Multivariate regression.
The covariance between the Y and X matrices (sample information and multimer frequencies, respectively) was determined by using partial least-squares (PLS) regression. Briefly, PLS regression models both the X and the Y matrices simultaneously to find the latent variables in X that will best predict the latent variables in Y. The PLS regression analyses were performed using the multivariate statistics software package Unscrambler (CAMO Technologies, Inc., Woodbridge, NJ). The calibrated model was validated using random cross-validation, a process in which 5% of randomly chosen samples were kept out during validation, and the process is repeated 20 times (for details, see the Unscrambler user manual; CAMO Technologies, Inc., Woodbridge, NJ).
fLAND.
Our calculations with respect to frequency landscape distribution (fLAND) analyses were performed using MATLAB (MathWorks, Natick, MA). We have also developed a computer program for making fLAND analyses more easily accessible for microbiologists. The program can be downloaded from www.matforsk.no/fLAND. It includes a user manual and a version of the program that is dependent on MATLAB and a stand-alone version that does not require MATLAB.
In the work presented here, fLANDs were obtained by counting the number of taxa within each (0.5 by 0.5) interval for the two first PCs in a global AIBIMM model (20). The distributions were then transformed to represent relative frequencies. The relative frequency distribution for each interval fLANDij is given by the following equation:
![]() |
The relative frequency of samples belonging to different categories within each interval was obtained by counting the taxa belonging to each category and dividing this number by the total number of taxa assigned to the same category using the formula described above. To find the difference fLANDdiffij among all categories from k = 1 to K in one specific interval ij, we summarized the squared difference between the relative frequencies for each category and the average frequencies for all categories. We used the following formula:
![]() |
![]() |
![]() |
![]() |
|
|
|---|
![]() View larger version (40K): [in a new window] |
FIG. 1. Global fLAND distribution (A) and the phylogenetic position of the major bacterial groups (B) in human adult microbiota. (A) A 0.5-by-0.5-interval density distribution based on AIBIMM analyses for the 11,831 taxa from the human adult clone library is shown. The major bacterial groups identified are marked. The color coding represents the natural logarithm of the densities within each segment. (B) Complete linkage dendrogram based on Euclidean distances for the first three principal components (explaining 52% of the variance). The dendrogram is based on a single taxon from each of the 579 segments. The dendrogram is collapsed to represent the same group structure as for the density distribution plot. Groups with low explained variance are red, while groups with high explained variance are black.
|
![]() View larger version (44K): [in a new window] |
FIG. 2. Global fLAND distribution (A) and differences with respect to age (B) and mode of delivery (C) for human infant microbiota. (A) A 0.5-by-0.5-interval density distribution based on AIBIMM analyses for the 390 taxa from the human infant clone library is shown. The major bacterial groups identified are marked. The color coding represents the natural logarithm of the densities within each segment. Shown is a natural logarithm for the ratio between the libraries with respect to age (B) and mode of delivery (C). Segments with significant differences (P < 0.05) are marked. The following color coding was used: for age (B), green indicates an age of less than 1 month and red indicates an age of 4 months; for mode of delivery (C), green indicates delivery by caesarean section and red indicates vaginal delivery.
|
![]() View larger version (71K): [in a new window] |
FIG. 3. fLAND intervals with significant (P < 0.01) differences between persons A, B, and C. The relative distribution (corrected for the differences in library size) is shown. The significance threshold was determined by using permutation testing as described in Materials and Methods. The coordinates for the interval's lower left edge are shown in parentheses as (PC1, PC2). The total number of taxa for each interval is also shown.
|
![]() View larger version (59K): [in a new window] |
FIG. 4. fLAND intervals with significant (P < 0.01) differences for each person (A, B, or C) with respect to the seven sample types analyzed. The relative distribution (corrected for the differences in library size) is shown with respect to each person (A, B, or C). The significance threshold was determined by using permutation testing as described in Materials and Methods. The coordinates for the interval are shown in parentheses as (PC1, PC2). The total number of taxa for each interval is also shown.
|
![]() View larger version (13K): [in a new window] |
FIG. 5. Cross-validated PLS regression coefficients for the respective libraries for each subject. The PLS regression analysis was carried out as described in Materials and Methods. Abbreviations: AC, ascending colon; C, cecum; DC, descending colon; R, rectum; SC, sigmoid colon; TC, transverse colon.
|
|
View this table: [in a new window] |
TABLE 1. Explained variance and significance by 50-50 MANOVA for the infant clone library
|
A lack of rigid statistical testing has been a major obstacle in identifying biological phenomena in microbial communities. Two recent reports have addressed the issue of statistical testing within large 16S rRNA gene clone libraries from intestinal samples using alignment-based microbial community comparisons (6, 12). These reports illustrate the complexity of the statistical analyses for the phylogenetically reconstructed microbial community data that are based on relative pair-wise comparisons. Obviously, absolute distances are much easier to compare than relative distances are. Basing the microbial comparisons on an absolute coordinate space would therefore simplify the comparative analyses.
Comparison of tools for microbial community analyses.
The human adult clone library has already been extensively investigated using the DOTOUR program for phylotype determinations and
LIBSHUFF for microbial community comparisons (6). We used these data in comparison with our alignment-independent method. Due to the relatively good coverage in the RDP II database for the bacteria found in the infant intestine, we used this library for RDP II comparison.
The DOTUR program identified 395 bacterial phylotypes from aligned sequence data, while we identified 579 intervals with one or more taxa in our fLAND analysis. This illustrates that our density distribution analyses gives a separation that is slightly higher than the phylotype determinations. The
LIBSHUFF analyses showed no significant mucosal library differences for the same subject with two exceptions. The library from the ascending colon from subject A was a subset of the other libraries, and the descending colon was a subset of the ascending colon for subject B (6). A notable difference between the fLAND and the
LIBSHUFF analyses was that bacteria within the fLAND segment (73, 31) showed a significant overrepresentation (P < 0.01) in the ascending colon for subject A compared to those for the rest of the mucosal sites (Fig. 4). This is in contrast to the conclusion that the microbiota in the ascending colon is a subset of the other libraries.
The RDP II classifier cannot be used for libraries with a relatively large portion of bacteria that are not well characterized (see the user recommendation at the RDP II homepage, rdp.cme.msu.edu). This is certainly the case for the human adult library. For the infant library, however, the RDP II classifier gave only 1.3% unassigned strains in the bacterial domain. We therefore used the infant library for evaluating RDP II Library Compare. The major difference between fLAND analyses and RDP II Library Compare is that Library Compare did not separate the two dominating segments of Bifidobacterium; consequently, it did not detect the major structures in our data related to these groups.
A summary of commonly used approaches for microbial community comparisons is presented in Table 2. Most of the microbial community comparison tools available are based on comparing phylogenetically reconstructed data based on DNA sequence alignments using a relative distance measure (13, 21, 22). The available alignment-independent tools, on the other hand, are generally based on predefined models for known categorical groups (4). Our fLAND analysis is different from the other approaches with respect to the combination of alignment independence, phylogenetic description, and the use of absolute distances.
|
View this table: [in a new window] |
TABLE 2. Properties of microbial community comparison tools
|
We thank P. B. Eckburg and D. A. Relman for providing the human microbiota data set.
Published ahead of print on 2 March 2007. ![]()
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2010 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»