| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

Taxon Biosciences, Inc., 3152 Paradise Drive, Tiburon, California 94920,1 Department of Molecular and Cell Biology, University of California, 16 Barker Hall, Berkeley, California 94720-3202,2 The Institute for Genomic Research, J. Craig Venter Institute, 9712 Medical Center Drive, Rockville, Maryland 20850,3 Howard University, Department of Biology, 415 College Avenue, NW, Washington, D.C. 200594
Received 20 December 2006/ Accepted 20 May 2007
| ABSTRACT |
|---|
|
|
|---|
| INTRODUCTION |
|---|
|
|
|---|
Current molecular microbial surveying methods, such as terminal restriction fragment length polymorphism (28), automated ribosomal intergenic spacer analysis (17), and denaturing gradient gel electrophoresis (30), offer fairly quick and inexpensive means to detect a few hundred of the most-abundant taxa in a sample. Comparison of the profiles created with these methods has been a valuable approach for addressing ecological questions about microbial communities, especially in studies where large numbers of samples are involved (reviewed in reference 24).
Another approach for surveying microbial communities has been through DNA sequencing of 16S rRNA gene clones (35). Although this approach provides good phylogenetic resolution, it is not the most efficient method of surveying a complex community, since the majority of the 16S rRNA gene sequence is conserved among prokaryotes (10). Thus, considerable effort is expended sequencing regions of the gene with little information content. An alternative approach, including that taken in this report, has been to focus sequencing resources on a given variable region within the 16S rRNA gene (27, 32, 33, 39). While the details of the methods that use this strategy differ, their common theme is a tradeoff of phylogenetic resolution for sequence throughput.
The need for methods that can be used for deep surveys of microbial communities is exemplified by studies showing that the majority of species are present at very low abundance (1, 5, 12, 42, 49). Moreover, the members of such communities with low abundance are disproportionately affected by environmental stress or disturbances (19, 44). Are these species with low abundance viable, and if so, do they contribute to ecosystem function in meaningful ways? To address these questions, new surveying methods are needed that can detect species with low abundance. A new method, termed SARD for serial analysis of rRNA genes, is described here that enables the detection and quantitation of rare sequences in microbial communities. In agricultural soil samples, this method indicated that, numerically, most of the DNA sequences came from prokaryotic species that were among the least abundant, revealing an unexpected dominance of rare species in the overall microbial population.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Soil sampling.
Soil samples were collected from Union Island and Victoria Island in the Sacramento River delta of California in October, 2004. These sites corresponded to locations directly above potential natural gas accumulations identified by three-dimensional seismic surveys. Wells were drilled at these sites within 4 weeks after the samples were collected, and the structures at both locations were found to contain noncommercial levels of hydrocarbons.
The sampling locations were determined with a wide area augmentation system-enabled Garmin GPS V (Olathe, KS) handheld unit. The coordinate measurements were averaged for approximately 10 min to increase accuracy. At each location, a hole was drilled to a depth of 30 cm with a Stihl BT45 gas-powered drill fitted with a 3.5-cm-diameter ship auger bit (no. 47422; Irwin). Core samples were collected at the bottom of the hole with a 2.5-cm-diameter by 30-cm AMS soil core sampler tool (American Falls, ID). The core sampler had a replaceable tip that was changed between samples. The tool was modified by adding a 1-in.-diameter by 8-in. (2.5 by 20 cm) plastic plug such that 1-in. by 4-in. (2.5 by 10 cm) core samples were collected in a plastic liner. Following collection, the core samples were capped and maintained on ice until they arrived at the laboratory for DNA extraction.
Soil DNA extractions.
Soil core samples were extruded from the sampling sleeves and were dissected to remove the outer portions that had been in contact with the liner or sampler tip. Genomic DNA was extracted from about 0.5 g soil from each sample with a FastDNA spin kit for soil in a FastPrep FP120 instrument (MP Biomedicals, Irvine, CA), except that the components for the lysis matrix were obtained from CeroGlass (Columbia, TN). The bead-beating matrix consisted of one 4-mm glass bead (GSM-40), 0.75 g 1.4- to 1.6-mm zirconium silicate beads (SLZ-15), and 1.0 g 0.07- to 0.125-mm zirconium silicate beads (BSLZ-1). The beads were prepared by acid washing with HCl and HNO3 followed by neutralization with extensive water washings and autoclaving. Prechilled samples were disrupted by bead beating in a FastPrep instrument at 6.5 m/s for 45 s. Subsequent steps were performed according to the kit instructions.
SARD profiling.
The following oligonucleotides were used in SARD profiles: TX-9, 5'-[BioTEG]-GGATTAGAWACCCBGGTAGTC-3'; 1391R, 5'-GACGGGCRGTGWGTRCA-3' (45); TX12, 5'-[Phos]CTCCAGGTCTACATCCTAGTCAGGAC[23-ddC-Q]-3'; TX13, 5'-ATAGGTCCTGACTAGGATGTAGACCTGGAG-3'; TX14, 5'-[Phos]CTCCAGACTAGCATCCGCTGACTTGA[23-ddC-Q]-3'; TX15, 5'-AATGTCAAGTCAGCGGATGCTAGTCTGGAG-3'; TX131, 5'-[BioTEG]-GTACGATTACTCGATAGTCACGGTCCTGACTAGGATGTAGACC-3'; and TX141, [BioTEG]-GGATATACTCAGGTTGCAACGGTCAAGTCAGCGGATGCTAGTC-3'.
Primers TX9 and 1391R were used to PCR amplify a 600-bp region of the 16S rRNA gene from 600 ng soil genomic DNA in a 400-µl reaction mixture volume as outlined in Fig. 1. The numbers of PCR-amplifiable 16S rRNA genes with these primers in 600 ng of soil genomic DNA were determined by quantitative PCR to be 2.8 x 107, 2.2 x 107, 7.3 x 107, and 2.4 x 107 for samples WP43, WP45, Pol-NE, and Pol-W, respectively.
|
80-bp restriction digest band, including fragments corresponding to 40 to 120 bp in size, was purified on a 10% polyacrylamide gel electrophoresis-Tris-borate-EDTA (PAGE-TBE) gel. The acrylamide pieces were fragmented by passage from a 0.5-ml microcentrifuge tube, with a hole pierced in the bottom with a 21-gauge needle, into a 1.5-ml microcentrifuge tube by centrifugation at 13,000 x g for 2 min. The excised DNA fragments were extracted from the acrylamide gel by diffusion in TEN buffer (10 mM Tris HCl, pH 8, 1 mM EDTA, 50 mM NaCl) at 55°C for 20 min. The buffer-acrylamide mixture was transferred to a Spin-X tube (no. 8163; Corning Inc., Corning, NY). The buffer containing DNA was separated from the acrylamide by centrifugation.
The
80-bp AluI fragments were immobilized by binding to Dynal M-280 streptavidin beads (Invitrogen, Carlsbad, CA) in 1x BW buffer (5 mM Tris HCl, pH 7.5, 0.5 mM EDTA, 1.0 M NaCl). The beads were washed twice in 1x BW buffer and twice in wash buffer (10 mM Tris HCl, pH 8, 10 mM MgSO4, 50 mM NaCl). During the final wash, the beads were split into two pools and the buffer was removed while the beads were positioned next to the magnetic tube holder. Each bead pool was resuspended in a ligation mixture containing T4 DNA ligase, buffer, and either of the double-strand adapters TX12/13 and TX14/15. The ligations were incubated overnight at 16°C, followed by heat inactivation at 65°C for 15 min. The ligations were washed twice in 1x BW buffer and twice in wash buffer.
SARD tags were cleaved from the streptavidin beads by incubation with 2.5 U BpmI for 2 h at 37°C. Released tags were transferred to a new tube and heat inactivated at 65°C for 20 min. The 3' overhangs were removed by incubation with 2.5 U Klenow fragment in the presence of 200 µM deoxynucloside triphosphates at 25°C for 30 min. The split pools were recombined and heat inactivated at 70°C for 30 min. The SARD tag adapters were ligated together to form ditags by adding rATP to 1 mM and 200 U of T4 DNA ligase and incubating overnight at 16°C.
The ditags were amplified in two steps, utilizing the primers TX131 and TX141, to prepare an adequate amount for concatenation. Prior to each step, the number of cycles that would lead to the greatest amount of 120-bp amplicon product without the appearance of artifact bands or smears was determined empirically in small-scale (15 µl) reaction mixtures. PCR conditions for the first step consisted of 94°C for 5 min followed by 10 cycles of 94°C for 20 s, 55°C for 30 s, and 72°C for 30 s followed by 6 to 10 cycles (depending upon the sample) of 94°C for 20 s and 72°C for 45 s. Ditags were amplified with AmpliTaq Gold DNA polymerase in three 100-µl reaction mixtures. The amplified ditags were purified on 10% PAGE-TBE gels and served as the template for the second amplification. PCR conditions for the second step included 94°C for 2 min followed by 6 to 10 cycles of 94°C for 30 s and 72°C for 1 min. The ditags were amplified with Taq DNA polymerase (New England Biolabs) in 96 100-µl reaction mixture flasks (9.6 ml total) in the second large-scale PCR step. The amplified ditags were purified on 10% PAGE-TBE gels.
The PAGE-purified large-scale ditag preparations were digested with 240 U of FokI for 4 h at 37°C to release the adapters. The 28-bp ditags were purified by two rounds of PAGE-TBE purification, first on a 12% gel and then on a 16% gel. DNA was recovered from the acrylamide gel by fragmentation and diffusion as described above, phenol-chloroform extracted, and ethanol precipitated.
The purified 28-bp ditags were concatenated by resuspending ethanol precipitates in a ligation mixture including 1,000 U of T4 DNA ligase. The ligations were incubated overnight at 16°C. The reaction mixtures were heated to 65°C for 15 min, followed by incubation on ice for 15 min as described previously (23). The ditag concatemers were ethanol precipitated and purified on 8% PAGE-TBE gels. DNA fragments corresponding to 300 to 700 bp were excised and purified by fragmentation/diffusion followed by ethanol precipitation. The resulting ditag concatemers were ligated into the HindIII site of pUC19 and transformed into chemically competent Escherichia coli DH10B cells (Invitrogen, Carlsbad, CA).
Bacterial transformants were plated on medium containing 5-bromo-4-chloro-3-indolyl-ß-D-galactopyranoside for blue/white screening to identify cells harboring plasmids with inserts. Cells from white colonies were further screened for clones with large inserts by colony PCR utilizing M13 forward and reverse primers. Amplicons were resolved on 1.5% agarose gels. Clones with insert sizes of 300 to 600 bp (20 to 40 SARD tags) were selected for sequencing.
SARD clone DNA sequencing.
Sequencing of the SARD clones was performed by using a modification of the BigDye terminator method used for microbial genome sequencing at The Institute for Genomic Research (18, 40). Briefly, the sequencing reaction used the M13 forward primer only, and a sequencing buffer containing sucrose and betaine (final concentrations, 80 mM Tris HCl, pH 9.0, 2 mM MgCl2, 2% sucrose, 0.75 M betaine) was substituted for the original sequencing buffer. Finally, the following cycling conditions were used: 98°C for 2 min, 98°C for 10 min to 50°C for 5 min to 70°C for 4 min (40 cycles).
SARD profile analysis.
A software program was written to extract the SARD tag sequences from the raw sequence files and convert the data into a tab-delimited file that could be further manipulated in a spreadsheet program such as Microsoft Excel. Any SARD tags that were too short or that were greater than 1 nucleotide too long were discarded. Tags that were 1 nucleotide too long were trimmed to the correct length and included in the analyses. These tags likely resulted from incomplete removal of the 3' overhang by the Klenow enzyme prior to the head-to-head tag ligation to form ditags. A small number of tags (<0.4%) that were either presumptive cloning artifacts derived from TX131/TX141 or were from a downstream, conserved AluI site at position 1067 (E. coli numbering) of the 16S rRNA gene were filtered out of the tag set. The program reported additional numerical features of the profiles, including, for each tag, the minimum number of nucleotide position changes necessary to equal another tag sequence. SARD tag richness and diversity estimates were made with the EstimateS software package, version 7.5 (9). To construct the rarefaction plots and Chao1 estimates, the SARD data was randomized a total of 50 times.
16S rRNA gene libraries.
Approximately 600 bp of the bacterial 16S rRNA genes were PCR amplified with the same primers used for SARD profiling (TX9 and 1391R), except that neither primer had a biotin modification and both primers possessed 5' phosphate groups to facilitate cloning. The number of PCR cycles was kept to 20 to 22 cycles to minimize amplification-dependent biases. Amplicons were purified on agarose gels with a QIAquick gel purification kit (QIAGEN, Valencia, CA).
PCR products were ligated nondirectionally into the SmaI site of pUC19 and were transformed into E. coli DH10B cells (Invitrogen, Carlsbad, CA). Transformants were plated onto LB agar-ampicillin plates containing 5-bromo-4-chloro-3-indolyl-ß-D-galactopyranoside for blue/white screening. White colonies were picked and grown in LB-ampicillin medium. Plasmid template DNA was prepared by a modified alkaline lysis method. The 16S rRNA gene nucleotide sequences of the clone inserts were determined by using BigDye terminators (Applied Biosystems, Foster City, CA) and 3.2 pmol of M13F sequencing primer as described previously (20). Sequences were analyzed on ABI 3730xl sequencers (Applied Biosystems, Foster City, CA) and trimmed to remove the vector sequence.
Phylogenetic analysis.
16S rRNA gene sequences were aligned with ClustalX, version 1.81 (41). Unrooted phylogenetic trees were created with the PHYLIP package of programs, version 3.63 (16), including DNADIST and NEIGHBOR, where the output file of one program served as the input for the next program. The evolutionary distances were computed with DNADIST using the Kimura two-parameter model. The trees were edited using RETREE. The bootstrap values for the nodes were determined from 1,000 iterative analyses with the SEQBOOT, DNADIST, NEIGHBOR, and CONSENSE programs.
Nucleotide sequence accession numbers.
Partial 16S rRNA gene sequences were deposited in the GenBank database (accession nos. EF600115 to EF600228 [Pol-NE], EF600229 to EF600332 [Pol-W], EF600333 to EF600441 [WP43], and EF600442 to EF600552 [WP45]).
SARD data accession numbers.
SARD tag sequences and abundance data were deposited in the NCBI Gene Expression Omnibus (GEO) site located at http://www.ncbi.nlm.nih.gov/geo (accession no. GSE8119).
| RESULTS |
|---|
|
|
|---|
SARD was designed to recover a 16-bp sequence tag from the fifth variable region (V5) of the bacterial 16S rRNA gene (4). The method begins by PCR amplification of a subregion of the 16S rRNA gene at positions 790 to 1391 (Fig. 1). The location where the SARD tag is recovered from the 16S rRNA gene is defined by the first occurrence of an AluI restriction site downstream of the forward primer (TX9). The forward primer is complementary to a conserved DNA sequence immediately upstream of the V5 region. In most cases, the first AluI site downstream of TX9 occurs at position 860 (E. coli numbering), which is located at the downstream junction of the V5 region and the subsequent conserved region (10). As described below, the AluI site was found to be conserved at this location (or within the V5 region) in 70 to 80 percent of 16S rRNA gene sequences examined.
The forward primer, TX9, was biotinylated to facilitate binding to a solid support following digestion with AluI. However, prior to binding to the streptavidin beads, the AluI-digested PCR products were PAGE purified to recover only those fragments where the AluI site was located within or adjacent to the 5' region. These fragments were then bound to a streptavidin solid support, which resulted in the V5 variable sequence being exposed at the free end of the bound DNA. Double-stranded DNA adapters that include recognition sites for the type IIS restriction enzyme BpmI were ligated to the bound DNA. Type IIS restriction enzymes cleave DNA at a distance from their recognition sites. Thus, cleavage of these bound DNA fragments with BpmI released the adapter sequences, including 12 bp of variable sequence from the V5 region.
In a series of subsequent enzymatic modifications, the 3' overhangs of the released sequence tags were removed with the Klenow fragment, and the resulting blunt end tags were ligated together to form ditags. Following amplification and PAGE purification, the adapter sequences were released from the ditags by cleavage with a second type IIS restriction enzyme, FokI, whose recognition site was also located within the adapter sequences. FokI digestion resulted in 24-bp ditags that possessed 4-bp AGCT 5' overhangs on each end. These ditags were purified and ligated together to form concatemers. These concatemers were cloned into the HindIII site of pUC19.
The SARD tag concatemers were free from adapter sequences, which increased the throughput of tags identified per sequencing run. The ditags that comprise the concatemers were themselves the result of head-to-head ligation of individual tags. Therefore, the SARD tags were arranged within the concatemers on alternating DNA strands separated by an AluI site every 28 bp. Each tag consisted of 12 bp of variable sequence plus the 4-bp AluI site. A software program was written to identify and extract SARD tags from the raw sequence data.
A total of 37,008 SARD tags were identified from the four soil sample genomic DNA preparations (Table 1). These tags comprised 3,127 unique tag sequences. The number of times a tag sequence is present in a sample is expected to reflect the abundance of the corresponding 16S rRNA gene sequence in the sample community. Most unique SARD tags were observed only once (singletons) or twice (doubletons) in each sample, indicating extensive tag richness in these samples and that the profiles had captured only a fraction of the tag diversity present.
|
A total of five SARD profiles were conducted from soil genomic DNA extracted from four samples. Two profiles were conducted from the same genomic DNA, prepared from the sample Pol-W, to assess the reproducibility of SARD (Fig. 2). As expected, the least-abundant tags (seen the least number of times) showed the most variability between the duplicate profiles. Since the data are plotted on a log-log scale, only those tags that were present in both SARD profiles were included. The Pearson correlation from this comparison (r2 > 0.99) and the plots on a linear scale obtained from the entire data set (data not shown) were similarly robust.
|
One approach for estimating the number of tag artifacts is through published error rate determinations. Taq polymerase misincorporates nucleotides at a rate of 8 x 106/bp/duplication (8). Therefore, under the conditions used in this study, <180 of the tags identified in this study were expected to be artifacts as a result of Taq replication errors. DNA sequencing errors, which can be discriminated during automated sequencing by using Phred scores (13, 14), were expected to have resulted in <550 tag artifacts. Taken together, the total number of tag artifacts was expected to comprise <730 (2%) of the
37,000 tags identified.
Community structure assessment.
Rank abundance plots were used to assess the SARD tag diversity of the four agricultural soil samples (Fig. 3). In this and subsequent experiments, the data from the two SARD profiles from Pol-W were combined into a single data set. The SARD profiles were found to be comprised of a large number of rare sequence tags and a small number of abundant tags. The Simpson's reciprocal diversity index values were similar for three of the four samples. The lower diversity value for the fourth sample, Pol-W, could be attributed to the presence of two abundant SARD tag sequences that were each present at levels of 16 and 17 percent of the sample. The most-abundant tag found in any of the other three samples was present at 11 percent.
|
|
In the case of the Pol-NE and Pol-W samples, the accumulation plots and Chao1 estimates from the two samples are essentially indistinguishable from one another. In contrast to the observed richness in the accumulation plots, the Chao1 richness estimates of WP43 were significantly higher than that of the WP45 sample. The difference between these samples did not become significant at the 95% confidence level until the sample size had reached about 5,000 tags. Despite similar accumulation plots in the four samples, the Chao1 richness estimates varied by nearly a factor of two between the alfalfa field (Pol-NE and Pol-W) and asparagus field (WP43 and WP45) sampling locations (Fig. 5). Importantly, the richness estimates appeared sample-size dependent and did not reach a plateau. Therefore, additional sampling would be expected to lead to higher richness estimates.
|
Two factors can limit the resolution of SARD profiles: (i) different 16S rRNA genes may share the same SARD tag sequence, and (ii), some 16S rRNA genes may not produce an informative SARD tag. To understand how these limitations affect SARD profiles, a phylogenetic tree was constructed from the WP45 16S rRNA clones and was annotated with the corresponding SARD tags (Fig. 6). Sixteen of the 53 SARD tags identified in the 16S rRNA clones occurred in more than one clone sequence. In these cases, where different 16S rRNA clones possessed the same SARD tag sequence, the 16S gene sequences were significantly similar to one another and were members of the same bacterial division, except in two cases. These were PT0015 and PT0023. The PT0015 tag was found in a Betaproteobacteria clone and an Acidobacteria clone. The PT0023 tag was likewise found in clone sequences from two divisions, Chloroflexi and WS3. Taken together, the average distance between 16S rRNA clones identified in this study that encoded the same SARD tag was 3.3%.
|
Comparison of the SARD data with the 16S rRNA clone sequences also revealed that 13 of the 14 bacterial divisions represented by 16S rRNA clones had corresponding SARD tag sequences (Fig. 6). The exception was the Planctomycetes division, where three 16S rRNA clones that did not encode an informative SARD tag were identified in this study by DNA sequencing. The Gemmimonas division also appeared underrepresented in the SARD profile, since only 1 of the 10 nonidentical 16S clone sequences possessed an informative SARD tag. To more objectively assess phylogenetic bias in the SARD method, we examined a database of 5,100 16S rRNA gene sequences (21) to determine whether different bacterial divisions would be equally represented in a SARD profile (Fig. 7). In addition to Planctomycetes and Gemmimonas, less than 50 percent of the 16S sequences from five other divisions, including Haloanaerobiales, Fibrobacteres, Bacteroidetes, Synergistes, and Deinococcus-Thermus, were predicted to be identified in a SARD profile and these divisions would therefore be underrepresented. The majority of 16S rRNA sequences from 28 other bacterial phylum-level divisions were predicted to be well represented in a SARD profile.
|
| DISCUSSION |
|---|
|
|
|---|
SARD is based on concatenating short DNA sequence tags and is similar to a method devised to measure gene expression, termed SAGE, for serial analysis of gene expression (46-48). One advantage that SAGE has over other methods of gene expression analysis, such as microarrays or reporter matrices, is that no prior DNA sequence information for the genome under study is necessary. This attribute is a critical property of any microbial community surveying method, since only a small fraction of bacterial DNA sequences are known (37). SAGE per se would not work as a microbial community profiling tool, since the method relies on the fact that essentially none of the mRNA transcripts being surveyed share sequence similarities. Thus, the location of a SAGE tag within a given mRNA is not important. For the purpose of surveying a common gene from a community of genomes, a method necessarily must target a variable region. SARD accomplishes this point by targeting a conserved AluI restriction site located immediately adjacent to the V5 variable region of the 16S gene (Fig. 1).
Methods similar to SARD that rely on concatenation of variable regions of the 16S rRNA have been developed. One of these methods, termed SARST, for serial analysis of ribosomal sequence tags (RSTs), uses PCR to amplify and concatenate the V1 region of the 16S gene (33). SARST produces variable region RSTs that are from 17 to 55 bp in length. In a recent application of SARST, nearly 13,000 RSTs were identified from arctic tundra and boreal forest samples (32). A variation on SARST that targets the V6 region has also been reported (27). In each of these methods, there is a tradeoff between RST length and phylogenetic resolution. Relative to SARST and its related methods, SARD provides less phylogenetic resolution (shorter tags) for increased throughput (more tags/sequence run) to create deeper surveys.
SARD may be further distinguished from these other methods in that SARD tags are perhaps best utilized as barcodes to facilitate quantitative analysis of the abundance and distribution of microbial taxa rather than to make phylogenetic inferences between tags that possess various degrees of sequence identity. In order to realize the most value from the longer tag sequences recovered by SARST and related methods, alignments need to be made to accommodate insertions and deletions in related sequences. While this step enables the grouping of related RSTs into operational taxonomic units to lessen the effects of artifacts, the alignment of large sets of sequence data is computationally intensive and currently limited by available bioinformatics tools. The much larger molecular survey data sets that are now possible with new sequencing technologies (39) pose a significant challenge to software normally employed to handle DNA sequence data. SARD tags, by contrast, are treated as unique sequence identifiers, resulting in much smaller-sized data files that can be manipulated and analyzed with common spreadsheet software. The presence of SARD tag artifacts in the resulting SARD profiles must be kept in mind, however, when interpreting the results.
Phylogenetic resolution of a SARD tag.
A SARD tag is comprised of 12 bp of variable sequence and 4 bp of the AluI restriction site (AGCT). The resolution of a 16-bp tag corresponds to a single base change, or 6.3% (1/16). For example, the smallest difference that could be detected when comparing two 16-mer sequences was
6%. As a broad approximation, when comparing 16S rRNA gene sequences, distance values of 3%, 5%, and 10% correspond to species-, genus-, and family-level phylogenetic resolutions, respectively (37). From a practical standpoint, the phylogenetic resolution of a SARD tag may be somewhat greater. For example, 75% (12/16) of a SARD tag is variable sequence, whereas variable regions comprise only 38% (591/1,542) of the E. coli 16S rRNA gene (10). A SARD tag can then be considered to possess more information content per unit length than a longer piece of DNA having a greater proportion of conserved sequence. Also, the information content of the 16S gene is partially redundant because of the intramolecular base pairing that occurs to produce the rRNA secondary structure. The targeted location of a SARD tag within the 16S rRNA gene is on one side of a hairpin structure that forms the V5 region, and thus, the SARD tag itself has no redundant sequence.
The notion that a SARD tag contains more information density than a longer piece of DNA that includes conserved DNA and inverted repeats was supported by the observation that nonidentical 16S rRNA clone sequences from this study that encoded the same SARD tag were, on average, only 3.3% different from one another. Therefore, at least operationally, the phylogenetic resolution of a SARD tag corresponds to somewhere between the species and genus levels.
SARD limitations.
As with all microbial surveying methods, SARD has inherent limitations and biases. Sources of bias that are intrinsic to most microbial surveying projects include the choice of the sample DNA extraction method and the choice of "universal" primers for the initial PCR amplification of the 16S rRNA gene. In addition, the differential amplification of templates, or PCR bias, can lead to distortions of the relative abundance of DNA sequences in a sample. This type of bias occurs in a small minority of 16S rRNA templates and can be lessened by decreasing the amplification cycles and increasing the template concentrations (2, 15, 29, 36), as was done in this study.
A potentially more significant bias arises from variability in the location of the first AluI site downstream of the TX9 primer. For example, the position within the 16S gene where a SARD tag will be recovered is determined by the presence of a conserved AluI restriction site within or adjacent to the V5 region. This site was found to be present in about 70 percent of sequences examined. 16S rRNA gene sequences that do not have an AluI site in this region will be excluded from a SARD profile. Some bacterial divisions will be disproportionably affected by this bias, which will lead to their being underrepresented in a SARD profile (Fig. 7). This bias highlights the need for conducting more than one type of survey method. Sequencing 16S rRNA gene clones is complementary to SARD in that the approach provides a qualitative framework of a microbial community, as well as identifying underrepresented sequences in a SARD profile.
rRNA operons are known to vary from 1 to 15 copies per genome (26). As a result, 16S-based survey data cannot be translated directly to genome equivalents or cell numbers. Nevertheless, in comparisons of 16S data from two samples to identify relative differences or ratios, the variation in rRNA operon copy number is not relevant, since it cancels out in the arithmetic. Therefore, meaningful quantitative comparisons of 16S data can be made despite rRNA operon copy number variation between species. A potentially relevant exception could occur when two different species that harbor identical 16S rRNA gene sequences (or SARD tags) and that possess different rRNA operon copy numbers are being compared in two samples. This possibility may not significantly detract from such comparative analyses, since rRNA operon copy number appears to have a phylogenetic basis (related genomes equal similar rRNA copy numbers) (25).
SARD tag abundance distribution.
The abundance class distribution of a bacterial community is an important parameter that can enable estimates to be made of certain features of the community based on relatively small samples. Such features include estimates on the amount of sampling necessary to achieve a given level of coverage, estimates of species diversity, and estimates of the total species richness present. Without knowing what type of abundance distribution a community follows, species or operational taxonomic unit richness estimates must rely on nonparametric estimators (22).
Abundance classes of bacterial communities have been suggested to follow log-normal (11, 38), power law (Zipf) (19), or other distributions (31), to name a few. The abundance class data presented here from four soil samples clearly followed a power law distribution down to a lower abundance level of 0.01%. Interestingly, the observed power law distribution model would not be consistent with a single-copy abundance class. Since the total number of PCR-amplifiable 16S genes used in the construction of the SARD libraries (determined by quantitative PCR) was
25 million, tags present in single copy would represent an abundance class of
4 x 106 percent. Such an abundance class would not fit the line in any of the plots in Fig. 4 at any tag fraction. The data therefore anticipate that either of two mutually exclusive scenarios exists: (i), the SARD tag abundance distribution for the entire community follows a power law and does not include very-low-abundance classes, or (ii), the data follow a bimodal (e.g., log normal) distribution with an inflection point that has not yet become evident with the current level of sampling. Whether the observed SARD tag distribution from these four samples translates to other taxonomic levels of resolution is not known.
SARD tag richness estimates.
Soil habitats are known to harbor significant numbers of bacterial cells and species. The classic DNA reassociation experiments by Torsvik et al. indicated that as many as 4,000 different bacterial genomes were present in a 30-g soil sample (42). More recently, a mathematical reevaluation of published soil genomic DNA reassociation data indicated that there may have been
107 prokaryotic species in a 10-g soil sample (19). Richness estimates made from sampling-based survey data, such as 16S rRNA clone libraries, have tended to be much lower and are probably the result of inadequate sampling.
In the SARD data presented here, both the accumulation curves and Chao1 richness estimates of the SARD profiles of the four soil samples were found to be sample-size dependent. Thus, additional sampling would be expected to identify more unique tags and lead to higher estimates of the total richness. Taken together with the large number of singleton tags observed, the level of sampling of these soil communities was not sufficient to estimate the total tag richness in these samples. The inadequate sampling of these profiles was significant given the size of these surveys (
104) and the level of taxonomic resolution of the SARD profiles (approximately between the genus and species level).
SARD tags do not provide a species-level resolution, and thus, one SARD tag may represent multiple different 16S sequences (species). Importantly, these sequences are usually quite closely related (Fig. 6). For some applications, this sequence consolidation aspect of SARD can be an asset. For example, to our knowledge, no soil or similarly complex bacterial community has been surveyed to completion. A method, such as SARD, that effectively reduces microbial complexity in a coherent manner and provides increased throughput may be an important tool for characterizing extraordinarily species-rich bacterial communities, such as those found in soil and sediments.
| ACKNOWLEDGMENTS |
|---|
Authors M.N.A., J.R., and D.D.-D. declare a financial interest in the subject matter as founders/stockholders of Taxon. This work was supported in part by Department of Energy Small Business Innovations Research (SBIR) grant DE-FG02-04ER84089.
| FOOTNOTES |
|---|
Published ahead of print on 25 May 2007. ![]()
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||