Previous Article | Next Article ![]()
Applied and Environmental Microbiology, May 2009, p. 3263-3270, Vol. 75, No. 10
0099-2240/09/$08.00+0 doi:10.1128/AEM.01931-08
Copyright © 2009, American Society for Microbiology. All Rights Reserved.
,
Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, North Carolina,1 Environmental Genomics Core Facility, University of South Carolina at Columbia, Columbia, South Carolina2
Received 19 August 2008/ Accepted 25 February 2009
|
|
|---|
|
|
|---|
The shorter reads produced by pyrosequencing require the choice of a particular region of the 16S rRNA gene to target for pyrosequencing as well as the choice of an algorithm to classify the taxonomy of the shorter reads. In their initial surveys of microbial diversity with pyrosequencing (12, 14, 25), Sogin and colleagues targeted the V6 variable region, in part because it is was small enough to be captured with the 100-bp reads of the pyrosequencing technology available at the time. Recently, the read length of 454 pyrosequencing machines has been increased to an average of
250 bp. This allows for more flexibility in primer design and opens up the possibility of targeting regions of the 16S rRNA gene other than V6. In recent work, Huse et al. took advantage of this new capability to compare the classifications made for the human gut microbiome with the V6 and longer V3 regions (13). Plotting the taxonomic abundance of these two sequence sets against each other yielded an excellent correlation (r2 = 0.99), suggesting that the choice of which variable region to target makes little difference. In this report, we introduce a data set examining the performance of sets of primers targeting the V1-V2, V6, and V6-V7 regions. By using a sample for which we have also generated a whole-genome shotgun sequencing run with 250 bp reads, we were able to compare the observed 16S rRNA genes in samples with and without an initial PCR step targeting the 16S rRNA gene. Our results demonstrate that experimental choices such as which region of the 16S rRNA gene to sequence and which algorithm to use to classify taxa are much more likely to affect observations of the "rare biosphere" than more commonly observed taxa.
|
|
|---|
PCR conditions.
A 454 pyrosequencing reaction requires the presence of two oligonucleotide sequences, the 454A and 454B primers, for the emulsion PCR reaction that amplifies DNA prior to pyrosequencing. In a whole-genome shotgun sequencing experiment, these primer sequences are blunt-end ligated to sheared sequence prior to emulsion PCR. If a pyrosequencing sequencing run instead uses PCR to focus on a particular gene, these primer sequences can be incorporated into the primers used in the initial PCR. Table 1 shows the full primer sequences used in our initial PCR targeting the 16S rRNA gene. Primers were ordered from IDT and were not purified by high-performance liquid chromatography.
|
View this table: [in a new window] |
TABLE 1. Primers used in initial 16S rRNA gene experimentsa
|
Shannon sequence entropy.
The aligned version of the Ribosomal Database Project ([RDP] version 9.59) was downloaded from http://rdp.cme.msu.edu/. A reference sequence was chosen (Escherichia coli J01695), and a new alignment was created in which all the columns of this alignment were numbered by the reference sequence. That is, if within the alignment the reference sequence E. coli J01695 had a gap, that column was removed from every sequence in the alignment. We then calculated the Shannon sequence entropy (Fig. 1) (23) for every column in the new alignment. This measure of conservation is defined as:
![]() |
![]() View larger version (40K): [in a new window] |
FIG. 1. Sequence conservation as a function of alignment position for the 489,840 sequences in version 9.59 of the RDP. The x axis shows the position in the alignment as numbered by the E. coli 16S rRNA gene. The y axis shows the Shannon sequence entropy (see Materials and Methods), a widely used measure of conservation in multiple sequence alignments (23). Highly conserved positions within the alignment have a sequence entropy close to zero and hence are shown toward the top of the y axis. Positions of the hypervariable regions V1-V3 and V6-V7 are derived from Chakravorty et al. (3).
|
|
View this table: [in a new window] |
TABLE 2. Sequences obtained via pyrosequencing of our 20 March 2007 aeration basin wastewater sample
|
![]() View larger version (9K): [in a new window] |
FIG. 2. Number of sequences assigned at the family classification level by the RDP classification algorithm to different sequences for the V1-V2, V6, and V6-V7 primers plotted against the number of 16S sequences assigned to the whole-genome shotgun sequence set. One has been added to each count to allow the data to be shown on a log-log plot.
|
Implementation of JGast.
A "nearest neighbor" algorithm maps a query sequence to a previously described full-length sequence in a database and then assigns the classification of the full-length sequence to the query sequence. There are several nearest-neighbor algorithms available as web servers (5, 6), but they have limitations in the number of sequences that can be uploaded, making them poorly suited to large pyrosequencing datasets. The code for the Global Alignment for Sequence Taxonomy (GAST) process (13) has been made publicly available (http://vamps.mbl.edu/resources/software.php) but requires creation of a database containing variable regions extracted from a full-length 16S rRNA alignment. As an alternative, we implemented a nearest-neighbor algorithm in Java (see File S2 in the supplemental material) that works directly on unaligned reference 16S rRNA sequences. Our implementation, which we call JGast, begins with classification of the 302,066 sequences in the Greengenes database (the file current_prokMSA_unaligned.fasta dated 16 December 2008 was downloaded from http://greengenes.lbl.gov/Download/Sequence_Data/Fasta_data_files/on 7, January 2009) by the stand-alone RDP classification algorithm, version 2 (downloaded from http://sourceforge.net/projects/rdp-classifier/). In addition, a BLAST database was created from the current_prokMSA_unaligned.fasta sequences using formatdb (with the default parameters except for "–p f"). For each query sequence, we performed a BLASTN search against this database. We used NCBI BLAST, version 2.2.18 for Linux, with the following parameters: –p blastn, v = 250, and –e 0.000001. For each query sequence, we took the top 250 sequences with the best BLAST match (i.e., lowest e-score) to the query sequence. For each of the 250 sequences found by BLAST, we performed a pairwise global alignment with the query sequence using the Needleman-Wunsch algorithm with a match score of 2, a mismatch score of –1, and an affine gap penalty of –2 for opening a gap and a score of 0 for extending the gap. Following Huse et al., we defined the distance of our pairwise Needleman-Wunsch-aligned sequences as the number of insertions, deletions, and mismatches, divided by the ungapped length of the query sequence. In determining the region of the target sequence to use for the global alignment, we took enough of the target sequence to ensure full coverage of the query sequence. For example, consider a BLAST alignment of 100 nucleotides between a query sequence of 175 nucleotides to a target sequence of 400 nucleotides that matched from nucleotides 200 to 300 of the target sequence. There are 75 nucleotides in the query sequence that were not present in the BLAST alignment. We wish to allow for the possibility that the missing 75 nucleotides could contribute to a global alignment on either the 5' or 3' side of the target sequence. We would therefore use the Needleman-Wunsch algorithm to align the full query sequence to a substring of 125 to 375 from the target sequence (plus an additional 10 nucleotides on either side to allow for gaps). Since trailing or leading gaps are treated as a single insertion event, the presence of the extra sequence does not substantially increase the calculated distance.
Of the 250 sequences aligned to the query sequence in this way, we followed Huse et al. and considered the taxonomy of the target sequence(s) with the best percent identity (i.e., lowest distance) from the Needleman-Wunsch alignment as reference sequences. We compared the classification of these reference sequences and generated a consensus taxonomy in which two-thirds of the reference sequences agreed on the taxonomy, with at least an 80% RDP confidence score on the full-length reference sequence. If two-thirds of the sequences did not agree on a classification with an 80% RDP confidence score at a given taxonomic level (e.g., genus), we moved to the next higher taxonomic level on the tree (e.g., family).
Although JGast shares many similarities with GAST, there are some differences that could yield different results. GAST starts with RDP classifications of the SILVA database. We instead used RDP classifications of the Greengenes database in part because the Greengenes database has a longer average sequence length (1,416 bp for Greengenes versus 1,005 for SILVA). GAST uses a blast database consisting of an alignment of just the target variable regions and uses a multiple sequence alignment program (MUSCLE) (8, 9) to align the query sequence with the 100 best BLAST hits. Distance scores are calculated on this global alignment. JGast, by contrast, calculates distance scores from a pairwise Needleman-Wunsch alignment of the query sequence to each of the 250 best BLAST hits found in a BLAST database consisting of full-length sequences of the Greengenes database. Finally, GAST considers the taxonomy of all sequences in the SILVA database that contain an exact match to the reference sequence substring containing the variable region. By contrast, JGast considers only the taxonomy of sequences among the 250 best BLAST hits that have an exact match to the substring of the target sequence used for the global alignment. We do not anticipate these different choices, employed for expediency or performance or to take advantage of an existing Java code base, would make large differences in the results that we obtain. Indeed, our results strongly suggest that different classifications that arise due to different choices made during construction of algorithms will mostly be limited to the rare biosphere.
In Fig. 1 and 2, we show sequences that are classified by the RDP classification algorithm with a confidence score of at least 80% (the number of sequences assigned in this way is shown in Table 3). In Fig. 3 and 4, we follow Huse et al. and show only assignments where the percent identity of the query sequence and reference sequence in the Greengenes database is >85% and where the reference sequence(s) has a consensus classification of a given taxonomic level with an RDP confidence score of at least 80% (the number of sequences assigned in this way is shown in Table 4). In order to ensure that the differences in Fig. 3 and 4 are not simply the result of different thresholds being applied to generate classifications of different sequence sets, for the direct RDP classifications of the query sequences in these figures we took the same set of sequences for which JGast had made a classification and used classifications from the RDP algorithm, regardless of confidence score. That is, we let JGast choose the sequences based on a maximum distance to the reference sequence of 0.15 and then took RDP assignments for this sequence set even if the RDP confidence score was less than 80%.
|
View this table: [in a new window] |
TABLE 3. Number of sequences classified by RDP analysis scheme with a confidence score of >80%a
|
![]() View larger version (28K): [in a new window] |
FIG. 3. Across taxonomic levels, the results of a linear regression on log-transformed data between sequences generated by PCR targeting the 16S rRNA gene and 16S sequences culled from our whole-genome shotgun sequence set. Assignments are by the RDP classification algorithm as in Fig. 2. For each sequence set, two separate regressions were constructed, one for the rare biosphere (circles) with taxa seen 10 or fewer times in that sequence set's PCRs and one for a common biosphere (squares) with taxa seen 11 or more times in the PCR reactions (Fig. 1, gray lines). The top panels show the –log10 of the P value of the null hypothesis that the slope of the regression equals zero. The middle panels show Pearson's R values while the bottom panel shows the number of taxa for which classifications are made. Note that a significant P value (top panel) can be produced by either a negative or positive correlation.
|
![]() View larger version (9K): [in a new window] |
FIG. 4. Comparison of classifications at the family level made by JGast and the RDP classification algorithm.
|
|
View this table: [in a new window] |
TABLE 4. Number of sequences classified by JGasta
|
Nucleotide sequence accession number.
Sequences in this study are available at the NCBI short-read archive under accession number SRA001164 and as File S1 in the supplemental material.
|
|
|---|
60°C (without the 454A and 454B sequences). To allow for easier comparison to other studies, where possible we chose primers that met our criteria that were also used in previous studies. We therefore included the primers targeting the V6 region that the Sogin group used in their first paper (25) (primer 967F-1046R). We also included sets of primers (967F-1177R) that start with the 5' primer from Sogin et al. but extend further past the V7 region (967F-1177R) as well as primers that target the V1-V2 regions (27F-337R) (Table 1; Fig. 1). Once the primers were chosen, we used as a template a DNA sample taken on 20 March 2007 from the aerobic basin of the Mallard Creek Wastewater Treatment Plant for which we had also performed a 454-FLX whole-genome shotgun sequencing reaction (21). Using the three sets of primers in Table 1, we performed 30 rounds of PCR on this sample (see Materials and Methods), gel purified the resulting amplicon, and submitted the resulting DNA for 454-FLX pyrosequencing at the Environmental Genomics Core Facility in Columbia, South Carolina. Each of the three reactions was run on 1/16 of an LR70 plate. The number of sequences that was generated for each primer pair is shown in Table 2.
Differences between whole-genome results and results from the 16S rRNA gene are pronounced for infrequently observed taxa.
Although it is the basis of much of the biotechnology revolution, conventional PCR is known to have limits as a quantitative technology. Primer bias, saturation of amplicon at higher cycle numbers, and a stochasticity that is an inherent part of the PCR process can all limit the degree to which conventional PCR can be reliably used as a quantitative tool. A comparison of our whole-genome-derived 16S rRNA sequences with sequences derived from PCR targeting the 16S rRNA gene on the same DNA sample allows us to evaluate the degree of bias introduced by different PCR primers during the initial PCR step. Figure 2 shows this comparison for the family level for the V1-V2, V6, and V6-V7 primer sets (data for all taxonomic levels are given in File S3 in the supplemental material). For all primer sets, we see a rough correspondence, with a slightly better agreement for V1-V2. We note, however, that most of the differences occur in taxa that we observe 10 times or less in each PCR sequence set. If we limit our regression to sequences that we observed 11 times or more in the PCR sequence sets (Fig. 2), the correspondence between the whole-genome sequence sets and PCR sequence sets is reasonable. Figure 3 shows that, across taxonomic levels, linear regressions between the whole-genome and PCR sets involving only sequences we observed 11 times or more in the PCR sequence sets are much more positively correlated than regressions involving sequences we saw fewer than 11 times. We conclude that (i) different PCR primers yield profoundly different results for less abundant taxa and (ii) primers targeting the V1-V2 region do a somewhat better job of finding taxa that are also present in the whole-genome data set.
Differences between different analysis schemes also most significantly affect infrequently observed taxa.
In Fig. 2 and 3, assignments were performed using the RDP classification algorithm (26). A popular alternative method is to use a nearest-neighbor approach (5, 13). In this approach, a query sequence is first mapped to a full-length sequence in an existing database, and this full-length sequence is used to assign taxonomy. Huse et al. used this approach, in an implementation they call GAST, to compare pyrosequencing runs for the V3 and V6 regions (13). We implemented a broadly similar approach to GAST in Java that we call JGast (available in File S2 in the supplemental material). Figure 4 compares the JGast assignments for each taxon at the family level with the assignments taken from the RDP classification algorithm's direct assignment of taxonomy to the query sequence. We see generally reasonable agreement between the two approaches with, again, pronounced differences occurring in sequences that are observed fewer than 10 times. Figure 5 shows the results of regressions comparing JGast and RDP across taxonomic levels. Again, we see pronounced correlations when we consider a regression based only on taxa observed more than 11 times under either the RDP classification algorithm or JGast but generally very poor correlations when we consider taxa that are observed 10 or fewer times by both algorithms. We conclude that different classification approaches between different algorithms have a much more pronounced effect on the rare biosphere than on more commonly observed taxa.
![]() View larger version (29K): [in a new window] |
FIG. 5. Regressions across classification levels on log-transformed data showing the comparison between the RDP classification algorithm and JGast. Two regressions were constructed for each comparison: one for a common biosphere in which a taxon was observed 11 or more times under either JGast or RDP (squares) and one for a rare taxon in which fewer than 11 taxa were observed under both classification schemes.
|
|
|
|---|
In addition to primer bias and differences in the number of 16S sequences as causes of the poor correspondence between whole-genome and PCR results for low-abundance taxa, we suggest a third, less immediately obvious, cause: low-abundance taxa are harder to classify, and hence results for these taxa are more sensitive to choices made during data analysis, what we here call "analysis noise" (Fig. 4 and 5). This assertion that analysis noise complicates descriptions of the rare biosphere while having less of an impact on more frequently observed taxa is consistent with previous observations that sequences that are infrequently observed are not as well represented in databases as more abundant taxa (7, 25) and hence will be more difficult to classify.
In a recent paper, Huse et al. performed a comparison of different primer sets (full-length, V3, and V6). Their regressions of the abundance of taxa produced by these different primer sets produced r-squared values down to genus of
0.99 (13). These are substantially higher than the r-squared values we observed when we compared our whole-genome run to our runs targeting V1-V2, V6, and V7 (Fig. 2 and 3). We believe that these discrepancies may be explained by the following considerations. (i) The first is depth of sequencing coverage. Huse et al. generated 422,992 V3 tags, 441,894 V6 tags, and 7,215 full-length sequences. By contrast, our study consists of a much more modest number of tag sequences in the thousands and whole-genome 16S sequences in the hundreds (Table 1). Because there is more disagreement in the rare biosphere than in abundant taxa and because our regressions are on a log-log scale, we would expect a smaller r-squared value with less sequence coverage. That is, as the amount of sequencing goes up, the regression becomes more and more dominated by abundant taxa and therefore becomes tighter since primer bias primarily affects less abundant taxa. This effect is exacerbated by the fact that Huse et al. report their r-squared values based on an untransformed linear regression while we log transform our data before performing the regression. This increases the effect of the variability introduced by the rare biosphere in our analysis. (ii) Huse et al. generate their full-length sequences with PCR while we cull our 16S sequences from a whole-genome data set of 250 bp reads. This increases the analysis noise in our analysis as our taxonomic assignment of 250 bp fragments, randomly distributed throughout the 16S sequence, is certainly less accurate than the assignment of full-length sequences by Huse et al. (13). (iii) The data set of Huse et al. is from the human gut, which is likely a less complex ecosystem than our wastewater treatment plant. The additional complexity of our ecosystem likely increases the effect of the rare biosphere, further suppressing the correspondence between our whole-genome and PCR sequence sets. (iv) Finally, PCR noise is a factor. PCR is an inherently stochastic process with many possible confounding effects including saturation of amplicon product. Nonquantitative PCR is not generally thought of as a quantitative technology. It would be surprising if the results of PCR runs on complex communities produced results that were reliably reproducible in quantitative detail. As the cost of sequencing continues to drop and as large pyrosequencing data sets become increasingly trivial to obtain, we anticipate that future studies will be able to discriminate the relative importance of these causes for discrepancies in whole-genome and PCR-based surveys of complex communities.
We note that there is not an immediately obvious superior choice between the two analysis schemes we examined. The RDP classification scheme offers simplicity, very fast run times, and a straightforward bootstrap mechanism for establishing significance. Algorithms like JGast (and GAST [13] and the Greengenes classification scheme [5]) offer potentially greater sensitivity but are also more susceptible to small changes in the database. For example, a single misclassified sequence inserted into the reference database can significantly change classifications for a large number of query sequences if it serves as the nearest neighbor. The RDP classification algorithm is less sensitive to small changes in the training set. We imagine that in the future as the read length of new sequencing technologies approaches the length of the 16S rRNA genes (10), nearest-neighbor approaches, which replace a short query sequence with a full-length database sequence, may be replaced by direct classification of full-length query sequences.
In the paper that introduced the RDP classification algorithm (26), an in silico analysis suggested that subsequences covering the V1-V2 region out-performed other variable regions in reproducing full-length classifications. Our results are consistent with this finding, with the V1-V2 region slightly out-performing V6 and V6-V7 in matching our whole-genome results (Fig. 2 and 3). In addition, the V1-V2 region seemed slightly less susceptible to "algorithm noise." There are only three families which the RDP classification algorithm detects more than 10 times that are not detected by JGast (Fig. 4) while there are more such disagreements between the two approaches for V6 and V6-V7. These differences in performance between the primers sets are small, and it is reasonable to think that environments other than wastewater might yield different results. Nonetheless, all other things being equal, the V1-V2 region seems reasonable to target with the 250 bp available in the 454-FLX technology.
In their analysis of primer bias, Huse et al. conclude that "any effects of primer bias are limited to rare taxa." This analysis is largely in agreement with our results in which different primer sets produce somewhat similar results for abundant taxa while producing vast differences in the results generated for the rare biosphere. A principle argument for favoring pyrosequencing over traditional Sanger sequencing is that the much greater depth of pyrosequencing allows for investigations of the rare biosphere (25). Our results suggest that even more diversity in the rare biosphere will be uncovered when pyrosequencing experiments are repeated with different primer sets. Our results also suggest that longer sequence reads, which may be possible in sequencing technology to be introduced soon (10), will reduce analysis noise by eliminating the step in which a short query sequence is mapped to a full-length reference sequence. Such improvements in technology will be an important component of explorations of the rare biosphere while having much less impact on surveys of more abundant taxa.
Published ahead of print on 6 March 2009. ![]()
Supplemental material for this article may be found at http://aem.asm.org/. ![]()
|
|
|---|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»