Previous Article | Next Article ![]()
Applied and Environmental Microbiology, March 2009, p. 1688-1696, Vol. 75, No. 6
0099-2240/09/$08.00+0 doi:10.1128/AEM.01210-08
Copyright © 2009, American Society for Microbiology. All Rights Reserved.
,
Bioinformatics Research Center,1 Department of Civil and Environmental Engineering,2 Department of Biology, University of North Carolina at Charlotte, 9201 University City Blvd., Charlotte, North Carolina 282233
Received 30 May 2008/ Accepted 16 December 2008
|
|
|---|
|
|
|---|
In this paper, we apply recently developed pyrosequencing technology to probe the molecular diversity of the aerobic basin of a wastewater treatment plant in Charlotte, NC. In line with other studies of complex microbial communities (28, 32), we observed astounding levels of diversity. We found that substantial regions of the genomes of the most prevalent microbes in the wastewater treatment plant are poorly described by existing sequence databases. Our results demonstrate that despite recent technological advances that allow identification of microorganisms, the microbial population of wastewater treatment plants remains undersampled and inadequately characterized. Our results are a first step toward more complete molecular characterization of this important microbial community.
|
|
|---|
8 days) and then flows to secondary clarifiers. Clarified effluent is routed to denitrification filters and then to UV disinfection before discharge into Mallard Creek. The plant National Pollutant Discharge Elimination System permit requires the plant to meet a monthly 5-day test for carbonaceous biochemical oxygen demand of 4.2 mg/liter in the summer and 8.3 mg/liter in the winter months. Ammonia nitrogen (NH3-N) levels must be below 1 and 2 mg/liter in the summer and winter, respectively. There are no other nitrogen or phosphorus limits. The total suspended solids are limited to a maximum of 30 mg/liter, and the pH must be between 6 and 9. Fecal coliform counts must be less than 200 CFU per 100-ml sample. These limits are routinely met by the plant unless there are extreme weather events or plant upsets. Wastewater entering the secondary treatment system was monitored over a 6-month period for filtered flocculated chemical oxygen demand, a good estimator of readily biodegradable soluble organics, and the values ranged from 40 to 75 mg/liter. The ammonia nitrogen concentrations in this same flow ranged from 12 to 24 mg/liter, with the concentration varying in part due to return flow from digested sludge dewatering.
On the morning of 20 March 2007 we collected a 50-ml sample from the aeration basin using a plastic dipper. At the time of sample collection, the temperature in the aeration basin was 18.5°C and the pH was 6.5. The sample was decanted to remove as much foam as possible before the liquid was transferred to a sterile tube. DNA was extracted from the sample using a Mo Bio UltraClean Water DNA kit. The sample tube was inverted several times to maximize homogeneity, and a 10-ml aliquot was removed and pipetted onto the provided filter (0.22 µm). The filtrate was discarded, and DNA was extracted from the membrane using the manufacturer's protocol. The final DNA extract was analyzed to determine its purity and concentration using a NanoDrop ND-1000 spectrophotometer. Approximately 100 µl of extracted DNA was concentrated in a vacuum centrifuge and resuspended in about 12 µl of molecular-grade biology water. The final sample concentration was 479 ng/µl as determined by a NanoDrop spectrophotometer. Preliminary analysis of the DNA using denaturing gradient gel electrophoresis indicated that there was substantial diversity in the observed bands, confirming that our DNA extraction was successful (data not shown). The sample was submitted to 454 Life Sciences for pyrosequencing by the 454-FLX platform. The methodology underlying pyrosequencing has been documented elsewhere (22).
The bioinformatics analyses used in this study are shown in Fig. S2 of and described in the supplemental material.
Nucleotide sequence and quality score accession number.
Sequences and quality scores from our pyrosequencing run have been deposited in the NCBI short-read archive under accession number SRA001012.
|
|
|---|
250 bp, this threshold could be achieved with overlap of a modest number of our sequences. Despite this, only 1,154 (or approximately 0.3%) of our reads were recruited into 117 contigs greater than 500 bp long (the sequences of these contigs are available in File S1 in the supplemental material). To assign possible functions to these contigs, we used the GenMark algorithm (4) to predict genes on our contigs and then performed a Blastp search of these predicted proteins against the pfam database. This method produces more assignments than other approaches, including those based on profile searches (see the supplemental bioinformatics methods for details). With an E-value cutoff of 0.01, this approach found matches for 75% (88/117) of our large contigs (see File S2 in the supplemental material). Of these matches, 22% (20/88) were to hypothetical proteins and 21% (19/88) were to transposases. The prevalence of transposase gene sequences in our assembled contigs suggests that transposons are much more strongly conserved across metagenomes than other genomic regions. The prevalence of contigs for hypothetical proteins shows that the function of many of the highly conserved regions of our metagenome is poorly understood. This failure of the 454 assembly algorithm to assemble 99.7% of our sequence reads emphasizes the great diversity of the microbial community within the treatment plant. Because previous studies found a similar failure of assembly algorithms for metagenomic communities characterized by Sanger sequencing (28), as well as for simulated data sets created by sampling Sanger sequencing reads (23), we would not expect a significantly improved degree of assembly even if our sequence reads were longer.
The majority of taxa in the wastewater treatment plant cannot be classified at the genus level.
In order to discover the 16S rRNA genes within our data set, we downloaded the 16S rRNA gene FASTA DNA sequences from version 9.52 of the Ribosomal Database Project (RDP) (7) and used these sequences to create a BLAST database. Using the Blastn algorithm, we asked which of our 378,601 query sequences could be found in this RDP database with an E-value of
0.01 (see supplemental bioinformatics methods for details). The resulting 648 sequences (available as File S3 in the supplemental material) were run through the RDP classification algorithm (34). The RDP classifier algorithm uses Bayesian statistics to assign taxa to 16S rRNA gene sequences. The output of this algorithm includes a confidence score, which ranges from 0 to 100, that indicates the degree of confidence that can be assigned to the classification based on the results of 100 bootstrap trials (see reference 34 for more details). The recommended threshold for assignment of a taxon by the RDP algorithm is a confidence score of
80. Because sequence reads as short as 90 bp have been shown to be long enough to accurately characterize taxa (15, 21), we anticipated that our results would not be substantially different even if we had a read length of more than 250 bp.
The classifications of the 148 16S rRNA sequences that could be assigned to a phylum with a confidence score of
80 are shown in File S4 in the supplemental material and are summarized in Fig. 1. In another paper (T. J. Hamp, W. J. Jones, and A. A. Fodor, submitted for publication), we show that these classifications of 16S rRNA sequences derived from the whole-genome wastewater sequence set are well correlated with results from PCR experiments targeting the 16S rRNA gene. At the phylum level, the observed taxa are dominated by the Proteobacteria, with
70% of the classifiable taxa belonging to this category (Fig. 1, top panel). Moving from phylum to genus, fewer of the sequences can be classified with an RDP confidence score of at least 80%. At the genus level, nearly 60% of the sequences cannot be classified at an RDP threshold of 80, and, among the taxa that can be classified, there is no dominant taxon (Fig. 1). These data demonstrate the extraordinary microbial diversity of activated sludge and are consistent with reports for other complex environments (14, 28, 30). We note that the inability of the RDP algorithm to classify these sequences to taxa with high confidence is not primarily because our 16S rRNA sequences have never been observed previously. Figure 2 shows that many of the sequences with RDP scores of <80% (to the left of the vertical lines) have very high levels of identity with previously described sequences. These results demonstrate that for wastewater treatment plants, as is the case for other complex ecosystems, the accumulation of 16S rRNA sequences in public databases is vastly outpacing our ability to classify these sequences and that this problem becomes more pronounced as one moves from the phylum level to the genus level. Presumably, future annotation efforts will rectify this problem.
![]() View larger version (46K): [in a new window] |
FIG. 1. Pie charts showing taxonomic assignments for 148 16S rRNA sequences in our data set that could be classified to the phylum level with RDP confidence scores of 80. At the phylum level, the Simpsons diversity index is 0.48.
|
![]() View larger version (24K): [in a new window] |
FIG. 2. Results obtained with the RDP classification algorithm for 148 16S rRNA sequences that can be assigned at the phylum level with a confidence score of 80. The x axis of each graph shows the confidence in assignments as reported by the RDP classification algorithm. The y axis of each graph shows the level of identity (expressed as a percentage) between our query sequence and the best Blastn hit in the RDP database (version 9.52). The horizontal and vertical lines indicate 95% sequence identity and an RDP confidence score of 80, respectively.
|
0.01, we manually annotated where the corresponding RDP sequence was discovered. This was done by manual inspection of the GenBank records for these 648 sequences. The results of this annotation are shown in File S5 in the supplemental material and in Fig. 3. The x axis of Fig. 3 indicates our classification, while the y axis indicates the E-value with which the top hit from each of our query sequences matched the RDP database sequence. We found that while a large number of environments had at least one hit, if we restricted ourselves to environments with multiple hits at high stringency (i.e., low E-value), only three environments are well represented: freshwater, soil, and other wastewater studies (Fig. 3). While, of course, the low number of sequences for some of the other environments may simply reflect the low number of sequences from those environments in the RDP 16S rRNA database, there is a strikingly small number of sequences with high scores that are related to two 16S rRNA populations that are well represented in the database: marine and human. The relatively small number of human-derived 16S rRNA sequences observed is particularly interesting given the vast number of human microbes deposited in the wastewater treatment plant each day. These results show that the environment within the wastewater treatment plant exhibits strong selection pressure against the microbes that are present in human feces.
![]() View larger version (12K): [in a new window] |
FIG. 3. Locations (as determined by manual annotation) and E-values of sequences from the 648-member pyrosequencing data set that matched the RDP 16S rRNA database at an E-value cutoff of 0.01.
|
As of November 2008, there were 772 complete bacterial genomes in the NCBI database (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.fna.tar.gz). In order to explore how well these known genomes are represented in the treatment plant, we used Blastn to compare our wastewater sequences to the 1,442 assembled genome and plasmid sequences from the 772 sequenced bacteria. In order to eliminate spurious hits, we required that any hit matched at least 75 nucleotides in our query sequence (see the supplemental bioinformatics methods for more details). Because our average sequence length was 250.4 bp, this is not an overly conservative criterion. Using this criterion, only 20% (73,274/378,601) of our sequences matched any of the known bacterial genomes. This result again reflects the great diversity of organisms in the wastewater treatment plant and emphasizes a key challenge for genomics; despite the considerable effort that has been expended in microbial genome projects, the great majority of our sequence reads are not found in known genomes.
For the sequences that do match known genomes, we can determine how closely the sequenced genomes of cultivated organisms match the genomes present in our wastewater metagenome. We calculated for each of the 1,442 assembled sequences from the 772 finished genome projects the number of nucleotides in that genome that have a Blastn match that aligns with at least one of our wastewater sequences. Dividing this number by the total length of each assembled sequence yielded the "fraction of the genome covered." Figure 4 shows that even for the bacterium with the most well-represented assembled genome, nitroaromatic compound degrader Acidovorax sp. strain JS42 (accession number NC_008782), only 25% of its genome sequence matched our wastewater metagenome. Table 1 shows that the fraction of the genome covered is similarly poor for the 10 genomes that recruited the most reads from our wastewater metagenome.
![]() View larger version (22K): [in a new window] |
FIG. 4. Fraction covered as a function of the size of each assembled sequence for each of the 1,442 assembled plasmids and chromosomes in the NCBI datadase. The fraction covered is defined as the number of nucleotides in the assembled sequence that match at least one of our wastewater sequences divided by the total number of nucleotides in the assembled sequence.
|
|
View this table: [in a new window] |
TABLE 1. Top 10 assembled microbial genomes as sorted by the number of hits recruited from our wastewater metagenomea
|
![]() View larger version (19K): [in a new window] |
FIG. 5. Nonspecific recruitment against the Acidovorax sp. strain JS42 genome. BLAST hits with alignment lengths less than 75 nucleotides (for the 20 March run) or 250 nucleotides (for the environmental sequence database) were removed. Protein annotations are derived from the full NCBI core nucleotide report for the Acidovorax sp. strain JS42 genome (http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=121592436).
|
The great diversity of our wastewater metagenome caused very few contigs to be assembled. Of the sequences that were joined as contigs, a substantial fraction involved transposases. We might expect, therefore, a different pattern of recruitment around transposons. Figure 6 shows a region of the Acidovorax genome around a transposase gene sequence with a stark exception to the pattern of nonspecific recruitment. A large number of sequences from our metagenome recruited to this region with a nearly perfect match. Interestingly, a number of marine sequences from the GOS (28) also matched the region around this transposase gene, suggesting that, unlike most genomic regions, parts of this transposon are conserved across a wide environmental space.
![]() View larger version (17K): [in a new window] |
FIG. 6. Region involving a transposase from the JS42 genome that shows an exception to the pattern of nonspecific recruitment. For visualization, a small amount of random noise was added to the y axis (as otherwise most of the hits to the transposase region would be superimposed). The red sequences matching the transposase region are from the GOS (28).
|
When mapped to protein space, the wastewater metagenome displays a distinct metabolic profile.
By translating our nucleotide sequences in all six frames and mapping the translated sequences to known proteins, we can generate a distinct metabolic profile for our wastewater sequences. This approach, asking which genes a microbial community is capable of producing, has been successfully used to analyze the metabolic signatures of a number of metagenomic sequence sets (10, 31). To perform this analysis, we submitted our pyrosequencing data set for annotation on the SEED platform (2, 26). Within SEED, metabolic pathways are classified in a hierarchical structure in which all of the genes required for a specific task are arranged into subsystems. At the highest level of organization, the subsystems include both catabolic and anabolic functions (for example, DNA metabolism), and at the lowest levels the subsystems are specific pathways (for example, the synthesis pathway for thymidine). Using the Blastx algorithm and an E-value cutoff of 0.001, the SEED database was able to assign
60% of our sequences. The result of assigning these sequences to functional categories is shown in Fig. 7. For comparison, Fig. 7 shows the mapping to functional categories from a recently published survey of 1,040,665 sequences from 45 microbial metagenomes collected from nine distinct biomes (10). We note that compared to the "average" profile of these nine biomes, the wastewater treatment plant has a distinct metabolic signature. For example, compared to other biomes, the wastewater treatment plant contains almost no genes coding for proteins involved in photosynthesis. We would expect this as the primary energy source for the microbes at this treatment plant is the organic material being processed by the plant. In addition, genes involved in the degradation of aromatic compounds are expressed at a much higher rate in the wastewater treatment plant than in other metagenomic systems. Again, we might expect this given the nature of household and industrial wastes present in sewage. Finally, we note that the Mallard Creek Wastewater Treatment Plant has no additional biological nutrient removal facilities to treat phosphorus. Consistent with this, the percentage of sequences assigned to genes involved in phosphorus metabolism appears to be lower than that for genes involved in nitrogen metabolism in the activated sludge (Fig. 7).
![]() View larger version (39K): [in a new window] |
FIG. 7. Functional categories provided for our data set by the SEED server (http://www.theseed.org). The data for microbial genomes are averages for sequences gathered from multiple biomes (10).
|
Perhaps the most surprising result of our study is the pronounced conservation of transposases across widely different environments. While there is generally poor agreement between sequences from the GOS and known genomes (28) and between our wastewater genomes and known genomes (Fig. 4 and 5), there are a few regions of conservation involving transposons (Fig. 6) where there is a pronounced match between the metagenomes and the sequenced genomes. A substantial fraction of the contigs that could be assembled from our data set involved strongly conserved transposases. It is an open question why transposons have escaped the pronounced sequence mutability that mark nearly all of the rest of bacterial genomes.
Like the results of other metagenomic projects (28, 32, 35), our results point to the extraordinary diversity of microbial communities. Patterns of nonspecific recruitment to known genomes suggest that the structures of the genomes of the most abundant organisms in the wastewater treatment plant are unknown (Fig. 4 and 5). Despite the great diversity of microbes in the treatment plant, analysis at the protein level is surprisingly tractable, with the sequences from the treatment plant displaying a distinct metabolic profile consistent with what we would expect based on the plant's function (Fig. 7). This suggests that despite the great complexity of microbial communities, next-generation sequencing technology will be a useful tool for monitoring changes in microbial processes across time and space. As treatment requirements become more stringent and monitoring expands to address a broadening group of compounds of concern, probe-free sequencing will increase the rate at which key microbial groups can be identified and selected for to optimize contaminant removal.
Published ahead of print on 29 December 2008. ![]()
Supplemental material for this article may be found at http://aem.asm.org/. ![]()
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»