Previous Article | Next Article ![]()
Applied and Environmental Microbiology, April 2009, p. 2221-2229, Vol. 75, No. 7
0099-2240/09/$08.00+0 doi:10.1128/AEM.02118-08
Copyright © 2009, American Society for Microbiology. All Rights Reserved.
,
Department of Marine Sciences, University of Georgia, Athens, Georgia 30602,1 Department of Microbiology, University of Georgia, Athens, Georgia 306022
Received 12 September 2008/ Accepted 2 February 2009
|
|
|---|
1,019 genes and 1.8 copies of the 16S rRNA gene, suggesting that these bacteria have relatively streamlined genomes in comparison to those of cultured bacteria and bacteria from other habitats (e.g., soil or acid mine drainage). |
|
|---|
The initial GOS investigations provided an overview of intraribotype (41), protein family (57), and kinome (26) diversity in surface ocean waters, but the GOS database can be interrogated to address many other hypotheses. One application of this data set, the identification of microbial diversity, is a fundamental question in microbial ecology. The taxonomic makeup of bacterioplankton in GOS samples was provided previously by Rusch et al. (41). However, only the diversity of the most abundant ribotypes and major taxonomic groups was reported, with diversity either reported qualitatively or pooled across the entire sampling range. While subtype classification is a valuable tool for comparing certain microbial populations, biogeography- or habitat-driven questions demand a quantitative assessment of environmentally relevant ribotypes at each site in the GOS metagenome.
In this study, we provide a quantitative, categorical assessment of microbial diversity—both operational taxonomic unit (OTU) evenness and OTU richness—in surface waters of the northwest Atlantic Ocean through the eastern tropical Pacific Ocean by analyzing 16S rRNA gene homologs in the GOS metagenomic library. Based on 16S rRNA gene sequences, we compared the bacterial diversities between sampled habitats and assessed the extent of diversity that was sampled. By normalizing the number of 16S rRNA gene hits (to gene length and the number of single-copy gene hits), we also report estimates for the average number of ribosomal gene copies as well as the number of genes present in the average surface water bacterial genome.
|
|
|---|
Retrieval of gene homologs.
All Basic Local Alignment Search Tool (BLAST) searches for gene homologs in the unassembled GOS metagenome were performed through the community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis (CAMERA) interface (44). To maximize homolog retrieval, the entire 1,542-bp sequence of the 16S rRNA gene (rrsE, locus tag b4007) from the E. coli K-12 genome (RefSeq accession no. NC_000913) was set as the query sequence. Nucleotide sequences for 16S rRNA homologs were identified with BLASTn against "GOS: all metagenomic sequence reads (N)" (cutoff expect [E] value of
10–5; overlap length of
65 bp; similarity of
85%). Before taxonomic assignment of 16S rRNA gene homologs, the retrieved GOS sequences were trimmed to exclude any non-16S rRNA gene portion of the sequence read. Paired-read sequences, two sequence reads originating from the same insert (one forward read, one reverse read), within the 16S rRNA gene homologs were identified and then screened to verify that they were assigned into the same taxonomic bin (see method below). Then, one of the paired reads was removed from further analysis.
Single-copy gene hits, hits to genes found once per prokaryotic genome, were retrieved previously (23). Briefly, amino acid sequences for single-copy genes (atpD, gyrB, dnaK, rpoB, tufA, and recA) were identified with BLASTp against "GOS: all ORF peptides (P)" (cutoff E-value of <10–20). Hits for each single-copy gene, excluding paired reads, were normalized to the relative length of the gene compared to the length of recA (hg* = LrecA/Lg x hg), and the average number of recA-normalized single-copy gene hits was used in this analysis. In this equation, hg* is the number of recA size-normalized hits for gene g, LrecA is the length of recA in bp, Lg is the length of gene g in bp, and hg is the number of hits for gene g. In this study, the lengths of the query genes were used (for recA, LrecA is 1,062 bp; for the 16S rRNA gene, Lg is 1,542 bp).
Taxonomic assignment.
The Ribosomal Database Project II Classifier (RDP Classifier) (55), a naïve Bayesian tool based upon taxonomic classifications in Bergey's Taxonomic Outline of the Prokaryotes (15), was initially used to classify 16S rRNA gene homologs (data not shown). However, certain abundant marine bacteria, especially taxa without cultured representatives, were poorly classified by this method, as they are not represented in Bergey's classification system. Therefore, reference sequences from marine bacteria spanning the known diversity of marine microbes were compiled as reference sequences (see Table S1 in the supplemental material) (34) and analyzed using the RDP Classifier. Many of the marine reference sequences were classified correctly by RDP. However, a lack of taxonomic assignment was found for ecologically relevant marine groups, such as SAR11, the Roseobacter clade, SAR116, SAR324, SAR86, and SAR406, accounting for 48% of all retrieved hits (see Table S2 in the supplemental material). Therefore, taxonomic assignment of the GOS 16S rRNA gene homologs was made by similarity binning with the marine reference sequences using a Smith-Waterman alignment (47). If the GOS sequence had
70% overlap and
90% identity with a reference sequence, taxonomy was assigned to the level of order, where possible, according to the reference sequence with the highest similarity.
Distance-based OTU and richness comparison (DOTUR).
The partial 16S rRNA gene sequences we retrieved included sequences that were taxonomically classifiable by our methods, but because all sequences did not span the same location of the 16S rRNA gene (average read length of 822 bp, average 16S rRNA gene length of
1,500 bp), these sequences would align poorly and artificially enhance predicted diversity. Therefore, statistical assessments of observed and predicted diversity were performed on a subset of 16S rRNA homologs that was retrieved from within the 16S rRNA gene homolog pool by using methods similar to those of Schloss and Handelsman (43). The forward and reverse complements of a modified universal 16S rRNA gene probe (variable 3 [V3] region; EUB338, GCN GCC NCC CGT AGG NGT) (2, 6) were used as query sequences against the 16S rRNA gene homologs we had retrieved previously. For each matching sequence, 150 bp upstream and 150 bp downstream of the probe was obtained so that each retrieved sequence started and ended at the same approximate location. These sequences were aligned using Greengenes, a tool that aligns 16S rRNA gene sequences with a core set of 16S rRNA gene templates (9, 10), and any sequences that did not align were removed. Sequences were trimmed to the same aligned region (positions 1699 to 2257) to maximize the number of overlapping sequences. Any trimmed sequences less than 170 bp (
2/3 the length of the longest sequences, excluding gaps) were discarded to minimize effects of gaps and poor alignment on diversity assessment.
To evaluate the diversity captured in the GOS sequencing effort, we used the V3 region 16S rRNA gene sequences to create rarefaction curves for several GOS niches: the entire GOS metagenome, open-ocean habitats, coastal habitats, estuarine habitats, and a hypersaline lagoon. The open-ocean, coastal, and estuary habitats were chosen because they were the only habitats sampled multiple times (14, 18, and 3 times, respectively), and the hypersaline lagoon was chosen because it was the most highly sequenced sample (sample 33). Distance matrices for sequences within each of these GOS niches were calculated from alignments using DNAdist within BioEdit (19).
These matrices were analyzed with DOTUR (43) for distance-based OTU and richness determination. For estimates of species richness, we used Chao1—a nonparametric estimator based on mark-release-recapture techniques—which is designed to estimate the lower limit of species richness (5). While Foggo et al. (13) and Kemp and Aller (27) found that asymptotic Chao1 values were appropriate estimates of species richness for intermediate sampling frequencies, Chao1 underestimates diversity when sample sizes are small (25). Regardless, Chao1 performs like other estimators, both parametric and nonparametric, when used as a relative measure of diversity between samples (45) and is used here as a best estimate of the lower bound of species richness.
Distance matrices retrieved from DNAdist were additionally analyzed with LIBSHUFF (46), a statistical tool that examines the similarity between two 16S rRNA libraries. LIBSHUFF comparisons were made between estuarine, open-ocean, and coastal habitats.
Hit normalization.
To reconcile the effects of gene size on hit retrieval, the number of hits for each gene was normalized to the relative length of the gene compared to the length of recA (22, 23, 58) by using the equation hg* = LrecA/Lg x hg (variables are described above).
Estimating the number of genes per genome.
To approximate the number of genes (g) found in the average genome (G) of prokaryotes captured in the GOS, we used the following calculation: g/G = [(NR x LR)/Lg*]/NG, where NR is the total number of reads, LR is the average read length, Lg* is the average gene length, and NG is the total number of genomes. For NR, we used the total number of nonpaired reads at each site. For LR, we used the average GOS read length of 822 bp (41). For Lg*, we used the average gene length of prokaryotes of 924 bp (56). For NG, we assumed that the abundance of single-copy genes, genes found in nearly all bacteria with only one copy per genome, is equivalent to the number of genomes sequenced. We used the single-copy gene numbers found by Howard et al. (23) for sites GS02 to GS51 and the methods described by Howard et al. to compute the numbers of single-copy genes in sites GS00 to GS001. These numbers include only nonpaired reads.
To approximate the number of 16S rRNA genes found in the average genome (G) of prokaryotes captured in the GOS, we used the calculation hg/G = hg*/NG, where hg is the number of hits for gene g, hg* is the size-normalized number of hits for gene g, and NG is the total number of genomes. For hg*, we used the size-normalized number of nonpaired reads.
|
|
|---|
70% length overlap with a set of reference 16S rRNA gene sequences) (see Table S1 in the supplemental material) (34) and were thus removed from this analysis. After duplicate reads (from paired reads) were removed, 10,025 partial 16S rRNA gene sequences were retrieved from the GOS database, making up 0.24% of the unassembled GOS sequence reads. This is similar to the frequency of 16S rRNA gene homologs retrieved from a coastal marine shotgun metagenomic library sequenced by pyrosequencing (0.19%) (34).
Ribotypes retrieved.
Of the 10,025 partial 16S rRNA gene sequences retrieved from the GOS database, 87% were classified into taxonomic bins by Smith-Waterman alignment to a set of reference 16S rRNA gene sequences (see Table S1 in the supplemental material). Similarly, Rusch et al. (41) assigned 88% of retrieved 16S rRNA gene sequences to ribotypes available in public databases. However, fewer sequences were reported in that study (4,125 primary assemblies). While the present study queried unassembled sequence data, Rusch et al. (41) queried primary assemblies which merge mated pairs, as well as sequences that overlap at >98% identity.
Observed taxon distributions are shown by GOS habitat and by sample in Table 1 and Table S3 in the supplemental material, respectively, while biogeographies of taxa are shown by phylum and by alphaproteobacterial order in Fig. 1 and Fig. S1 in the supplemental material, respectively. The dominant ribotypes are consistent with other assessments of marine microbial diversity (8, 16, 17, 36). Alphaproteobacteria dominate the data set (43.1% of 16S rRNA gene sequences), and the majority of these are SAR11-like bacteria (30.9% of 16S rRNA gene sequences) (see Fig. S1 in the supplemental material). Gammaproteobacteria are the second most abundant taxon (17.3%), followed by Actinobacteria (8.2%), Cytophaga-Flavobacterium-Bacteroides (CFB) (6.0%), Cyanobacteria (5.3%), Betaproteobacteria (2.8%), and SAR406 (1.1%). Notably, Archaea made up only 0.3% of the total diversity in the GOS. These phylum-level abundances are generally equivalent to those found by Rusch et al. (41) (see Table S4 in the supplemental material). While this study found comparatively low proportions of Archaea, Bacteroidetes, and Firmicutes, this study was able to classify more Proteobacteria, as well as taxa not mentioned by Rusch et al. (41) (Chloroflexi, Deinococcus/Thermus, Fibrobacteres/Acidobacteria, SAR406, mitochondrion/plastid).
|
View this table: [in a new window] |
TABLE 1. Ribotype diversity for habitats sampled multiple times by GOSa
|
![]() View larger version (66K): [in a new window] |
FIG. 1. Microbial diversity (16S rRNA) for all GOS samples. Dotted lines connect taxonomic data to sample locations (filled circles). Numbers within taxonomic plots indicate the sample identification numbers (preceded by "GS" as described by Rusch et al. [41]). The border color surrounding each taxonomic plot indicates the habitat type. The habitats classified as "Other" include the following: 5, embayment; 16, coastal sea; 20, fresh water; 25, fringing reef; 30, warm seep; 31, coastal upwelling; 32, mangrove; 51, coral reef atoll. All samples were collected from the 0.1- to 0.8-µm size fraction, unless indicated otherwise by the following symbols: *, from the 0.22- to 0.8-µm size fraction; , from the 3- to 20-µm size fraction; and , from the 0.8- to 3-µm size fraction.
|
![]() View larger version (18K): [in a new window] |
FIG. 2. Diversity, evenness, and richness indices for GOS habitats, as calculated by DOTUR. (A) Diversity indices were estimated using the Shannon-Weaver index of diversity. (B) Evenness scores were estimated from Shannon-Weaver indices of diversity as well as richness. (C) Lower bounds of richness were calculated using the Chao1 estimator. Error bars indicate 95% confidence intervals.
|
1,500 bp, the partial 16S rRNA gene sequences we retrieved span different regions of the 16S rRNA gene. Analyzing those sequences would result in enhanced estimates of ribotype diversity. Therefore, from within our 10,025 partial 16S rRNA gene sequences, we obtained similarly sized sequences (
300 bp) surrounding a 16S rRNA gene probe (EUB338) targeting the V3 region for use in statistical analyses of ribotype richness (43). We identified 3,404 V3 region 16S rRNA gene sequences, equaling 34% of the hits we retrieved by the full-length 16S rRNA gene query.
The number of OTUs observed in GOS samples was approximated by rarefaction curves of the V3 region 16S rRNA gene sequences (see Fig. S3a to e in the supplemental material). For the purpose of this paper, we define a phylum as a group of organisms with
20% 16S rRNA gene sequence dissimilarity and a species as a group of organisms with
3% 16S rRNA gene sequence dissimilarity (1, 24, 48). Based on rarefaction and LIBSHUFF collection curves, ribotype diversity at the phylum level has been sampled reasonably within the major ocean habitats, but saturation has not been met for other taxonomic levels at 10%, 3%, and 0% sequence dissimilarity (see Fig. S2 and S3a to e in the supplemental material).
Estimates of taxon diversity and evenness are shown by GOS habitat and by sample in Fig. 2a and b and in Table S5 in the supplemental material, respectively. While the numbers of V3 16S rRNA sequences at many individual sites were too few to analyze statistically, the number of sequences in a given habitat allowed for a comparison of diversities between open-ocean, coastal, estuarine, and hypersaline environments. Regardless of OTU definition, the highest diversity was found in open-ocean habitats, followed by coastal habitats (Fig. 2A).
We also predicted the lower bound of microbial richness within sampled habitats by using Chao1 richness estimates (Fig. 2; also see Fig. S3f to j in the supplemental material). At the phylum level, we found that the lower bounds of richness in coastal, open-ocean, and estuarine habitats were approximately equal (Fig. 2; also see Fig. S3j in the supplemental material), with
37 phyla in open-ocean and coastal habitats (95% confidence interval, 35 to 46). There was, however, potentially less phylum richness in the hypersaline lagoon habitat than in the open-ocean and coastal habitats (lower bound of richness, 23; 95% confidence interval, 22 to 34). A similar trend was found with the estimated number of species (Fig. 2; also see Fig. S3h in the supplemental material). In the open-ocean and coastal habitats, the estimated minimum number of species is
420 (95% confidence interval, 336 to 565). The lowest bound of species richness was found in the hypersaline lagoon habitat (148 species; 95% confidence interval, 108 to 238). The minimum levels of richness of both phylum and species suggest that the hypersaline lagoon environment (GS33) contains a less complex bacterioplankton community, a conclusion also reached by Rusch et al. (41), based upon the relatively high degree of sequence assembly at this site. There is little difference in the predicted lower bounds of phylum and species numbers between open-ocean and coastal habitats.
When richness is based on unique sequences, however, the lower bounds of taxon richness within coastal and open-ocean communities differ. Populations of microbes in the open ocean have more unique sequences than those in all other habitats (Fig. 2; also see Fig. S3g in the supplemental material), suggesting more taxonomic microdiversity in the open ocean (45). Several major open-ocean taxa, such as SAR11 and Prochlorococcus marinus, have been shown to exhibit considerable intraclade diversity (12, 14, 31, 41, 53), so our observation of microdiversity within open-ocean bacteria is not as surprising as the relative lack within coastal habitats.
Average surface ocean water bacterium (0.1 to 0.8 µm).
With this large metagenomic library, it is possible to approximate average genome characteristics for surface ocean water bacteria (in the 0.1- to 0.8-µm size fraction). In a previous study, Howard et al. (23) found the abundance of single-copy genes, genes found in nearly all bacteria with only one copy per genome, for each site in the GOS (see Table S6 in the supplemental material). Based on an average GOS read length of 822 bp (41) and an average gene length of 924 bp (56) and assuming that the single-copy gene abundance is equivalent to the number of genome equivalents sequenced, we calculated the number of genes per genome at each GOS site {g/G = [(NR x LR)/Lg*]/NG} (see Table S6 in the supplemental material). By averaging across all sites, we estimate that the average genome of a marine surface water microorganism in the 0.1- to 0.8-µm size fraction is
1 Mb and contains
1,019 genes.
The scale of the GOS study also allows for an approximation of 16S rRNA gene abundance per bacterium (hg/G = hg*/NG). In this study, we found 10,025 16S rRNA gene hits within the GOS (6,904 when size normalized to recA by the equation hg* = LrecA/Lg x hg). Based on single-copy gene abundance, the average bacterium sequenced in the GOS (in the 0.1- to 0.8-µm size fraction) contains 1.8 16S rRNA genes, varying between 1.3 (GS00c, open ocean) and 2.8 (GS08, coastal). On average, coastal bacterioplankton genomes contain more 16S rRNA genes (1.9) than open-ocean or estuarine bacterioplankton (1.6 each).
|
|
|---|
Ribotype diversity.
Our analyses agree with previous assessments of microbial diversity in surface ocean waters (16, 36) and further show that ribotype diversities are distinct between coastal, estuarine, and open-ocean habitats. However, because the current GOS metagenome focuses on the free-living (0.1- to 0.8-µm) size fraction, the diversity of larger, aggregate, or particle-bound microbes, such as those found in coastal sites, was likely underestimated (33, 58). This size bias can be seen by observing the microbial diversity in samples GS01a, GS01b, and GS01c (hydrostation S) from the Sargasso Sea pilot study (52), the only site to date for which sequences from three size fractions are available. The taxon composition in the 0.1- to 0.8-µm size fraction (GS01c) is typical of that of most open-ocean water samples in the GOS. However, even with a lower sequencing effort, major differences in represented taxa are evident for the larger size fractions (GS01a, GS01b). For example, within the Alphaproteobacteria, SAR11-like sequences dominate in the smaller size fraction, but the proportions of Roseobacter clade, SAR116-, and Caulobacterales-like sequences are more abundant in the larger (see Fig. S1 in the supplemental material). Unfortunately, the significantly smaller sequencing effort for those samples prevents a statistical comparison of levels of estimated diversity found in different size classes of the same water column. Even so, this illustrates that the genetic abundance and diversity retrieved in the GOS represent those of smaller, free-living, surface ocean water microbes and do not necessarily reflect the entire surface ocean water microbial gene pool. When the remaining size fractions are sequenced, the GOS metagenome will better reflect the entire microbial gene pool in surface ocean waters.
Typical surface ocean water bacterium.
Assuming that there are, on average, 924 bp per gene, we estimate that the genome of a typical surface ocean water bacterium (sized 0.1 to 0.8 µm) is
1 Mb and contains
1,019 genes. Similarly, Raes et al. (38) estimated that a surface ocean water bacterium from within the Sargasso Sea sites (samples GS00 and GS01) has an effective genome size of 1.35 to 1.94 Mb. In contrast, a much larger genome size is estimated for sequenced, cultured, nonparasitic/symbiotic bacteria (average bacterium contains
3,000 genes [18]) and for bacteria from soil, whale fall, and acid mine drainage environments (average bacterium genome contains 3.16 to 4.74 Mb [38]).
Pelagibacter ubique, a SAR11 clade member, contains 1,354 genes (18), remarkably close to the number of genes we found in a typical surface ocean water bacterium. This is not surprising considering that SAR11 clade members made up a significant fraction of the entire GOS 16S rRNA gene inventory (31%). However, the comparatively small genome of a surface ocean water bacterium (sized 0.1 to 0.8 µm) could indicate widespread genome streamlining, similar to that found in ubiquitous marine bacteria such as Pelagibacter ubique and Prochlorococcus marinus (11, 18, 40, 49). Genome streamlining is often found in conjunction with low G·C content (18, 51). A high G·C content requires more nitrogen during DNA synthesis, so in the ocean, where nitrogen is often limited, a low G·C content could provide a potential selective advantage. Indeed, in the GOS samples (sized 0.1 to 0.8 µm), where we found a surface ocean water bacterium's genome to be relatively small, Raes et al. (37) also found the G·C content to be relatively low (
35 to 45%).
In environmental metagenomic libraries such as the GOS, the presence of eukaryotic sequences may affect estimations of bacterial genome size (38). By increasing the number of sequence reads without adding bacterium-specific sequences containing genes such as recA, the presence of eukaryotic sequences tends to inflate estimations of bacterial genome size. Raes et al. (38) found that in the two Sargasso Sea sites where larger filtrate was sequenced (samples GS01a and GS01b;
0.8-µm size fractions), the effective genome size was 4.04 to 6.02 Mb, but after correction for bacterium-specific sequences, the effective genome size dropped to 1.71 to 1.94 Mb. We found the same result in our calculations, as the largest estimated genome sizes were found in GOS samples sequenced from size fractions larger than 0.8 µm (samples GS01a, GS01b, and GS25) (see Table S6 in the supplemental material). The presence of eukaryotic sequences in the larger size fractions as well as the presence of picoeukaryotic sequences in the 0.1- to 0.8-µm size fraction (35) indicates that our estimates of average bacterial genome size, while already relatively small, may be overestimates.
As further evidence for genome streamlining, regardless of habitat, the average surface ocean water bacterium in the 0.1- to 0.8-µm size fraction contains only one to two rRNA operons. This is low in comparison to the average 3.9 rRNA operons found in cultured, sequenced prokaryotes (last checked on 14 August 2008) (29). A high number of rRNA operons is thought to allow a cell to respond quickly to transient improvements in environmental conditions (28, 30, 32, 54). This can be advantageous, especially in environments that receive pulses of nutrients. However, the tradeoff may be that maintaining multiple rRNA operons requires more energy (28). Our calculation that each genome contains one to two 16S rRNA genes implies that, while some cultured marine bacteria have a comparatively large number of rRNA operons per cell to quickly respond to the environment (e.g., Roseobacter clade members contain one to five operons per cell [32]), the average surface ocean water bacterium actually contains few, similarly to the GOS-dominant SAR11 group (P. ubique contains one operon per cell [18]). Based on our estimates of genome size and the number of 16S rRNA genes per genome, the average free-living marine bacterium may be constrained by energy conservation strategies more than by the need to adjust quickly to a changing environment.
This research was supported by the National Science Foundation (OCE-0724017).
Published ahead of print on 6 February 2009. ![]()
Supplemental material for this article may be found at http://aem.asm.org/. ![]()
Present address: Department of Marine Sciences, University of Georgia, Athens, GA 30602. ![]()
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»