Previous Article | Next Article ![]()
Applied and Environmental Microbiology, March 2007, p. 1425-1432, Vol. 73, No. 5
0099-2240/07/$08.00+0 doi:10.1128/AEM.01647-06
Copyright © 2007, American Society for Microbiology. All Rights Reserved.

School of Electrical Engineering and Computer Science,1 Center for Integrated Biotechnology,2 Department of Veterinary Microbiology and Pathology, Washington State University, Pullman, Washington 991643
Received 14 July 2006/ Accepted 28 December 2006
|
|
|---|
|
|
|---|
With the availability of whole-genome sequences and the introduction of microarrays, comparative genomic hybridization (CGH) is being used to make phylogenetic inferences between bacteria (9, 10, 18, 21, 24). One common CGH method relies on a whole-genome microarray (18, 21) constructed from most of the open reading frames of one completely sequenced reference strain. Hybridization of sample strains onto the array identifies genes that are either present or absent or else are highly divergent from a sample strain. Genetic relatedness is determined based on a comparison of gene presence and absence patterns (or more specifically, accessory gene presence and absence patterns) among sample strains.
Using a whole-genome microarray and the CGH approach to infer phylogenetic relationships is more advantageous than single and multilocus methods, if only because much more information is incorporated into the analysis (10). Nevertheless, there are two inherent problems with using CGH data in this manner. First, only those genes harbored by the reference strain are available for analysis; genes specific to nonsequenced strains are not included. For example, genomes for the Escherichia coli strains CFT073, K-12, O157:H7, and H3110 share only 40% of their genetic content. Therefore, construction of a microarray using any one of these might not render sufficient information to discriminate between the other strains. Second, use of a single genome as the basis of comparison may introduce bias into the analysis. In an extreme case, all target strains that share few genes with the reference strain will appear closely related by CGH even if they are actually highly divergent from each other.
One alternative approach is to incorporate genetic information from multiple strains within a single microarray. This can be done by inclusion of specific genes from multiple whole-genome sequences (if available) or by using a "mixed-genome microarray" (MGM) that incorporates randomly selected gene fragments from many strains of bacteria (2-4, 25). An MGM is constructed from a shotgun library built from a pool of isolates belonging to the same species or genera of interest. Genomic DNA (gDNA) is fragmented by sonication (25) or by use of restriction enzymes (2-4) and size fractionated to isolate 500- to 600-bp fragments. The fragments are cloned, and a randomly selected collection of clones is used to construct a glass-based microarray. Genetic comparisons are made by hybridizing gDNA from test strains and assessing signal intensity across the multiple strains. For analysis, hybridization data can be converted to binary variables (present or absent), or relative intensities can be compared. In addition to its usefulness in CGH, inclusion of genetic variation from a number of different reference strains on the MGM enables the detection of lineage- or strain-specific genes that can serve as useful molecular markers or as targets for further functional analysis (3, 4, 25). Finally, MGMs have an advantage over conventional CGH arrays because no a priori information about the genome is required to construct these microarrays.
It would be ideal if all isolates were equally represented in the library used to construct the MGM. Nevertheless, depending on the scope of the project, this may not be practical, and it also assumes that the resultant shotgun library produces a truly random selection of clones (4) whereby all strains are equally represented. Thus, the question of library bias due to unequal strain representation arises as an important issue. Can phylogenetic relationships be correctly determined with the existence of library bias or incomplete representation with respect to the target strains? In this paper, we use both experimental and computational methods to assess the applicability of the MGM in determining phylogenetic relationships among strains of bacteria. The experimental method uses an Enterococcus MGM, and the computational method uses a virtual Streptococcus microarray to simulate MGM experiments, including construction, hybridization, imaging, and analysis. We show that MGM results can be used to accurately infer phylogenetic relationships among strains. We also analyze the effects of array size and library bias on the accuracy of the MGM, and we provide an easily applied method that effectively corrects for library bias.
|
|
|---|
Sample hybridization and detection.
Seven Enterococcus strains (generously donated by Rachel Noble, University of North Carolina, Chapel Hill) were used in this analysis: E. hirae, E. gallinarum, E. dispar, E. avium, E. faecalis, E. casseliflavus, and E. faecium. Isolates were retrieved from 80°C and recovered by streaking on M-Enterococcus agar plates and incubation for 48 to 72 h at 37°C. For each strain, one colony was placed into 3 ml brain heart infusion broth and grown overnight at 37°C. Genomic DNA was extracted using a DNeasy tissue kit (QIAGEN, Valencia, CA) and quantified using spectrophotometry. A segment of 16S rRNA was PCR amplified as described by Soule et al. (25) and sequenced to verify strain identity. Genomic DNA (1 µg) was fragmented and biotinylated using nick translation (Bio-Nick kit; Invitrogen, Carlsbad, CA), and after ethanol precipitation the labeled DNA was resuspended in 80 µl hybridization buffer (4x SSC [60 mM NaCl, 0.6 mM sodium citrate; pH 7.0) and 5x Denhardt's solution (0.1% [wt/vol] Ficoll, 0.1% polyvinylpyrrolidone, 0.1% bovine serum albumin). Slides were preblocked for 30 min at room temperature with 200 µl TNB buffer (100 mM Tris-HCl [pH 7.5], 150 mM NaCl, 0.5% blocking reagent [TSA biotin system; Perkin-Elmer, Boston, MA]). Nick-translated gDNAs (80 µl) were heat denatured (95°C for 2 min), applied to the slide, enclosed by a humidified, conical tube (50 ml), and incubated overnight at 60°C. After incubation, the slides were then incubated and washed as described previously (5), with 600 µl of the appropriate reagent applied to the slide at each step. After the final washing and drying, slides were imaged with an arrayWoRxe scanner (Applied Precision, Issaquah, WA). The resulting images were stored as TIFF files with pixel values ranging from 0 to 65,535. Each strain was hybridized on two independent slides. Images were segmented using SoftWoRx software (Applied Precision, Issaquah, WA), and median probe intensity values were exported to Microsoft (Redmond, WA) Excel.
Enterococcus MGM analysis.
Hybridization data sets (n = 14) were normalized with respect to the average intensity of the four 16S rRNA control spots on each subarray. Normalized intensities of spots from replicate slides were averaged. Data were converted to a binary format; probes having a normalized intensity of <0.5 were considered absent or highly divergent ("0"), and probes with intensity values of
0.5 were considered present ("1"). A Euclidean distance matrix was then calculated using
![]() |
MLST analysis.
Probes with intensities greater than 60,000 for all 14 Enterococcus hybridizations were retrieved from the clone library, PCR amplified with primers T3 and T7, and then sequenced (Amplicon Express, Pullman, WA). Probe sequences were trimmed for vector contamination, and PCR primer sets were designed for each probe sequence using primer3 software (http://frodo.wi.mit.edu/cgi-bin/primer3/primer3_www.cgi) (22) with default settings except for product size, which was set to >400 bp, and primer length, which was constrained to 20 bp. Primer sets (Table 1) were commercially synthesized (Invitrogen) and used to amplify corresponding sequences from each Enterococcus strain. PCR was performed in 50-µl volumes with 10 µl of gDNA template (10 ng), 1x reaction buffer (Fisher Scientific, Pittsburgh, PA), 0.2 mM (each) deoxynucleoside triphosphate, 2 mM MgCl2, 1 U Taq (Fisher Scientific), and 0.4 µM of each primer. The cycling program consisted of 2 min of initial denaturation at 95°C, followed by 30 cycles of 95°C for 30 s, annealing at 56°C for 60 s, and 72°C for 60 s, and it concluded with a final extension at 72°C for 10 min. The annealing temperature for the EPE8 product was 48.6°C. PCR products were sequenced (Amplicon Express) and submitted to GenBank. Genetic sequences corresponding to different genetic fragments were concatenated in the same order and stored in FASTA format for alignment (ClustalW1.8.3 at http://www.ebi.ac.uk/clustalw/) (7). The alignment output file (.ph) was sent to Treeview for a tree plot, and the output file (.aln) was sent to Splitstree4 (14) for bootstrap analysis.
|
View this table: [in a new window] |
TABLE 1. PCR primers used to generate MLST data for comparison of Enterococcus species
|
Tree comparison was handled carefully, because different branch arrangements of phylogenetic trees can represent the same tree; what matters is how strains are clustered rather than the ordering of branches or clusters. In our case a cluster is defined as the set of all leaves that descend from a common, nonroot node of the tree (6). Each tree is first rerooted with E. faecium serving as the tree root, and then all the branches in a cluster are collected into a set (the order was not important), and all the sets (clusters) are collected into a bigger set. Each tree is finally represented by a set containing multiple sets of branches corresponding to all clusters of the tree. For example, for the rerooted tree shown in Fig. 1a, a complete set corresponding to the tree is as follows: {{E. gallinarum, E. avium}, {E. dispar, E. faecalis}, {E. gallinarum, E. avium, E. casseliflavus}, {E. dispar, E. faecalis, E. gallinarum, E. avium, E. casseliflavus}, {E. dispar, E. faecalis, E. gallinarum, E. avium, E. casseliflavus, E. hirae}}. For cases when we only want to compare trees to a limited "depth," we delete the corresponding sets of no interest to us. For example, for the tree above, if we do not want to differentiate how E. gallinarum, E. avium, and E. casseliflavus are clustered, we simply delete the set {E. gallinarum, E. avium} from the tree's corresponding set. Finally, if a subsampled data set produces a set contained in the complete set that corresponds to the original tree derived from the entire data set, it is scored as a match, and we can calculate the percentage of matches for 1,000 subsamples. For each data subset size, the program was run for 10 iterations to find the average and standard deviation of the percentage agreement with the original tree. The custom program (available upon request) used for the calculations was implemented using Matlab 7.0 (MathWorks, Natick, MA) equipped with the Bioinformatics toolbox.
![]() View larger version (12K): [in a new window] |
FIG. 1. Phylogenetic tree for seven Enterococcus strains using (a) MGM analysis or (b) MLST with E. faecium as the root. Both data sets identified the same four clusters, A, B, C, and D, marked beside nodes.
|
Virtual MGM simulation.
Streptococcus genome files of 15 strains belonging to five different species were downloaded from PubMed in FASTA format. MGM construction was simulated by randomly choosing n (n is the microarray size) positions with replacement in the genome sequences and collecting n gene segments 600 bp long (as probe sequences) into FASTA-format files. To construct a virtual array with equal representation of the 5 species, 800 probes were randomly chosen for each species (n = 4,000). Hybridization was simulated using stand-alone BLAST 2.2.13 (1). The FASTA-format files were used to construct local libraries, and the 15 genomes were compared with local libraries to generate BLAST report files. The BLAST report files contained queries of all genomes for each probe. Imaging was simulated using a Perl program that determined the best score among all reported hits for a given genome against a probe, and the length of the matched sequence corresponding to the best score was divided by 600. The resulting value was used as the normalized hybridization intensity of that genome for the probe. Finally, Matlab 7.0 was used to read the intensity files and calculate the distance matrices for the 15 genomes. The neighbor-joining method and Treeview were used for phylogenetic tree construction as described previously.
For MLST analysis, six genes were selected, namely, 16S rRNA, cpn10, dnaK, groEL, hsp, and htpX. Gene sequences were retrieved from GenBank and were concatenated in the same order for each strain. The concatenated sequences were aligned using ClustalW, and a phylogenetic tree was constructed using Treeview as described above.
For construction of the virtual whole-genome microarray, 4,000 probes were randomly selected from a single species. To simulate unequal representation, a different proportion of probes was selected from each species. For both the equally represented and unequally represented MGM size study, 10,000 random subsets (1,000 per iteration, 10 iterations total) were generated for each desired array size, and the mean and standard deviation of percent correct identification were plotted. In addition, for the unequally represented microarray with library bias correction, 1,000 random subsets (100 per iteration, 10 iterations total) were generated, and the library bias correction method was applied to each subset to find the consensus tree among 50 randomly generated bias-corrected trees from each subset.
Nucleotide sequence accession numbers.
PCR product sequences determined in this work have been submitted to GenBank under the accession numbers listed in Table 1.
|
|
|---|
Enterococcus MGM versus MLST.
To compare the MGM results with a conventional MLST analysis, we first identified five probes that were positive for all seven Enterococcus species (by hybridization) and designed primer sets for the corresponding sequences (Table 1). These target sequences were then PCR amplified for each of the seven strains of Enterococcus and sequenced for MLST. We concatenated the five sequences for each species and generated a phylogenetic tree (Fig. 1b). The results closely reflect the MGM results by identifying all four clusters, A, B, C, and D, with differences related to how E. gallinarum, E. avium, and E. casseliflavus are grouped in cluster B. The bootstrapping values of all nodes were above 99.2%.
Effect of the number of probes from Enterococcus MGM.
We hypothesized that the accuracy of the MGM for phylogeny analysis depends on the number of probes used. Larger arrays provide more details on genetic differences, but the benefit is probably asymptotic and a function of the total variation between the strains being tested. For interspecies comparisons with the Enterococcus MGM, we randomly selected 10,000 probe subsets of different sizes (including 10 iterations; 1,000 for each iteration). The mean and standard deviation for the percent correct identification of either clusters A and B or clusters A, B, C, and D are shown in Fig. 2. The percent correct identification curves resemble a steep step function, which indicates that very few probes are necessary for robust cluster identification. Approximately 100 probes are sufficient to identify the two major clusters, A and B, with 100% accuracy, and on average, these two clusters could be identified with 95% accuracy using as few as 46 probes. Moreover, as expected, more probes are required to consistently identify all four clusters. Approximately 1,000 probes are necessary for 100% recovery of all 4 clusters, whereas 460 probes identify all 4 clusters in 96% of the sampling comparisons. This analysis indicates that robust comparisons between Enterococcus species can be obtained with a relatively small MGM.
![]() View larger version (7K): [in a new window] |
FIG. 2. Percent correct identification of two clusters (A and B; solid line) or all four clusters (A, B, C, and D; dashed line) as a function of the number of microarray probes included from the Enterococcus MGM analysis. The mean and standard deviation (bars) of the percentage correct identification for each probe subset for 10 iterations, each with 1,000 runs, is shown.
|
|
View this table: [in a new window] |
TABLE 2. Enterococcus microarray hybridization patterns
|
MGM array versus MLST.
Using sequenced genomes, we can simulate MGM construction, hybridization, and imaging in silico. For this analysis, we used Streptococcus for the virtual simulations because of the availability of a relatively large number of sequenced strains (n = 15) and species (n = 5). We randomly chose a total of 4,000 gene segments (each 600 bp long) from the 15 Streptococcus genomes, 800 segments from each of the five species, to construct an equally represented, virtual MGM to use for virtual hybridization. In the virtual MGM (data not shown), the gray level of each spot is proportional to the normalized hybridization intensity of a target strain to that probe. Relative intensities were converted to binary scores, after which the phylogenetic tree for the 15 Streptococcus strains was constructed (Fig. 3a, with S. mutans as the tree root). This analysis correctly grouped the strains belonging to each species, forming four species clusters: J, K, L, and M. At the species level, Streptococcus agalactiae and Streptococcus pyogenes form cluster A, Streptococcus pneumoniae and Streptococcus thermophilus form cluster B, and clusters A and B form cluster C.
![]() View larger version (40K): [in a new window] |
FIG. 3. Phylogenetic tree of 15 Streptococcus strains based on (a) an equally represented virtual Streptococcus MGM, (b) MLST, (c) a virtual whole genome composed of sequences from S. pneumoniae only, or (d) a microarray composed of sequences from S. pyogenes only. Cluster labels are shown to the left of nodes, and the same clusters are denoted by the same label.
|
MGM versus whole-genome microarray.
The MGM is also capable of identifying the same phylogenetic groups at the species level as a whole-genome microarray, such as that constructed from the S. pneumoniae genome (Fig. 3c) or from the S. pyogenes genome (Fig. 3d). At the strain level, the MGM results almost exactly match the whole-genome microarray result constructed using S. pyogenes as the probe reference (Fig. 3d). In contrast, the MGM clearly outperforms the whole-genome microarray constructed with S. pneumoniae as the probe reference (Fig. 3c), as can be seen from the latter's poor differentiation of S. pyogenes strains (E'' and H'' in Fig. 3c but E and H in Fig. 3a). This example illustrates how a whole-genome microarray constructed from one strain or species may not provide enough information to differentiate other strains. Furthermore, when the microarray is constructed using data from a single genome, we see exaggerated separation between the source strain/species and other strain/species used in the comparison. This is evident from the outlying separation of S. pneumoniae and S. pyogenes in the virtual array analysis (Fig. 3c and d) and the probable overrepresentation of E. faecium in the Enterococcus array (Fig. 1a; Table 2).
When only one species is used to construct the virtual microarray (for example, S. pneumoniae; Fig. 3c), most probe intensities are classified as "1" for the reference species, while few probes appear positive for most of the other strains and species hybridizations. Thus, out of 4,000 probes selected from S. pneumoniae (Fig. 3c), only a relatively small number of probes play a role in the genetic discrimination of other strains, and this produces an exaggerated branch distance for S. pneumoniae relative to other species. This bias does not usually affect the phylogeny, because all strains not used for library construction have similar hybridization strengths. According to the neighbor-joining algorithm (23), when searching for nearest neighbors to join, the long distance between S. pneumoniae and the other strains is offset and does not affect clustering. However, this may not be true for some extreme and complicated library biases for which hybridization strengths vary among all strains.
MGM library bias analysis and correction.
Although in most cases library bias does not affect phylogeny relationships because of the implementation of the neighbor-joining algorithm, this is not always true, as demonstrated by the example, shown in Fig. 4a. The phylogeny in Fig. 4a is obtained from virtual hybridization of 15 Streptococcus strains on a virtual MGM consisting of 1,500 probes from S. agalactiae, 1,500 probes from S. thermophilus, 420 probes from S. mutans, 420 probes from S. pyogenes, and 160 probes from S. pneumoniae. The number of bright spots for the 15 hybridizations varies considerably (Table 3). Library bias causes the formation of incorrect clusters at the species level, with clusters A' and B' differing from their counterparts in Fig. 3. However, the library bias correction algorithm compensates for the bias, and the phylogenetic tree (clusters A, B, and C) is correctly generated at the species level, as shown in Fig. 4b. The library bias correction algorithm has the advantage that it can provide a bootstrap confidence value for each node (Fig. 4b) and can also provide multiple bias-corrected trees with a high consensus frequency.
![]() View larger version (33K): [in a new window] |
FIG. 4. Phylogenetic tree of 15 Streptococcus strains based on a virtual unequally represented MGM before and after library bias correction analysis. Panel a shows an example for which representational bias produces an incorrect phylogenetic tree (clusters A' and B', instead of A and B). Panel b shows that after library bias correction, the correct phylogeny is retrieved. Bootstrapping values are adjacent to nodes.
|
|
View this table: [in a new window] |
TABLE 3. Virtual Streptococcus microarray hybridization pattern
|
![]() View larger version (10K): [in a new window] |
FIG. 5. Comparison of an unequally represented Streptococcus array before (dotted line) and after (dashed line) library bias correction with an equally represented Streptococcus array (solid line). For each subset size of the array, the mean and standard deviation of the percentage of correct identification of clusters A, B, and C are plotted as a function of the number of microarray probes.
|
The virtual MGM provides an effective method for analyzing an experimental MGM and, in fact, has potential as a tool for genetic analysis when sequenced genome information is available. Probes for the virtual MGM are selected randomly from sequenced genomes so that the virtual MGM consists of genetic information from multiple genomes. Rather than having to perform a cumbersome genome-wide comparison, the virtual MGM method permits a shotgun sequence comparison that provides reliable phylogeny information, as shown by this study. For cases when library bias exists, the proposed library bias correction method provides effective compensation with bootstrap confidence values.
This project was partially funded by USDA NRI contract 2002-35102-12374, by the Agricultural Animal Health Program at the College of Veterinary Medicine, Washington State University, Pullman, and by the Carl M. Hansen Foundation.
Published ahead of print on 5 January 2007. ![]()
|
|
|---|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»