Previous Article | Next Article ![]()
Applied and Environmental Microbiology, November 2005, p. 7504-7514, Vol. 71, No. 11
0099-2240/05/$08.00+0 doi:10.1128/AEM.71.11.7504-7514.2005
Copyright © 2005, American Society for Microbiology. All Rights Reserved.
Matthew T. G. Holden,2
Richard A. Stabler,1,
Sarah E. Husain,1
J. Keith Vass,3
Philip D. Butcher,1
Jason Hinds,1 and
Jodi A. Lindsay4*
Bacterial Microarray Group,1 Infectious Diseases, Department of Cellular and Molecular Medicine, St. George's Hospital Medical School, London SW17 0RE,4 The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA,2 Beatson Institute for Cancer Research, Garscube Estate, Bearsden G61 1BD, United Kingdom3
Received 8 February 2005/ Accepted 15 June 2005
| ABSTRACT |
|---|
|
|
|---|
| INTRODUCTION |
|---|
|
|
|---|
Seven S. aureus isolates have been sequenced in their entirety (see Table 1). Comparisons of these genomes show a backbone of ordered conserved genes (the core genome) peppered with a range of genetic elements integrated at dozens of insertion sites (25). These mobile genetic elements (MGEs) make up approximately 15 to 20% of the genome and include bacteriophages, pathogenicity islands, integrated plasmids, staphylococcal cassette chromosomes (SCC), genomic islands, transposons, and other "islets." Many of the known or putative virulence and resistance genes carried by S. aureus are found on MGEs. Therefore, a comprehensive microarray that offers extensive coverage of the species, including MGE genes, and can rapidly identify the gene complement carried by any particular strain is a prerequisite for panspecies comparative genomic studies. These studies would enable the investigation of S. aureus variation, evolution, and epidemiology and would identify genes that are associated with pathogenic strains. Furthermore, a multistrain array would be useful for global gene expression studies with a wide range of strains.
|
Many of these studies are limited by the composition of the microarray and are often able to make only a one-way comparison to discover the genes deleted or divergent in the test strain compared to the sequenced reference strain. This restricted microarray content, governed itself by sequence availability, means that such studies are unable to detect any genes specific to the test strain that are not present in the reference strain and therefore not represented on the array. However, microarrays that offer extensive coverage of a particular species by inclusion of multiple strain-specific sequences enable a two-way comparison to the reference strain that not only identifies deleted/divergent genes in the test strain but also detects the presence of genes specific to the test strain that are absent in the particular reference strain. Clearly, the wider the scope of the genes on the array, the more informative and comprehensive the comparative genomics studies that can be accomplished.
Recently, Dunman et al. (11) developed an Affymetrix GeneChip containing oligonucleotides from six of the S. aureus genome-sequencing projects. The Affymetrix GeneChip represents a complementary yet distinct commercial technology platform based on proprietary design and construction methods (Affymetrix, Santa Clara, Calif.). Affymetrix GeneChips are based on short oligonucleotides of 20 to 25 bases that are synthesized in situ on silicon wafers through a photolithographic manufacturing process, with multiple oligonucleotides and control oligonucleotides for each gene represented. The spotted PCR product approach that we describe presents an alternative in-house solution, and any apparent advantages of one approach over the other would depend on issues such as availability, access to associated hardware, flexibility in design required, and intended application.
In this paper we describe (i) a new strategy for designing a nonredundant multistrain bacterial PCR product microarray, (ii) a novel approach to predicting the outcome of microarray hybridizations based on genome sequence analysis, (iii) validation of the observed microarray results based on the predicted results, to demonstrate the efficacy of the design and its potential application for identifying the presence or absence of genes in S. aureus isolates at a whole-genome level, and finally (iv) a method for interpreting comparative genomic microarray data that highlights MGEs, which are the most common variable between related S. aureus isolates. Furthermore, as a demonstration of the true value and practical benefit of a multistrain microarray, we prove that isolates thought to be clones of epidemic MRSA actually differ substantially in their carriage of MGEs. Such variation has not been documented before and may crucially impact on pathogenesis and treatment options.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Microarray design.
The multistrain microarray design strategy involved choosing one strain to act as a base strain and then adding genes from the other strains that are strain specific or show significant divergence from the genes in the base strain. A sequential process of strain comparisons thus builds up a gene pool that is representative of all seven strains.
Selection of strain-specific or divergent genes from the remaining six strains was performed using a cumulative procedure, outlined in Fig. 1, based on sequence comparisons using BLAST (1). An initial gene pool was generated consisting of all predicted genes from S. aureus MRSA252. A BLAST database was created, consisting of the current gene pool plus genes from the first additional strain, N315, which was then searched with the genes from N315. Strain-specific or divergent genes present in N315 were identified by comparing the quality of the BLAST hit for an N315 gene against itself to the best-quality hit against a gene in the gene pool. A relatively poor-quality hit against a gene in the gene pool would indicate an N315-specific gene requiring addition to the gene pool for representation on the array. The BLAST bit score was used as the quality measure, because it reflects the length as well as the degree of sequence similarity. The comparison of BLAST hit quality was achieved by calculating the ratio of BLAST bit scores, whereby the bit score for a gene against itself is divided by the bit score of the best hit in the gene pool. N315 genes with a bit score ratio greater than 2 were highlighted as those genes to be added to the gene pool, with any borderline cases inspected manually to verify inclusion. Any N315 genes confirmed as strain specific or significantly divergent were added to the gene pool, and then the process was repeated with the next strain. Hence the gene pool grew in size with each successive analysis of a strain due to the addition of any genes not already represented in the gene pool by a gene from a previous strain. Strains were sequentially processed (from first to last, N315, Mu50, Col, 8325, MW2, and MSSA476), followed by plasmids pN315, pMu50, and pMW2. A summary of the composition of the gene pool is given in Table 1.
|
Microarray construction.
Template DNA from each of the sequenced strains was prepared using a cesium chloride method as previously described (24) or Qiagen midi-preps with lysostaphin treatment (27). PCR primers were synthesized by MWG Biotech (Ebersberg, Germany) and supplied in 96-well format to enable high-throughput amplification of PCR products using a liquid handling and PCR amplification robot (RoboAmp 9600; MWG Biotech). Initially, standard 100-µl PCRs were performed with 10 ng DNA template, 5 U HotStar Taq DNA polymerase (Qiagen), 0.5 µM primers, 1.5 mM MgCl2, and 200 µM deoxynucleoside triphosphates. Thermocycling was performed using an initial denaturation of 95°C for 15 min and 40 cycles of 95°C for 1 min, 52°C for 1 min, and 72°C for 1 min, followed by a final extension of 72°C for 5 min. All PCR products were verified by agarose gel electrophoresis to confirm amplification of a single product of the expected size. Modification of the standard PCR conditions was undertaken for any failures or multiple products by either adjusting the annealing temperatures or adding the PCR additive Solution Q (Qiagen) until single products of the correct expected size were obtained for all reactions. Five percent of the PCR products were sequenced to provide additional validation of the products, with aliquots of all PCR products stored to enable additional confirmatory sequencing of array elements in the future if required. Aliquots (50 µl) of the PCR products were concentrated by isopropanol precipitation, resuspended in 15 µl 50% dimethyl sulfoxide in H2O, and printed in duplicate onto GAPS slides (Corning) using a MicroGrid II arraying robot (BioRobotics, United Kingdom). The printed microarrays were post-print processed according to the slide manufacturer's instructions (including rehydration, UV irradiation, chemical blocking, and denaturation steps) and then stored at room temperature in the dark prior to use.
Labeling, hybridization, and data acquisition.
Four micrograms each of chromosomal DNA from the test strain and the reference strain (MRSA252) was labeled with Cy3 and Cy5, respectively, using the method described previously (10). Microarray slides were prehybridized in 3.5x SSC (1x SSC is 0.15 M NaCl plus 0.015 M sodium citrate)-0.1% sodium dodecyl sulfate (SDS)-10 mg/ml bovine serum albumin at 65°C for 20 min before a 1-min wash in distilled water and a subsequent 1-min wash in isopropanol. Each Cy3-labeled test strain DNA was mixed with an equal amount of Cy5-labeled reference strain DNA, purified using a MiniElute kit (Qiagen), denatured, and mixed to achieve a final 45-µl hybridization solution of 4x SSC-0.3% SDS. Using two 22- by 22-mm LifterSlips (Erie Scientific), the microarray was sealed in a humidified hybridization chamber (Telechem International), hybridized overnight by immersion in a water bath at 65°C for 16 to 20 h. Slides were washed once in 400 ml 1x SSC-0.06% SDS at 65°C for 2 min and twice in 400 ml 0.06x SSC for 2 min. Slides were scanned using an Affymetrix 428 scanner, and signal data were extracted using ImaGene 5.2 (BioDiscovery). For each strain tested, two to four microarray slides were hybridized and analyzed.
Data analysis.
The ImaGene output for each array was loaded into GeneSpring 6.2 (Silicon Genetics) for normalization and analysis. For each spot, the median pixel intensity for the local background was subtracted from the median pixel intensity of the spot, and any values less than 1.0 were adjusted to 1.0. Background-subtracted pixel intensities for the test strain channel were divided by those for the reference strain channel. The resulting log ratios were normalized by applying the LOWESS intensity-dependent normalization, using 40% of the data to calculate the LOWESS fit for each point. If the value for the reference channel was less than 10, then a value of 10 was used instead. This normalization scheme counteracted any dye effects at low intensity and enabled interarray comparisons by equalizing the median ratio on each array. To account for any skew in the global normalization caused by an imbalance in the outlier populations, due to a disproportionate number of genes either deleted in or specific to a particular strain, an additional transformation was applied to the data. This additional transformation also facilitated some of the downstream analysis methods. A subset of core genes, defined as present in all strains analyzed, was determined by identifying those genes with a ratio between 0.5 and 2 on every single array. The ratio value for each gene was then divided by the median ratio of the core genes, ensuring that on all arrays the data were centered around the core genes which had a median ratio of 1. When data were combined for replicate arrays, the mean of the ratios was calculated with any genes flagged as empty by ImaGene excluded. Empty flags in ImaGene were assigned to spots with a signal intensity less than twice the local background intensity.
In order to validate the array design process, these normalized ratio data were analyzed using three commonly applied methods to determine the presence or absence/divergence in the test or reference strain. All these approaches are essentially trying to classify genes into four groups: group 1, genes present in the test strain and absent/divergent in the reference strain; group 2, genes present in the reference strain and absent/divergent in the test strain; group 3, genes present in both strains; group 4, genes present in neither strain. In practice, most of the genes falling into the last category were flagged by ImaGene as absent because the fluorescent signal intensity was less than 2 times the background signal. These genes were not included in subsequent analysis.
(i) Twofold cutoff.
An arbitrary cutoff of twofold was used to identify those genes that are specific to one of the strains. Therefore, for all strains, the upper cutoff was set at a ratio of 2 and the lower cutoff at a ratio of 0.5. Genes with a ratio greater than the upper cutoff were deemed to be specific to the test strain, genes with a ratio less than the lower cutoff were deemed to be specific to the reference strain, and genes with ratios between 0.5 and 2 were deemed to be present in both strains.
(ii) 3SD.
Rather than using a fixed-value cutoff for all arrays as above, a cutoff based on the variation in the ratio data of the core genes was determined for each strain. For each strain, the standard deviation of ratios for genes within the subset of core genes was calculated to measure variation in the data, and then the ratio cutoffs for each strain were set at 3 standard deviations (3SD) on either side of the median value. The standard deviation was calculated for each test strain independently. For N315, 3SD was calculated to correspond to an upper cutoff of 1.59-fold difference and a lower cutoff of 0.65-fold; for other strains, the corresponding cutoffs were 1.69- and 0.60-fold (Mu50), 1.55- and 0.65-fold (COL), 1.59- and 0.64-fold (8325), 1.53- and 0.66-fold (MW2), and 1.69- and 0-fold (476). Genes with a ratio greater than the upper cutoff were deemed to be specific to the test strain, genes with a ratio less than the lower cutoff were deemed to be specific to the reference strain, and genes with ratios between the cutoffs were deemed to be present in both strains.
(iii) GACK.
GACK software uses the distribution of the ratio data for each strain to classify genes based on the probability that a gene is either present or absent/divergent (19). The software was set to generate a binary output using a threshold of 50% estimated probability of being present (EPP) in order to split the data into two groups, namely, present or absent genes. At 50% EPP there is equal probability that the gene will be called present either correctly or incorrectly. The chosen threshold depends on the relative importance of false positives/negatives for the data set under analysis; a higher threshold will determine more genes to be strain specific, thus increasing the chance of identifying all the truly absent genes but at the cost of miscalling genes that are in fact present. A lower threshold will determine fewer genes to be strain specific, thus increasing the chance of identifying all the truly present genes but at the cost of miscalling genes that are in fact absent. A threshold of 50% was chosen because it should give a balance of false positives/negatives to compare with the other methods tested. However, GACK was originally designed to work for one-way comparisons, that is, to identify only genes that are deleted in the test strain. Using the software in this way would mean that any genes present in the test strain and absent/divergent in the reference strain would be incorrectly classified as present in both strains. To overcome this, the normalized ratio data for each strain were passed through GACK a second time using the inverse of the ratios.
BLAST predictions of microarray results.
To enable prediction of the expected microarray results from the sequence data, we first needed to classify genes as "present" or "absent" in each of the sequenced strains based on the sequencing projects. The sequence of every PCR product on the array was compared by BLAST (1) to the complete genome sequence plus any associated plasmid sequence for each of the seven sequenced strains. The complete genome sequence was used in this comparison, in contrast to the annotated gene sequences, to ensure that all factors contributing to the hybridization of genomic DNA were accounted for. This avoids any discrepancies simply due to differences in gene prediction for each strain and also accounts for the contribution of intergenic regions or gene remnants to the final hybridization signal. In order to incorporate the potential for multiple hits of varying qualities in the BLAST comparison, a total bit score (a measure of similarity between two sequences that takes into account both the degree and the length of sequence similarity) (1) was calculated for each PCR product against each genome by summing the bit scores of each high-scoring sequence pair with an E value less than 0.001. Using these total bit scores for each PCR product for each strain, three ratios were calculated: ratio 1, the bit score for each strain against the bit score for the genome from which the PCR product was designed; ratio 2, the bit score for the reference strain against the bit score for the genome from which the product was designed; and ratio 3, the bit score for each strain against the bit score for the reference strain. If either ratio 1 or ratio 2 was greater than or equal to a threshold value of 2 (the threshold value used to determine strain-specific genes during the design process), then ratio 3 was used to determine if the gene was predicted to be present or absent/divergent in the test and/or reference strain; presence was predicted if ratio 3 was greater than the threshold value of 2, and absence was predicted if it was lower than the threshold value. Lists were generated using these criteria, which could be used for comparison with the analyzed microarray data.
Sensitivity, specificity, PPV, and NPV.
To validate the microarrays, we used several different methods to categorize whether genes were "unique to the test strain" (found in the test strain and not in the reference strain) or "unique to the reference strain" (found in the reference strain and not in the test strain). Once we had classified the genes based on microarray data, the results were compared to the BLAST predictions based on sequence analysis. The various lists were compared using the Venn diagram function in GeneSpring, with the likelihood of chance list similarity measured using hypergeometric probabilities. The abilities of the microarray to identify genes "unique to the test strain" and "unique to the reference strain" were calculated separately. Sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were calculated as follows. Sensitivity is the proportion of truly unique genes by sequence prediction that were correctly identified by the microarray. It is calculated as the number of genes unique by both microarray and sequence prediction divided by all genes unique by sequence prediction. Specificity is the proportion of genes predicted to be truly not unique by sequence prediction that were correctly called not unique by the microarray. It is calculated as the number of genes not unique by both microarray and sequence prediction divided by all genes not unique by sequence prediction. PPV is the proportion of genes unique by microarray that are truly unique by sequence prediction. It is calculated as the number of genes unique by both microarray and sequence prediction divided by all genes unique by microarray. Note that subtracting the PPV from 100 gives the level of false positives. NPV is the proportion of genes called not unique by microarray that are truly not unique by sequence prediction. It is calculated as the number of genes not unique by both microarray and sequence prediction divided by all genes not unique by microarray. Note that subtracting the NPV from 100 gives the level of false negatives. We also calculated the proportion of genes "correctly" identified by the microarray for each sequenced isolate. This was the number of genes that were called by both microarray and BLAST as either unique to the test strain, unique to the reference strain, or found in both, divided by all the present or marginal genes.
Composite genomes.
A major aim of using the microarrays for comparative genomics is to easily identify when an isolate has acquired or lost MGEs. This is because the sequencing projects have suggested that S. aureus isolates belong to a finite number of closely related lineages but that within each lineage there is substantial variation in MGE carriage (25). Due to this variation, clustering of S. aureus isolates based on microarray data for all genes (including MGEs) is inappropriate, and these genes should be identified and removed.
Since MGEs often carry resistance and virulence genes, the distribution of these elements in strains is of great interest. Furthermore, many MGEs prevent the host bacterium from acquiring related MGEs via "immunity." Thus, the identification of an MGE and the genes on it provides information about virulence and resistance gene carriage, mechanisms and possible sources of horizontal transfer, and the susceptibility of the host strain to acquisition of new genes. We constructed composite genomes of each of the sequenced strains and each of the sequenced MGEs (5 plasmids, 13 bacteriophages, 6 S. aureus pathogenicity islands [SaPI], 6 SCC, 7
genomic islands [GI
], 7 GIß, 3 conjugative elements, and 3 transposons). This was necessary because of the nonredundant nature of the microarray design, such that only one PCR product was designed if the gene was found in more than one of the sequenced isolates or MGEs. First, we identified every ORF in the sequencing projects that had >90% homology to each of the PCR products on the microarray. We then constructed a list (composite genome) of the best-match PCR products for each sequenced gene (in gene order in the genome) for each whole genome or MGE. This typically means that a composite MGE consists of PCR products taken from more than one strain of S. aureus. For example, the SaPI2(Mu50) composite genome consists of PCR products amplified from MRSA252 and N315 but not Mu50. Composite genome lists were imported into GeneSpring.
For interpretation of each experiment, the presence or absence of genes within each composite genome was determined by color based on fluorescent signal intensity ratios and visualized using the GeneSpring "Ordered List" function. Isolates related to a sequenced isolate can be compared visually using the relevant composite genome, and deletions can be identified. To identify acquisitions, each of the 50 MGE composite genomes is rapidly searched to identify any novel MGEs or fragments of MGEs.
| RESULTS |
|---|
|
|
|---|
Visualization of microarray results.
The seven sequenced strains were each cohybridized with the MRSA252 reference strain, and the results of the hybridization intensities for the test and reference strain channels are visualized as scatter plots in Fig. 2 (fully annotated microarray data have been deposited in BµG@Sbase [accession number E-BUGS-30 {http://bugs.sgul.ac.uk/E-BUGS-30}] and also Array Express [accession number E-BUGS-30]). In these scatter plots, the genes have been color coded according to BLAST sequence predictions to indicate whether they are present (red) or absent/divergent (green) in the test strain compared to the reference strain. Clearly, there is a strong association with genes predicted to be present or absent in the test strain based on sequence comparison and the corresponding signals on the microarray. As expected, the results of comparing the reference strain MRSA252 to itself show a high degree of correlation (R2 = 0.9664) (Fig. 2).
|
Calculating microarray accuracy.
To calculate microarray accuracy, it is necessary to decide what constitutes a gene that is "present" or "absent." It can be seen from the scatter plots in Fig. 2 that a straight line cannot be drawn to separate the blue and red genes or to separate the blue and green genes. This is also illustrated in Fig. 3. The "indeterminate" or "marginal" zones have a mixture of colored spots, indicating a number of genes that by microarray hybridization analysis do not precisely fall into one or the other category of present or absent/divergent. If we want to calculate how accurate the microarray is at calling every gene "present" or "absent," we need to find the best compromise and draw the "cutoff" lines. Our first step to approach this critical problem was to investigate current methods to categorize genes as present/absent by microarray.
|
|
The 3SD method we describe here was based on calculating the standard deviation from the median of the ratios of a set of core genes and multiplying by 3 to generate the cutoff points. 3SD had the advantage of increasing the specificity and reducing false positives, but it slightly decreased the sensitivity. In Fig. 2, the 3SD cutoff lines are slightly farther away from the median than the GACK cutoff lines. The 3SD method was the most successful at categorizing genes as present or absent compared to BLAST analysis; by this method, the microarrays were, on average, 91% sensitive and 94% specific, with a PPV of 82% and an NPV of 92%, for the sequenced strains (see Table S1 in the supplemental material). Expressed more simply, the microarray correctly identified >96.5% of genes present or absent/divergent in the sequenced strains compared to the reference strain MRSA252 (see Table S1 in the supplemental material).
Both GACK and 3SD produced high false-positive rates. To identify possible reasons for this, we inspected the MW2 data using the 3SD interpretation method, which called 95 genes unique to MW2 by microarray when these genes were found in both genomes by BLAST. First, we identified 16 genes that could be called unique to MW2 because of elevated copy number. They are all found in the chromosome in MRSA252 (on an integrated plasmid or on a single integrated transposon), but in MW2 they are found on a free plasmid. Since free plasmids are self-replicating and multiple copies are usually found in any bacterial cell, enhanced copy number will mean that the signal intensities on the microarray will favor MW2. Indeed, virtually all of the 16 genes had intensity ratios greater than 2 and did not fall into the "marginal" zone. Thus, the microarray was not incorrect.
An additional explanation for a low PPV is gene redundancy. The S. aureus genome has numerous well-documented examples of multiple genes (often in series) that are closely related, e.g., the superantigen gene nurseries (25). Our design strategy avoided most of these, and we note that no superantigen genes were falsely called by the microarray in this experiment. However, not every case of gene redundancy has been solved by our design strategy. For example, five putative lipoproteins, N315-0092 to -0096, were identified by BLAST as found in both MW2 and MRSA252, although MW2 carries five genes homologous to these products while MRSA252 carries only two genes homologous to these products. When we analyzed the two genomes by our BLAST method, these genes were found in both genomes (blue), although their bit score ratios bordered those found in MW2 only. The intensity ratios for four of these products are greater than 1.54 (and they fall among the red spots), and thus they contribute to the false-positive rate. Six genes of the 95 MW2 false positives fall into this category.
Finally, 11 false positives were particularly small genes, which generally give lower signal intensities. If the signal is low enough to be in the more variable region of the data set, the ratios generated from these genes become less reliable and harder to call as present/absent. If we remove the enhanced-copy-number, redundant, and small genes from our false-positive list, the PPV is recalculated from 67.1% to 75.5%. In other words, of the 253 genes called specific to MW2 by using 3SD, 62 were not MW2 specific by sequence analysis. The majority of these genes fall into the "marginal" zone with ratios between 1.54 and 2 and are difficult to categorize when a single cutoff line is chosen.
Replicates.
The use of replicates is mathematically proven to increase the accuracy of microarrays when used in gene expression studies (41). To prove this in practice for comparative genomic studies, the data generated for MW2 using four microarrays with duplicate spots analyzed separately (as if duplicate spots were two separate arrays), each of the four individual slides (averaging duplicate spots), the averages of two slides (paired according to when the microarrays were performed), and averaged data over all four microarrays (total of eight spots) were compared (full data are shown in Table S2 in the supplemental material). Data were interpreted using the 3SD method and compared to BLAST predictions. The results show that microarray results are highly reproducible, and even a single slide is typically 85 to 97% sensitive and >93% specific. Surprisingly, replicates increase the sensitivity and specificity of microarrays only marginally. The major advantage of replicates is the substantial decrease in false positives. This is indicated by PPV calculations improving from 54% (with one data set from one slide) to 75% (with eight data sets from four slides).
Composite genomes.
Most of the variation that occurs between isolates of S. aureus is due to the horizontal transfer or loss of MGEs. The seven sequencing projects revealed 5 plasmids, 13 bacteriophages, 6 SaPI, 6 SCC, 7 GI
, 7 GIß, 3 conjugative elements, and 3 transposons. Genes found in one type of MGE are generally not found in other types of MGE. However, within each MGE type, such as SaPI, recombination of SaPI within one strain and subsequent transfer of these rearranged SaPI to other strains at high frequency are probably common (25). Thus, a SaPI typically consists of a mosaic of fragments found in other SaPI. For example, the SaPI2 element from Mu50 consists of 12 genes: the first 3 are also found in SaPI2(N315), the next 4 are found in SaPI4(MRSA252), and the remaining 5 in SaPI2(N315). Yet it is also missing two genes found in SaPI2(N315). Since many MGEs (phage, SaPI, plasmids) have immune functions preventing related elements from transferring to the host bacterium (25), it can be important to identify which type of MGE is carried. Recombination presents problems for identifying MGEs, and thus composite genomes were used.
To rapidly identify conserved MGEs or MGE mosaic regions, we generated a list of the PCR products from the microarray that best corresponded to each individual sequenced gene in order. These MGE lists we call "composite genomes." Composite genomes were constructed for all 50 MGEs and for the seven sequenced chromosomal genomes. We use the composite genomes to readily display the presence or absence of colocalized genes in MGEs between strains by microarray.
Applications of multistrain composite genomes for comparative genomics.
A clinical isolate of MRSA from St George's Hospital (strain PM11) (28) was compared to the reference strain MRSA252 by microarray analysis. Both strains belong to the epidemic MRSA-16 clonal group causing 50% of MRSA infections in the United Kingdom and found throughout the world, where it is also known as ST36, USA200. Figure 4A shows a scatter plot of PM11 compared to the reference strain. Spots are colored according to whether they are found in the reference strain sequence (blue) or not (red). Clearly, a group of genes is unique to PM11 and located above the twofold cutoff line, although there are no genes found in the reference strain that are missing in PM11.
|
1(Mu50),
5(8325), SaPI2(Mu50), and SaPI2(N315). Composite genome results for pMu50,
1(Mu50), and SaPI2(N315) are presented (Fig. 4B). The sequenced plasmid from Mu50 contains a section of 14 genes that appear to be part of a transposon carrying aminoglycoside resistance genes. This section fluoresces in PM11 but not in MRSA252, suggesting that PM11 has acquired this transposon, which may or may not be on a plasmid in this strain. Both sequenced phage
1(Mu50) and
5(8325) have a common conserved mosaic region of 28 genes. At least five of these genes are strongly present in PM11 but not in MRSA252, suggesting that PM11 carries a novel phage. The sequenced SaPI2(Mu50) and SaPI2(N315) share 12 common genes, and 7 of them are found in PM11 but not MRSA252. The genes include tst (encoding toxic shock syndrome-1 toxin) and the SaPI2 integrase gene. This suggests that PM11 is carrying a novel SaPI2 element. Microarray analysis of six other EMRSA-16 isolates from St. George's Hospital Medical School (SGHMS) (28) showed that they each carried unique assemblages of known MGEs and could all be distinguished from each other (data not shown).
A second example of the use of composite genomes involved comparing the laboratory strain 8325-4 (test strain) with the sequenced strain 8325 (reference strain) (Fig. 5A). 8325-4 was originally derived from 8325 (also known as RN1) and is presumed to be missing three prophages (30). This experiment was performed differently from those discussed in the rest in this paper in that the standard MRSA252 was not used as the reference strain, and the two test strains were compared directly to each other. The data are presented as a composite genome of 8325, and they clearly show that 8325-4 is missing three prophages,
11,
12, and
13 [also known as
5(8325),
2(8325), and
3(8325), respectively]. Scatter plots revealed that this strain was not carrying any genes that are not found in 8325 (data not shown).
|
A third application of the multistrain microarray composite genome approach is in the analysis of the acquisition of methicillin resistance genes, which convert S. aureus into a pathogen that is much more difficult to treat and is the cause of significant morbidity and mortality in hospitals. These genes have probably moved horizontally into S. aureus on at least five types of SCCmec element. Until now, identification of the SCCmec element type by PCR of multiple genes has been used to categorize MRSA isolates for epidemiological and evolutionary studies in the absence of more-discriminatory methods such as microarrays. In Fig. 5B, we show that microarrays are more than capable of identifying SCCmec and other SCC types by using composite genomes (here illustrated by the SCCmecII type). From genome sequence data we know that MRSA252, N315, and Mu50 all carry SCCmecII, and on the microarray all these isolates give a positive signal for the whole SCCmecII element. MW2 is known to carry SCCmecIV, and the only similarity between these elements are the mec and ccr genes; again, this can be seen using the composite genome display of the microarray hybridization profiles for these regions (Fig. 5B). COL carries an unclassified SCCmec type most closely related to SCCmecI and does not carry any genes associated with SCCmecII apart from the mec genes themselves; this is also visualized by cohybridization signals with some genes in the composite genome display. Neither MSSA476 (which carries an SCC element without mec and without homology to SCCmecII) nor 8325 (which is mec and SCC negative) shows a microarray signal for the SCCmecII composite genome. Mosaic sections of the SCC are marked on Fig. 5B, including the mecARI genes, involved in methicillin resistance; the conserved ccrAB genes in SCCmecII and IV, which are thought to be involved in horizontal transfer; bleomycin and kanamycin resistance genes; and transposon Tn554, carrying resistance to erythromycin. Similarly, composite genomes of other SCC types provided further information in support of SCCmec type classification (data not shown).
| DISCUSSION |
|---|
|
|
|---|
The design method was developed to produce an essentially nonredundant microarray such that most genes would be represented only once; this was possible because the core genome appears highly conserved in all S. aureus strains. The advantage of using a nonredundant microarray is that a single strain can be used to provide reference DNA. The choice of reference DNA is not straightforward for a multistrain microarray. Others have tried using ratios of chromosomal preparations of all strains in order to generate a control signal for every spot on the microarray. However, this leads to increased signal intensity of core gene spots in the reference channel and complicates data analysis based on ratios. Another option is to use a mixture of PCR products generated using the same primers as those used to construct the microarray, although this approach would be technically demanding and poorly reproducible. Our results suggest that using a chromosomal preparation of a single genome was appropriate.
Interpretation of data using a multistrain microarray is complex, and many research groups have struggled to identify exactly which genes are present or absent in a strain by using microarray data for other bacterial species. The major problem is genes that fall into the "marginal" zone, that is, somewhere between the "present" and "absent" regions. Some groups have chosen to ignore these data altogether. For the first time we have been able to independently assess the methods commonly used for categorizing genes as present or absent with a substantial number of sequenced genomes. A twofold cutoff method was shown to be more conservative than the GACK or 3SD method and so may miss many important differences. The GACK and 3SD methods give greater sensitivity and specificity at the risk of a higher number of false positives. The method of choice, therefore, would depend on the importance of these factors to the experimental design. The development of better methods for determining cutoffs is warranted.
Our results show the impact of replicates on microarray results and support the use of additional replicates to minimize false positives. However, because of the cost of microarrays, the choice of replicate numbers in any large-scale comparative genomics experiments will depend on balancing the increased cost for a high number of replicates against the disadvantage of an increased false-positive rate for a low number of replicates. Depending on the intended application, a single microarray slide would provide useful and reproducible data more suited to a global comparison of a large number of strains. For a more precise determination of presence or absence on a gene-by-gene basis, a higher number of replicates would be required to achieve a more acceptable false-positive rate.
The seven sequenced strains are thought to offer reasonable coverage of human clinical strains, and they include a wide range of MGEs. However, our microarray cannot include every possible S. aureus gene. The "missing" genes are expected to be carried on MGEs. Using our composite genome approach, we can identify the presence of novel MGEs in unsequenced strains, which will enable targeted sequencing of these elements and even more comprehensive microarrays in the future.
Using three biological examples of genome variation, we have proven the value of the seven-strain microarray and its advantages over a microarray based on a single genome. Microarrays are clearly a very valuable method for discriminating between related and unrelated isolates and can be used for typing isolates in epidemiological studies, for identifying genes associated with particular groups of isolates (such as virulent versus nonvirulent isolates), and for studying the horizontal transfer of MGEs and the evolution of S. aureus.
In particular, we show here that "clonal" isolates of epidemic MRSA-16 can differ significantly in their carriage of MGEs. These MGEs can include genes encoding novel resistance or virulence factors, potentially impacting on treatment and pathogenesis. The frequency of MGE transfer and the stability of these genes are unknown, but these data support our previous studies suggesting that such rearrangements in S. aureus isolates in the hospital environment are common (27). Whether different isolates of the same clone are equally pathogenic remains to be determined. But the continuing accumulation of virulence and resistance genes in epidemic MRSA clones will lead to strains that are increasingly problematic in the hospital and community environments (25) and should be monitored. In addition, the substantial variation between clonal isolates of MRSA could be exploited for epidemiological typing. Since many hospitals are endemically affected by only a few clonal types, an enhanced typing method will allow identification and tracking of outbreaks both within and between hospitals.
In summary, the construction of a seven-strain S. aureus microarray has presented unique challenges in design, use of reference DNA, interpretation, and visualization of data. Our results suggest that we have developed suitable methods, including the concept of composite genomes, that could be used to generate and use multistrain PCR product microarrays for other bacterial species, particularly those with substantial genome plasticity.
| ADDENDUM IN PROOF |
|---|
|
|
|---|
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
Supplemental material for this article may be found at http://aem.asm.org/. ![]()
Present address: Department of Genetics, University of Leicester, University Road, Leicester LE1 7RH, United Kingdom. ![]()
Present address: Department of Infectious and Tropical Diseases, London School of Hygiene and Tropical Medicine, Keppel Street, London WC1E 7HT, United Kingdom. ![]()
| REFERENCES |
|---|
|
|
|---|