Previous Article | Next Article ![]()
Applied and Environmental Microbiology, December 2005, p. 8491-8499, Vol. 71, No. 12
0099-2240/05/$08.00+0 doi:10.1128/AEM.71.12.8491-8499.2005
Copyright © 2005, American Society for Microbiology. All Rights Reserved.
Department of Biology,1 Department of Entomology, University of California, Riverside, California 925212
Received 5 February 2005/ Accepted 9 September 2005
|
|
|---|
|
|
|---|
Multilocus sequence typing (MLST) is a recently devised method for identifying strains of bacteria based solely on nucleotide sequence differences in a small number of genes (25). For MLST, each allele of a gene is given a number, and each strain characterized (for n loci) is represented by the set of n numbers defining the alleles at each locus. This defines the sequence type (ST). In contrast to the results of prior DNA-based methods, MLST sequence data are unambiguous, can be easily interpreted and replicated between labs, and is generally made available in a public database.
MLST typically has higher resolution than previous methods used to group bacterial strains. By convention, a seven-locus MLST data set is generally used, and such a set has been estimated to give a level of discrimination comparable to that of an MLEE study using 15 to 20 loci (16). A single nucleotide difference always produces a new allele in an MLST data set, while as many as eight amino acid substitutions may be required to produce a new allozyme in an MLEE data set (4). MLST also gives a comparable or even slightly better level of discrimination than PFGE (18, 24, 27, 29).
MLST is used to group closely related strains into clonal complexes. The definition of a clonal complex has been modified since the first use of MLST to maximize its effectiveness in defining meaningful groups. Initially, Enright and Spratt (10) defined a clonal complex as a group of strains having the same ST (i.e., the same allelic profile). This definition was later amended to any frequent group of strains (defined as the ancestral or "consensus" genotype) plus their single-locus variants (12, 16). Single-locus variants differ from the ancestral genotype at only one of the test loci. However, defining the ancestral type as the allelic profile or sequence type most commonly present in the clonal complex is subject to sampling bias. Feil et al. (13) used a parsimony-based approach and redefined the ancestral type as the sequence type with the most single-locus variants in the clonal complex. More recently, the definition of a clonal complex has been further relaxed, so that a clonal complex is a group in which every strain shares at least five identical alleles out of seven with at least one other genotype in the group (13) or has four alleles out of seven that are identical to alleles in a consensus or ancestral clone (8). Finally, Feil et al. (14) suggested that the definition of a clonal complex should be flexible depending upon the characteristics of the species in question.
One of the major factors influencing the nature and diversification of clonal complexes is the recombination rate. As the recombination rate increases, the phylogeny of individual descent within a bacterial species becomes increasingly randomized. Thus, a priori, we would expect the occurrence of well-defined clonal complexes to be more probable in species with low recombination rates. Indeed, the contribution of recombination relative to that of point mutation can profoundly affect variation and host adaptation within a bacterial species and is an important determinant of the evolutionary trajectory (26). This ratio inevitably varies among species, since the necessary precursor of recombination, lateral gene transfer, varies considerably (15). Recent studies that have estimated recombination in bacteria have revealed a wide range of values, from zero in Mycobacterium tuberculosis (40) to midrange values in Escherichia coli and Haemophilus influenzae (13) to high estimated values in Streptococcus pneumoniae, in which an allele is about 10 times more likely to change by recombination than by point mutation (13).
MLST data sets have been used to estimate the relative contributions of recombination and point mutations in the formation of new alleles within a clonal complex (12, 13, 15, 17, 34). Recombination can be identified by a mosaic structure in a particular gene, reflecting the different evolutionary histories of different regions of the gene. In contrast, a point mutation results in a novel allele distinguished by a single base change. To distinguish between these two sources of genetic variability, Feil et al. (15, 17) adopted the criterion that if a variant allele within a single-locus variant in a clonal complex differs at only one site from the "ancestral" allele of the complex, then it is considered a point mutation. If it differs at more than one site, it is considered to have originated from a recombination event. This classification assumes that it is unlikely that strains that are identical at several other loci have accumulated more than one mutational difference at a single remaining locus and allows measurement of the "effective" rate of recombination (i.e., recombination which results in novel alleles within a clonal complex).
In this study, we developed an MLST system for the plant pathogen Xylella fastidiosa. X. fastidiosa is a gram-negative, xylem-limited eubacterium that is closely related to the xanthamonads (30). It is transported between plant hosts by xylem-feeding insect vectors (typically leaf hoppers belonging to the order Hemiptera). Different strains of the bacterium infect different species of plants throughout the Americas. These strains cause scorch diseases such as Pierce's disease (PD) in grapevine, almond leaf scorch (ALS) in almond, and oleander leaf scorch (OLS) in oleander in North America (21, 31) and citrus variegated chlorosis (CVC) in South America (5). Three distinct clades of X. fastidiosa have been identified in North America (37); these clades correspond to X. fastidiosa subsp. fastidiosa (renamed from the original subspecies, piercei) and X. fastidiosa subsp. multiplex (36) plus a third subspecies, X. fastidiosa subsp. sandyi, that so far has been found only in oleander (37). X. fastidiosa subsp. fastidiosa is found in grapevines, almond, and alfalfa, and X. fastidiosa subsp. multiplex consists of several plant host pathovars (e.g., almond, peach, plum, and oak pathovars).
In establishing an MLST system for identifying and classifying X. fastidiosa strains, we examined the effectiveness and robustness of the MLST method for detection of subspecies and plant host strains within the subspecies and for estimation of recombination rates. We were able to do this using the preexisting phylogeny of the strains (37), with which we could compare the clonal complexes. We investigated the extent to which the definition of a clonal complex influences the conclusions of an MLST study and the effect of choosing an MLST standard of 7 genes from a larger number (in our case 10 genes) on the clonal complexes identified and the estimated recombination rate.
Estimation of the recombination rate is very important for understanding bacterial evolution and the origin of new pathogenic strains. This is true for X. fastidiosa. The distinct phylogenetic clades of X. fastidiosa (37) suggest that recombination is limited; however, there is an opportunity for recombination between strains both within the insect vector, in which the X. fastidiosa strains are transported in the head and foregut (30), and in the plant host. For example, both strains of X. fastidiosa subsp. fastidiosa and strains X. fastidiosa subsp. multiplex have been isolated from symptomatic almond trees (1). We compared estimates of recombination obtained from the MLST data to independent estimates derived using other available methods (2, 35) and examined the extent to which reducing the size of the data set to the MLST standard of seven genes altered recombination estimates.
|
|
|---|
|
View this table: [in a new window] |
TABLE 1. X. fastidiosa strains used in the MLST analyses plus the outgroup CVC18
|
Amplification and sequence determination for 10 genes.
Ten genes were sequenced for the MLST analysis, and these genes represented a total of 9.3 kb. The genes were amplified using primers designed from four genomes using Oligo v.6 (33) and the methodology of Schuenzel et al. (37). The genes used occur in all four genomes sequenced (ALS strain Dixon [accession no. NC_002723], OLS strain Ann-1 [accession no. NC_002722], PD strain Temecula [accession no. AE009442], and CVC strain 9a5c [accession no. AE003849]). The genes were chosen for the MLST analysis based on a survey of the ALS and OLS genomes (Schuenzel et al., unpublished data). Because the X. fastiodiosa genomes have diverged only between 0.5% and 3.0% (3, 41; Schuenzel et al., unpublished data), we focused on genes that have diverged at least 1.0% between the ALS and OLS genomes. We also selected genes that represented a variety of biochemical functions and were distributed around the CVC genome (Table 2). The evolutionary pattern of each gene was characterized based on its rate of change and on its ratio of nonsynonymous substitutions to synonymous substitutions (dN/dS) (37).
|
View this table: [in a new window] |
TABLE 2. Positions, functions, PCR primers, and lengths sequenced for the 10 X. fastidiosa genes
|
The similarities among allelic profiles were visualized with an unweighted pair-group method with arithmetic averages (UPGMA) dendrogram. The UPGMA dendrogram was based on the percentage of pairwise differences between the allelic profiles of the 25 strains and was constructed using START (22). The UPGMA dendrogram was compared to a maximum-likelihood tree for the same sequence data (37).
Recombination in clonal complexes.
The alleles within a clonal complex were compared to the alleles of the ancestral ST, and the number that differed by only 1 bp was used to approximate the number of point mutations (13). The number that differed at multiple sites was used as a measure of recombination events. The role of recombination relative to the role of mutation in creating clonal diversity was measured by determining the ratio of recombination to mutation (r/m ratio) per allele (where each event results in one change) and per nucleotide (where each mutation results in one change, but each recombination results in more than one change).
Rather than restrict our recombination analysis to single-locus variants of the ancestral type (which differ from the ancestral type at only one locus) (13), we expanded our sample to included STs that differ from the ancestral form at multiple loci. This modification could have led to an overestimate of the recombination rate, since it allowed for a longer evolutionary separation of sequences that increased the probability of a pair of point mutations occurring in the same gene. For this reason the occurrence of each 2-bp "recombination" was examined in detail.
In closely related clonal complexes, a single ancestral sequence type carrying all of the ancestral alleles is often difficult to identify. In such cases, we did not use a single ancestral sequence type but instead used a parsimony-based approach, in which the least derived allele at each locus was used as a basis for comparing variant alleles.
We also estimated homologous recombination using the method of Sawyer (35), as implemented in the START program (22). This method focuses on sites that exhibit silent (synonymous) polymorphism across the whole data set. For each gene, these sites are compared for all pairs of alleles. For each pair, the gene is partitioned into fragments, and a fragment is defined by the region between a site that is "discordant" (different) between the pair and either the next such discordant site or the end of the gene. The length of the fragment is determined by two methods. The length of a "condensed" fragment (in nucleotide base pairs [bp]) is the number of "concordant" sites that it contains, where a concordant site is a silent polymorphic site that is identical in the pair being compared. The length of an "uncondensed" fragment is the traditional length of a DNA sequence (i.e., the number of all nucleotide sites that it includes). If the length of the fragment is greater than expected by chance, recombination is indicated. Sawyer's test statistics are calculated by adding the squared lengths of the condensed fragments or the squared lengths of the uncondensed fragments (9, 35). Since two tests were performed, a sequential Bonferroni correction was applied, where a P value of <0.025 is necessary for the first significant result and a P value of <0.05 is necessary for the second significant result.
The method of Betran et al. (2), implemented in the DnaSP program (32), was also used to detect recombination events. This method is based on detecting regions of congruence between alleles in different designated subgroups (in this case, clonal complexes). The observed recombination length (L) (in nucleotides) is estimated as follows: L = TR TL + 1, where TL and TR are the left and right site positions of the outermost informative nucleotide sites of a congruent recombination exchange, respectively.
Recombination events were also ascertained by visual inspection. Variant alleles in a clonal complex or closely related clonal complexes were compared to other alleles in the data set. If three or more changes in the variant allele were shared with an allele in a different clonal complex, then recombination was assumed to have occurred.
|
|
|---|
|
View this table: [in a new window] |
TABLE 3. Allelic profiles of 25 X. fastidiosa isolates divided into six clonal complexes linking isolates sharing at least 70% of their alleles
|
![]() View larger version (18K): [in a new window] |
FIG. 1. Dendrogram showing the relationships between the clonal complexes based on UPGMA from the matrix of pairwise percentage differences between the allelic profiles of the 25 isolates. The clonal complexes (CC1 to CC6) were determined with the eBURST program, and 7 out of 10 alleles was the criterion for inclusion in a complex. The dotted line indicates the region demarcating the designation of six clonal complexes.
|
![]() View larger version (19K): [in a new window] |
FIG. 2. Maximum-likelihood phylogeny of 26 X. fastidiosa strains based on 9,307 bp using a general time reversible model with gamma distribution and invariant sites. The numbers above and below the lines at nodes indicate maximum-likelihood bootstrap support and Bayesian posterior probabilities, respectively. The clonal complexes (CC1 to CC6) were determined with the eBURST program, and 7 out of 10 alleles was the criterion for a complex.
|
95% support. Relaxing the stringency defining a clonal complex to 6 shared alleles out of 10 (instead of 7 out of 10) resulted in four clonal complexes (i.e., two fewer clonal complexes) since CC3 to CC5 form a single complex (corresponding to X. fatidiosa subsp. multiplex), while using 5 shared alleles out of 10 resulted in combination of CC6 with a multiplex for a total of three clonal complexes, corresponding to the three major groups shown in Fig. 2. The tendency of CC3 to CC5 and then CC3 to CC6 to collapse into single clonal complexes is apparent from both the UPGMA and maximum-likelihood trees (Fig. 1 and 2). CC3 to CC5 formed a clade with 100% support, while CC3 to CC6 received
96% support. Increasing the stringency to 8 out of 10 alleles created three singletons (ST6, ST18, and ST19) and five clonal complexes. Finally, using the criterion of Feil et al. (12) for defining a clonal complex as the ancestral type and its associated single-locus variants resulted in designation of four clonal complexes, CC1 (with ST5 and ST6 excluded), CC2 (with ST9 excluded), CC3, and CC4, plus seven singleton genotypes. The effects of gene sampling were examined by limiting the number of loci chosen to seven (the suggested standard for MLST data sets). When a criterion of five shared loci out of seven was used to define a clonal complex, depending on the choice of genes, between three and six clonal complexes were identified. This variation was due to the erratic behavior of CC3 to CC6. With some choices these complexes remained distinct; with other choices CC4 and CC5, CC4 and CC6, or CC5 and CC6 combined or CC3 to CC6 combined. Some of these reduced data sets also resulted in ST6, ST18, and ST19 being assigned as singleton sequences.
A subset of seven genes that excluded the cell surface genes rfbD and pilU could be selected, which retained 16 of the 19 STs found in all 10 genes. These seven housekeeping genes are holC, nuoL, leuA, gltT, cysG, petC, and lacF (Table 3). When the clonal complex criterion of shared alleles for five out of seven genes was used, this set of seven genes continued to identify the same six clonal complexes. The remaining housekeeping gene, nuoN, was not used in the final seven-gene MLST set because including it would have split CC6 into two singletons.
The MLST set of seven genes exhibits a rate of evolution that varies symmetrically by a factor of about 2 above and below the mean rate of 1.99 (relative to the slowest) and has an average dN/dS of 0.169, with a relatively narrow range (0.08 to 0.32), all of which are well below the criterion for positive selection of >1.00 (Table 4). There is no indication that any of these genes are subject to unusual evolutionary behavior.
|
View this table: [in a new window] |
TABLE 4. Rates of evolution of genes relative to each other, dN/dS ratios, and number of alleles identified at each locus
|
A total of 10 allelic changes were putatively assigned as recombination events, compared to 22 changes that were assigned as point mutations (Table 5). At the nucleotide level, a total of 71 base pair changes were estimated to have occurred by recombination, compared to 22 changes that occurred by point mutation (Table 5). The ratio of the contribution to diversity by recombination to the contribution to diversity by point mutation (r/m ratio) is 0.45:1 at the allelic level and 3.23:1 at the individual nucleotide level.
|
View this table: [in a new window] |
TABLE 5. Assignment of recombinants and point mutations for clonal groups on a per allele and per nucleotide basis
|
The Sawyer test (35) showed significant recombination in 1 of the 10 genes, cysG (Table 6). Two other genes, nuoL and rfbD, showed weak indications of recombination consistent across both tests (P < 0.10), and the pilU gene showed similar weak evidence (P < 0.10) based on the condensed fragment analysis but no indication of recombination based on the uncondensed analysis (Table 6).
|
View this table: [in a new window] |
TABLE 6. Results of Sawyer's test for the genes with indications of homologous recombinationa
|
|
View this table: [in a new window] |
TABLE 7. Numbers of recombination events suggested by different methods
|
|
|
|---|
The six clonal complexes were based on a 70% criterion for grouping STs into clonal complexes; i.e., a clonal complex was defined as a network linking STs with allelic identity at 7 or more of 10 loci. Relaxing the identity criterion to 6 of 10 loci collapsed CC3 to CC5 (ALS, OAK, and PP strains) into a single group corresponding to X. fastidiosa subsp. multiplex. The remaining complex (CC6) was composed of a pair of ALS strains that Schuenzel et al. (37) set apart from the three subspecies since they include sequences characteristic of all three taxa.
The MLST approach allows each strain to be (i) defined by its allelic profile as a particular ST and (ii) grouped by a simple criterion of allelic identity among STs into clonal complexes. In the case of X. fastidiosa, it appears that 70% identity groups STs into plant host pathovars, while a 50 to 60% criterion groups STs at a broader subspecific level. The simplicity of MLST compared to a phylogenetic approach is a clear advantage in enabling communication of information when the spread of pathogenic strains is tracked and in facilitating rapid recognition of an unusual isolate. This simplicity has considerable practical value when, as is often the case, data sets involve hundreds of strains (10, 11, 23). The computational demands for analyzing such large data sets make the use of phylogenetic methods impractical.
The initial MLST analysis employed data for 10 loci. Reduction to a set of seven loci (the MLST standard) retained the same six clonal complexes when the criterion for grouping STs was kept at roughly 70% (five out of seven loci). All seven loci showed fairly homogeneous evolutionary characteristics, with dN/dS ratios typical of moderately constrained genes (Table 4).
The groups corresponding to the clonal complexes are found on both UPGMA and maximum-likelihood trees. However, the UPGMA approach for validating clonal complexes can be misleading. First, the complete set of six clonal complexes identified by the 7-out-of-10-shared-allele criterion can be recovered from the UPGMA dendrogram only in a very narrow window between allelic pairwise distances of 0.43 and 0.47 (Fig. 1). On one side of this window, CC4 and CC5 are combined, while on the other side, PD14 (ST6) would be removed from CC1 and designated a singleton. Second, the close relationship of X. fastidiosa subsp. fastidiosa (PD strains) and X. fastidiosa subsp. sandyi (OLS strains) could not be detected from the UPGMA tree because these taxa do not share any alleles. In contrast, at the nucleotide level, there are numerous synapomorphies linking the two clades, so the sequence-based maximum-likelihood method strongly recovers PD and OLS strains as sister clades. In general, support for the validity of clonal complexes should be based on a phylogenetic analysis of the sequence data. The phylogenetic approach always provides more information unless recombination rates are extremely high. Frequent recombination randomizes the phylogenetic signal; however, it also undermines the usefulness of the clonal complex since associations of alleles become fleeting.
Both the MLST analysis and the maximum-likelihood phylogeny identified one clonal complex (CC6, consisting of strains ALS12 and ALS22) that groups close to the multiplex strains but is distinct due to the presence of a number of recombination events with the other subspecies, X. fastidiosa subsp. fastidiosa and X. fastidiosa subsp. sandyi (37). Since recombination had not previously been reported in this species of bacteria, an initial estimate of the influence of recombination in X. fastidiosa was made. Despite the low number of strains used, our data suggest that X. fastidiosa is roughly one-half as likely to gain a new allele by recombination as by point mutation. However, an individual nucleotide is approximately three times more likely to change as a result of recombination than as a result of a point mutation, since in X. fastidiosa a single recombination results in, on average, about seven nucleotide changes (Table 5).
This estimate was based on classification of allelic changes due to single and multiple base differences as due to point mutation and recombination, respectively. A potential bias that diminishes the estimated role of recombination is the possibility that some single base changes may be due to recombination (13, 15, 17). On the other hand, two potential sources of error bias the estimate in favor of recombination. One of theses sources is the use of long genes. We used genes whose lengths varied from 345 bp to 1,824 bp. Longer gene segments are more likely to accumulate multiple point mutations that would be interpreted as marking recombination events. However, the bias due to length is quite small. To confirm this, we applied two kinds of correction: first, the criterion for recombination in longer genes was increased proportional to the length, and second, we reanalyzed the data with the longer genes divided into shorter independent segments of about 500 bp. Neither of these corrections significantly affected the results (data not shown). A second factor that could bias the results in favor of recombination was our use of multilocus variants rather than just single-locus variants in the comparison with the estimated ancestral type. This change had the effect of increasing the time scale of the comparisons (since multilocus variants are generally older than single-locus variants, particularly if the recombination rate is low). Increasing the time scale increases the risk of multiple point mutations. This problem decreases with sample size, since intermediate (single point mutation) alleles are likely to be observed. A possible example of this effect is a putative recombination in the leuA gene identified by the MLST and DnaSP methods. This case involves two shared changes separated by a single base pair in CC3 to CC5 compared with CC1, CC2, and CC6. These two changes could represent shared mutations that accumulated since the separation of CC3 to CC6 from CC1 and CC2, combined with recombination between CC6 and CC1 or CC2, or they could represent two shared mutations accumulating in the much shorter time since the common ancestor of CC3 to CC5 (Fig. 2).
Sawyer's test has the advantage of providing a statistical test for recombination events; however, it provided significant support for only one occurrence (corresponding to one of the four recombination events ascertained from visual inspection). Part of this lack of power arises because Sawyer's test cannot detect recombination when the recombined region is larger than the gene. For example, allele 1 of pilU is the ancestral allele for the PD clonal complex. This allele has completely recombined in ALS12 and ALS22 (Table 3). Also, Sawyer's test raises the problem of multiple testing. When the goal is to estimate the overall rate of recombination, each test is not strictly independent, so the significance values should be corrected in a table-wide manner (in this case for 20 tests [10 genes x 2 types of test]). This results in no value being returned as significant and illustrates the lack of power inherent in Sawyer's test.
Of the methods used for identifying recombination, the DnaSP method appears to be the most reliable, since it identifies all of the recombination events identified by visual inspection, plus one more (the leuA example discussed above) (Table 7). The multilocus variant method has the advantage that it provides a direct estimate of the relative contributions of mutation and recombination; however, the method may assign additional events as recombinations that could potentially be point mutations. Additional sampling should help uncover whether the alleles that differ from other alleles at only 2 to 4 bp represent cases of multiple point mutations alone or if recombination was involved. Finally, Sawyer's test appears to be much too conservative, failing to identify recombination events identified by visual inspection.
Compared to other bacteria, X. fastidiosa appears to have low ratios of recombination to point mutation on a per allele basis (0.46:1) and on a per nucleotide basis (3.23:1). For example, Streptococcus pneumoniae (ratios, 8.9:1 and 61:1) and Neisseria meningitidis (ratios, 4.75:1 and 100:1) (13) have ratios that are shifted more than 10-fold in favor of recombination. This strong bias toward recombination being the dominant force in the generation of new alleles persists even when Escherichia coli, a proteobacterium more closely related to X. fastidiosa, is considered. Guttman and Dykhuizen (19) found a recombination rate per nucleotide that was 50 times greater than the mutation rate for E. coli. The low rate of recombination in X. fastidiosa suggests that the phylogeny (Fig. 2) is the true evolutionary history. A similar, largely clonal pattern has been documented for Pseudomonas syringae (34).
The MLST method clearly offers an excellent opportunity for strain typing and cataloguing diversity within a bacterial species. Our database of 10 genes (9.3 kb) differentiated 19 STs from 25 strains, and a subset of five genes retained the same level of sequence type diversity. Since the suggested number of loci for an MLST data set is seven (16), we suggest that the holC, nuoL, leuA, gltT, cysG, petC, and lacF genes be used as the basis of MLST typing in this species. A database for this purpose has been established at http://www.mlst.net.
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2010 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»