Previous Article | Next Article ![]()
Applied and Environmental Microbiology, November 2006, p. 7286-7293, Vol. 72, No. 11
0099-2240/06/$08.00+0 doi:10.1128/AEM.01398-06
Copyright © 2006, American
Society for
Microbiology. All Rights Reserved.
,
,¶ and
James M. Tiedje
Center for Microbial Ecology, Michigan State University, East Lansing, Michigan
Received 16 June 2006/ Accepted 5 September 2006
|
|
|---|
|
|
|---|
Although the advantages of the MLST method over several traditional methods for strain genotyping are now well documented in the literature (4, 6), several issues remain less clear. Most importantly, it remains unknown how well the phylogenetic relationship based on eight loci, which are selected primarily based on being unlinked in the genome (e.g., to avoid hitchhiking of selection and recombination events) and offer conserved sites for PCR primer design (14) rather than on their phylogenetic robustness, approximates the real phylogenies of the strains studied. It is also largely uninvestigated how comparable the results derived by different sets of genes are and which functional classes of genes provide better phylogenetic markers. Last, it remains unclear whether a smaller number of loci, which would substantially economize on sequencing cost and time, could give results comparable to those based on eight loci. The principal reason that these issues remain largely uninvestigated is the lack of a robust and highly accurate method or measurement that can be used as a reference standard to compare the phylogenetic informativeness of every gene in the genome and hence identify the best-performing genes for phylogenetic purposes. Even if such a measurement were available, however, the exact methodology of how to perform the analysis at the whole-genome level would remain challenging.
The recent availability of genomic sequences for a number of closely related bacterial strains has made it possible for the first time to construct highly accurate phylogenetic reconstructions among the strains (3, 11). Such reconstructions may be used as a reference standard to provide new insights into the issues described previously. Toward these goals, we have analyzed four important bacterial groups that are currently best represented with genomic sequences, and these genomic sequences sample various levels of resolution within and between closely related species. Using state-of-the-art methods for phylogenetic analysis, we built whole-genome-based phylogenies for the members of a group and used them to investigate the performance of every gene in the genome for phylogenetic purposes. A much more accurate phylogenetic reconstruction was achieved by using just three of the best-performing genes identified by our analyses compared to classical MLST approaches that employ six to eight genes. Our results are expected to have important practical implications for how strain genotyping and phylogenetic analysis are performed, particularly among closely related organisms.
|
|
|---|
Conserved gene cores.
For each group, the conserved gene
core, i.e., the genes that are shared by all members of the group, was
determined using the following strategy. All annotated genes in one of
the published genomes of the group (hereafter referred to as the
reference genome for the group) were searched against the genomic
sequences of the remaining genomes of the group by using the BLASTn
(nucleotide search) algorithm, release 2.2.9
(2). The best match, when
it showed better than 50% identity over at least 70% of the length of
the gene in the reference genome, was extracted from the genomic
sequence with a custom PERL script and searched back (BLASTn) to the
reference genome's genes to identify the reciprocally
best-match-conserved (and presumably orthologous) gene set. The genes
of the reference genome that were reciprocally best match conserved in
all genomes of the group constituted the conserved gene core for the
group. The previous strategy circumvented the problem of
inconsistencies in the annotations of the published genomes and the
need for annotating the draft genomes. The BLASTn algorithm was run
with the following settings: X = 150 (drop-off value
for gapped alignment), q = 1 (penalty for
nucleotide mismatch), and F = F (filter for
repeated sequences); the rest of the parameters were at default
settings. These settings give better sensitivity with moderately
diverged sequences than default settings, which target highly identical
sequences (10). Because
the groups encompass closely related genomes (all members show
>80% average nucleotide identity among themselves), using a
lower cutoff or searching at the amino acid level (as opposed to the
nucleotide level) within a group did not substantially differentiate
the conserved gene core for the group. Genes conserved between groups
were determined using an identical strategy but searching at the amino
acid level (BLASTp) and using a cutoff of 30% amino acid identity over
at least 70% of the length of the gene, to accommodate the evolutionary
distances between the groups, which were greater than the intragroup
distances
(10).
Phylogenetic gene analysis.
For every
reference gene in the conserved gene core of a group, an alignment of
all of its orthologs within the group (therefore, the number of
orthologs equals the number of genomes in the group) was built using
ClustalW software (21).
MODELTEST version 3.7 software
(17) in combination with
PAUP (20) was used to
find the most plausible evolutionary model (out of 56 models in total)
for each gene via the Akaike information criterion test
(16,
17). The best model was
subsequently used in a maximum-likelihood (ML) analysis as implemented
in the PAUP software (20)
to build a phylogenetic tree and calculate the ML-based distances
between all pairs of genomes in a particular group. For instance, for
the E. coli group that includes 12 genomes, there are 66
nonredundant pairs of orthologs (equal to the number of pairs of E.
coli genomes) for every gene of the 2,646 core genes in all 12
E. coli genomes. A whole-genome analysis based on the
concatenated alignments of all core genes of a group was also carried
out using a strategy identical to the one described abovefor individual genes. The gene-based ML distances were compared to the
whole-genome-based ML and the average-nucleotide-identity (ANI)
distances (10) for the
same pairs of genomes using the nonparametric Kendall
correlation as implemented in the SPSS Grad Package
(18). The latter was used
because neither ML nor ANI values were normally distributed.
Average-nucleotide-identity values were calculated based on the
nucleotide level identities, as computed by the BLASTn
algorithm, of only the core genes of a group, i.e., the same genes used
in the whole-genome-based ML analysis (hereafter referred to as ANIo),
as opposed to the whole genome (hereafter referred to as ANIg to avoid
confusion) in our previous study
(10). Custom PERL scripts
were used to semiautomate the process and to parse the outputs of the
software as needed.
Statistical analyses.
The statistical
significance of the correlation between the ML distances based on
individual genes and the ANIo (or whole-genome ML) distances was
evaluated for the E. coli group by using a delete-half
jackknife approach. For every gene, a random selection of 33 pairs of
genomes (of the total 66 nonredundant pairs in the E. coli
group) was made without replacement and the ML gene-based distances for
these 33 pairs were compared to the ANIo values for the same pairs of
genomes by using the Kendall
correlation test. The procedure
was repeated 1,000 times, and the distribution of the Kendall
values was used to define the 2.5 and 97.5% quantiles, which define the
lower and upper limits of the 95% confidence interval for the Kendall
values, respectively, as shown in Fig.
2.
![]() View larger version (32K): [in a new window] |
FIG. 2. Performance
of individual genes against the whole-genome average. The
individual-gene-based distance matrix for all genome pairs in a group
was compared to the ANIo matrix for the same genome pairs by using the
nonparametric Kendall correlation test. Graphs show the
distribution of the Kendall values for all core genes within
a group (1, perfect correlation; 0, no correlation). The area that
corresponds to a significant correlation (P < 0.05) is
also designated. The E. coli (A), Salmonella (B), and
Burkholderia (C) groups are shown. Panel D shows the
distribution of the Kendall values for the genes in the
E. coli group, which were significant, as determined by the
bootstrap approach (see Materials and Methods for details). n, number
of
genes.
|
correlation test. The procedure was repeated 1,000
times, and the distribution of the Kendall
values was used to
define the 95% confidence interval for the Kendall
values, as
shown in Fig. 3. Figure S1
in the supplemental material represents a graphical description of the
methodology described above for calculating Kendall
values
for individual genes and their confidence intervals based on bootstrap
analysis.
![]() View larger version (13K): [in a new window] |
FIG. 3. Upper
and lower confidence levels for the Kendall correlation
coefficients for a given number of genes. The upper (open squares) and
the lower (solid squares) 95% confidence levels for the Kendall
correlation coefficients between the averages for the ML
distances for a given number of genes (x axis) and the ANIo
distances are shown for the E. coli group. See Materials and
Methods for details on the calculation of the confidence
intervals.
|
values along the mean gene position in a given genome was determined by
the Runs test for a significance level (P) of <0.05,
their autocorrelation by a Moran I correlogram (significance
determined by 1,000 bootstrap replicates for each distance class), and
their periodicity by a Lomb periodogram
(13). Distance classes
were made so that each consisted of an equal number of observations (to
make statistical comparisons meaningful), and the first class spanned a
distance of about 100 consecutive genes. All statistical calculations
were implemented with the R statistical package
(http://cran.r-project.org/). |
|
|---|
-proteobacteria), and Burkholderia being a member of
the ß-proteobacterial phylum. Therefore, the robustness of the
conclusions obtained for one group can be tested against that for
increasingly more-distant groups. First, an ML analysis of the concatenated conserved gene core of each group (at least 1,400 genes aligned) was performed to determine the phylogenies of the genomes in a group, and the ML-based distances among the genomes were calculated. The ML analysis employed the best evolutionary model (see Materials and Methods), and thus, it represented a particularly powerful measurement of the evolutionary distances between the genomes in a group. Interestingly, the best model for sequence evolution was the same for all groups, i.e., the time-reversible model with six categories of sites and gamma distribution of variation rates among sites (i.e., GTR+I+G). Only the values for individual variables of the model were different among groups. In contrast, the best models for individual genes were more variable among genes (data not shown but available upon request).
The whole-genome-based ML
distances between all pairs of genomes in a group were compared to the
ANIo distances for the same pairs of genomes. A perfect correspondence
between the two measurements was observed for all four groups, i.e.,
r2 values of >0.98 for all groups (Fig.
1). It also appeared that the relationship was linear for ANIo distances up
to 0.1 to 0.15 (i.e., 85 to 90% identity). Beyond this
area, multiple substitutions at the same site, which are not considered
in the ANIo measurement (but are considered in the ML analysis),
presumably made the relationship nonlinear (but equally strong). In
other words, the robustness and discriminative power of the ANIo
measurement remain equally strong below 85% identity but the absolute
evolutionary distance becomes gradually and uniformly more compressed
in the units of ANIo below the 85% identity level. In any case,
however, these results suggest that ANIo distances can be
used interchangeably with whole-genome ML distances for short
evolutionary scales, which is also consistent with our previous study
(10). Furthermore, ANIo
distances are easier to realize conceptually than ML-based distances;
therefore, for the procedures indicated in the remaining
text, we used ANIo measurement unless otherwise noted. Finally, because
the conserved gene core in a group is enriched in housekeeping genes,
which tend to show higher sequence conservation than the genome
average, the 70% DNA-DNA reassociation hybridization (DDH) standard for
species demarcation in bacteria
(19,
23) corresponded to
96% ANIo in the current analysis as opposed to 94 to 95% ANIg
in our previous study using the whole-genome sequences
(10).
![]() View larger version (22K): [in a new window] |
FIG. 1. Genetic
diversity within each of the four bacterial groups studied, based on
the ML and ANIo measurements. Each square represents a pair of genomes
from one group, colored according to the group to which the genomes
belong (see legend). Whole-genome ML distances between two genomes in
the pair (y axis) are plotted against their ANIo distances
(x axis). The gray area corresponds to the current species
cutoff for
bacteria.
|
98% ANIo among themselves) that should all
belong to the same species according to the current standard
(10,
23); the E. coli
group, similar to the Salmonella group, includes genomes that
should belong to the same species but are slightly more diverse than
the Salmonella ones, i.e., most genomes show 96 to 98% ANIo
among themselves (Fig. 1).
The Shewanella and Burkholderia groups include
genomes that belong to closely related species, i.e., showing 80 to 95%
ANIo among themselves, in addition to a few genomes that belong to the
same species (Fig. 1).
Therefore, these four groups comprise at least three different levels
of resolution among closely related organisms. No genome that showed
<80% ANIo to another member of the same group was included in
the analysis, because our focus was species-level differences, while
all members of a group show >98.5% small-subunit rRNA gene
identity among themselves (<98.5% small-subunit rRNA between
groups) (data not shown).
Evaluation of the phylogenetic robustness of every gene in the genome.
The performance of individual core
genes relative to that of the whole genome was evaluated at two
different levels: (i) distance matrix comparison, in which the
gene-based ML distances among all pairs of genomes in a group were
compared to the ANIo distances for the same pairs of genomes to
identify which genes are good predictors of whole-genome-based
distances and are thus good candidates for phylogenetic analysis within
the groups, and (ii) tree comparison, in which a maximum-likelihood
Kishino-Hasegawa (KH) test
(9), including branch
lengths, was employed to test whether the gene-based phylogeny was
statistically different from the whole-genome-based phylogeny based on
the gene alignment. The consistency index and retention index as
implemented in PAUP (20)
were also computed to test whether nonsignificant P values in
the KH test were merely the result of phylogenetic noise as opposed to
a strong phylogenetic signal. In general, there was a good correlation
between showing a strong correlation with ANIo and having a low
P value in the KH test or a high consistency index (data not
shown).
Our results revealed a very interesting trend: when the
analysis included genomes from different species, i.e., covering the
whole range of evolutionary relatedness studied here (80 to 100% ANI),
most genes in the genome showed very strong correlations with ANIo
(see, for example, the Burkholderia group in Fig.
2B). In other words, robust phylogenetic reconstruction of well-separated
(i.e., not too closely related) strains is feasible for almost any of
the core genes. When the analysis included only closely related strains
of the same species, as in the E. coli group, then many genes
in the genome showed poor correlations with ANIo (Fig.
2C). The trend is nicely
exemplified in the Salmonella group, where the removal of the
S. bongori genome, the only genome that is more distantly
related to the remaining Salmonella genomes, has a dramatic
effect on the distribution of the Kendall
correlation
coefficients for the core genes of Salmonella (Fig.
2A). The correlations
observed for individual genes appear to be independent of the genomes
used. Indeed, when a jackknife analysis, using half of
the genome pairs and 1,000 replicates, was
performed for every gene in the E. coli group, the
range of Kendall
correlation coefficients was within
a 13% difference from the correlation coefficient observed with all
genome pairs in 95% of the replicates. Last, there was weak
(R2 = 0.11 for the E. coli group)
but significant (P < 0.001) correlation between the
Kendall
correlation coefficient and the length of the gene.
Data for the top 20 genes in terms of correlation with ANIo for each
group considered as well as for selected genes that have been used for
MLST studies in any of the four groups are shown in the supplemental
material and are also available through the Ribosomal Database Project
website
(http://rdp.cme.msu.edu/).
Data for all genes within a group are available from the authors upon
request.
Using a small number of genes to predict whole-genome relatedness.
In typical MLST applications, six to
eight genes are sequenced and phylogenetic analysis of their
concatenated alignment is performed to reveal the
"average" phylogeny. To investigate how well the
average for a given number of randomly selected genes correlates with
(and thus could predict) the ANIo and what the optimum number (if any)
of genes for predicting whole-genome relatedness is, a resampling
strategy without replacement was performed for the E. coli
group (see Materials and Methods for details). Our results showed that
the average distances for even two randomly selected genes almost
always showed a significant correlation (Kendall
>
0.221, P < 0.05) with ANIo distances. This represents
the worst-case scenario, where genes with poor correlation with ANIo
were included in the selection (Fig.
3, solid squares). Conversely, if the two genes were selected to be among
the best-performing genes in terms of correlation with ANIo, then the
correlation of their average value with the ANIo gave an
r2 value of
0.81 (Fig.
3, open squares). The
analysis also showed that the higher the number of genes sampled, the
higher the confidence level for the Kendall
correlation
between their average value and the ANIo, i.e., a more robust
prediction of ANIo was more likely. The lower confidence levels
appeared to substantially increase for up to about 100 genes, after
which they increased only slightly with more genes sampled. For
instance, a random selection of 10, 100, and 300 genes gave lower
Kendall
correlation coefficients of 0.33, 0.61, and 0.7,
respectively (Fig. 3,
solid squares).
In silico MLST evaluation.
As exemplified
previously by the E. coli and Salmonella groups, the
selection of genes for MLST applications that target intraspecies
diversity could be very critical because not all genes perform well at
this level (Fig. 2), and
thus, the accuracy may be very variable (Fig.
3). To further investigate
this, we selected three of the best-performing genes for the E.
coli group according to our analysis and built a phylogeny based
on the concatenated alignment of the three genes by using full-length
gene sequences. The three genes were selected primarily based on their
correlations with ANIo (the primary criterion; see above),
while the tree-based criteria (see above) were used as additional,
secondary criteria to better refine the selection of the best genes. We
then compared this phylogeny to the phylogeny built on eight genes that
are frequently used in MLST studies for the E. coli species
(1,
8,
15) and the whole-genome
phylogeny (i.e., the reference phylogeny; all genes used and their
statistics are summarized in Table
1). Our results revealed that with as few as three genes (3,300 bases in
total), a phylogeny more congruent to the whole-genome-based phylogeny
was achieved than what was found for the classical MLST application (8
genes; 9,500 bases in total). This was evident both in terms of tree
topology, i.e., with KH test P values of 0.709 and 0.02 for
the classical MLST tree and our top-three-gene tree, respectively, and
in terms of branch length and bootstrap values (all nodes have >50%
bootstrap support in the top-three-gene tree) (Fig.
4).
|
View this table: [in a new window] |
TABLE 1. Analysis
of genes that have been used in MLST studies for E. coli (top
eight) and the best-performing genes identified in this study
(lower three)a
|
![]() View larger version (16K): [in a new window] |
FIG. 4. Improved
phylogenetic reconstruction in an MLST-like application, using only
three of the genes in the genome. Three maximum-likelihood trees are
shown, one based on the concatenated alignment of all 2,635 core genes
for the E. coli group (A), one based on the concatenated
alignment of 8 genes frequently used in MLST studies for the E.
coli group (B), and one based on the concatenated alignment of 3
of the best-performing genes according to our analysis (C). Dashed
branches designate the major differences between the trees in panels B
and C and the whole-genome tree (A). nt,
nucleotides.
|
correlation) of
ANIo values did not depend on COG classification: each COG category had
a constant proportion of genes that were good predictors of ANIo, and
this was reproducible among the four groups considered (Fig.
5). Further, the genes with better correlation with ANIo were relatively
evenly distributed throughout the genome, e.g., no systematic
clustering toward the origin, the middle, or the terminus of
replication was evident (Fig.
6, outer circle). Hence, no major trend or functional bias was evident.
Some minor functional trends were evidenced, however, and are
summarized below.
![]() View larger version (14K): [in a new window] |
FIG. 5. Functional
annotations of the genes with significant correlation with ANIo values.
The number of genes with significant correlation with ANIo (Kendall
correlation > 0.221, P < 0.05)
(y axes) is plotted against the total number of genes in a COG
functional category (x axes). Using a higher cutoff for
Kendall correlation (e.g., >0.35) does not change the
results shown. The letters on the graphs correspond to the COG
individual functional categories according to the category designations
on the COG website
(http://www.ncbi.nlm.nih.gov/COG/).
|
![]() View larger version (61K): [in a new window] |
FIG. 6. Spatial
distribution of the Kendall values for individual genes in
the E. coli genome. The inner circle represents the
G + C% skew analysis of the E. coli
strain O157 (Sakai) genome, while the outer circle represents the
distribution of the Kendall values for every core gene
(n = 2,635), centered around the average Kendall
value, which is 0.46. Kendall values are
derived from the correlation analysis between the individual-gene-based
ML distances and the ANIo distances for all pairs of E. coli
genomes (see Materials and Methods for details). Blue bars represent
genes with Kendall values that are smaller than and red bars
genes with Kendall values that are higher than the average
Kendall value, while the height of the bar is proportional to
the difference from the average Kendall value. The figure was
plotted using the GenomeViz software
(7). (A) The
individual values were plotted along the mean position of each gene,
and a local fitting algorithm was used to further reveal the major
patterns in the data. (B) The presence of significant
autocorrelation, as represented by filled circles, was determined for
each distance class by a Moran I correlogram by bootstrap
analysis (see Materials and Methods for more
detail).
|
100%. For the E.
coli group, however, this proportion was only 72%. Moreover,
several COG categories, such as the informational categories (Fig.
5A, point J), which
include the ribosomal proteins, polymerases, and tRNA synthetases,
etc., as well as the categories of hypothetical (Fig.
5A, No COG) and conserved
hypothetical (Fig. 5A,
point S) proteins, had a higher proportion of
"poor-predictor" genes. Conversely, metabolism and
cellular-process categories (for example, Fig.
5A, points P, M, and C)
had a higher proportion of "good-predictor" genes
relative to the average for all categories (Fig.
5). The best-performing
genes included, but were not limited to, dehydrogenases and transport
enzymes, for example. The Salmonella group (without the S.
bongori genome), on the other hand, which encompasses a diversity
comparable to that of the E. coli group, seemed to have a more
uniform proportion of "good-predictor" genes for every
COG category than the E. coli group (Fig.
5B). These results
indicate that there may be small but qualitatively significant
differences in terms of which genes are the best predictors of ANIo,
even among very closely related groups.
The distribution of the
Kendall
values in the genome of E. coli strain Sakai
was strikingly nonrandom (Z = 5.466,
P << 0.001; Runs test), but there was no systematic
increase or decrease detected (Fig.
6A). A visual inspection
of the distribution patterns revealed instead the existence of a
wave-like structure, which was further statistically confirmed by
analysis of the Moran I correlogram, which displayed a
succession of significant correlation coefficients in peaks and troughs
along increasing distance classes (Fig.
6B)
(13). The Moran
I correlogram also revealed that successive Kendall
values were not independent of each other within the range of 80 to 100
consecutive genes (i.e., the first distance class) (Fig.
6B), i.e., that the data
were significantly autocorrelated. Further statistical analyses were
performed to reveal the periodicity of the distribution, and using the
Lomb periodogram, we also identified the most significant period to be
about 100 genes (P < 0.01), which was very consistent
with the results obtained from the Moran I
correlogram.
Future research including more genomes and
phylogenetically different groups is needed to shed light on the
underlying mechanisms leading to the observed periodicity in the
robustness of the phylogenetic signals of individual genes. It does
appear, however, that the functionality of the genes might have an
effect on this periodicity. For instance, the major drop in Kendall
values around the six o'clock position in the circle of the
Sakai genome corresponds to that of the genes in flagella and pili
operons. These genes are known to be among the most variable and most
frequently horizontally transferred genes in the genome, and hence, it
is not surprising that they showed poor Kendall
correlation
with ANIo.
|
|
|---|
Once the highly reliable markers have been identified, they can be used to robustly predict whole-genome-level relatedness (i.e., ANI) among a large collection of uncharacterized strains of a species. This has important applications for microbiology, including metagenomic surveys, e.g., in assigning genomic fragments such as fosmid or bacterial artificial chromosome clones to specific genotypes, especially for clones with identical or nearly identical rRNA genes or for clones that lack the commonly used, universal markers, such as the ribosomal proteins, DNA polymerases, and RecA and GyrB genes. In the latter case, metabolic or even hypothetical genes may constitute reliable markers (Fig. 5). Three, in our opinion, is the minimum number of genes to use in such MLST-like applications because if there is an unanticipated horizontal gene transfer (HGT) or recombination event in one of the genes in one or a few lineages, the phylogenetic signal conflicting with the remaining two genes will uncover the HGT event. Moreover, our analysis suggests that a random selection of six to eight genes would be expected to give a statistically significant prediction of whole-genome relatedness, even in the worst-case scenario, in which the genes employed are among the worst-performing ones (Fig. 3). Therefore, the MLST data available to date should be reasonably predictive of ANIo values if there are a few reference genomic sequences available for calibrating the equation between MLST and ANIo values (i.e., it is necessary to understand how conserved, at the sequence level, the genes used in MLST are relative to the genome average). This also has important applications for demarcating species based on the current standards, using the MLST approach as opposed to the cumbersome DDH method, since the ANIo values can be predicted from MLST data (and 96% ANIo corresponds tightly to 70% DDH) (10).
It also appears that even closely related groups may show significant differences in terms of which are the very best performing genes within each group (Fig. 5). For example, the orthologs of the genes that appeared to be the best within the E. coli group were typically not among the best genes for the Salmonella group (data not shown). Nonetheless,the best genes for E. coli typically showed quite strong (but not among the strongest) correlations with the ANIo for the Salmonella group. Therefore, extrapolations from one group to another group are possible but need to be done with caution. A good understanding of the extent of genetic diversity within each group and the level of evolutionary relatedness that is targeted within each group is essential for more-accurate extrapolations due to the differences between shorter (i.e., within-species) and longer evolutionary scales (Fig. 2).
The reasons for the poor correlation of individual genes with whole-genome-based relatedness at the intraspecies level are diverse and could include a lack of time for selection to act upon sequence conservation, more-frequent horizontal gene transfer and recombination events within species, varied levels of gene divergence among lineages, and intragenomic recombination, etc. It is important to point out that since the distances between strains of the same species are smaller than the distances between strains of different species, the exact same rates or levels of the processes described above will have more-dramatic effects in the phylogeny of strains of the same species. Presumably, all these processes, in combination or alone, are affecting at least a fraction of the genes in the genome, and their relative importance is a subject for future investigations. It is tempting to speculate, however, that the lack of elapsed time as opposed to HGT and recombination may have a relatively greater impact at the whole-genome level. The fact that almost all genes in the genome yield good correlations with ANIo when more-distant genomes are included in the analysis, such as when the E. coli group is expanded by including the Salmonella genomes (data not shown), favors the greater impact of the lack-of-divergence-time hypothesis.
We
obtained very comparable results for the E. coli group, in all
aspects of our analyses, when this group was expanded with eight
unpublished genomic sequences (analytical data not shown). For
instance, the linear regressions of the Kendall
values for
all individual genes to ANIo values between the 12-genome (66
nonredundant pairs of genomes used in the correlation analysis of each
gene) and the 20-genome (190 nonredundant pairs of genomes) data sets
gave an r2 value of 0.81 (Table
1). These
results underscore the robustness of our methodology and suggest that
the findings from a small number of genomic sequences may be
universally applicable to more genomes within the same group.
The analysis performed here also showed that the ANI of the conserved genes in the genome (10) could be used as the reference standard for measuring genetic or evolutionary relatedness within species or between closely related species (Fig. 1). For measurement of relatedness, ANI is much easier to conceptualize and compute than whole-genome-based ML. The fact that ANI is a simple, robust, and pragmatic measurement for all bacteria and provides a very robust representation of their phylogenetic relationships at the species and probably up to the family level greatly magnifies its importance and potential for finer-scale systematic, diversity, and epidemiological studies.
This work was supported by the National Science Foundation (awards DEB0516252 and DEB-00755564), the DOE Genomics: GTL Program (for sequencing and the Shewanella Federation), and the Center for Microbial Ecology. K.T.K. is grateful to the Bouyoukos Fellowship Program for supporting his Ph.D. studies, and A.R. acknowledges a Swiss National Foundation Fellowship for a postdoctoral position in the laboratory of J.M.T.
Published
ahead of print on 15 September 2006. ![]()
Supplemental
material for this article may be found at
http://aem.asm.org/. ![]()
¶ K.T.K.
and A.R. contributed equally to this work. ![]()
Present
address: Celsiusstrasse 1, Max Planck Institute for Marine
Microbiology, Bremen 28359, Germany. ![]()
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»