Previous Article | Next Article ![]()
Applied and Environmental Microbiology, September 2006, p. 5734-5741, Vol. 72, No. 9
0099-2240/06/$08.00+0 doi:10.1128/AEM.00556-06
Copyright © 2006, American Society for Microbiology. All Rights Reserved.
Cardiff School of Biosciences, Cardiff University, Main Building, Park Place, PO Box 915, Cardiff CF10 3TL, United Kingdom,1 Biostatistics and Bioinformatics Unit and Institute of Medical Genetics, Cardiff School of Medicine, Cardiff University, Heath Park, Cardiff CF14 4XN, United Kingdom,2 Cardiff School of Computer Science, Cardiff University, Queen's Buildings, 5 The Parade, Roath, Cardiff CF24 3AA, United Kingdom3
Received 8 March 2006/ Accepted 9 June 2006
|
|
|---|
|
|
|---|
Our previous study showed that chimeras and other anomalies are continuing to be generated and submitted without comment to the public repositories (2). The presence of such high numbers of substantial anomalies in the public domain has serious implications for future efforts to accurately estimate bacterial diversity, elucidate likely phylogenetic relationships, and form correct taxonomic identifications. Consequently, there is a requirement for effective computer programs to simplify the screening process.
A number of useful, complementary approaches already exist, with Bellerophon (13) and CHIMERA_CHECK (19) being two noteworthy examples, and in our previous paper we described a new computer program, called Pintail, for screening individual sequences for errors (2). Now, we describe another program, Mallard, which develops the Pintail algorithm further so that whole libraries of 16S rRNA gene sequences can be screened simultaneously and quickly.
We demonstrate the new program's ability to screen libraries of a range of sizes from different sources. Through a detailed analysis of submissions made to public repositories during 2005, we show that the problem of unrecognized anomalies within the public domain appears to be getting worse, highlighting the need for immediate steps to be taken, by the research community at large, to minimize further database contamination.
|
|
|---|
![]() |
In Mallard, the Pintail algorithm is applied to all pairwise comparisons within a multiple alignment of size n, resulting in (n2 n)/2 separate DE values, each DE value presenting a unique pairwise comparison. DE values are plotted against their corresponding mean observed percentage differences, (
oi)/m, which can be viewed as a simple measure of the evolutionary distance between sequences Sq and Ss. The larger the DE value, the greater the likelihood that either Sq or Ss (or perhaps even both) is in some way corrupt. Thus, by plotting DE values, one can immediately see which pairwise comparisons are likely to involve an anomalous sequence, since DE values generated from reliable sequences will tend to cluster close to a DE value of zero, while DE values involving anomalous sequences will tend to appear as outliers.
In Mallard, outliers are identified as those DE values that appear above one of several possible cutoff lines, specified by the user and based on DE values calculated from comparisons of error-free sequences from type strains (2). Specifically, in our earlier study, we calculated DE values from a collection of 2,007 reliable type strain sequences, and the 75, 95, 99, 99.9, and 100% quantiles of the resulting plot were determined at each 1% interval along the x axis of the DE plot (2). These quantile data give roughly straight lines when plotted on a logarithmic scale, so for this study, the quantile data were simplified to the following equations: 75% quantiles, y = 2.28 log10 x + 1.00; 95% quantiles, y = 2.64 log10 x + 1.46; 99% quantiles, y = 3.12 log10 x + 1.66; 99.9% quantiles, y = 3.27 log10 x + 2.07; 100% quantiles, y = 4.37 log10 x + 1.81. Cutoff lines, generated from these equations, are offered by the program.
DE outliers are caused by one, or even both, of the sequences involved in the corresponding pairwise comparison being anomalous. To identify which are the corrupt sequences, the following procedure is applied by the program. First, each sequence in the library is scored according to the number of DE outliers it is coresponsible for. The DE outliers are then ranked, in descending order, according to distance from the cutoff line. For each DE outlier, the two sequences responsible for that outlier are identified, and if neither sequence has previously been marked as anomalous, the sequence with the highest score is marked as such (or both are marked if they have the same score). In this way, a list of anomalous sequences is generated, with those that were identified first being the most likely anomalies.
Mallard was written in Java 1.4 (Java Technology) and tested on Redhat 9.0 Linux, Microsoft Windows XP, and Apple Mac OS X, version 10.2. The program, along with full instructions for use, help documentation, example files, and source code, is freely available from http://www.cardiff.ac.uk/biosi/research/biosoft/. Mallard is an open-source project and is released under the terms of the GNU General Public License (http://www.gnu.org/copyleft/gpl.html).
Analysis of 16S rRNA gene libraries.
To demonstrate Mallard's utility, a selection of publicly available 16S rRNA gene libraries was analyzed. The procedure was the same for each library. A multiple-sequence alignment was prepared for each that included the sequence Escherichia coli U00096 (as the reference sequence). An explanation for the reference sequence is included in the accompanying help documentation. Each multiple sequence alignment was passed to the Mallard program and screened for putative anomalies. Each putative anomaly identified by the program was checked with BlastN (1), in conjunction with the Pintail program (2). First, a library of Verrucomicrobia-derived sequences, to exemplify a Bacteria phylum, was considered. A total of 222 near-complete (
1,200-base) representatives of the Verrucomicrobia, as identified by the Ribosome Database Project (RDP) (4) release 9 update 36, were downloaded, along with E. coli U00096 as a reference, as an aligned file from the website http://rdp.cme.msu.edu/.
Second, a library of Crenarchaeota-derived sequences, to represent the Archaea, was analyzed. Near-complete sequences were identified from the National Center for Biotechnology Information (NCBI) online database (http://www.ncbi.nlm.nih.gov/) using the search phrase "16S[TITL] AND Archaea[ORGN] AND Crenarchaeota[ORGN] AND 1200[SLEN]:1600[SLEN]." The resulting data set of 270 sequences was checked and then aligned, along with E. coli U00096, using ClustalW (27).
Third, a coastal-marine 156-sequence clone library (AY354711 to AY354866), previously generated by our laboratory (20), was examined. Because this library consisted of partial 16S rRNA gene sequences, it was necessary to subdivide it according to the region of the 16S rRNA gene covered so that sensible alignments were obtained. All groups were aligned, along with E. coli U00096, using ClustalW.
Finally, a selection of clone libraries representing submissions to the public repositories over the last year was analyzed. Using the "View by Publication" facility on the RDP's online hierarchy browser, all libraries submitted during 2005 were identified. Of these, libraries containing
100 near-complete (
1,200-base) sequences were identified. Three libraries (with 2,062, 3,635, and 11,831 near-complete sequences) exceeded our 1,000-sequence limit and were discarded. In this way, 25 libraries were selected for analysis, the near-complete sequences of which were downloaded as RDP aligned datasets (each including E. coli U00096).
Comparison with Bellerophon.
The most widely used program for checking whole gene libraries for chimeras is currently the server-based program Bellerophon (13). Bellerophon was used to analyze the Verrucomicrobia-, Crenarchaeota-, and coastal-water-derived (20) libraries described above, using the same input files prepared for Mallard.
In addition, we considered the performances of both programs in relation to two further gene libraries. First, we considered the 18-sequence gene library of Stein et al. (26), which the Bellerophon website (http://foo.maths.uq.edu.au/
huber/bellerophon.pl) uses as an example file. Secondly, we considered the recently published gene library of Walker et al. (29), selected as an example of a library containing a mixture of near-complete and partial sequences. In the latter library, records AY911480, AY911482, AY911483, AY911485, AY911493, and AY911495, although labeled as Alphaproteobacteria in origin, were in fact found to closely resemble Acanthamoeba mitochondrial 16S ribosomal DNA and so were excluded from analysis. In addition, AY911496, an example of chloroplast 16S rRNA, was excluded.
For all comparisons, the default settings for both programs were used. The same aligned input files were used for both programs. Since Bellerophon is designed specifically to detect chimeras, we restricted our analyses to the detection of chimeric records. In all cases, chimeras were confirmed and false positives were identified by using the Pintail program (2).
|
|
|---|
![]() View larger version (43K): [in a new window] |
FIG. 1. Mallard program screenshot, illustrating a typical analysis. In this example, the library containing 222 16S rRNA gene sequences representing the Verrucomicrobia phylum is being considered. Each sequence within the library was compared with every other sequence, generating 24,531 separate DE values that were plotted against the mean percentage differences (a simple measure of evolutionary distance). Unusually high DE values are those plotted above the superimposed dotted line, and they represent comparisons in which one (or both) of the sequences is likely to be anomalous. From these outlier DE values, a list of suspected anomalies is generated (upper left-hand panel of the screenshot). Clicking on a listed sequence record causes associated DE values to be highlighted in red in the right-hand panel. Clicking on individual plotted DE values displays the underlying Pintail plot in a separate panel (not shown), and from this information, the nature of any anomaly may be discerned.
|
![]() View larger version (34K): [in a new window] |
FIG. 2. Mallard-generated DE plot in detail. (A) Reproduced DE plot of the Verrucomicrobia phylum library shown in Fig. 1 with the dotted line (the 100% cutoff line) identifying unusually high DE values (outliers), which lie above the line. Each plotted DE value represents a separate sequence comparison using the Pintail algorithm, and clicking on a plotted point within the program reveals the underlying Pintail plot. (B) The plot generated from one such comparison (between the chimera AY752110 and the error-free AF050561). The solid black line represents changes in evolutionary distance between these two sequences, when aligned, as determined from a 300-base sampling window moving 25 bases at a time along the alignment (2). The solid dark-gray line represents those evolutionary distances that one might have expected had both sequences been error free (2). The disparity between these two lines reflects the chimeric nature of AY752110. Excluding this and other chimeras identified by the program from the analysis produces the plot in panel C. DE values below the dotted cutoff line result from comparisons between error-free sequences; panel D represents a typical example, with AY212657 being compared with AB154319.
|
Mallard lists those sequences identified as likely causes for the observed DE outliers. For example, 13 sequences are listed in the screenshot (Fig. 1); these were judged by the program to be suspicious. Mallard identifies these records only as likely anomalies, so they need to be further checked to confirm that they have not been falsely identified. To do this, Pintail is used to check individual records, as described previously (2). In this example, 11 of the 13 sequences were confirmed to be chimeras (AY942760, AM040116, AJ617868, AJ401133, AF316731, AJ401123, AB179538, AF449257, AF351215, AJ401131, and the already considered AY752110). A further sequence (Z94005) was shown to be poorly assembled, with roughly 130 bases missing from the middle of the gene. Pintail analysis of the remaining sequence (AJ401106) failed to confirm an anomaly, so this was deemed a false positive.
Rerunning the analysis with the 12 confirmed anomalies removed generated the plot illustrated in Fig. 2C. Note how only DE values below the cutoff line remain, representing comparisons between reliable sequences only. For example, by selecting the DE value indicated in Fig. 2C, the Pintail plot illustrated in Fig. 2D is obtained. Note how in this plot the observed percentage difference between the two sequences is essentially constant along the length of the 16S rRNA gene; this is typical of comparisons between reliable sequences.
The 100% cutoff line, as shown in Fig. 1 and 2, provides a conservative estimate of anomaly numbers: some true anomalies will be missed. Typically, more anomalies can be uncovered with lower cutoff lines, but at the cost of more false positives (Fig. 3). With the Verrucomicrobia example, dropping the cutoff line to 99.9% (Fig. 3A) revealed two further anomalies (AJ244308 and AJ401118) that were previously undetected, but also one further false positive (Fig. 3B). Dropping to 99% (Fig. 3A) identified another chimera (DQ015833), but now seven false positives were identified (Fig. 3B). Reducing the cutoff line still further failed to identify any more anomalies, but the number of false positives increased greatly (Fig. 3B). Thus, choosing a cutoff line will often be a compromise between the numbers of false positives and false negatives.
![]() View larger version (20K): [in a new window] |
FIG. 3. Impact of cutoff line choice on correct identification of anomalies. (A) DE values from the phylum Verrucomicrobia analysis are plotted, with the five possible cutoff lines superimposed. (B) The numbers of true anomalies and false positives recorded for each cutoff line show that reducing the cutoff line allows more actual anomalies to be correctly identified as such but also leads to an increased number of falsely identified anomalies. The default cutoff line for the Mallard program is 99.9%, which provides a reasonable compromise between detecting as many anomalies as possible and producing the smallest number of false positives.
|
Analysis of remaining gene libraries.
An equivalent analysis of 270 near-complete sequences from the archaeal taxon Crenarchaeota revealed 21 anomalies (7.8% of the records). Of these, nine were clearly chimeric (AY882843, AY861964, AY882689, AB113633, AB113628, AY882728, AB113635, AB113631, and AB113630), seven were assembly errors with missing sequence (AF425659, U71116, U71111, U71110, X99558, AY861962, and AY861949), and five were highly degenerate (AY247896, X99559, AF425658, AF169012, and AY264344).
To demonstrate the effectiveness of Mallard in handling partial sequences, a library of 156 sequences, generated from our laboratory (20), was investigated. This library contained partial sequences ranging from 655 to 1,115 bases and four near-complete (
1,200-base) sequences. The partial sequences fell into two groups: those located at the 5' end of the 16S rRNA gene (82 sequences) and those derived from the 3' end (70 sequences). In total, 11 anomalies (all chimeras) were found (AY354817, AY354789, AY354824, AY354794, AY354776, AY354718, AY354851, AY354749, AY354852, AY354811, and AY354804). A detailed breakdown of this analysis is included as a worked example with the Mallard program help documentation.
Finally, a selection of libraries generated by other authors over the preceding year (2005) were screened. Here, analysis was restricted to putative anomalies identified by a cutoff line of 100% only; thus, our results (Fig. 4; see Table S2 in the supplemental material) have underestimated the true anomaly numbers. All but three of the 25 libraries identified were found to contain anomalies. Mallard identified 714 putative anomalies; of these, 543 were subsequently confirmed to be anomalous, 493 of which showed clear chimeric patterns (see the supplemental material for a complete list of confirmed anomalies). The average (confirmed) anomaly content per library was 9.0%, with the highest content recorded as 45.8% (Fig. 4; see Table S2 in the supplemental material).
![]() View larger version (32K): [in a new window] |
FIG. 4. Analysis of near-complete ( 1,200-base) sequences from 25 16S rRNA gene clone libraries submitted to the public repositories during 2005 (5-12, 15, 18, 21, 25, 28, 32, 33). Gene libraries are identified by the first author surname and the RDP REFID number, with the number of near-complete sequences (library size) in parentheses. The bars indicate the number of detected anomalies (identified with the 100% cutoff line) as a percentage of library size, with those anomalies confirmed as such by further investigation and false positives shown.
|
Comparison with Bellerophon results.
Mallard was consistently better at correctly detecting chimeras than Bellerophon, with an average of 73.1% of known chimeras being detected per library using default settings only, in contrast to Bellerophon, where only 59.8% of chimeras were correctly identified (Table 1). Mallard was also consistently better at avoiding false positives than Bellerophon, with an average of 1.9% of library records being falsely identified as chimeric in contrast to Bellerophon's mean figure of 7.2% (Table 1). Although Mallard was consistently better at detecting chimeras, Bellerophon would sometimes detect anomalies missed by Mallard. For example, Bellerophon correctly identified the Crenarchaeota records AY882694 and AY882830 as chimeric, whereas they were missed by Mallard.
|
View this table: [in a new window] |
TABLE 1. Comparison of the performance of Mallard with that of Bellerophon
|
|
|
|---|
Like Bellerophon and most sequence comparison methods generally, Mallard uses aligned sequence data and is dependent on the quality of these alignments to arrive at the correct answer. In this study, we used a mixture of ClustalW alignments and alignments downloaded from the RDP website. Unlike ClustalW, the RDP's alignment procedure takes into account 16S rRNA secondary structure when constructing an alignment. Theoretically, this should make RDP alignments more accurate than ClustalW alignments; however, in practice we found that RDP alignments were sometimes inferior. An example is the RDP alignment for the gene library of Spear et al. (25), which successfully identified four chimeras but also generated 28 false-positive results; further investigation revealed that these false positives were caused by poor alignment. Realigning them with ClustalW resolved the problem, and the four correctly identified chimeras were identified without extra false positives. We recommend that the user pay particular attention to the quality of the alignment when using Mallard, Bellerophon, or indeed any other alignment-based method.
In our previous study (2), we estimated that, overall, around 5% of Bacteria 16S rRNA gene sequence records within the public repositories have substantial errors. In our current study, we found anomaly levels of 6.8% among Verrucomicrobia records (Bacteria) and 7.8% among Crenarchaeota records (Archaea). More significantly, however, in our survey of 16S rRNA clone gene libraries submitted during 2005, we showed that the average number of anomalies per submitted library had risen to 9.0% over the course of that year. This is very likely an underestimate. Using a 100% cutoff line alone to identify putative anomalies resulted in a conservative estimate of true anomalies, and as a result, some more subtle (and not so subtle) chimeras that we know exist were excluded from our final counts.
The submitted 2005 clone libraries varied greatly in chimera content, ranging from 0 to 45.8% of the total sequence records considered. Of the 25 libraries, only 17 are currently associated with papers, and of these, the amounts of information on how libraries were constructed and checked vary greatly (for example, only nine papers actually stated that chimera detection methods were used, preventing any conclusion as to the efficacy of existing methods based on these libraries). Consequently, it is difficult to draw any conclusions as to why such a variation in chimera content has occurred. It has been speculated that increasing the number of cycles when PCR amplifying DNA can increase the chances of chimera formation (30), although no correlation between chimera generation and cycle number could be detected in the current study. The harshness of the DNA extraction method used has also been implicated in chimera formation, but even recourse to "gentle" DNA extraction methods involving detergents or enzymes does not appear to reduce the problem (17), and certainly there is insufficient information available to draw any conclusions in this regard from the 2005 clone libraries considered.
It would appear, therefore, that chimeras within 16S rRNA gene clone libraries are inevitable, at least with current PCR methodologies. Previously, it had been estimated that up to 30% of individual PCR-generated clone libraries were likely to be chimeric (17, 30, 31). We cannot comment on how many chimeras were originally generated by the researchers considered in this study, but we note that libraries with up to 45.8% chimeras are being submitted without comment to the public repositories. Serious anomalies are polluting the public repositories to such an extent that their usefulness is being surreptitiously and progressively compromised. The effects are already being felt; for example, some putative chimeras were especially difficult to check during the current study because so many anomalies had been submitted for the taxa they supposedly represent.
This study indicates that most libraries submitted during 2005 contain misleading anomalies, and the average anomaly content per library is estimated to be 4% higher than the 5% estimated previously for the public repository overall. Moreover, our results show that the vast majority of these errors are now chimerasthe most insidious and misleading of anomalies. At least 90.8% of the anomalies considered in this study had chimeric patterns, which contrasts dramatically with the 64.3% of anomalies reported previously (2). Our previous study showed that between 1993 and 2004 a steadily increasing number of chimeras were submitted to the NCBI database (2), at least among the phyla investigated by that study. Overall, we conclude that the specific problem of chimeric 16S rRNA sequences in the public databases is at best not improving and at worst is becoming more acute. We offer our software free to the wider research community in the hope that it will complement existing methods to ensure that as few chimeras and other anomalies as possible are submitted in future.
Supplemental material for this article may be found at http://aem.asm.org/. ![]()
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»