Previous Article | Next Article ![]()
Applied and Environmental Microbiology, May 2008, p. 3257-3265, Vol. 74, No. 10
0099-2240/08/$08.00+0 doi:10.1128/AEM.02720-07
Copyright © 2008, American Society for Microbiology. All Rights Reserved.
,
Pacific Northwest National Laboratory, Richland, Washington
Received 3 December 2007/ Accepted 20 March 2008
|
|
|---|
|
|
|---|
In order to facilitate the investigation of the underlying processes that enable these respiratory specialists to thrive in such environments, the chromosome and plasmid of S. oneidensis MR-1 were sequenced, the genes and functions were predicted (12), and the data were deposited in GenBank (accession no. AE014299 and AE014300). A second version of the protein-encoding-gene calls for the chromosome was later produced (accession no. NC_004347) based on an alternative gene-calling strategy that resulted in the removal of 429 protein-encoding-gene predictions and the addition of 108 new ones (7). These predictions serve as an essential resource for hypothesis development and the interpretation of experimental results produced by numerous institutions around the world that are conducting research on this bacterium. However, as is likely true of any genome annotation, especially those for species like S. oneidensis MR-1 that were among the first to have their genomes sequenced, there are numerous errors in gene calling and inaccurate functional predictions. As a primary objective of sequencing centers is to provide researchers with rapid access to sequence data, the subsequent refinement of the annotation has generally been left to the research community.
Maintenance of the genome annotation requires continuous fine-tuning of the predicted positions of coding sequences in the genomes and of the functions ascribed to them. The process involves the manual evaluation of results from automated bioinformatics analyses (e.g., sequence comparison, the detection of conserved domains, and the prediction of protein localization), extensive mining of the literature for evidence of functions of homologs, the review of experimental results produced from high-throughput analyses (microarray and global proteomics analyses), and physiological and biochemical characterization, both of the organism sequenced and of mutants. While these activities are interdependent, our first major effort was focused on improving gene and pseudogene predictions. In this study, we report the discovery of three repeated elements that we propose to be miniature inverted-repeat transposable elements (MITEs) that can be mobilized by transposases encoded by insertion sequence (IS) elements within the MR-1 genome, the mapping of the positions of over 200 IS elements, and changes to the predicted gene and pseudogene counts.
|
|
|---|
Identification of mobile elements.
The termini of the IS elements were identified manually by employing several different strategies. First, each MR-1 transposase was assigned to a transposase family by BlastP analysis against protein sequences in the ISfinder database (http://www-is.biotoul.fr/) (25, 26). Based on information provided at the ISfinder site, the family identity could be used to predict the characteristics (e.g., the element size, the presence of direct repeats, and the expected sequences of insertion sites) of the associated IS element for each transposase. In rare instances in which an MR-1 transposase had high levels of sequence identity to an ISfinder entry, it was possible to precisely identify MR-1 IS termini by simply searching for terminal repeats that were located at positions equidistant from a transposase open reading frame (ORF) and had high levels of identity to IS element sequences deposited in ISfinder.
Artemis (24) was routinely used to record, view, and adjust assigned IS element positions within the MR-1 genome. Two types of data were typically used to approximate the positions of the IS termini. One involved identifying positions at which a flanking ORF was truncated or interrupted (as described above). The other involved using BlastN to analyze DNA sequences flanking the transposase gene for identity to sequences flanking identical transposase genes in MR-1 or similar genes in other Shewanella species. Where possible, elements encoding paralogous or orthologous transposases were aligned using T-coffee (21) and trimmed to produce a consensus IS (except in cases in which the element was disrupted). Where necessary, the IS termini were then further adjusted to include terminal inverted repeats, exclude flanking direct repeats, and conform to general characteristics expected for elements encoding transposases of the same family. A representative of each new type of IS element, except those that were degenerate, was deposited in the ISfinder database.
The termini of the MITEs were defined by identifying aligning repeated regions to define the core conserved repeat and, where possible, identifying positions of gene interruption or truncation. Once the terminal inverted repeats were identified, it was possible to identify additional shorter versions of these MITEs that either lacked a sequence between the inverted repeats or matched one end or the other of the cognate full-length MITE. Secondary structures of full-length MITEs were determined using Sfold, available at http://sfold.wadsworth.org/srna.pl (8).
Identification of candidate laterally transferred genes.
BlastP was utilized to identify best-hit matches in the nonredundant database. Matches to MR-1 were ignored, and the next best match was identified. Information regarding the phylogenetic origin of the top hit as well as sequence identity scores and protein sizes (for the query sequence and the hit) were extracted from the BlastP output. The DNA sequence of each MR-1 coding region was also analyzed using a custom Perl script to determine the percent G+C contents at the third position of the codons, and then the mean and standard deviation for all genes were determined. Genes adjacent to ISSod25 and the integron integrase gene were manually analyzed in Artemis for the presence of putative attC sites bounded by conserved 5'-RYYYAAC and 3'-GTTRRRY motifs. In addition, sequences of attC sites available in the literature were compared, via BlastN analysis, to the MR-1 genome to assist in the identification of additional attC sites.
Proteome analysis.
Global analyses of proteins using the accurate mass and time (AMT) tag technique have been described in detail previously (16). The current database for S. oneidensis MR-1 was created based on the use of a modified FASTA file containing sequences for 4,198 proteins, including 146 deduced from "repaired" pseudogenes. Two hundred thirty proteins belong to 24 paralogous families containing identical or near-identical sequences. Hence, the count of 4,198 includes one representative of each of these paralogous families, making it easier to identify peptides that uniquely identify either a single protein or a group of nearly identical proteins. Also not included in this file are translations for 54 pseudogenes that are either represented by an intact paralog or are highly degenerate and five newly identified or otherwise missing genes (SO_0461, SO_4814, SO_4816, SO_4817, and SO_A0186). The AMT tag database included 1,545 data sets containing tandem mass spectrometry (MS-MS) data from a combination of linear trap quadrupole (LTQ), LTQ-Fourier transform, and LTQ Orbitrap instruments. These S. oneidensis MR-1 data sets were generated from samples collected from 33 different cultures prepared under different growth conditions. Included in this database were 88,040 peptide identifications associated with 3,579 proteins after filtering by previously defined methods (27). The SEQUEST score filters used included a minimum required discriminant score of 0.85 and a minimum required peptide length of 6 amino acids. For analyses discussed herein, only peptides which were observed at least three times were considered. At this observation count, a total of 3,118 proteins were represented by at least one unique peptide and 2,424 were represented by at least three unique peptides. An additional 128 proteins were also detected with at least one peptide (95 with three peptides) but are members of paralogous protein families with identical or near-identical sequences, making it impossible to distinguish from which genes they were expressed.
Accession numbers.
The updated gene annotations and genome sequence for the chromosome have been submitted to GenBank via the J. Craig Venter Institute under the original accession number (AE014299) assigned to this organism. The plasmid annotation updates were submitted by The Institute for Genome Research under the original accession number (AE014300) over 1 year ago, but the annotation has changed since that time, and therefore, it is suggested that the reader use data provided in the supplemental material to obtain current genome locations and annotations for genes described herein.
|
|
|---|
Mapping the termini of IS elements and interrupted genes.
S. oneidensis MR-1 encodes a large number of transposases, suggesting that IS element interruption would likely account for many pseudogene predictions. While only 59 transposases were noted in the original MR-1 genome publication (12), a reassessment of gene functions suggests that a total of 219 genes encode transposases, 54 of which are in themselves pseudogenes and many of which are identical or nearly identical in sequence. Four of these genes (SO_0643/SO_0644 and SO_2654/SO_2655) are associated with the transposition of the two Mu prophages in MR-1 and, thus, are not considered further here as potential sources of gene disruption. The mapping of the termini of the IS elements was facilitated by the comparison of sequences that flank paralogous transposase genes in MR-1 or their orthologs in other sequenced Shewanella genomes and by the identification of the breakpoints in interrupted genes. By using this strategy, it was possible to predict the termini for all but four of the IS elements (see Table S3 in the supplemental material). Comparisons to transposase gene sequences deposited in the ISfinder database revealed that the diversity of IS elements carried by the MR-1 genome is quite broad, the over 40 types of IS elements found most closely matching 15 of the total of 19 IS families described in ISfinder.
All but six of these types of IS elements carry a single ORF predicted to encode the IS-mobilizing transposase. IS elements ISSod1, ISSod2, ISSod10, and ISSod15 each encode a transposase that is predicted to be activated by programmed translational frameshifting of the two ORFs found in the element, a frequently used strategy that limits the expression of transposase activity (2). ISSod9 is a class II transposon belonging to the Tn3 family and comprises four ORFs; there are two copies of ISSod9 on the MR-1 megaplasmid. Orthologs of the ISSod9 transposase occur in Shewanella sp. strain ANA-3, S. baltica OS155, and S. frigidimarina. Mapping of the conserved IS termini in each genome revealed that the five copies in strain OS155 encode only a transposase and a resolvase and therefore would be classified as an IS element, not a transposon. In the remaining strains, including MR-1, the IS element also encodes passenger proteins whose function is not related to IS element mobilization (see Fig. S1 in the supplemental material) and, hence, would be classified as a transposon. While the S. frigidimarina and Shewanella sp. strain ANA-3 transposons encode cation efflux pumps predicted to mediate resistance to Cd2+, Co2+, Zn2+, or Pb2+, the S. oneidensis MR-1 transposon encodes two functionally uncharacterized cytoplasmic proteins, one of which possesses a nucleotidyltransferase domain commonly found in kanamycin nucleotidyltransferases, suggesting that the acquired function may be related to antibiotic resistance.
The remaining IS element that comprises multiple ORFs is ISSod25, an IS91 family member which is found in five copies (one truncated) on the chromosome. This IS element encodes both a transposase and a phage integrase family protein and is flanked on the 5' side by GAAC and on the 3' side by CAAG (with two exceptions), as expected for members of this family. The element carrying SO_2035-SO_2036 (ISSod25_2) is of particular interest because it has previously been proposed to be a component of the MR-1 superintegron (9). A putative recombination site (attI) and integron integrase gene (SO_2037), whose activity was experimentally validated (9), are found immediately upstream of ISSod25_2, as well as immediately upstream of the three other full-length ISSod25 elements. Drouin et al. (9) identified three attC sites downstream of the SO_2037 integron integrase gene, one that was described as characteristic of S. oneidensis MR-1 [called attC (Son type) in reference 9], one that is similar to the VCR repeat in Vibrio cholerae [called attC (VCR-like) in reference 9] (6), and a third that resembles the 59-bp attC site associated with the aadA and aadB genes, which confer aminoglycoside resistance (22). This superintegron locus was identified by the analysis of sequences available prior to the final assembly of the genome sequence of S. oneidensis MR-1. However, in the final assembly, it is apparent that this region is no longer contiguous but is instead now split into two separate sites on the chromosome, each containing a copy of ISSod25 (Fig. 1).
![]() View larger version (37K): [in a new window] |
FIG. 1. (A) Map of ISSod25-associated integron adapted from Drouin et al. (9); (B) map of corresponding loci in the final genome assembly; and (C) map of loci adjacent to other ISSod25 elements. IS elements encoding transposases and integrases are depicted as gray boxes with dark and light green arrows, respectively. The DUF568-associated repeat is shown in red, the integron integrase gene is shown in blue, and the truncated (SO_2035 and SO_4779) and interrupted (SO_2167) pseudogenes are shown in yellow. Each of the four full-length ISSod25 elements on the chromosome are immediately preceded by the sequence GTTGAAC, which matches the consensus GTTRRRY sequence found at the 3' end of the attI recombination site. Identical stretches of 69 nucleotides ending with GTTGAAC are found upstream of ISSod25_1, ISSod25_2, and ISSod25_4 and likely delineate the full-length attI site. attC Son, S. oneidensis-specific attC site; attC 50-be and attC 59-be, attC sites resembling the 59-bp attC site associated with the aadA and aadB genes; attC VCR, resembles a V. cholerae VCR repeat.
|
The algA (SO_2213) gene, which is found downstream of ISSod25_4, is not predicted to have been acquired by integrase activity and is only 28 bp downstream of the ISSod25_4-associated attC site, suggesting that the native promoter for this gene was lost as a consequence of ISSod25_4 insertion at this site. Because algA is conserved, it was possible to investigate whether there was evidence to support this hypothesis by comparing the upstream sequences of the orthologous algA genes in other Shewanella strains to sequences in MR-1. Interestingly, we found the algA promoter locus upstream of ISSod25_3, rather than separated from algA by ISSod25_4 (see Fig. S3 in the supplemental material). This observation, combined with the results of neighborhood analyses around the algA locus of Shewanella, suggests that a recombination event between sequences upstream of ISSod25_3 and downstream of ISSod25_4 occurred, resulting in the displacement of the MR-1 algA promoter from the expected site downstream of ISSod25_4 to the position upstream of ISSod25_3. Eight different peptides of AlgA have been detected, suggesting that ISSod25_4 carries a promoter capable of controlling the expression of this gene in MR-1.
Several additional small 3' fragments of ISSod25 occur in the genome at positions upstream of SO_0911, SO_4816, SO_1081, SO_1888, SO_3617, SO_3775, SO_4341, and SO_4704 and downstream of SO_3453. Again, genes found downstream of these ISSod25 fragments either are more similar to genes of species outside the Shewanella genus than to Shewanella genes or have low levels of similarity to Shewanella genes, suggesting that they may have been acquired by a mechanism similar to that of the acquisition of the genes found downstream of the full-length ISSod25 elements (see Table S4 in the supplemental material).
Other IS-like elements in the genome.
Three features characteristic of IS elements include the frequent occurrence of exact or near-exact copies throughout the genome, the presence of short terminal inverted repeats, and the interruption or truncation of genes. Identical conserved hypothetical proteins with a DUF1568 domain are encoded by 18 genes on the chromosome, suggesting that perhaps these proteins too are transposases or are associated with an IS element. Analyses of the regions that flank the genes encoding these proteins revealed that the proteins correspond to a conserved sequence that has terminal inverted repeats and, in seven instances, occurs adjacent to truncated genes. We therefore propose that this conserved sequence is a mobilizable element (ISSod41) and that either the associated DUF1568-containing proteins are involved in its mobilization or other transposases encoded by MR-1 are responsible for its mobilization. An unusual feature of this element is that it frequently occurs in pairs in the genome, sometimes even colocalized with an additional ISSod41 fragment. In addition, copies of a 59-bp sequence are found near the 5' end of ISSod41 and between the pair of attC sites that resemble the V. cholerae VCR repeat and the 59-bp site adjacent to the aadA and aadB genes, respectively (Fig. 1B). A perfect match to the conserved GTTRRRY integrase insert site is found near the 3' end of this conserved 59-bp sequence (5'-GACACCCATCCTTAATAGTGCGGTAGTTAACCTCCTACTATGCTTTGGTTAAGCAT TGA; the matching sequence is in boldface). While this observation may be coincidental, it does raise the possibility that this site is an attI site that can serve as a site for the capture of foreign DNA. Indeed, many of the ISSod41 elements are adjacent to genes that are not conserved in Shewanella and have values for GC usage at codon position 3 that differ from the MR-1 mean value by at least 10% (see Table S4 in the supplemental material). These observations suggest that the potential roles of ISSod41- and ISSod25-encoded proteins in the acquisition of foreign genes into genomes warrant further study.
The analysis of the MR-1 genome for additional repeated DNA sequences revealed the presence of three elements, called MITEs, that have characteristics of the class II transposons (13); specifically, these MITEs are short (see Table S5 in the supplemental material), have no coding potential, and have the potential to form a stable RNA secondary structure (Fig. 2; see Fig. S2 in the supplemental material). A comparison of the terminal inverted repeats for these proposed MITEs with those of the MR-1 IS elements (Fig. 3) revealed that SonMITE_1, SonMITE_2, and SonMITE_3 termini are homologous to the termini of ISSod6, ISSod10, and ISSod22, respectively, suggesting that the respective transposases encoded by these IS elements may be able to mobilize the MITEs with similar terminal repeats. While no obvious gene disruptions resulting from SonMITE_2 insertion were found, representatives of both SonMITE_1 and SonMite_3 interrupt genes. SonMite1 interrupts five genes (SO_0790, SO_0793, SO_1591, SO_2158, and SO_4423) once and one gene (SO_3976) three times, and SonMite_3 interrupts one gene (SO_0911), further supporting the hypothesis that these elements can be mobilized by other transposases in MR-1. Several additional copies of SonMITE_1 overlap the 3' ends of genes, by as much as 92 bp in the case of SO_2196 (see Table S5 in the supplemental material). However, comparative analysis with other Shewanella orthologs suggested that this overlap would result in no significant loss of encoded protein, and hence, we chose not to annotate these genes as being truncated by SonMITE_1.
![]() View larger version (5K): [in a new window] |
FIG. 2. Ensemble centroid structure for a full-length SonMITE_1 element. Structures of SonMITE_2 and SonMITE_3 are provided in Fig. S4a and b, respectively, in the supplemental material. G°37, Gibbs free energy calculated at a folding temperature of 37°C.
|
![]() View larger version (13K): [in a new window] |
FIG. 3. The alignment of SonMITE and IS element termini demonstrates high levels of sequence identity. Asterisks indicate identical residues.
|
Because some of the genes encode proteins that are identical or nearly identical to proteins produced by other genes in the cell (and hence have no unique peptides), 53 of the predicted pseudogenes could not be evaluated by this analysis. Of the remaining pseudogenes, 40 were matched to only one peptide, which is generally not considered sufficiently robust to validate protein expression (4). However, many of these single-hit peptides were observed in multiple MS-MS scans, suggesting that their parent proteins may in fact be expressed. It is also interesting that several of the peptides matched positions in the proteins that corresponded to sequences after the predicted mutation sites, suggesting that under at least some culture conditions, the cells were producing full-length proteins. If true, this observation would indicate either that the IS element had been excised from the site or that a subpopulation of cells lacking the interruption existed within the culture.
In addition to interrupting genes, IS element insertions can lead to the separation of promoters from nearby genes, thereby inactivating them. A total of 67 genes were identified as potentially being impacted by a nearby IS or MITE insertion event (see Table S8 in the supplemental material) at close proximity (
50 bp or less) to the 5' gene end. High peptide counts were observed for proteins encoded by genes close to SonMITE_1, ISSod25, and ISSod10_3 elements, suggesting that these elements comprise promoters that can drive the expression of neighboring genes. The large chemotaxis operon (SO_2317-SO_2327), one of three found in MR-1 (15), is interrupted by ISSod4_17 (which, in turn, is interrupted by ISSod1_22) and ISSod4_18, which would suggest that this operon is nonfunctional. However, peptides from both the interrupted cheA_2 gene (SO_2320) and three of five of the downstream genes were detected, with peptides from CheA_2 corresponding to positions before and after the site of IS insertion (albeit only one peptide for each side, with each peptide observed only six to seven times). While these observations are not significant enough to validate the expression of this chemotaxis locus, they do suggest that this locus should not be regarded as being degenerate without further study.
As a result of this annotation refinement and additional assessment of coding potential, 685 genes were dropped from the annotation (see Table S7 in the supplemental material). Most of the genes that were dropped were small, with 399 predicted to encode polypeptides of less than 50 amino acids in length. Additional reasons to drop genes included the joining together of disrupted gene fragments (172 genes), an overlap with mobile elements (72 genes), an overlap with genes or other elements on the same or the opposite strand (111 genes), and the assessment that the genes were too close to or even overlapping a bidirectionally transcribed gene (81 genes). The majority of the remaining 249 genes were small (200 genes) and/or started with the rare TTG codon (100 genes) and, hence, were considered unlikely to encode proteins. It should be noted that 474 of these dropped genes are included in only one version (RefSeq or GenBank) of the MR-1 annotation, demonstrating the extent of difference in gene predictions that arises simply by employing different ORF-calling algorithms and cutoff criteria.
|
|
|---|
Our analysis also revealed that the integron previously discovered in the partially assembled genome (9) is split into two different sites in the final genome assembly, each carrying a copy of ISSod25. An additional 2 full-length copies and 10 fragments of ISSod25 were also found in the genome. Surprisingly, most of these IS elements were in the 5' direction from one or more putative attC sites and adjacent to genes with 3' attC sites whose sequences were more characteristic of other bacterial phyla than of Shewanella species and that had 3' attC sites. ISSod25-like elements are found in other Shewanella species, including all three S. putrefaciens strains and all S. baltica strains except OS155. Among these strains, homologs of the MR-1 integron integrase are present only in S. putrefaciens 200, S. baltica OS185, and S. baltica OS233. An analysis of the two ISSod25-like elements in S. putrefaciens W3-18-1, which lacks the integron integrase, revealed flanking attI/attC sites as well as downstream genes having 3' attC sites. Numerous additional sites identical to the ISSod25 attC and VCR-associated attC repeats are also present in this genome, often in the 3' direction from multiple adjacent genes. These observations lend additional credence to the hypothesis that the ISSod25 integrase can mediate the integration of foreign DNA into the MR-1 chromosome. They also demonstrate that several other, if not all, sequenced Shewanella spp. have a means to use integrase-mediated capture of foreign DNA.
Proteome data provided evidence that only 40 of the pseudogenes are translated into proteins (see Table S6 in the supplemental material). In most cases, however, both the number of unique peptides observed and the maximum number of times any one peptide was observed were low. This finding suggests that few, if any, of these genes have expressed significant levels of protein under the culture conditions used to generate the proteome sample. It is possible that the absence of peptides for these pseudogenes reflects simply experimental limitations, specifically, that conditions required for their expression have not been tested or that expression levels are below the level of detection. However, the currently available evidence is more indicative of these genes' no longer being functional. Over one-third of the pseudogenes encode a transposase or recombinase, and many of these sequences are fragments of full-length copies carried elsewhere in the genome. Just under 20% of the predicted functions of the remaining pseudogenes are associated with environmental sensing, the control of gene expression, or transport. In some instances, multiple genes within a single functional subsystem are mutated, providing a clear indication that entire systems are decaying. Examples include one of the three chemotaxis gene clusters present in the MR-1 genome, genes with functions associated with the degradation of starch, genes for C4-dicarboxylate sensing and uptake and nitrite respiration, and genes encoding components of the type I pilus (Table 1).
|
View this table: [in a new window] |
TABLE 1. Selected degenerate functions in MR-1
|
Having mapped the mobile elements in S. oneidensis MR-1, we are now better poised to investigate their roles in the evolution of the MR-1 genome and to do the same with the other sequenced Shewanella genomes. Future research efforts that capitalize on the availability of sequences from related genomes to study genome evolution hold considerable promise for developing new insights into the roles of mobile elements and DNA recombination in the adaptation of organisms to their environment.
Genome sequencing efforts were funded by the DOE Office of Biological and Environmental Research (OBER). This research was supported by the DOE OBER Genomics: Genomes to Life program. Proteomics analysis was performed at the W. R. Wiley Environmental Molecular Sciences Laboratory, a national scientific user facility sponsored by OBER and located at Pacific Northwest National Laboratory.
Published ahead of print on 31 March 2008. ![]()
Supplemental material for this article may be found at http://aem.asm.org/. ![]()
|
|
|---|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»