Previous Article | Next Article ![]()
Applied and Environmental Microbiology, October 2008, p. 6452-6456, Vol. 74, No. 20
0099-2240/08/$08.00+0 doi:10.1128/AEM.01394-08
Copyright © 2008, American Society for Microbiology. All Rights Reserved.

Thomas Junier,2,
and
Karl-Paul Witzel3
Environmental Microbiology Laboratory, Ecole Polytechnique Federale de Lausanne, CH-1015 Lausanne, Switzerland,1 Computational Evolutionary Genomics Group, University of Geneva, CH-1211 Geneva, Switzerland,2 Max Planck Institute for Evolutionary Biology, 24306 Ploen, Germany3
Received 21 June 2008/ Accepted 19 August 2008
|
|
|---|
|
|
|---|
Specialized software can support the design and interpretation of T-RFLP experiments at two levels: (i) digestions of reference sequences can be simulated in silico in order to find appropriate enzymes for experimental analysis, and (ii) experimental T-RFLP patterns can be associated to predicted T-RFs from sets of reference sequences in order to identify possible species in the sample. Programs available on the web, such as MICA (microbial community analysis), TAP T-RFLP from the Ribosomal Database Project, or TReFID (6, 9, 12), can be used to perform in silico digestion of 16S rRNA genes. More recently, a similar module was integrated in the phylogenetic software program ARB (10). Although programs such as ARB can handle user-defined sets of sequences from genes other than 16S rRNA genes, this requires additional steps, such as the integration and alignment of the sequences, before the simulation can be performed. To our knowledge, none of the programs available so far has been specifically designed to simulate and create T-RF data sets using arbitrary sets of DNA sequences prepared from specific targets (e.g., genes involved in any metabolic pathways) or from unpublished sequences.
An increasingly popular trend in T-RFLP analysis consists of the identification of species in the samples by associating T-RFs from experimental runs with predicted T-RFs from a set of existing sequences. However, since related organisms commonly produce T-RFs of the same length, this association can be ambiguous, requiring digestion with several enzymes to increase the confidence on the assignment (5). Therefore, automation in the comparison of more-complex sets of data can contribute to the analysis and interpretation of T-RFLP data.
In this work we present the software program TRiFLe, which generates theoretical T-RFs from arbitrary sets of sequences by simulating PCR amplification and digestion with restriction enzymes. The main advantage of TRiFLe is thus that the simulation can be tailored to any desired groups of organisms, sequences from clone libraries, or specific genes. The results of the simulation can be used to design T-RFLP experiments or to compare theoretical and experimental T-RFs. The identification function included in TRiFLe allows the comparison of experimental results from several independent digestions with theoretical T-RFs from a data set of sequences. The program was validated by analyzing the diversity of ammonia- and methane-oxidizing bacterial communities in the metalimnion of Lake Kinneret (Israel) using PCR amplification, T-RFLP, and cloning of the genes amoA and pmoA.
|
|
|---|
Two different functionalities are implemented in the program. In the simulation function, the aim is to predict T-RFs from sequences, primers, and enzymes given by the user. In the identification function, the program compares results from T-RFLP experiments with a data set of T-RFs from a set of reference sequences and computes a score to predict the community composition of the sample.
The input for the simulation of T-RFs (Fig. 1) consists of the following: (i) a FastA file containing the data set sequences, (ii) the primer sequences (these are just typed in a text field; IUPAC ambiguity codes can be used to specify degenerate primers), (iii) the labeled primers (forward, reverse, or both), and (iv) the set of restriction enzymes. The program constructs a probabilistic model (weight matrix) from each primer and searches the reference sequences for matches of each model. The matrix is slid along the candidate sequence, and each position is scored according to the matrix. The score is expressed as a probability. If the probability is above a certain threshold, which can be set by the user (through a slider in the dialogs), the position is considered a match. Nucleotide mismatches in the candidate sequence will lower the probability score, so the threshold gives control over the number of allowed mismatches.
![]() View larger version (29K): [in a new window] |
FIG. 1. Input interface of TRiFLe for simulating T-RFs (A) and graphical display of the results from the simulation (B). The different parts of the displays are indicated.
|
For the identification function, experimental profiles are compared with theoretical T-RF profiles generated from a set of sequences. The input data are as follows: (i) a FastA file containing the reference sequences, (ii) the primers, and (iii) a set of files from analyzed data of a T-RFLP experiment (run file), each containing experimentally measured T-RF lengths obtained with one enzyme using one labeled primer. For the run file, the program accepts any TAB-delimited table format and the user may define which of the columns correspond to the experimental fragment length, allowing run files with different formats to be analyzed. Considering that experimental lengths reported by a sequencer are known to be subject to errors (5, 7), the user can correct the experimental values using the correction formula of Kaplan and Kitts (5). Although this experimental correction was calculated for T-RFLP analysis using an ABI 310 genetic analyzer, it is so far the only experimental correction existing, and simulations with our data sets have shown good results when T-RFLP data from other systems have been corrected (data not shown).
For the identification of the T-RFs in the experimental samples, the program displays those T-RFs that were compared (simulated and experimental), as well as the distance (expressed in nucleotides). Additionally, considering that the experimental lengths of amplicons that do not contain an enzyme cut (unrestricted amplicons) are usually more biased (5), TRiFLe includes an option for setting a range of the fragments to be included in the calculation of the distance. Since different species may produce the same T-RF length with a particular enzyme and it is not possible to accurately quantify the contribution of each of them to the peak, a particular peak can be used in more than one identification. Therefore, having a larger set of enzymes can be expected to yield better identifications, since the overall distance is calculated from the combination of all the enzymes used.
|
|
|---|
|
View this table: [in a new window] |
TABLE 1. Predicted and measured T-RFs of nifH sequences in five diazotrophic strainsa
|
|
|
|---|
![]() View larger version (48K): [in a new window] |
FIG. 2. Validation of the identification function of TRiFLe using experimental results from a T-RFLP experiment using a water sample at the metalimnetic layer of Lake Kinneret (Israel). (A) Electropherograms of the T-RFLP analysis of pmoA and amoA PCR products digested with HaeIII, MspI, MboI, AluI, and TaqI. T-RFs from a control set of clones that were identified by the program are shown in blue and green (colors indicate different phylogenetic groups). "N.I." indicates undigested peaks that were omitted for the calculation of the distance. (B) Phylogenetic tree of the reference data set of sequences used for the identification, including 13 clones from a library prepared from the environmental sample (bold). The simulated T-RF lengths calculated with TRiFLe for each of the enzymes are given in parentheses. The phylogenetic analysis was carried out using ARB (http://magnum.mpi-bremen.de/molecol/arb/); bootstraps values are indicated by black (100%) or gray (90 to 99%). The top 20 sequences identified by the program (see Table 2) are shown in color. Sequences in blue and green correspond to clones from the control set. Sequences in red correspond to reference sequences among the top 20. b, bacterium; prot, proteobacterium; alph or alpha, alphaproteobacterium; gamm or gamma, gammaproteobacterium; str., strain.
|
|
View this table: [in a new window] |
TABLE 2. Differences between observed amoA and pmoA T-RF sizes and those predicted using the identification function of TRiFLea
|
We thank personnel of the Yigal Allon Kinneret Limnological Laboratory, Israel Oceanographic and Limnological Research, for their assistance during the sampling. We thank Ok-Sun Kim for testing the program and Ilonka Jäger, Tobias Lenz, Marco Pagnini, Dario Diviani, and Carlo Rivolta for their valuable comments.
Published ahead of print on 29 August 2008. ![]()
These authors contributed equally to the manuscript. ![]()
|
|
|---|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»