Previous Article | Next Article ![]()
Applied and Environmental Microbiology, November 2004, p. 6525-6534, Vol. 70, No. 11
0099-2240/04/$08.00+0 DOI: 10.1128/AEM.70.11.6525-6534.2004
Copyright © 2004, American Society for Microbiology. All Rights Reserved.
C. E. Bagwell,
J.-Z. Zhou,
T. Yan,
X. Liu,
and
C. C. Brandt
Environmental Sciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee
Received 29 August 2003/ Accepted 28 June 2004
|
|
|---|
|
|
|---|
Cloning and sequencing of functional genes from environmental samples are powerful methods for investigating the ecology of microorganisms. These techniques have advanced our understanding of the types of microorganisms and degradation capabilities found in various habitats (6, 12, 15, 43, 51). However, relating the population data generated by these techniques to environmental characteristics, such as geochemical measurements, can be challenging. One problem is the small sample size that is typical in these studies (66). The time and difficulty of generating and characterizing clone libraries that adequately cover the microbial populations often limit the sample size. Another characteristic is the large number of measurements for each sample. Finally, the underlying relationships between the microbial populations and their environment are often complex and nonlinear.
Various statistical and mathematical tools are available for relating the distributions of microorganisms to environmental characteristics. One powerful nonlinear approach that has been used to analyze such data is artificial neural networks (ANNs) (10, 17, 39, 45, 46, 50). ANNs are interconnected layers of simple computational units that map the relationships between predictor and target vectors. A computational unit sums its inputs and computes its present state from a nonlinear activation function based on this sum. Outputs from each layer are passed onto the next layer via weights that can be optimized to reflect the strength of the connection. Adjacent layers are typically fully interconnected. Bishop (7), Haykin (25), and Jain et al. (27) provide basic introductions to ANN theory.
ANN models are more general than linear methods. Hornik et al. (26) have shown that with a sufficiently complex architecture, an ANN is capable of approximating any continuous function. These approximations can be very precise if the training set is sufficiently large (64). ANNs are also more general than other classes of nonlinear statistical methods, such as general additive models, because the form of the nonlinear function does not have to be specified. They can also generalize to new data sets and degrade gracefully in the presence of noisy data. ANNs have demonstrated some of these potential advantages when applied to microbial data in studies comparing linear methods with ANNs (10, 45). ANNs are becoming a primary modeling tool in biotechnology (2).
This paper describes the use of ANN models to analyze clone libraries for two sets of functional genes (dissimilatory sulfite reductase and nitrite reductase genes) isolated at the U.S. Department of Energy's Natural and Accelerated Bioremediation Research (NABIR) Program field site in Oak Ridge, Tenn. (55). Our general approach (Fig. 1) was to divide the clone libraries into phylogenetic groups based on sequence similarity and investigate whether the groups could be linked to the geochemistry of the samples. This process began with a reduction of the geochemical data by principal component analysis. Next, linear and nonlinear ANN models were developed to relate the reduced geochemical data to the distributions of the clone groups. We examined weight decay and leave-one-out cross-validation as methods for managing generalization error in the models. The influences of the geochemical principal components on the final model predictions were assessed using a normalized importance index that was computed by measuring the proportional increase in data misfit following effective removal of each principal component from the model. The results were used to identify the geochemical measurements that had the greatest influence on the distributions of the clone groups. The critical need for assessing generalization error and the utility of the weight decay method in constructing predictive models are discussed.
![]() View larger version (31K): [in a new window] |
FIG. 1. Data analysis strategy.
|
|
|
|---|
![]() View larger version (9K): [in a new window] |
FIG. 2. Sample locations at the contaminated field site. The uncontaminated sample was taken about 5 km to the west of this location.
|
Unique nirS, nirK, and dsrAB clones from each site were selected for further sequence analysis based on differences in their RFLP patterns. PCR products were amplified (66), and DNA sequences were determined with a BigDye Terminator kit (Applied Biosystems, Foster City, Calif.) and a 3700 DNA analyzer (Perkin-Elmer) according to the manufacturer's instructions. The sequences obtained were compared with nirS, nirK, and dsr sequences from GenBank, translated into amino acid sequences, and aligned using techniques described by Yan et al. (66). Phylogenetic and molecular evolutionary analyses were conducted using MEGA version 2.1 (36), and phylogenetic trees were constructed from distance matrices using the neighbor-joining method. Trees constructed with maximum-likelihood methods were not significantly different.
Five gene groups within the nirS and nirK groups were defined based on sequence similarity as described by Yan et al. (66). However, many of the dsrAB clones appeared to be consistent with those found in a variety of environments (16, 21, 61) but very different from those found in confirmed sulfate-reducing groups. Since the function of these outlying sequences is in some doubt, we divided the dsrAB clones into two subgroups for data analysis. The subgroup designated dsrAB1 included those clones that appear to be most similar to known sulfate reducers, such as Desulfosporosinus, Desulfococcus, and Desulfosarcina (16). The subgroup designated dsrAB2 contained much more diversity and represented sequences that were relatively dissimilar to confirmed sulfate reducers. The dsrAB1 subgroup contained four classes, and the dsrAB2 subgroup consisted of five classes.
Data preparation.
Relative to the small number of samples, the nine geochemical analytes constituted a large set of potential predictors. Principal components (PCs) analysis (PCA) was used to reduce the original geochemical variables to a smaller set of predictors. Several investigators have recently employed PCA (5, 62) to orthogonally transform input variables for predictive ANN analysis. The main advantage of this approach is that the complexity of the model is substantially reduced without sacrificing much important information. On the other hand, the PCs are surrogates for the original variables, and the relationships among the original analytes and output variables (class frequencies) could be obscured.
Since the distributions of the nitrate, sulfate, uranium, Tc-99, and nickel data were skewed, they were transformed using the function log(x + 1) to approximately normalize the data prior to PCA. The first three PCs, which cumulatively accounted for 92% of the total variance, were selected as inputs for subsequent data analyses. The preprocessing step resulted in a reduction of the number of inputs from nine to three.
Both the geochemical PCs (inputs) and gene group frequencies (outputs) were normalized to values between 0 and 1 to maximize the performance of the models (7). A sigmoid transformation {y = 1/[1 + exp(ax + b)]} was applied to the geochemical PC scores because they were centered about zero but unbounded. Nonzero gene frequencies were normalized using the function y = xb/(xb + a), which has a lower bound of zero and an upper bound of 1.
Model selection.
Models were implemented using a combination of custom-designed MATLAB functions (The MathWorks, Natick, Mass.) and the Netlab (42) toolkit for MATLAB. A standard multilayer perceptron architecture with three fully interconnected layers (input, hidden, and output) was employed. The hyperbolic tangent transform was the nonlinear activation function in the hidden layer, and the logistic function was selected as the nonlinear activation function in the output layer. All of the ANNs were trained with the scaled conjugate gradient algorithm. For comparison, the same MATLAB functions were used to train a simpler class of generalized linear models (GLMs) on all of the data sets. Each GLM was computed by employing only a single node in the hidden layer with a linear activation function. This network computed a logistic function, which is linear in the input variables with nonlinear output. GLMs offer a natural, simpler alternative to ANNs for predictive modeling.
Overfitting is a concern when using ANNs to analyze small data sets such as the one described here. We used weight decay, one of the most popular and theoretically motivated methods, to improve the generalization performance of ANNs by introducing a controlled amount of regularization to the objective function (40). The form of the modified objective function was M(w) = ED(w) +
EW(w), where the vector w includes both weights and unit biases. The term ED(w) is the measure of model-data misfit. The additional term EW(w) limits the model complexity by imposing a penalty on large weights, where large weights are associated with more complex functions. The measure of model complexity can take many forms (7), but the simplest and most frequently used formulation is EW(w) = 0.5
wi2. The hyperparameter
is a positive constant that determines the penalty for model complexity. Unit biases were needed to adapt to the output range, so only the weights received a positive value for
.
We examined the effect of weight decay on generalization performance by plotting values of
versus the corresponding validation error ED. If an ANN is overfitting, the validation ED typically begins at a relatively high value, gradually decreases to a minimum, and then starts to increase again as
increases. For a large
value, ED converges to a value that corresponds to the performance obtained by the simplest possible model, i.e., prediction of the mean output. Thus, the plot defines a smooth path from a complex ANN to the simplest model. The minimum of the weight decay function identifies the value of
that produces the lowest generalization error. For this study, the values of
selected for nirS, nirK, dsrAB1, and dsrAB2 were 1.0, 0.35, 1.0, and 0.001, respectively.
K-fold cross-validation is a well-established method of using an entire data set for both training and testing (7, 54, 58). We performed onefold, also known as leave-one-out, cross-validation in which one sample was withheld from training and used to test the model fitted to the remaining data. This procedure was repeated six times, each time withholding a different sample for testing. The final ANN models selected were those that possessed the smallest generalization error over a wide range of values for
. The final architectures (input x hidden x output nodes) selected for each data set were 3 x 4 x 4 for dsrAB1, 3 x 4 x 5 for dsrAB2, 3 x 3 x 5 for nirS, and 3 x 3 x 5 for nirK.
Importance analysis.
It is critical to assess the relative importance or salience of each input variable with respect to the model predictions. We used an importance index that is based on a sensitivity concept proposed by Moody (41). His approach evaluated the change in training error that occurs when an input is effectively removed from the network. The most unbiased method of removing the influence of an input is to substitute its mean value (computed over the entire data set) in each sample. The sensitivity index is the average change in the mean squared error (MSE) that occurs after removing the input's influence and without retraining the network. Instead of measuring the difference, we computed the ratio of corrected MSE to MSE and normalized the ratio such that the importance indices sum to unity over all input variables. The resulting index describes the importance of the specified input variable relative to that of the other input variables.
|
|
|---|
![]() View larger version (8K): [in a new window] |
FIG. 3. Loadings of the geochemical variables on (a) PC1 and PC2 and (b) PC1 and PC3. The loading measures the correlation between the PC and the variable.
|
|
View this table: [in a new window] |
TABLE 1. Percent variance explained in the entire data set by GLM and ANN models with and without weight decay
|
(103). For nirK, nirS, and dsrAB1, the validation ED decreased monotonically as
increased with an accompanying increase in the training error.
![]() View larger version (20K): [in a new window] |
FIG. 4. Mean squared training error (ED MSE) for GLM and ANN models using all and validation data for increasing (left to right) levels of weight decay ( ) for the (a) nirS, (b) nirK, (c) dsrAB1, and (d) dsrAB2 gene groups. A solid line denotes a GLM, a dashed line indicates an ANN model, the filled circles represent all data, and the open circles indicate validation data.
|
|
View this table: [in a new window] |
TABLE 2. MSE of GLM and ANN models using entire and validation data sets with and without weight decay for nirS, nirK, dsrAB1, and dsrAB2
|
![]() View larger version (19K): [in a new window] |
FIG. 5. Observed and predicted gene frequencies for (left column) training and (right column) validation of dsrAB1 with (a and b) ANN, (c and d) dsrAB2 with ANN, and (e and f) dsrAB2 with GLM with weight decay.
|
![]() View larger version (14K): [in a new window] |
FIG. 6. Importance analysis of the three principal components for each dsrAB2 gene class using the ANN model with optimized weight decay and leave-one-out cross-validation.
|
The highest importance value observed was PC1 for class D in dsrAB2 (Fig. 6), and it appeared that the influence of PC1 on this class was monotonic (Fig. 7). The highest frequency for class D was found at the stations with the lowest values for PC1 (e.g., FW-300), and the frequency in class D generally decreased with increasing values of PC1 (Fig. 7). The highest level of class D was associated with low nitrate, nickel, Tc-99, and NpOC (positive loadings on PC1) and high pH (negative loading on PC1). Class D frequencies also decreased with decreasing values of PC2 (Fig. 7), thus indicating that the class was associated with low sulfate and uranium. The importance of PC3 for this group was extremely small (Fig. 6).
![]() View larger version (43K): [in a new window] |
FIG. 7. Principal component analysis sample scores and dsrAB2 frequencies for gene classes (a) A, (b) B, (c) C, (d) D, and (e) E. The samples are identified in panel f.
|
Apparent nonlinear relationships were observed between the geochemistry and the frequencies of classes B and C (Fig. 7). The highest importance values for PC2 were found for these classes (Fig. 6). The frequencies of these two groups appeared to peak at intermediate levels of PC2 and decreased at lower and higher values. The two classes differed in that PC1 was more important than PC2 for class B, but for class C PC2 was more important. Class C was the only case for which PC2 was more important than PC1. The frequencies of the two classes differed primarily at FW-300, where class B was high and class C was relatively low (Fig. 7).
|
|
|---|
Only the first two geochemical PCs were important in predicting the frequency of different classes of dsrAB2. The loadings on PC1 indicated that pH, Tc-99, nitrate, nickel, and NpOC were important in predicting the gene frequencies for classes A, B, D, and E. However, uranium and sulfate were more important in predicting the frequencies for class C. Interestingly, Chang et al. (14) found that uranium had an influence on one dominant cluster of dsr genes at a uranium mill tailings site. Examination of the loadings of the original geochemical parameters (Fig. 3) indicated that only DO loaded heavily on PC3. Thus, it appears that the DO concentration had little influence on the gene frequencies.
We found that cross-validation and weight decay were useful methods for measuring and increasing the generalizability of the ANN models. Due to the importance of the weight decay hyperparameter, we used a systematic method to assist in selection of
. The goal was to find a region of weight decay values that minimized generalization error. For dsrAB2 this minimal region was observed when the weight decay hyperparameter was relatively small and had not resulted in a noticeable increase in training error (Fig. 4). For the other clone groups there was little improvement in the generalization error until the training error had increased substantially. The generalization error did not reach a minimum until the weight decay parameter was so high that the result was equivalent to guessing the mean for each sample and not using the geochemical data in the prediction at all. Thus, it appeared that the inferred relationships between the geochemical parameters and the dsrAB2 frequencies were much more robust than those for the other gene groups.
Modeling strategy.
Modeling can help identify what environmental parameters control microbial community structure (17, 22, 34, 39, 44). However, if there are a large number of model parameters and a small sample size, a model can often be fit to the data even in the absence of truly meaningful relationships (19). This overfitting problem has been implicitly or explicitly recognized for environmental questions (18) and addressed by using techniques such as cross-validation to test for the generality of the model (39). Almeida (2) points out that some ANN packages can succeed in relating two sets of random numbers, thus indicating the importance of cross-validation. The modeling strategy should include techniques for assessing and addressing overfitting.
Overfitting is a potential problem in the application of ANNs, other parameterized predictive models, and classifiers such as GLMs to analysis of clone libraries and expression data (18, 31). A data set from a typical study might contain a few to a few dozen samples and a much greater number of measurements per sample. An ANN trained to predict distributions of clones from environmental measurements could possess numerous parameters (weights and biases) that must be estimated from the data. Since the number of parameters can greatly exceed the number of samples, the ANN is often able to provide a close fit to the data, even though the data contain measurement errors. The ANN attempts to fit all of the data's features, including the measurement error. The "true" model, on the other hand, is assumed to be a smooth function that describes a simpler relationship among variables but ignores the measurement error in the data. The result of overfitting is that the ANN generalizes poorly to new data that were not contained in the training set.
A useful approach for addressing overfitting is to reduce the number of inputs in the model by using a technique such as principal component analysis (57). In this study, we reduced nine geochemical characteristics to three principal components which still accounted for most of the variability in the original measurements (91%). An alternative approach is to use a technique that eliminates the variables that contribute the least either before (11) or after (50) specifying the final model. However, this approach is not as powerful as a data reduction technique, such as principal component analysis.
An important concern is the potential for generating models that may not generalize well (i.e., have a large generalization error). Weight decay is one method for addressing this problem. With weight decay, an additional term is added to the error function that is proportional to the sizes of the weights associated with each factor entering the models. Early stopping is a popular alternative to weight decay that is often employed when the number of parameters/number of samples ratio is significantly greater than unity (3, 20, 47, 54, 60). Early stopping is a nonconvergent technique that terminates training before the ANN is finished fitting the training data. Sarle (53) performed computer simulations which showed that early stopping can improve generalization. However, several investigators have noted problems with this technique (13, 56). In addition, examination of early stopping results with these small data sets (data not shown) showed that the optimal stopping point is highly sensitive to the variability in the validation set. Thus, we chose to focus this analysis on weight decay.
There is always the possibility that a model fits the training data well but is useless when applied to a new data set. Thus, a tool that allows for an assessment of the ability to generalize the model is necessary. A straightforward way to estimate generalization error is to test the model with a subset of data that was excluded from training (39, 50). If the percentage withheld is too high, not enough data remain for training. Too low a percentage can result in a validation set that does not resemble the entire data set if a few outliers are included by chance or if the subset contains only data in a narrow range. Some simulation studies suggest that the test set should be in the range of 5 to 25% of the total number of samples. Examples within most of this range (e.g., 10 to 25%) can be found in the application of ANNs to microbial data (35, 44, 46).
Cross-validation is a method that uses the entire data set for both training and testing (23). For example, suppose there are six (N) samples in a data set and we select one (k) for the test set. The procedure is repeated six (N/k) times such that all the data are eventually used in mutually exclusive test sets. Combining the error estimates from all N/k iterations provides a less biased estimate of the generalization error. Alternatives to cross-validation include early stopping and various complexity measures, such as the Akaike information criterion. In our experience, the usefulness of early stopping is severely limited by small sample sizes.
To make the final connection between the site geochemistry and the gene distributions, a method is needed to identify the model inputs that have the greatest influence on the predicted values. Traditionally, sensitivity analysis has been employed for measuring importance. However, there are several different definitions of sensitivity in general usage (32, 52). Sensitivity may be defined as a local measure that assumes a large value when a small perturbation about a specific input value produces a large change in the output. Other definitions are global measures that assess sensitivity as the mean of absolute sensitivities over all samples and outputs for each input variable. Different methods given identical input data may yield radically different results because they are based on different concepts. We propose an importance method that indicates which of the inputs had the greatest impact on predicting the distributions of gene clusters at the sites. This is equivalent to substituting the mean value of an input into the model for every sample and then measuring the increase in the prediction error based on only the other inputs.
The success of relating indicators of phylogenetic or functional groupings of bacteria (based on any technique) to environmental characteristics will depend on the validity of the assumption that microorganisms within a group can also be considered an ecotype that will respond in a similar manner to different geochemical conditions. This is the basis for using phylogenetic 16S rRNA gene probes to evaluate community changes within specific groups (37). Another approach is to group the clones of functional genes according to their distribution in the samples examined and relate those groupings to the site characteristics. Thus, the groups being related to geochemistry or other characteristics can be distributionally related (they occur at a similar group of sites) but not related by sequence similarity. For example, Liu et al. (38) found relationships between nitrate and oxygen with community structure in denitrifying populations in marine sediments after grouping the clones of nirS and nirK by their distribution at the sampled sites. These types of clusters may not readily lend themselves to development and application of probes. However, approaches such as functional gene arrays (65) may be applicable in these situations as an alternative to cloning and sequencing.
Based on an examination of the literature and the results presented here, it may be uncommon for clone groups of specific functional genes to react in a similar manner to environmental characteristics, such as geochemistry. If this is true, probes for functional genes may not be as useful as those developed for 16S and other phylogenetic genes. General spatial distributions of phylogenetic groups have been observed (24) especially over wide ranges of environmental characteristics, such as the transition from fresh to salt water (51). Chang et al. (14) found by logistic regression that one cluster of dsr gene sequences (out of eight) was highly related to uranium at a mill tailings site. Also, there are examples where similar clones of functional genes have been found in a variety of dissimilar environments (4).
Conclusions.
The inherent complexity and scale will often limit our ability to understand the relationships between phylogenetic or functional gene groups and environmental characteristics. To further our understanding we are attempting to capture patterns or correspondence between gene patterns and the geochemistry to infer relationships among them. If successful, we can begin to understand, at least in simple terms, what dominant variables show strong statistical influence or coincidence with specific populations. However, we will not always measure the critical environmental characteristics, or we may not measure them on an appropriate scale. Only when a characteristic we measure exerts a strong influence can we hope to make inferences between the geochemistry and gene distributions. Also, when we are successful it does not mean that other variables are not important but only that given the measurements made, the sites sampled, and the limits of our sampling methods we could not detect the influence of other variables.
Present address: Department of Microbiology, Miami University, Oxford, OH 45056-1400. ![]()
Present address: Savannah River National Laboratory, Aiken, SC 29808. ![]()
Present address: Central South University, Changsha, Hunan, People's Republic of China. ![]()
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»