Previous Article | Next Article ![]()
Applied and Environmental Microbiology, May 2006, p. 3468-3475, Vol. 72, No. 5
0099-2240/06/$08.00+0 doi:10.1128/AEM.72.5.3468-3475.2006
Copyright © 2006, American Society for Microbiology. All Rights Reserved.
Price Associates, One North Broadway, White Plains, New York 10601,1 Department of Biological Sciences,2 Environmental Health Science, Salisbury University, Salisbury, Maryland 21801,3 Maryland Department of the Environment, 1800 Washington Blvd., Baltimore, Maryland 21230-17184
Received 11 October 2005/ Accepted 27 February 2006
|
|
|---|
|
|
|---|
Statistical classification methods, including discriminant analysis, logistic regression, and cluster analysis, have been used to develop classification models for ARA data (3-5, 14). Classification trees provide an alternative statistical approach for this BST method. Classification tree analysis methods have the flexibility to accommodate complex interactions among antibiotic variables (1, 6). This article describes and discusses the use of classification trees for BST. To aid the presentation, we describe an application of classification tree modeling to data collected from the Anacostia River Watershed in Maryland. (Our model was used to develop a risk management program for bacterial contamination of the Anacostia River [8].)
|
|
|---|
|
View this table: [in a new window] |
TABLE 1. Anacostia Watershed library data
|
Classification tree method.
Classification tree methods build classification models by recursively splitting in two the reference library of isolates such that the resulting subsets, termed nodes, are increasingly homogenous with respect to isolate sources (1, 6). (We used the Classification and Regression Trees [CART] software developed by Salford Systems to build a classification tree for the ARA data.) The first step divides the reference library of isolates into two nodes by considering every binary split defined by the 0 or 1 outcome associated with every antibiotic variable. A variable is selected for the split that maximizes homogeneity of isolate sources within each node. The same procedure is applied to each resulting node. The collection of nodes at the ends of branches of the tree, called terminal nodes, is a classification model. Each terminal node is characterized by a unique antibiotic resistance profile and is associated with one source category. A bacterial isolate from a water sample that has the same antibiotic resistance profile as the profile for a particular terminal node would be assigned the source associated with that terminal node.
Criteria governing node splits.
Various approaches for determining when to conclude the node-splitting process have been suggested and analyzed (1). Stopping criteria, or rules for determining when to stop splitting nodes, typically are based on a measure of source homogeneity for isolates within a node. Source homogeneity achieves its maximum when all isolates in a node have the same source. A node where an additional split would not yield a specified minimum increase in homogeneity would be a terminal node.
Hastie et al. (6) described three criteria, or indexes, that address homogeneity. As formulated, the indexes measure the complement of homogeneity, which Hastie et al. refer to as impurity. Homogeneity is maximized when impurity is minimized. The three impurity indexes are misclassification error, Gini index, and cross-entropy deviance. The optimal split of a node at any stage of model development is the one that minimizes impurity as measured by the chosen index. We relied on the Gini index for developing the Anacostia classification tree model.
The Gini index for a node labeled m is calculated as
, where p is the proportion of isolates from source k in node m. The minimum value taken by the Gini index is 0, which occurs when all isolates in a node are from the same source (i.e., minimum impurity corresponds to maximum homogeneity). The maximum of the Gini index occurs when the isolates in a node are equally distributed among the sources.
At this time, we know of no general guidance concerning a preference for one of the impurity indexes for BST modeling. Following the general suggestion that model building is an exploratory data analysis process, experimentation with each of the indexes may be beneficial.
Source identification for terminal nodes.
The classification model, equivalently the set of rules for classifying isolates by source, is embodied in the collection of terminal nodes. Each terminal node in a tree represents isolates with a particular pattern of antibiotic resistance and assigns the source with highest probability in the node to all such isolates.
The source probabilities in a terminal node are posterior probabilities of the events: "source, given the pattern of antibiotic resistance." The posterior probability for a particular source S in terminal node T is P(S | T) = P(T | S) x P(S), where P(S | T) is the posterior probability that an isolate comes from source S given that the isolate is in terminal node T, P(T | S) is the conditional probability that an isolate is in terminal node T given that it comes from source S, and P(S) is the prior probability that an isolate comes from source S.
For the Anacostia Watershed, whose sources we divided into four categories, the posterior probability for the ith source would be calculated as
![]() | (1) |
There are two obvious choices for assigning values to the prior probabilities: (i) P(Si) is proportional to the population size of the ith source in the watershed, which if estimated from the isolates in the reference library would be Ni/
Nj, or (ii) equal values, which would be 0.25 for each of the four sources in the Anacostia Watershed. From Table 1 (complete data), the source proportions reflected by isolates in the reference library are as follows: P(Spet) = 0.212 (236/1,074), P(Shuman) = 0.372 (399/1,074), P(Slivestock) = 0.228 (245/1,074), and P(Swildlife) = 0.181 (194/1,074). If these proportions were used as prior probabilities in equation 1, the posterior source probabilities would be the proportions of the sources for isolates in that node (1), which is the same as the majority source among the reference library isolates in the node. Subsequently, any isolate with an unknown source that would be classified by the tree into the terminal node under discussion would be assigned the majority source for that node.
However, using the source proportions among the isolates in the reference library as prior probabilities is not advisable in most circumstances. It is unlikely that the proportion of isolates in the reference library from a particular source would reflect the true proportion of potential sources of bacteria. In the Anacostia reference library, we assume that the collection of fecal samples provide adequate coverage of the sources in the watershed, but the numbers of isolates for each source are not necessarily proportional to actual source contributions.
The second obvious choice for prior probabilities, equal probabilities of 0.25 for each of the four sources in the Anacostia analysis, is a more acceptable choice. In the absence of specific reliable data on the source contributions of bacteria from the watershed, the equal prior probabilities properly reflect ignorance concerning the source contributions and, in that respect, lead to an unbiased classification model. Under choice of the equal prior probabilities, the posterior probabilities would be calculated using equation 1 with each P(Sj) replaced by 0.25. The numerical example in Table 2 demonstrates how source predictions are different depending on the choice of values for the prior probabilities.
|
View this table: [in a new window] |
TABLE 2. Classification results for a hypothetical terminal node, showing that changing the prior probabilities from source proportions to equal changes the source assignment of the node from human to wildlife
|
Evaluating alternative classification tree models.
A classification model for BST is judged by the likelihood of correctly predicting the source of isolates grown from water samples containing bacteria from unknown sources. The most basic assessment of this likelihood is a resubstitution estimate of the rate of correct classification (i.e., using the model to reclassify the reference library isolates). A relatively high value of this resubstitution estimate of the correct classification rate would be expected because the model is being used to classify the same isolates that were used to build it. Although a minimum acceptable limit for the correct classification rate has not been established, the model should at least have a correct classification rate that exceeds the rate for classification by chance (i.e., random assignment of isolates to sources). Therefore, the rate of correct classification for each source should be larger than the relative frequency of isolates in the reference library from that source.
Cross-validation provides a more realistic estimate of the correct classification rate than resubstitution because the predicted source of an isolate is based on a model that did not employ the isolate for building the model. Cross-validation is accomplished for large reference libraries by developing the classification tree from a subset of reference library isolates by excluding a random subset of the isolates from the reference library. Using the resulting model to classify the isolates that were excluded leads to an independent estimate of the correct classification rate, which would be expected to be smaller than the resubstitution estimate. For smaller reference libraries, cross-validation may be structured to provide a bootstrap estimate of the correct classification rate (see references 1, 2, and 6 for general discussions of bootstrapping). In the software we employed, the bootstrap method is referred to as K-fold cross-validation. This method randomly establishes K nonoverlapping subsets of the reference library isolates, each consisting of 1/K of the total number of isolates in the reference library. Each subset is "held out," and the remainder of the reference library is used to build a tree classification model that, in turn, is used to classify the "held-out" isolates. Accumulating the results for all K subsets leads to an alternative estimate of the correct classification rate. This estimate provides a more realistic assessment of model performance than the estimate based on resubstitution because the predicted classification of an isolate is based on a model that did not employ the isolate to build the model. The bootstrap estimate of the correct classification rate usually would be smaller than the resubstitution estimate. In our analysis, K = 10. (In addition to using K-fold cross-validation to estimate the rate of correct classification, CART, the software we used to build a classification model, uses information from the K-fold analysis to determine the optimal classification model. For details, see references 1 and 6.)
Classification probability thresholds.
As described above, each terminal node is associated with the source that has the maximum posterior probability for the node. Also, an isolate that has the same antibiotic resistance profile as the node would be assigned the source associated with the node. In terminal nodes where the maximum probability for the source associated with the node is not much larger than probabilities for other sources represented in the node, classifications based on this node are likely to be uncertain. This uncertainty can be reduced by establishing a threshold for the maximum probability in the node before accepting the node as a basis for assigning isolates to one of the sources. If the maximum probability does not exceed the threshold, isolates in the terminal node would be classified as "source unknown."
For example, assume that a node contains one pet isolate, one human isolate, two livestock isolates, and six wildlife isolates and, for purposes of this example, that the posterior probability is the relative frequency of isolates in the node. (For this example, we are assuming that the prior probability distribution for sources is the distribution of sources in the reference library. See "Source identification for terminal nodes" above for a discussion of prior probabilities.) The posterior probabilities would be 10% pet, 10% human, 20% livestock, and 60% wildlife. Without any threshold, all isolates falling into this node would be classified as wildlife. With a threshold of 70%, isolates falling into this node would be classified as unknown. If the threshold was 55%, the node and all isolates in it would be classified as wildlife.
A threshold probability reflects our confidence in the classification scheme. While increasing the threshold increases the correct classification rate, it also increases the number of isolates classified as unknown. The trade-off between correct classification rate and number of unknowns needs to be investigated by varying the value assigned to the probability threshold. Application of a probability threshold for the Anacostia Watershed is described in Discussion.
|
|
|---|
Initial analysis.
We used CART to build a tree classification model for four sources based on the 1,074 isolates from 135 samples with complete data in the Anacostia Watershed reference library (Table 1) The ARA data vary statistically both between samples for a particular source and within samples (i.e., across isolates grown from a particular sample). Both sources of variability must be accounted for in the modeling process, and they are important parameters for determining the number of samples and isolates needed for developing a satisfactory model. We used the 1,074 isolates for model building. The effects of both between- and within-sample variability are discussed with the results of the final model.
We chose the Gini index as the impurity criterion and implemented K-fold cross-validation with K = 10. Additionally, the minimum number of isolates in a parent node was set to 10 and the prior probabilities were set equal to 0.25, a choice that reflects the absence of information concerning the true distribution of sources in the watershed. The final tree consisted of 171 nodes, 86 of which were terminal nodes.
Figure 1 shows the first few nodes of the tree for illustration purposes. At the top is the root node, which is comprised of all isolates in the reference library. The first split was based on chlortetracycline at a concentration of 100 µg/ml, minimizing the Gini index. None of the nodes shown in Fig. 1 are terminal nodes. Each node was subsequently split, leading to a classification model consisting of 86 terminal nodes.
![]() View larger version (28K): [in a new window] |
FIG. 1. The first few splits of the classification tree based on the Anacostia isolate library. Equations above the nodes represent the splitting criteria: particular antibiotics and whether the isolates were resistant (1) or susceptible (0). CT100, chlortetracycline, 100 µg/ml; CA5, chloramphenicol, 5 µg/ml; S60, streptomycin, 60 µg/ml; C10, cephalothin, 10 µg/ml.
|
|
View this table: [in a new window] |
TABLE 3. Resubstitution results for classification of Anacostia Watershed reference library isolates by using the tree classification model
|
|
View this table: [in a new window] |
TABLE 4. Correct classification percentages for samples based on resubstitution
|
|
View this table: [in a new window] |
TABLE 5. Classification of Anacostia Watershed library isolates with cross-validation
|
Model refinement with probability thresholds.
We conducted an analysis to determine whether a probability threshold would improve the classification model. We evaluated the trade-off between increasing the rate of correct classification and the percentage of isolates that would be classified as unknown by plotting these two quantities for a range of potential probability threshold values.
Figure 2 shows the results of applying different probability thresholds to isolates in the reference library. If the threshold was 50%, the percentage of isolates classified as unknown would be less than 5%, but the improvement in the rate of correct classification would be minimal. Thresholds of 70% or greater are associated with at least a 10% improvement in the correct classification rate, but the percentage of unknowns also increases. We chose 80% as the threshold. For this threshold, the percent correctly classified increased from 81% to over 90%; the percentage of unknowns for this threshold is 44%.
![]() View larger version (21K): [in a new window] |
FIG. 2. Effects of classification probability limits on Anacostia library resubstitution.
|
|
View this table: [in a new window] |
TABLE 6. Anacostia Watershed library classified with an 80% probability threshold
|
|
View this table: [in a new window] |
TABLE 7. Classification of Anacostia River water samples with and without an 80% probability threshold
|
|
|
|---|
Missing data.
The Anacostia Watershed reference library consisted of 1,155 isolates, each with outcomes for 37 antibiotic-concentration combinations. Of the 1,155 isolates, 1,074 had data for all antibiotic-concentration combinations; the remaining 81 isolates were missing data for 12 antibiotic-concentration combinations. Because most statistical classification methods require complete data on all variables, either isolates with one or more missing values are omitted from statistical analysis or missing values are imputed (i.e., replaced with estimates from similar isolates with complete data). CART, the tree classification software we employed, includes a procedure similar to imputation that is based on "surrogate splitters" (1, 6). In addition, the classification tree method, due to its flexibility for analyzing interactions among variables, is capable of accommodating missing data directly. Classification tree results based on utilization of the direct approach to missing data, which is described below, are more transparent than results where imputation or surrogate splitters are employed.
To accommodate missing ARA data directly, a fixed numerical value other than 0 or 1 would be assigned to each missing data entry in the reference library. Then, the analysis treats the missing value simply as another outcome value in the tree-building process. If missing data occur at random, the replacement of missing data with a value other than 0 or 1 should not alter the result that would have been achieved if there were no missing data. However, missing data may not occur at random but may be correlated with the sources. For example, if a particular antibiotic applied to isolates from a particular source often produced a laboratory result recorded as missing, the classification analysis would indicate a correlation with that source. The correlation could be an indication that a biological characteristic of the source is responsible for an unsuccessful laboratory outcome with the antibiotic in question. Therefore, the correlation could be useful for building the classification tree. If a water sample isolate from an unknown source registered "missing data" for the same antibiotic, the likelihood that the isolate came from the source that had a high percentage of missing data in the reference library for this antibiotic would be increased.
The missing data in the Anacostia Watershed reference library were limited to 81 isolates and the same 12 antibiotic-concentration combinations. All isolates were tested with an initial panel of antibiotics in various concentrations. Later in the project, additional antibiotics and concentrations were added to the panel, requiring the growth and testing of frozen isolates. Missing data resulted when 81 random isolates would not regrow for reasons unrelated to source. Therefore, the pattern of missing data contained no information that could help identify isolate sources. We omitted the 81 isolates with missing data from all subsequent analyses.
Reference library representativeness and size.
Two important recurring questions concerning the validity of all BST classification modeling methods are (i) whether the reference library is representative of the watershed and (ii) whether the reference library consists of a sufficient number of samples and isolates to develop a reliable model. Neither question can be simply and completely answered. However, we provide guidance based on general statistical principles and the specifics of BST using ARA data.
Concerning representativeness, statistical principles dictate that samples should be collected in a manner that would provide coverage for the total watershed. Stratification and proportional sampling by strata could be used if there was information indicating that certain areas within the watershed were more likely as sources of fecal bacteria than other areas. Also, if information on the relative population sizes of the contributing sources (i.e., pet, human, livestock, and wildlife for the Anacostia Watershed) was available, the source categories could be used to define strata and a proportional sampling scheme for these strata could be designed. However, in any initial BST investigation such as the investigation conducted for the Anacostia Watershed, it is unlikely that the information needed to define strata would be well developed. Therefore, a sampling design based on a combination of expert scientific judgment and convenience is a practical choice.
For the Anacostia Watershed, sampling was conducted after an initial field survey designed to identify likely "major" contributors to fecal bacteria contamination. Field personnel were instructed to concentrate their scat collection efforts around these major sources.
The size of the reference library is determined by specifying the number of field samples to be collected, the number of isolates to be grown from each sample, and the number of isolates analyzed by ARA. Determining the size (i.e., number of samples and isolates) required for a reference library may follow the usual approach for setting requirements for the number of observations in a statistical sample: (i) state a well-defined quantitative objective for the statistical analysis, (ii) determine the size of the statistical error considered acceptable for an estimate of the objective, and (iii) select a probability or confidence level for ensuring that the estimation error is not exceeded. This approach, although conceptually straightforward, is complicated in application and has not been applied in the BST literature.
As an introduction to the procedure for determining the size of the reference library for classification tree modeling, we provide an example of this procedure and its complications with a more familiar statistical classification method, i.e., linear discriminant analysis. As a first step, it would be useful to know whether the means of the antibiotic-concentration combinations used to build the classification model (the variables in the discriminant function) are statistically different across the sources. For two sources and data that have an approximately normal distribution, the test is straightforward and depends explicitly on the number of isolates from each source (7). It follows that 95% confidence contours could be developed for the mean vectors for each source and inspected to determine whether these contours overlapped. If the overlap was pronounced, it is likely that the classification model would perform poorly. The widths of the confidence contours depend on the inherent variation in the data and the number of isolates from each source. As such, the number of isolates from each source could be determined to achieve a specified separation of the means for the two sources. The complications for BST using linear discriminant analysis to model ARA data are that (i) ARA data do not follow an approximately normal distribution and (ii) extensions of the two-source confidence contour analysis to multiple sources have not been fully developed. These complications affect all statistical classification methods that rely on the assumption that the data have a normal distribution.
We describe an approach for determining the reference library size for BST by using the tree classification method, but practical implementation of this approach must await further development. For example, set the well-defined quantitative objective to be the rate of correct classification as estimated by resubstitution, and require a reference library large enough to ensure no more than a 5% error (plus or minus) with 99% confidence for the estimate. Then, by considering models with different numbers of terminal nodes, the required number of isolates could be determined, which, based on a determination of the number of isolates per sample, leads to the number of samples. Although this objective can be stated simply, it has not been translated into a requirement for the size of a reference library for use with the classification tree method. Another approach for determining the required number of isolates would be to require a small error for the estimates of the maximum posterior probability in terminal nodes. Recall that the maximum posterior probability in a terminal node determines the source for that node.
A concept referred to as artificial clustering has been suggested by some researchers as a way to judge whether or not the reference library is large enough (5, 12). However, it is unlikely that the approach employing the concept of artificial clustering is informative for determining the necessary size of a reference library. As described by Harwood et al. (5), a test of the library size would be conducted by randomly assigning the isolates in the library to sources. For example, if there were four sources, each isolate would have a probability equal to 0.25 of being assigned to one of the four sources. The statistical classification procedure would then be applied to the library isolates, and the resulting model would be used to reclassify the isolates among sources. Because the isolates were assigned to sources at random (i.e., the assignment was independent of the ARA outcomes), if the statistical classification procedure is not biased, it would function as a random classification procedure. Each isolate would have a 25% chance of being assigned to each source. The expected resubstitution result would be "25% of the isolates classified into each source." Various authors have relied on empirical results showing that as the library size increases, the source classifications deviate less from the expected 25%. Therefore, the realization of large deviations from the expected 25%, a circumstance referred to as artificial clustering, is interpreted as an indication that the library is too small.
Artificial clustering as defined, implemented, and interpreted does not provide useful information about the adequacy of library size. Let N be the number of isolates in a reference library. Assume that the isolates came from four sources. Any individual random assignment of the N isolates to the four sources, which subsequently is analyzed by a statistical classification method, would be expected to classify 25% of the N isolates into each source category. A deviation from the 25% target has two possible causes. First, the statistical classification method may be biased. It is generally assumed that this is not the case. Second, the deviation could be a consequence of statistical variation, and larger deviations would be more likely for smaller reference libraries. Deviations from the 25% target can be predicted on a probability basis using the multinomial distribution, but even then they provide minimal guidance for determining if the size of the reference library is acceptable.
At the current stage of BST model development, the only way to assess the size of a reference library is through sensitivity analysis by varying the size of the library or adding more isolates from known sources and evaluating the stability of the model.
Interpretation of water sample isolate source predictions.
We used a statistical method, classification tree analysis, to build a model for classifying isolates among sources and applied the model to water sample isolates with unknown sources. The resulting distribution of isolates among sources has implications for risk management interventions designed to reduce bacterial contamination of river waters. Our analysis suggests that the human and livestock sources contribute fewer bacteria to the Anacostia River than the pet and wildlife sources (Table 7). Although we are confident that the results indicate the correct attribution for sources, we have not conducted a statistical test to determine if the percentages are statistically significantly different, and at present we have not developed the methodology for conducting such a test. We currently are investigating methods for efficiently evaluating statistical differences among predicted source percentages derived from classification tree models.
|
|
|---|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»