This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrowReprints and Permissions
Right arrow Copyright Information
Right arrow Books from ASM Press
Right arrow MicrobeWorld
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Price, B.
Right arrow Articles by Currey, L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Price, B.
Right arrow Articles by Currey, L.
Agricola
Right arrow Articles by Price, B.
Right arrow Articles by Currey, L.

 Previous Article  |  Next Article 

Applied and Environmental Microbiology, May 2006, p. 3468-3475, Vol. 72, No. 5
0099-2240/06/$08.00+0     doi:10.1128/AEM.72.5.3468-3475.2006
Copyright © 2006, American Society for Microbiology. All Rights Reserved.

Classification Tree Method for Bacterial Source Tracking with Antibiotic Resistance Analysis Data

Bertram Price,1* Elichia A. Venso,2,3 Mark F. Frana,2 Joshua Greenberg,1 Adam Ware,1 and Lee Currey4

Price Associates, One North Broadway, White Plains, New York 10601,1 Department of Biological Sciences,2 Environmental Health Science, Salisbury University, Salisbury, Maryland 21801,3 Maryland Department of the Environment, 1800 Washington Blvd., Baltimore, Maryland 21230-17184

Received 11 October 2005/ Accepted 27 February 2006


arrow
ABSTRACT
 
Various statistical classification methods, including discriminant analysis, logistic regression, and cluster analysis, have been used with antibiotic resistance analysis (ARA) data to construct models for bacterial source tracking (BST). We applied the statistical method known as classification trees to build a model for BST for the Anacostia Watershed in Maryland. Classification trees have more flexibility than other statistical classification approaches based on standard statistical methods to accommodate complex interactions among ARA variables. This article describes the use of classification trees for BST and includes discussion of its principal parameters and features. Anacostia Watershed ARA data are used to illustrate the application of classification trees, and we report the BST results for the watershed.


arrow
INTRODUCTION
 
Bacterial source tracking (BST) with antibiotic resistance analysis (ARA) has been conducted using various statistical classification models to identify sources of bacterial contamination of surface waters (3-5, 9-15). Isolates are obtained from fecal samples from known sources, such as humans, pets, livestock, and wildlife, and are tested for antibiotic resistance against a panel of antibiotics at different concentrations. The isolates comprise a reference library that is used for developing a statistical classification model to predict probable bacterial sources based on antibiotic resistance profiles. The model is then applied to classify unknown water sample isolates treated with the same antibiotics. The result is a frequency distribution of water sample isolates by source, which is used to estimate the relative contributions of the sources to bacterial contamination in the watershed.

Statistical classification methods, including discriminant analysis, logistic regression, and cluster analysis, have been used to develop classification models for ARA data (3-5, 14). Classification trees provide an alternative statistical approach for this BST method. Classification tree analysis methods have the flexibility to accommodate complex interactions among antibiotic variables (1, 6). This article describes and discusses the use of classification trees for BST. To aid the presentation, we describe an application of classification tree modeling to data collected from the Anacostia River Watershed in Maryland. (Our model was used to develop a risk management program for bacterial contamination of the Anacostia River [8].)


arrow
MATERIALS AND METHODS
 
ARA data.
We use ARA data collected from the Anacostia Watershed to aid our description and discussion of the classification tree approach for BST. The Anacostia Watershed reference library was developed from 90 scat samples collected from three general sources in Maryland's Anacostia Watershed (pets, livestock, and wildlife) and 50 human source samples from sewage treatment facilities. A total of 1,155 known source enterococcal isolates were obtained from these fecal samples. Table 1 shows the distribution of samples and isolates among the four sources. Forty-two antibiotic-concentration combinations, subsequently referred to as the antibiotic variables, were used to form the ARA data. If the culture grew in the presence of a particular antibiotic, then it was considered resistant and recorded as "1." If the culture did not grow in the presence of the antibiotic, then it was considered susceptible and recorded as "0." Therefore, each observation (isolate) in the reference library is a sequence of 42 zeros and ones. Five of the 42 antibiotic variables were omitted from the statistical analysis: one because it was constant across all sources and four because they were not applied to any of the isolates from human sources. Of the 1,155 isolates, 1,074 had data for all of the remaining 37 antibiotic variables.


View this table:
[in this window]
[in a new window]
 
TABLE 1. Anacostia Watershed library data

In addition to the reference library data that were used to build and validate a classification model, 1,565 water sample isolates were grown from 71 water samples collected from the Anacostia River; 1,512 of these isolates had complete ARA data. The procedures used to develop the reference library and water sample isolate data are described in a publication from the Maryland Department of the Environment (8).

Classification tree method.
Classification tree methods build classification models by recursively splitting in two the reference library of isolates such that the resulting subsets, termed nodes, are increasingly homogenous with respect to isolate sources (1, 6). (We used the Classification and Regression Trees [CART] software developed by Salford Systems to build a classification tree for the ARA data.) The first step divides the reference library of isolates into two nodes by considering every binary split defined by the 0 or 1 outcome associated with every antibiotic variable. A variable is selected for the split that maximizes homogeneity of isolate sources within each node. The same procedure is applied to each resulting node. The collection of nodes at the ends of branches of the tree, called terminal nodes, is a classification model. Each terminal node is characterized by a unique antibiotic resistance profile and is associated with one source category. A bacterial isolate from a water sample that has the same antibiotic resistance profile as the profile for a particular terminal node would be assigned the source associated with that terminal node.

Criteria governing node splits.
Various approaches for determining when to conclude the node-splitting process have been suggested and analyzed (1). Stopping criteria, or rules for determining when to stop splitting nodes, typically are based on a measure of source homogeneity for isolates within a node. Source homogeneity achieves its maximum when all isolates in a node have the same source. A node where an additional split would not yield a specified minimum increase in homogeneity would be a terminal node.

Hastie et al. (6) described three criteria, or indexes, that address homogeneity. As formulated, the indexes measure the complement of homogeneity, which Hastie et al. refer to as impurity. Homogeneity is maximized when impurity is minimized. The three impurity indexes are misclassification error, Gini index, and cross-entropy deviance. The optimal split of a node at any stage of model development is the one that minimizes impurity as measured by the chosen index. We relied on the Gini index for developing the Anacostia classification tree model.

The Gini index for a node labeled m is calculated as Formula, where p is the proportion of isolates from source k in node m. The minimum value taken by the Gini index is 0, which occurs when all isolates in a node are from the same source (i.e., minimum impurity corresponds to maximum homogeneity). The maximum of the Gini index occurs when the isolates in a node are equally distributed among the sources.

At this time, we know of no general guidance concerning a preference for one of the impurity indexes for BST modeling. Following the general suggestion that model building is an exploratory data analysis process, experimentation with each of the indexes may be beneficial.

Source identification for terminal nodes.
The classification model, equivalently the set of rules for classifying isolates by source, is embodied in the collection of terminal nodes. Each terminal node in a tree represents isolates with a particular pattern of antibiotic resistance and assigns the source with highest probability in the node to all such isolates.

The source probabilities in a terminal node are posterior probabilities of the events: "source, given the pattern of antibiotic resistance." The posterior probability for a particular source S in terminal node T is P(S | T) = P(T | S) x P(S), where P(S | T) is the posterior probability that an isolate comes from source S given that the isolate is in terminal node T, P(T | S) is the conditional probability that an isolate is in terminal node T given that it comes from source S, and P(S) is the prior probability that an isolate comes from source S.

For the Anacostia Watershed, whose sources we divided into four categories, the posterior probability for the ith source would be calculated as

Formula 1(1)
P(T | Sj) can be estimated from the isolates in the reference library as nj/Nj, where nj is the number of isolates from source j in terminal node T and Nj is the total number of isolates from source j in the reference library. P(S), the prior probability distribution of source identity, must be specified as a modeling parameter, which means that the prior probabilities have a significant role in determining source predictions in the classification tree.

There are two obvious choices for assigning values to the prior probabilities: (i) P(Si) is proportional to the population size of the ith source in the watershed, which if estimated from the isolates in the reference library would be Ni/{Sigma}Nj, or (ii) equal values, which would be 0.25 for each of the four sources in the Anacostia Watershed. From Table 1 (complete data), the source proportions reflected by isolates in the reference library are as follows: P(Spet) = 0.212 (236/1,074), P(Shuman) = 0.372 (399/1,074), P(Slivestock) = 0.228 (245/1,074), and P(Swildlife) = 0.181 (194/1,074). If these proportions were used as prior probabilities in equation 1, the posterior source probabilities would be the proportions of the sources for isolates in that node (1), which is the same as the majority source among the reference library isolates in the node. Subsequently, any isolate with an unknown source that would be classified by the tree into the terminal node under discussion would be assigned the majority source for that node.

However, using the source proportions among the isolates in the reference library as prior probabilities is not advisable in most circumstances. It is unlikely that the proportion of isolates in the reference library from a particular source would reflect the true proportion of potential sources of bacteria. In the Anacostia reference library, we assume that the collection of fecal samples provide adequate coverage of the sources in the watershed, but the numbers of isolates for each source are not necessarily proportional to actual source contributions.

The second obvious choice for prior probabilities, equal probabilities of 0.25 for each of the four sources in the Anacostia analysis, is a more acceptable choice. In the absence of specific reliable data on the source contributions of bacteria from the watershed, the equal prior probabilities properly reflect ignorance concerning the source contributions and, in that respect, lead to an unbiased classification model. Under choice of the equal prior probabilities, the posterior probabilities would be calculated using equation 1 with each P(Sj) replaced by 0.25. The numerical example in Table 2 demonstrates how source predictions are different depending on the choice of values for the prior probabilities.


View this table:
[in this window]
[in a new window]
 
TABLE 2. Classification results for a hypothetical terminal node, showing that changing the prior probabilities from source proportions to equal changes the source assignment of the node from human to wildlife

Parameters governing model stability.
A classification model is stable if it consistently classifies new isolates from known sources accurately. Stability depends to a large extent on the representativeness and size of the reference library, two characteristics that are addressed in Discussion. For a given reference library, stability may be advanced by imposing size limitations on nodes that are candidates for splitting and size limitations on the terminal nodes. These limitations are intended to improve the statistical reliability of the classification model by increasing the statistical stability of terminal nodes. For example, requiring the number of isolates in a terminal node to be no less than a specified number is equivalent to imposing a statistical precision requirement on the estimates of posterior probabilities for that node.

Evaluating alternative classification tree models.
A classification model for BST is judged by the likelihood of correctly predicting the source of isolates grown from water samples containing bacteria from unknown sources. The most basic assessment of this likelihood is a resubstitution estimate of the rate of correct classification (i.e., using the model to reclassify the reference library isolates). A relatively high value of this resubstitution estimate of the correct classification rate would be expected because the model is being used to classify the same isolates that were used to build it. Although a minimum acceptable limit for the correct classification rate has not been established, the model should at least have a correct classification rate that exceeds the rate for classification by chance (i.e., random assignment of isolates to sources). Therefore, the rate of correct classification for each source should be larger than the relative frequency of isolates in the reference library from that source.

Cross-validation provides a more realistic estimate of the correct classification rate than resubstitution because the predicted source of an isolate is based on a model that did not employ the isolate for building the model. Cross-validation is accomplished for large reference libraries by developing the classification tree from a subset of reference library isolates by excluding a random subset of the isolates from the reference library. Using the resulting model to classify the isolates that were excluded leads to an independent estimate of the correct classification rate, which would be expected to be smaller than the resubstitution estimate. For smaller reference libraries, cross-validation may be structured to provide a bootstrap estimate of the correct classification rate (see references 1, 2, and 6 for general discussions of bootstrapping). In the software we employed, the bootstrap method is referred to as K-fold cross-validation. This method randomly establishes K nonoverlapping subsets of the reference library isolates, each consisting of 1/K of the total number of isolates in the reference library. Each subset is "held out," and the remainder of the reference library is used to build a tree classification model that, in turn, is used to classify the "held-out" isolates. Accumulating the results for all K subsets leads to an alternative estimate of the correct classification rate. This estimate provides a more realistic assessment of model performance than the estimate based on resubstitution because the predicted classification of an isolate is based on a model that did not employ the isolate to build the model. The bootstrap estimate of the correct classification rate usually would be smaller than the resubstitution estimate. In our analysis, K = 10. (In addition to using K-fold cross-validation to estimate the rate of correct classification, CART, the software we used to build a classification model, uses information from the K-fold analysis to determine the optimal classification model. For details, see references 1 and 6.)

Classification probability thresholds.
As described above, each terminal node is associated with the source that has the maximum posterior probability for the node. Also, an isolate that has the same antibiotic resistance profile as the node would be assigned the source associated with the node. In terminal nodes where the maximum probability for the source associated with the node is not much larger than probabilities for other sources represented in the node, classifications based on this node are likely to be uncertain. This uncertainty can be reduced by establishing a threshold for the maximum probability in the node before accepting the node as a basis for assigning isolates to one of the sources. If the maximum probability does not exceed the threshold, isolates in the terminal node would be classified as "source unknown."

For example, assume that a node contains one pet isolate, one human isolate, two livestock isolates, and six wildlife isolates and, for purposes of this example, that the posterior probability is the relative frequency of isolates in the node. (For this example, we are assuming that the prior probability distribution for sources is the distribution of sources in the reference library. See "Source identification for terminal nodes" above for a discussion of prior probabilities.) The posterior probabilities would be 10% pet, 10% human, 20% livestock, and 60% wildlife. Without any threshold, all isolates falling into this node would be classified as wildlife. With a threshold of 70%, isolates falling into this node would be classified as unknown. If the threshold was 55%, the node and all isolates in it would be classified as wildlife.

A threshold probability reflects our confidence in the classification scheme. While increasing the threshold increases the correct classification rate, it also increases the number of isolates classified as unknown. The trade-off between correct classification rate and number of unknowns needs to be investigated by varying the value assigned to the probability threshold. Application of a probability threshold for the Anacostia Watershed is described in Discussion.


arrow
RESULTS
 
Here we present statistical results for the application of the tree classification method to data from the Anacostia Watershed.

Initial analysis.
We used CART to build a tree classification model for four sources based on the 1,074 isolates from 135 samples with complete data in the Anacostia Watershed reference library (Table 1) The ARA data vary statistically both between samples for a particular source and within samples (i.e., across isolates grown from a particular sample). Both sources of variability must be accounted for in the modeling process, and they are important parameters for determining the number of samples and isolates needed for developing a satisfactory model. We used the 1,074 isolates for model building. The effects of both between- and within-sample variability are discussed with the results of the final model.

We chose the Gini index as the impurity criterion and implemented K-fold cross-validation with K = 10. Additionally, the minimum number of isolates in a parent node was set to 10 and the prior probabilities were set equal to 0.25, a choice that reflects the absence of information concerning the true distribution of sources in the watershed. The final tree consisted of 171 nodes, 86 of which were terminal nodes.

Figure 1 shows the first few nodes of the tree for illustration purposes. At the top is the root node, which is comprised of all isolates in the reference library. The first split was based on chlortetracycline at a concentration of 100 µg/ml, minimizing the Gini index. None of the nodes shown in Fig. 1 are terminal nodes. Each node was subsequently split, leading to a classification model consisting of 86 terminal nodes.


Figure 1
View larger version (28K):
[in this window]
[in a new window]
 
FIG. 1. The first few splits of the classification tree based on the Anacostia isolate library. Equations above the nodes represent the splitting criteria: particular antibiotics and whether the isolates were resistant (1) or susceptible (0). CT100, chlortetracycline, 100 µg/ml; CA5, chloramphenicol, 5 µg/ml; S60, streptomycin, 60 µg/ml; C10, cephalothin, 10 µg/ml.

Table 3 displays the model predictions for library isolates by using resubstitution. The overall correct classification rate for isolates by resubstitution is 81.0%. Table 3 also displays the distribution of predicted sources by actual source. The minimum correct classification rate by resubstitution was 75.9%, for livestock isolates.


View this table:
[in this window]
[in a new window]
 
TABLE 3. Resubstitution results for classification of Anacostia Watershed reference library isolates by using the tree classification model

Table 4 displays correct classification rates for samples by using resubstitution. These results address the effect of within-sample variability (i.e., variation across isolates). We consider a sample to be correctly classified if a minimum cutoff of correct classification is achieved for the isolates grown from that sample. For example (Table 4), a sample may be considered correctly classified if at least 60% of the isolates grown from that sample are correctly classified. As the cutoff percentage increases, the percentage of samples classified declines. This pattern is a manifestation of within-sample variation across isolates. These results are discussed further in "Reference library representativeness and size" below, which discusses determination of the number of samples (and isolates per sample) for a BST field study design.


View this table:
[in this window]
[in a new window]
 
TABLE 4. Correct classification percentages for samples based on resubstitution

Using 10-fold cross-validation, which provides a more accurate assessment of the model's predictive ability, the overall correct classification rate is estimated to be 72.2% (Table 5), which is 8.8% lower than the estimate obtained by resubstitution. The estimates of correct classification rates by source range from 64.1% for livestock isolates to 82.0% for human isolates.


View this table:
[in this window]
[in a new window]
 
TABLE 5. Classification of Anacostia Watershed library isolates with cross-validation

Applying the tree classification model to the 1,565 Anacostia River water isolates yielded the following distribution of sources: 468 (29.9%) pet, 222 (14.2%) human, 437 (27.9%) livestock, and 438 (28.0%) wildlife. These results were determined from analysis of all the water isolates, which represent six monitoring stations with samples collected monthly for 1 year. Therefore, the source distribution presented here does not account for the distribution of high-flow and low-flow periods, which may contribute different sources to the streams. Also, note that bacterial sources can be site specific in a watershed, given the nonconservative nature of bacterial transport. For the purpose of this analysis, all the water isolates from the six monitoring stations were used to estimate the overall watershed relative source contributions. The results based on this averaging method indicate that humans contribute the least bacterial contamination to the Anacostia River. The other sources of bacterial contamination are evenly distributed among pet animals, livestock, and wildlife.

Model refinement with probability thresholds.
We conducted an analysis to determine whether a probability threshold would improve the classification model. We evaluated the trade-off between increasing the rate of correct classification and the percentage of isolates that would be classified as unknown by plotting these two quantities for a range of potential probability threshold values.

Figure 2 shows the results of applying different probability thresholds to isolates in the reference library. If the threshold was 50%, the percentage of isolates classified as unknown would be less than 5%, but the improvement in the rate of correct classification would be minimal. Thresholds of 70% or greater are associated with at least a 10% improvement in the correct classification rate, but the percentage of unknowns also increases. We chose 80% as the threshold. For this threshold, the percent correctly classified increased from 81% to over 90%; the percentage of unknowns for this threshold is 44%.


Figure 2
View larger version (21K):
[in this window]
[in a new window]
 
FIG. 2. Effects of classification probability limits on Anacostia library resubstitution.

Table 6 shows how the model classified the library isolates when an 80% probability threshold was used. We observe a significant increase in the correct classification rates, both for sources individually (minimum equal to 91.9%) and overall (93.0%). Also, the number of classification errors is small compared to the resubstitution and cross-validation results (Tables 3 and 5), and there is no systematic pattern of misclassification by actual source. However, we must accept an overall rate of unknowns equal to 44.3%. The percentage of unknowns for the wildlife source is over 70%, suggesting that this source was not well differentiated by these ARA data.


View this table:
[in this window]
[in a new window]
 
TABLE 6. Anacostia Watershed library classified with an 80% probability threshold

Table 7 displays the results of applying the model, with and without the probability threshold of 80%, to water sample isolates. The threshold model classifies 500 of the 1,565 water isolates (31.9%) as unknown. The most notable change in the source prediction results occurs for livestock, which represented 27.9% of the water isolates based on the initial model but only 15.1% based on the threshold model. Also, the overall distribution of isolates among sources based on the threshold model is different from the distribution determined by the initial model. The initial model indicated that the human source, at approximately 14%, was about 50% of the other sources, which were approximately 28%. Based on the threshold model, the human and livestock sources (17.5% and 15.1%, respectively) are each the source of approximately half the percentages of the water isolates attributed to the pet and wildlife sources (34.5% and 33.0%, respectively). The threshold model, which we prefer even with its moderate rate of unknowns, suggests potentially different risk management strategies than the initial model for reducing the bacterial content of Anacostia River water.


View this table:
[in this window]
[in a new window]
 
TABLE 7. Classification of Anacostia River water samples with and without an 80% probability threshold


arrow
DISCUSSION
 
Here we discuss a variety of issues that have been raised in BST classification modeling, including missing data, cross-validation, and size and representativeness of the reference library.

Missing data.
The Anacostia Watershed reference library consisted of 1,155 isolates, each with outcomes for 37 antibiotic-concentration combinations. Of the 1,155 isolates, 1,074 had data for all antibiotic-concentration combinations; the remaining 81 isolates were missing data for 12 antibiotic-concentration combinations. Because most statistical classification methods require complete data on all variables, either isolates with one or more missing values are omitted from statistical analysis or missing values are imputed (i.e., replaced with estimates from similar isolates with complete data). CART, the tree classification software we employed, includes a procedure similar to imputation that is based on "surrogate splitters" (1, 6). In addition, the classification tree method, due to its flexibility for analyzing interactions among variables, is capable of accommodating missing data directly. Classification tree results based on utilization of the direct approach to missing data, which is described below, are more transparent than results where imputation or surrogate splitters are employed.

To accommodate missing ARA data directly, a fixed numerical value other than 0 or 1 would be assigned to each missing data entry in the reference library. Then, the analysis treats the missing value simply as another outcome value in the tree-building process. If missing data occur at random, the replacement of missing data with a value other than 0 or 1 should not alter the result that would have been achieved if there were no missing data. However, missing data may not occur at random but may be correlated with the sources. For example, if a particular antibiotic applied to isolates from a particular source often produced a laboratory result recorded as missing, the classification analysis would indicate a correlation with that source. The correlation could be an indication that a biological characteristic of the source is responsible for an unsuccessful laboratory outcome with the antibiotic in question. Therefore, the correlation could be useful for building the classification tree. If a water sample isolate from an unknown source registered "missing data" for the same antibiotic, the likelihood that the isolate came from the source that had a high percentage of missing data in the reference library for this antibiotic would be increased.

The missing data in the Anacostia Watershed reference library were limited to 81 isolates and the same 12 antibiotic-concentration combinations. All isolates were tested with an initial panel of antibiotics in various concentrations. Later in the project, additional antibiotics and concentrations were added to the panel, requiring the growth and testing of frozen isolates. Missing data resulted when 81 random isolates would not regrow for reasons unrelated to source. Therefore, the pattern of missing data contained no information that could help identify isolate sources. We omitted the 81 isolates with missing data from all subsequent analyses.

Reference library representativeness and size.
Two important recurring questions concerning the validity of all BST classification modeling methods are (i) whether the reference library is representative of the watershed and (ii) whether the reference library consists of a sufficient number of samples and isolates to develop a reliable model. Neither question can be simply and completely answered. However, we provide guidance based on general statistical principles and the specifics of BST using ARA data.

Concerning representativeness, statistical principles dictate that samples should be collected in a manner that would provide coverage for the total watershed. Stratification and proportional sampling by strata could be used if there was information indicating that certain areas within the watershed were more likely as sources of fecal bacteria than other areas. Also, if information on the relative population sizes of the contributing sources (i.e., pet, human, livestock, and wildlife for the Anacostia Watershed) was available, the source categories could be used to define strata and a proportional sampling scheme for these strata could be designed. However, in any initial BST investigation such as the investigation conducted for the Anacostia Watershed, it is unlikely that the information needed to define strata would be well developed. Therefore, a sampling design based on a combination of expert scientific judgment and convenience is a practical choice.

For the Anacostia Watershed, sampling was conducted after an initial field survey designed to identify likely "major" contributors to fecal bacteria contamination. Field personnel were instructed to concentrate their scat collection efforts around these major sources.

The size of the reference library is determined by specifying the number of field samples to be collected, the number of isolates to be grown from each sample, and the number of isolates analyzed by ARA. Determining the size (i.e., number of samples and isolates) required for a reference library may follow the usual approach for setting requirements for the number of observations in a statistical sample: (i) state a well-defined quantitative objective for the statistical analysis, (ii) determine the size of the statistical error considered acceptable for an estimate of the objective, and (iii) select a probability or confidence level for ensuring that the estimation error is not exceeded. This approach, although conceptually straightforward, is complicated in application and has not been applied in the BST literature.

As an introduction to the procedure for determining the size of the reference library for classification tree modeling, we provide an example of this procedure and its complications with a more familiar statistical classification method, i.e., linear discriminant analysis. As a first step, it would be useful to know whether the means of the antibiotic-concentration combinations used to build the classification model (the variables in the discriminant function) are statistically different across the sources. For two sources and data that have an approximately normal distribution, the test is straightforward and depends explicitly on the number of isolates from each source (7). It follows that 95% confidence contours could be developed for the mean vectors for each source and inspected to determine whether these contours overlapped. If the overlap was pronounced, it is likely that the classification model would perform poorly. The widths of the confidence contours depend on the inherent variation in the data and the number of isolates from each source. As such, the number of isolates from each source could be determined to achieve a specified separation of the means for the two sources. The complications for BST using linear discriminant analysis to model ARA data are that (i) ARA data do not follow an approximately normal distribution and (ii) extensions of the two-source confidence contour analysis to multiple sources have not been fully developed. These complications affect all statistical classification methods that rely on the assumption that the data have a normal distribution.

We describe an approach for determining the reference library size for BST by using the tree classification method, but practical implementation of this approach must await further development. For example, set the well-defined quantitative objective to be the rate of correct classification as estimated by resubstitution, and require a reference library large enough to ensure no more than a 5% error (plus or minus) with 99% confidence for the estimate. Then, by considering models with different numbers of terminal nodes, the required number of isolates could be determined, which, based on a determination of the number of isolates per sample, leads to the number of samples. Although this objective can be stated simply, it has not been translated into a requirement for the size of a reference library for use with the classification tree method. Another approach for determining the required number of isolates would be to require a small error for the estimates of the maximum posterior probability in terminal nodes. Recall that the maximum posterior probability in a terminal node determines the source for that node.

A concept referred to as artificial clustering has been suggested by some researchers as a way to judge whether or not the reference library is large enough (5, 12). However, it is unlikely that the approach employing the concept of artificial clustering is informative for determining the necessary size of a reference library. As described by Harwood et al. (5), a test of the library size would be conducted by randomly assigning the isolates in the library to sources. For example, if there were four sources, each isolate would have a probability equal to 0.25 of being assigned to one of the four sources. The statistical classification procedure would then be applied to the library isolates, and the resulting model would be used to reclassify the isolates among sources. Because the isolates were assigned to sources at random (i.e., the assignment was independent of the ARA outcomes), if the statistical classification procedure is not biased, it would function as a random classification procedure. Each isolate would have a 25% chance of being assigned to each source. The expected resubstitution result would be "25% of the isolates classified into each source." Various authors have relied on empirical results showing that as the library size increases, the source classifications deviate less from the expected 25%. Therefore, the realization of large deviations from the expected 25%, a circumstance referred to as artificial clustering, is interpreted as an indication that the library is too small.

Artificial clustering as defined, implemented, and interpreted does not provide useful information about the adequacy of library size. Let N be the number of isolates in a reference library. Assume that the isolates came from four sources. Any individual random assignment of the N isolates to the four sources, which subsequently is analyzed by a statistical classification method, would be expected to classify 25% of the N isolates into each source category. A deviation from the 25% target has two possible causes. First, the statistical classification method may be biased. It is generally assumed that this is not the case. Second, the deviation could be a consequence of statistical variation, and larger deviations would be more likely for smaller reference libraries. Deviations from the 25% target can be predicted on a probability basis using the multinomial distribution, but even then they provide minimal guidance for determining if the size of the reference library is acceptable.

At the current stage of BST model development, the only way to assess the size of a reference library is through sensitivity analysis by varying the size of the library or adding more isolates from known sources and evaluating the stability of the model.

Interpretation of water sample isolate source predictions.
We used a statistical method, classification tree analysis, to build a model for classifying isolates among sources and applied the model to water sample isolates with unknown sources. The resulting distribution of isolates among sources has implications for risk management interventions designed to reduce bacterial contamination of river waters. Our analysis suggests that the human and livestock sources contribute fewer bacteria to the Anacostia River than the pet and wildlife sources (Table 7). Although we are confident that the results indicate the correct attribution for sources, we have not conducted a statistical test to determine if the percentages are statistically significantly different, and at present we have not developed the methodology for conducting such a test. We currently are investigating methods for efficiently evaluating statistical differences among predicted source percentages derived from classification tree models.


arrow
FOOTNOTES
 
* Corresponding author. Mailing address: Price Associates, One North Broadway, White Plains, NY 10601. Phone: (914) 686-7975. Fax: (914) 686-7977. E-mail: bprice{at}priceassociatesinc.com. Back


arrow
REFERENCES
 
    1
  1. Breiman, L., J. H. Freidman, R. A. Olshen, and C. J. Stone. 1998. Classification and regression trees. Chapman & Hall/CRC, Boca Raton, Fla.
  2. 2
  3. Chernick, M. R. 1999. Bootstrap methods, a practitioner's guide. John Wiley & Sons, Inc., New York, N.Y.
  4. 3
  5. Graves, A. K., C. Hagedorn, A. Teetor, M. Mahal, A. M. Booth, and R. B. Reneau, Jr. 2002. Antibiotic resistance profiles to determine sources of fecal contamination in a rural Virginia watershed. J. Environ. Qual. 31:1300-1308.[Abstract/Free Full Text]
  6. 4
  7. Hagedorn, C., S. L. Robinson, J. R. Filtz, S. M. Grubbs, T. A. Angier, and R. B. Reneau. 1999. Determining sources of fecal pollution in a rural Virginia watershed with antibiotic resistance patterns in fecal streptococci. Appl. Environ. Microbiol. 65:5522-5531.[Abstract/Free Full Text]
  8. 5
  9. Harwood, V. J., J. Whitlock, and V. Withington. 2000. Classification of antibiotic resistance patterns of indicator bacteria by discriminant analysis: use in predicting the source of fecal contamination in subtropical waters. Appl. Environ. Microbiol. 66:3698-3704.[Abstract/Free Full Text]
  10. 6
  11. Hastie, T., R. Tibshirani, and J. H. Friedman. 2001. The elements of statistical learning. Springer, New York, N.Y.
  12. 7
  13. Johnson, R. A., and D. W. Wichern. 1992. Applied multivariate statistical analysis. Prentice Hall, Upper Saddle River, N.J.
  14. 8
  15. Maryland Department of the Environment. 2005. Draft total maximum daily loads of fecal bacteria for the non-tidal Anacostia River Basin in Montgomery and Prince George's Counties, Maryland. [Online.] http://www.mde.state.md.us/assets/document/Anacostia_%20fc_TMDL-08-03-2005_PN(1).pdf.
  16. 9
  17. Parveen, S., R. L. Murphree, L. Edmiston, C. W. Kaspar, K. M. Portier, and M. L. Tamplin. 1997. Association of multiple-antibiotic-resistance profiles with point and nonpoint sources of Escherichia coli in Apalachicola Bay. Appl. Environ. Microbiol. 63:2607-2612.[Abstract]
  18. 10
  19. Simpson, J. M., J. W. Santo Domingo, and D. J. Reasoner. 2002. Microbial source tracking: state of the science. Environ. Sci. Technol. 36:5279-5288.[Medline]
  20. 11
  21. U.S. Environmental Protection Agency. 2005. Microbial source tracking guide document, EPA/600-R-05-064, June 2005. U.S. Environmental Protection Agency, Washington, D.C.
  22. 12
  23. Whitlock, J. E., D. T. Jones, and V. J. Harwood. 2002. Identification of the sources of fecal coliforms in an urban watershed using antibiotic resistance analysis. Water Res. 36:4273-4282.[Medline]
  24. 13
  25. Wiggins, B. A. 1996. Discriminant analysis of antibiotic resistance patterns in fecal streptococci, a method to differentiate human and animal sources of fecal pollution in natural waters. Appl. Environ. Microbiol. 62:3997-4002.[Abstract]
  26. 14
  27. Wiggins, B. A., R. W. Andrews, R. A. Conway, C. L. Corr, E. J. Dobratz, D. P. Dougherty, J. R. Eppard, S. R. Knupp, M. C. Limjoco, J. M. Mettenburg, J. M. Rinehardt, J. Sonsino, R. L. Torrijos, and M. E. Zimmerman. 1999. Use of antibiotic resistance analysis to identify nonpoint sources of fecal pollution. Appl. Environ. Microbiol. 65:3483-3486.[Abstract/Free Full Text]
  28. 15
  29. Wiggins, B. A., P. W. Cash, W. S. Creamer, S. E. Dart, P. P. Garcia, T. M. Gerecke, J. Han, B. L. Henry, K. B. Hoover, E. L. Johnson, K. C. Jones, J. G. McCarty, J. A. McDonough, S. A. Mercer, M. J. Noto, H. Park, M. S. Phillips, S. M. Purner, B. M. Smith, E. N. Stevens, and A. K. Varner. 2003. Use of antibiotic resistance analysis for representativeness testing of multiwatershed libraries. Appl. Environ. Microbiol. 69:3399-3405.[Abstract/Free Full Text]


Applied and Environmental Microbiology, May 2006, p. 3468-3475, Vol. 72, No. 5
0099-2240/06/$08.00+0     doi:10.1128/AEM.72.5.3468-3475.2006
Copyright © 2006, American Society for Microbiology. All Rights Reserved.





This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrowReprints and Permissions
Right arrow Copyright Information
Right arrow Books from ASM Press
Right arrow MicrobeWorld
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Price, B.
Right arrow Articles by Currey, L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Price, B.
Right arrow Articles by Currey, L.
Agricola
Right arrow Articles by Price, B.
Right arrow Articles by Currey, L.