Previous Article | Next Article ![]()
Applied and Environmental Microbiology, September 2006, p. 5915-5926, Vol. 72, No. 9
0099-2240/06/$08.00+0 doi:10.1128/AEM.02453-05
Copyright © 2006, American Society for Microbiology. All Rights Reserved.
Department of Microbiology, University of Barcelona, Avda. Diagonal 645, Barcelona, Spain,1 Departament de Llenguatges i Sistemes Informàtics, Universitat Politècnica de Catalunya, Jordi Girona 1-3, Barcelona, Spain,2 EPHRU, School of the Environment, University of Brighton, Brighton, United Kingdom,3 Laboratoire de Chimie Physique et Microbiologie pour l'Environnement (LCPME), UMR 7564 CNRS/UHP-Nancy I, Faculté de Pharmacie, 5 rue Albert Lebrun, 54000 Nancy, France,4 Water and Environmental Microbiology, SMI, Swedish Institute for Infectious Disease Control, SE 171 82 Solna, Sweden,5 State General Laboratory, Microbiological Section, Kimonos 44, 1451 Nicosia, Cyprus,6 Microbiology and Tumor Biology Center, Karolinska Institute, Box 280, S-171 77 Stockholm, Sweden7
Received 17 October 2005/ Accepted 28 June 2006
|
|
|---|
|
|
|---|
In our opinion, it is necessary to first identify tracers or combinations of tracers demonstrating high discrimination and then adapt these methods to the needs of source tracking studies. Consequently, both new conceptual and methodological approaches are needed in order to develop models for microbial source tracking. These new approaches should address, step by step, the factors that could influence the successful determination of the source of fecal pollution. These factors include the nature of the dominant fecal pollution contributions (anthropogenic or nonanthropogenic pollution), dilution, the persistence of indicators and parameters, the presence of complex mixtures from several distinct animal species, and the selection of appropriate and consistent numerical methods for the development of models.
The study described herein was initially designed to focus on five key components that were selected after analyzing the results of several previous investigations reported in recent reviews (15, 54, 55). Initially, the study focused only on the differentiation between human and nonhuman sources. Secondly, examination of highly polluted wastewaters or slurries was included because failures previously reported in the literature were often related to the use of dilute samples that give values under the threshold of the method investigated (46, 47). The third key element was to study widely different geographical areas, as clear geographical variations in results have been reported for several approaches described in the literature (50, 57, 59). The fourth key element was to include several indicators of fecal pollution from both human and nonhuman sources throughout the study, since this is needed for defining ratios between discriminant and nondiscriminant indicators and for defining the persistence of the values of fecal contaminants in the environment. Finally, the fifth key element was to identify statistical or machine learning methods to develop appropriate predictive models.
Recently, a multilaboratory study was undertaken in the following areas of Europe: northern Europe (Stockholm, Sweden), northwestern Europe (Brighton, United Kingdom), central Europe (Nancy, northeastern France), southeastern Europe (Nicosia, Cyprus), and southwestern Europe (Barcelona, Spain). During a first phase, quality control schemes were agreed upon and published (6). In order to fully acquaint all relevant laboratory personnel from the participating laboratories with the methods and the materials to be analyzed, collaborative training sessions, in which reference materials were used, were undertaken prior to an interlaboratory comparison study. Furthermore, the reference materials were used as standard samples for first-line quality control during the full study. In this first phase, nine methods achievable in all laboratories (enumeration of fecal coliforms, enterococci, Clostridium perfringens, somatic coliphages, F-specific RNA phages, and bacteriophages infecting Bacteroides fragilis RYC2056; genotyping of F-specific RNA phages; and biochemical phenotyping of fecal coliforms and enterococci by a miniaturized system) were applied to local samples in all the laboratories. The novel methods, or methods not achievable in some of the laboratories, were applied to local samples only in the laboratories that had the appropriate facilities and/or expertise. The novel methods were those related to the specific detection of Bifidobacterium species and bacteriophages infecting Bacteroides thetaiotaomicron GA17, detection of adenoviruses and enteroviruses by genomic methods (PCR and reverse transcription-PCR), genotyping of Giardia, and determination of fecal sterols. Following this first phase, several methods (detection of adenoviruses and enteroviruses by genomic methods, genotyping of Giardia, and analysis of antibiotic resistance profiles) were rejected either because of failures (suspected true or false negatives) or because the method gave unreliable results in some of the laboratories. Results of this first phase were reported elsewhere previously (6). Taking the results of the first phase of study into account, the second phase involved sampling of wastewaters and slurries of human and animal origin in the different geographical areas using the single and derived parameters presented in Table 1. Moreover, a number of statistical methods were tested to aid in the identification and classification of sources of fecal pollution in water based on microbial and/or chemical indicators, which have been proposed as discriminant tracers. These included discriminant analysis (17, 48, 58), the nearest-neighbor technique (maximum similarity) (10, 13, 51), and the use of artificial neural networks (9, 18). Until now, there has been no widespread consensus on when the use of each of the methods is the most appropriate, and to date, none of these methods have provided an interpretable model. Furthermore, there is no consensus on the most appropriate statistical analyses to determine the set of optimal variables for developing these predictive models (51). Other numerical methods should be assayed in order to develop predictive models for source tracking. Specifically, machine learning methods (45) have been used with a considerable degree of success in many disciplines. Their potential application to microbial source tracking should therefore be evaluated.
|
View this table: [in a new window] |
TABLE 1. Distribution of samples from the geographical areas sampled by the research groups participating in this study
|
|
|
|---|
Detection and enumeration of bacterial indicators.
Three fecal indicators were measured: fecal coliforms, enterococci, and clostridia. Standardized methods (23, 25, 28) for the enumeration of these indicators were followed. Briefly, fecal coliforms were enumerated by membrane filtration on 0.45-µm-pore-size membranes followed by incubation for 24 h on mFC agar (Difco, Detroit, Mich.) at 44.5°C according to established procedures (28). Enterococci were also enumerated by membrane filtration according to standardized protocols by incubation on m-Enterococcus agar (Difco, Detroit, Mich.) at 37°C for 48 h. Membranes were then transferred to bile esculine agar (Difco) for 1 h at 44°C to confirm the enterococci colonies on the basis of the hydrolysis of esculine. Clostridia were enumerated by thermal shock of samples at 80°C for 10 min. Later, 10-fold dilutions were made in one-quarter-strength Ringer's solution, and 1 ml of each dilution was inoculated into 50 ml of liquid sulfite polymyxin sulfadiazine agar (Difco) followed by incubation at 44°C for 24 h.
Detection, enumeration, and typing of bacteriophages.
Somatic coliphages, F-specific RNA bacteriophages, and phages infecting Bacteroides fragilis RYC2056 were enumerated in accordance with ISO standardized methods (27, 29, 30). PFU of somatic coliphages were counted by the double-agar-layer technique using Escherichia coli strain WG5 according to ISO standard 10705-2 (29). Total numbers of F RNA and F-specific RNA bacteriophage PFU were determined using strain Salmonella enterica serovar Typhimurium strain WG49 (now classified as Salmonella enteritidis subsp. typhimurium) in accordance with ISO standard ISO 10705-1 (27). PFU of bacteriophages infecting Bacteroides fragilis strain RYC2056 were determined by the double agar layer method according to ISO standard 10705-4 (30). Phages infecting Bacteroides thetaiotaomicron GA17 were enumerated as described elsewhere previously (49) according to ISO standard 10705-4 (30); as stated in Results, plaques obtained on GA17 in the United Kingdom were very turbid. In this case, plaques were counted by researchers more experienced in the technique, and suspected plaques were verified by subculture (enrichment followed by spot test).
Genotyping of F-specific RNA phages.
The distribution of genotypes of F-specific RNA bacteriophages was carried out by plaque hybridization as previously described (52) using probes previously described (3).
Phenotyping of fecal coliforms and fecal streptococci.
From each sample, 24 fecal coliform colonies and 24 Enterococcus colonies were selected at random from selective agar plates (23, 28) containing between 30 and 100 colonies and were picked from these plates to obtain a pure culture for biochemical phenotyping. The number of bacterial isolates required in each sample for diversity analysis was previously determined by other authors (4). Biochemical phenotyping was performed using PhP-RE and PhP-RF microplates according to the manufacturer's instructions (PhP-Plate Microplates Techniques AB, Sweden) and previously described techniques (35). The basis of biochemical fingerprinting using these microplates has also been described previously (33). The biochemical profiles were calculated for each isolate as previously described (36) by using PhpWin software (PhP-Plate Microplates Techniques AB). Simpson's diversity index (Di) was used to calculate the diversity of bacterial populations in each group studied (2, 19), while similarity between populations was calculated by the population similarity coefficient (36). Calculations of diversity (Di), population similarity indices, and correlation coefficients and cluster analyses were also performed using PhpWin software (PhP-Plate Microplates Techniques) as previously described (36). In addition, the species distribution of Enterococcus species was analyzed using a previously described matrix (41) and a previously described procedure (5). The percentage of Enterococcus faecium plus Enterococcus faecalis isolates (variable FMFS) and the percentage of Enterococcus hirae isolates (variable HiR) were also calculated, because differences in their proportions in wastewaters of animal and human origin have been reported previously in other studies (34). Similarly, the percentage of E. coli within the fecal coliforms was determined by comparing isolates with E. coli PhP-Plate reference phenotypes. Additionally, the percentage of those fecal coliform isolates that did not demonstrate fermentation of cellobiose was also calculated. E. coli isolates are normally cellobiose negative (16), whereas other thermotolerant coliform species showing E. coli-like colonies on mFC agar are often cellobiose positive, and thus, this proportion is an estimation of the proportion of E. coli isolates among the E. coli-like isolates.
Bifidobacterium determinations.
Total bifidobacteria were counted on human bifido sorbitol agar as described previously by other authors (42). Yellow colonies on human bifido sorbitol agar were counted as sorbitol-fermenting bifidobacteria as described elsewhere previously (8). Additionally, the presence of Bifidobacterium dentium and Bifidobacterium adolescentis was determined by PCR amplification using specific primers of the 16S RNA genes as described elsewhere previously (7).
Determination of fecal sterols.
The procedure for analysis of sterols in wastewater with high concentrations of solid fraction was performed as previously described (37). First, separation of the solid fraction from 100-ml volumes of each sample was carried out by filtration through glass filters. The membranes were then weighed and frozen at 70°C until analysis. Gas chromatography with flame ionization detection analysis of four main fecal sterols (coprostanol [5ß-cholestan-3ß-ol], stigmastanol or 24-ethylcoprostanol [24-ethyl-5ß-cholestan-3ß-ol], epicoprostanol [5ß-cholestan-3
-ol], and cholestanol [5-
-colestan-3ß-ol]) was then performed.
Establishment of operating principles and quality assurance.
In order to establish a set of operating principles for data quality, a training session for operators from all the participant laboratories was undertaken. Noncertified reference materials (bacterial strains and bacteriophages) were prepared and used during the training session as previously described (39). These reference materials were provided to the partners at the end of an interlaboratory exercise session in order to evaluate the implementation of the methods in participating laboratories. Moreover, these reference materials were used in routine quality control practices at the participating laboratories. Taking into account the available facilities in the different laboratories and the results of the interlaboratory exercises, the following parameters were tested in each of the five laboratories: enumeration of fecal coliform bacteria, enterococci, clostridia, somatic coliphages, F-specific RNA phages, total bifidobacteria, sorbitol-fermenting bifidobacteria, bacteriophages infecting B. fragilis RYC2056, and bacteriophages infecting B. thetaiotaomicron GA17; genotyping of F-specific RNA phages; and phenotypic characterization of fecal coliforms and enterococci. Detection of Bifidobacterium dentium and Bifidobacterium adolescentis by PCR and fecal sterol analysis of all samples were performed in the laboratories of the University of Barcelona.
Data treatment and statistical analyses.
Raw data from the analyses performed provided 26 variables, as presented in Table 2. This initial group of variables used in the statistical analyses consisted of the 20 single variables and 6 derived variables from the phenotyping of fecal coliforms and enterococci (percentage of cellobiose-negative fecal coliforms [CNFC], diversity index for fecal coliforms [DiC], diversity index for enterococci [DiE], percentage of E. coli Ph-Plate phenotypes [ECP], FMFS, and HiR). Some values that were below the threshold value (lowest sensitivity) for the method were corrected to the threshold value.
|
View this table: [in a new window] |
TABLE 2. Definition of terms used for single and derived variables in the statistical and machine learning methods of this study
|
|
View this table: [in a new window] |
TABLE 3. Bacterial indicators and bacteriophage densities
|
|
View this table: [in a new window] |
TABLE 7. Concentrations of sterols in human samples and animal samples
|
The methods chosen were the k nearest-neighbor technique (with Euclidean distance), the linear and quadratic Bayesian classifiers (two discriminant analysis methods) (14), and the support vector machine (11). The development of predictive models was carried out using 81 of the 103 observations (hereafter referred to as the "training set") and using cross-validation, as explained below. The remaining 22 observations (the "test set") were withheld for an independent and unbiased assessment of the feasibility of the predictive models. These holdout observations presented unequivocally distinct values according to their origins (11 from waters polluted by human fecal sources and 11 from waters polluted by nonhuman fecal sources). These analyses were performed using the software package WEKA (60).
|
|
|---|
Distribution of genotypes of F-specific RNA phages.
The descriptive statistics for the distribution of genotypes of F-specific RNA bacteriophages are shown in Table 4. As described elsewhere previously (52), the method used here (53) gave a percentage of plaques (ranging from 0 to 10% in different samples) that hybridized with the probes of two different genotypes. These plaques were assigned to the genotype that showed the stronger hybridization signal. Percentages of genotypes were also calculated by deleting the counts of the plaques hybridizing with two probes. The final calculations of percentages of genotypes with both approaches were similar (data not shown). The relative distributions of genotypes I, II, and IV in human samples and animal samples were significantly different. Genotypes I and IV were significantly more abundant in animal samples, and genotype II was significantly more abundant in human samples (P < 0.001). The percentages of genotype III in human and animal samples did not differ significantly, although the average percentage in human samples was slightly higher than that in animal samples. The major differences between human and animal samples were shown by genotype II. However, the percentage of genotype II in some animal samples (6%) was the highest among genotypes I to IV, and in 5% of human samples, it was the lowest among all the genotypes. The rule that genotypes II and III were higher in humans and I and IV were higher in animals complied in all human samples but failed in 35% of the animal samples.
|
View this table: [in a new window] |
TABLE 4. Percentages of the four genotypes of F-specific RNA phages in human and animal samples
|
|
View this table: [in a new window] |
TABLE 5. Levels of Simpson's diversity index and percentages of different microorganisms in human and animal samples
|
0.5 for human samples and
0.5 for animal samples, there is still a 5% failure rate. However, this may be considered the best value for this variable to differentiate sources attending to the percentage of correct sample classification achieved. The selection of other values as reference criteria resulted in more failures. |
View this table: [in a new window] |
TABLE 6. Ratios between the values in log10 units of sorbitol-fermenting bifidobacteria and those of total bifidobacteria in human and animal samples
|
Fecal sterols.
The descriptive statistics for fecal sterol concentrations are shown in Table 7. The concentrations of 24-ethylcoprostanol, epicoprostanol, and cholestanol showed significant differences between human and animal samples, whereas coprostanol did not (P > 0.01), although it gave a P value of 0.054. Concentrations of 24-ethylcoprostanol varied the greatest between human and animal samples. Although in all cases, this was the fecal sterol (among those analyzed) that differed the most, the high number of overlaps prevented the establishment of a reference concentration for differentiation being established. The concentration of 24-ethylcoprostanol being greater than the concentration of coprostanol in animal samples, and vice versa in human samples, seems to be the more discriminant criterion among the data reported here. The percentage of incorrectly classified samples based on this criterion was 6.5%.
Correlation, regression, and discriminant analyses.
High linear correlation was found between the derived variable SOMCPH/BTHPH (ratio of the number of somatic coliphages [SOMCPH] to the number of isolates of B. thetaiotaomicron GA17 [BTHPH]) and the class variable (r = 0.886) and also between the derived variable FC/BTHPH (ratio of the number of fecal coliforms [FC] to the number of isolates of B. thetaiotaomicron GA17) and the class variable (r = 0.847). The correlation between these two derived variables was very high (r = 0.912). Best-subset regression performed with the 26 initial variables indicated that subsets with as few as seven variables (number of fecal enterococci [FE], percentage of genotype II of F-specific RNA bacteriophages [FRNAPH II], FRNAPH IV, concentration of epicoprostanol or coprostanol [EPICOP], FMFS, ECP, and detection of the presence or absence of Bifidobacterium adolescentis [BA]) gave an explanatory power (85.1%) almost equal to that provided by using all the single variables (85.7%). This fact points to high redundancy in terms of the available variables, and a level of redundancy in the data that is too high may reduce performance (38). However, subsets of the variables obtained may be taken as a first indication of relevance. Two-group discriminant analyses (for human or nonhuman fecal samples) using all 26 measured microbiological and chemical parameters provided a correct classification in 100% of the cases. The performance of all the microbial and chemical indicators allowed a predictive classification of cases by discriminant analysis. The question remains whether the same performance can be achieved using a lower number of variables, which would decrease the number of parameters measured, reduce costs, and provide simpler models that are easier to analyze from a microbiological point of view. Specifically, we were interested in finding the smallest subset of variables that was able to provide a correct classification in 100% of the cases. This was a difficult undertaking that could not be addressed by "generate-and-test" methods and is one of the main reasons why we expanded the toolbox to consider other statistical or machine learning methods. Before doing this, the same discriminant analyses were performed using only the subsets of variables that were considered meaningful. For instance, when only the bacterial indicators analyzed in this study were used (Table 3), 75% of nonhuman samples were correctly classified as nonhuman (25% of nonhuman samples were classified as human samples), and 3.7% of samples of human origin were misclassified as nonhuman samples (96.3% of human samples were classified as human samples). Classification using only the four genotypes of F-specific RNA phages allowed 98% of nonhuman and 85% of human samples to be classified correctly (with false-positive and false-negative rates being 0.02 and 0.14, respectively). The fecal sterols studied did not show better correct classifications, since only 38% of nonhuman samples were correctly classified, although 98% of human samples were correctly classified. Similar levels of correct classification were found for the phenotypic analysis of fecal coliforms and enterococcal populations: 79% of nonhuman and 89% of human samples were correctly determined (with false-positive and false-negative rates being 0.17 and 0.13, respectively). Finally, the classification functions developed using the results from the enumeration of the various bacteriophages (somatic coliphages, F-specific RNA phages, bacteriophages infecting B. fragilis RYC2056, and bacteriophages infecting B. thetaiotaomicron GA17) provided a correct classification (human versus nonhuman samples) in all cases.
Machine learning methods.
The Relief algorithm provided a list of individual variables arranged according to their discriminatory power. The top three variables in this list were SOMCPH/BTHPH, FC/BTHPH, and FRNAPH II. The next group was a group formed by the variables FMFS, FRNAPH II + FRNAPH III (sum of the percentages of genotypes II and III of F-specific bacteriophages), and FRNAPH I + FRNAPH IV. This outcome was used to build an optimal solution by using the four methods indicated above (Euclidean one-nearest-neighbor technique, linear Bayesian classifier, quadratic Bayesian classifier, and support vector machine). The main finding was that a set of just two variables, (SOMCPH/BTHPH and SOMCPH) provided a training set with 100% correct classification for all the inductive learning methods. The two variables FC/BTHPH and FC also provided excellent results using the four methods (100%, 98.8%, 98.8%, and 100% correct classification rates, respectively). The observations can be displayed in a two-dimensional scatter plot with no loss of information (Fig. 1). It is clear that observations of samples of human and samples of nonhuman origin are neatly separated. With this information, a linear separation is feasible. These two pairs of variables also gave 100% correct classification for the 22 samples in the withheld test set. Also noteworthy is the fact that there were no apparent differences between the various geographical sites. In other words, there were no subclusters. A secondary finding was that other subsets of three or more variables also showed good discriminating ability (for example, FRNAPH I, FRNAPH II, ECP, BA, and SFBIF). However, these variables gave some incorrect classifications and lower identification rates overall (between 85and 95%).
![]() View larger version (14K): [in a new window] |
FIG. 1. Distribution of training observations according to the variables SOMCPH/BTHPH and SOMCPH. Values are standardized to zero mean and unit standard deviation.
|
|
|
|---|
Results reported herein provide interesting information on the various conventional fecal indicators tested because of the broad spectrum of wastewaters and geographical areas tested. Fecal coliforms, enterococci, clostridia, total bifidobacteria, somatic coliphages, F-specific RNA phages, and phages infecting strain RYC2056 of B. fragilis had similar relative densities in municipal or human-derived wastewaters in the different geographical areas studied. No significant differences were observed between samples of human origin (hospital, military camp, and municipal wastewaters), regardless of the size of human communities (which ranged from a population of hundreds for hospital samples to 1.5 million for municipal wastewater samples). Consequently, wastewater samples from communities of around 100 inhabitants were shown to be representative. Also, for these indicators, geographical differences between animal samples were not evident.
With regard to the potential of the various microbial and chemical parameters studied as tracers of source pollution, there are a number of observations worthy of discussion. No differences were observed in the ratios between the values of fecal coliforms (taken as the reference value of fecal load) and those of bifidobacteria, enterococci, clostridia, and somatic coliphages in human and animal samples. Conversely, F-specific RNA phages and phages infecting B. fragilis RYC2056 showed differences, since the ratios of their numbers to numbers of fecal coliforms were clearly lower in animal samples, although the differences are not sufficient to allow source differentiation. Among the culture-based microbiological methods tested, which are independent of the characterization of the isolates, enumeration of phages infecting B. thetaiotaomicron GA17 and the ratio between numbers of total bifidobacteria and the numbers of sorbitol-fermenting bifidobacteria discriminated the most samples according to origin. No differentiated clusters were observed in the sets of values of all these nondiscriminant and discriminant indicators. Therefore, it can be concluded that the numbers of all of them are comparable in the various geographical areas studied. The only geographical difference detected was in the characteristics of the plaques of the phages detected by strain B. thetaiotaomicron GA17 in the United Kingdom. Most of these plaques were turbid and required a well-trained operator to count the phages accurately. This fact complicates the use of this method in this location. However, a recent investigation has shown that obtaining a geographically useful Bacteroides host with a performance similar to that of strain GA17 is feasible (49).
Genotypic methods (F-specific RNA genotypes and molecular detection of Bifidobacterium dentium and Bifidobacterium adolescentis) allowed differentiation but with a percentage of failures. As described previously, genotypes II and III predominated in human samples, and genotypes I and IV predominated in animal samples (12, 47, 52, 53). However, in this work, there was an unexpectedly high proportion of animal samples (33%) with high percentages of genotype III, similar to the ones in samples of human origin. There was also a small proportion of samples that gave misleading values, showing inverted percentages to those expected. Additionally, the percentage ranges of each of the genotypes or combinations of genotypes found in the different kinds of samples make the establishment of a threshold for this method difficult.
Conversely, the percentages of both Bifidobacterium dentium and Bifidobacterium adolescentis by molecular detection differed significantly in samples of human and nonhuman origin, being more common in human samples than in animal samples. Both species have been specifically associated with human intestinal microbiota. However, both species were not detected by multiplex PCR (7) in some human samples, and positive results were also observed for some animal samples. Both species were detected in water polluted by feces of human origin and not of animal origin. The detection method needs to be improved in order to detect these species at the lower densities commonly found in human samples to validate negative results in human samples. Additionally, an explanation for their presence in a percentage of animal samples should be sought as well. Although Bifidobacterium adolescentis has been described as a species that is related to humans exclusively (44), it was reported to have been found in samples from poultry (7).
Although some variables derived from phenotypic parameters are more related to nonhuman sources (percentage of E. hirae among the enterococci and percentages of E. coli Phene-Plate profiles or non-cellobiose-fermenting fecal coliforms among the total fecal coliforms) and others are more related to human fecal sources (percentage of E. faecium plus E. faecalis), these variables alone could fail to provide a correct identification of fecal source in some cases. On the other hand, phenotyping with the Phene-Plate system has previously been proven to be useful to identify specific animal species as contamination sources in surface water in Australia (1).
The relationships between the ß-sterols coprostanol and 24-ethylcoprostanol were different in human and nonhuman samples, as reported elsewhere previously (37), but there was a percentage of failures that prevented the effective application of these chemical indicators as fecal source discriminators.
Two-group discriminant analysis showed that using the entire set of microbial and chemical indicators measured in this study enabled the fecal source in wastewater or slurries to be ascertained. However, testing of over 20 tracers is not feasible for routine analyses because of the high cost, the time required, and the need for staff trained in a wide variety of analytical fields (which is beyond the reach of many laboratories). Furthermore, discriminant analysis carried out using different subsets of the parameters showed some promising results. A subset of parameters consisting of the enumeration of the four bacteriophage groups was able to successfully distinguish the source of fecal pollution in the wastewaters and slurries analyzed. It was also observed that only bacteriophages infecting the host strain B. thetaiotaomicron GA17 showed a high specificity to human samples. Enumeration of all these four groups of bacteriophages provided information that complemented the enumeration of bacteriophages infecting strain GA17, achieving 100% correct classification. Conversely, the variable subgroup consisting of the enumeration of different bacterial groups, genotypes of F-specific bacteriophages, fecal sterols, or bacterial phenotypes alone did not determine the fecal source with a 100% correct classification. Again, individual variables within these subgroups (for example, sorbitol-fermenting bifidobacteria) showed great differences between water samples with human fecal contamination and those with nonhuman fecal contamination. Consequently, other combinations of the most promising tracers should be considered in order to determine the lowest number of variables needed to maintain the highest possible discrimination rate of fecal source. Note that some combinations need not include the enumeration of bacteriophages infecting B. thetaiotaomicron strain GA17, which demonstrated geographical differences with regard to the clarity of plaques. We were thus especially interested in finding combinations specifically including or excluding this tracer. However, to resolve the problem of geographical variation, new host strains of Bacteroides, either B. thetaiotaomicron or other species, should be obtained for each specific geographical site in order to facilitate the enumeration of this group of phages (49). Furthermore, the variations in results of discriminant analyses of the parameter sets coincide with previous studies by other authors who have studied the assessment of statistical methods using library-dependent tests for microbial source tracking (48, 51, 58). Those authors also reported a high degree of variability in the correct classification rates among statistical methods and observed that no commonly used statistical technique emerges as superior. Our results are in general agreement with this observation; but additionally, our results suggest that some combinations of library-independent methods that showed a consistently high degree of discriminatory power could determine the origin of fecal pollution. Consequently, the use of alternative statistical methods that could determine the optimal combination of discriminant parameters and thus facilitate the development of predictive models is the logical next stage in the development of data analysis for fecal source tracking.
To this end, several statistical and machine learning methods were applied to the 38 single and derived microbial and chemical variables. The obtained predictive models provided 100% correct classification in the distinction of wastewaters of human origin and those of nonhuman origin. No differences were found between the various European geographical locations in this prediction model. A predictive model using the pair of variables SOMCPH/BTHPH and SOMCPH emerged as the optimal model and allowed the successful classification of fecal source in all cases (in both training and test sets). It was noted that the variable SOMCPH/BTHPH accounts for the greatest part of the classification. In light of this observation, the variable BTHPH alone might suffice. This could be achieved by determining a reference level for BTHPH that differentiates human samples (values above this reference) from nonhuman samples (values below this reference). This reference level could be obtained by taking the middle value between the two closest known observations (in the training set), one of which is human and the other of which is animal. This simpler rule actually achieved 100% correct classification. However, this approach may be unstable because of the closeness of the values to the reference value, especially when factors such as fecal aging or dilution in waters modify the concentrations of the parameters (named variables in statistical or machine learning analyses). A wider margin of separation is necessary in order to obtain a more robust and stable discrimination. This is accomplished by using the set of two variables SOMCPH/BTHPH and SOMCPH, in which case the margin is greater.
Alternative predictive models that do not use the variable BTHPH were also found. However, these models showed lower rates of correct classification and required more parameters and thus entailed higher costs and resource requirements. Furthermore, their percentages of correct classification were more dependent on the statistical method used. For instance, the pair of variables SOMCPH and FRNAPH II showed a 96% correct classification using a quadratic Bayesian classifier, and sets of three (FRNAPH I, FRNAPH II, and ECP) or four (e.g., SOMCPH, FRNAPH II, BA, and SFBIF) variables were needed to provide 100% correct classification when using the Euclidean one-nearest-neighbor classifier.
In conclusion, none of the tested microbial and chemical parameters were alone able to determine the source of fecal pollution in wastewaters and slurries of known human or nonhuman origin, and therefore, a suite of parameters was required. However, we demonstrated that there are a number of potentially good tracers showing high discriminatory capabilities, and hence, there is a need for alternative numerical approaches to the data analysis. The concentration of phages infecting certain strains of Bacteroides is the parameter showing the greater discriminatory power. Host strains to detect and enumerate phages of Bacteroides seem to be geographically dependent, but a method for the isolation of geographically specific host strains for the enumeration of phages infecting Bacteroides has recently been published (49). Other tracers such as FRNAPH I, FRNAPH II, ECP, BA, and SFBIF also showed a good discriminatory ability when groups of three or more variables were used. Combinations of variables based on a discriminating tracer and a universal fecal indicator seem to offer the best solutions. The universal and nondiscriminant fecal indicator provides information on the fecal load of the sample at the time it is taken. The discriminant indicator (tracer) contributes to the identification of source. If both indicators have similar persistence in the environment, their combined use could be the best way of defining predictive models suitable for any environmental water sample. Such combinations may also offer advantages when samples different from the ones tested here are analyzed (such as diluted, aged, and mixed samples). Finally, the use of different statistical or machine learning methods in conjunction with algorithms for variable selection was shown to be a feasible numerical analysis for the development of predictive models for microbial source tracking in waters. The experimental approach used in this study aimed to provide a preliminary model suitable for wastewaters and slurries, which are considered the most important starting points for fecal pollution of surface waters. Any subset of methods selected for predictive models must be effective at this level of fecal pollution. Otherwise, there is no sense in applying it to surface waters or other kinds of waters with lower values for the indicators and parameters involved. Following our experimental approach, the next stage in the development of predictive models should consider additional factors such as dilution, specific types of animal sources, persistence of microbial tracers, and complex mixtures from different sources. All these factors will progressively add complexity to the models and bring them closer to "real-world" scenarios so as to provide effective and practical solutions for fecal pollution problems.
We thank the Scientific-Technical Services at the University of Barcelona for their technical support in the analysis of fecal sterols.
Supplemental material for this article may be found at http://aem.asm.org/. ![]()
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»