Previous Article | Next Article ![]()
Applied and Environmental Microbiology, January 2005, p. 512-518, Vol. 71, No. 1
0099-2240/05/$08.00+0 doi:10.1128/AEM.71.1.512-518.2005
Copyright © 2005, American Society for Microbiology. All Rights Reserved.
Department of Biological Sciences, The University of Southern Mississippi, Hattiesburg, Mississippi
Received 16 April 2004/ Accepted 23 August 2004
|
|
|---|
|
|
|---|
There are several coefficients that can be used to calculate similarity, and the choice of which similarity coefficient to use can affect the outcome of the source tracking assignment. Cosine Coefficient and Pearson's Product Moment Correlation are curve-based coefficients that use both the presence or absence of DNA bands and the peak intensity of each band as variables, whereas Jaccard, Dice, Jeffrey's x, and Ochiai are band-based coefficients that consider only the presence or absence of DNA bands. Currently, there is no consensus on which coefficient results in more accurate source assignment. Pearson's Product Moment Correlation has been used to calculate similarities among repetitive extragenic palindromic (rep) PCR fingerprints (4, 12) and ribotypes (3). Cosine Coefficient has been used to calculate similarities among rep-PCR and pulsed-field gel electrophoresis fingerprints (10). The band-based coefficients Jaccard (5) and Dice (13) have been used to calculate similarities among rep-PCR fingerprints, and the latter has also been used to calculate similarities among ribotypes (7, 11) and among fingerprints generated by the amplification of the 16S-23S intergenic spacer region (13).
In addition to uncertainties concerning which similarity coefficient to use, there are no standards concerning acceptance of source assignments. When similarity coefficients are used, each environmental isolate is assigned a source depending on which library includes the DNA fingerprints to which its DNA fingerprint is most similar. Because each isolate is always assigned a source, without regard to the actual degree of similarity, the user must decide, using a priori assessments of the accuracy of the method, whether each source assignment is likely to be correct.
Currently, there are no published studies on the issue of reliability when source assignments are determined using similarity coefficients. Information is needed to better understand limitations of statistical analyses used for assigning sources of bacteria on the basis of DNA fingerprint patterns (1). In an attempt to evaluate the significance of source tracking on the basis of discriminant analysis, Whitlock et al. (18) suggested using the percentage of misassigned isolates in each source group calculated by Jackknife analysis as a lower limit for significance. When such an approach is used, each isolate from a collection of isolates from a known source (a library) is treated, one at a time, as an unknown and then identified by comparison to the remaining isolates. The proportion of isolates assigned to the correct source relative to the number of isolates in the library is the rate of correct assignment. Whitlock et al. (18) proposed that a source can be implicated in water pollution only when the percentage of environmental isolates assigned to it exceeds the percentage of library isolates misassigned to it by Jackknife analysis. Although this approach can be useful when libraries with good representation of the diversity of isolates in the environmental site being studied are used, it is less useful when libraries with poor representation are used or in the presence of isolates contributed by nonlibrary sources. Therefore, in computer-assisted, library-based bacterial source tracking efforts, the choice of which similarity coefficient to use for DNA fingerprint comparisons and what similarity threshold to use in deciding whether each source assignment is to be accepted are important issues that need further investigations to improve the reliability of source assignments.
In the present study, six different similarity coefficients were compared in terms of their rates of correct assignment (RCAs). In addition, statistical options to improve the reliability of source assignments on the basis of the use of similarity coefficients were investigated. These options include the choice of how the similarity values are used and the effect of the use of a threshold similarity value and quality factor on improvement of the reliability of source assignments.
|
|
|---|
|
View this table: [in a new window] |
TABLE 1. Sources and numbers of fecal samples and bacterial isolates (before and after exclusion of clonal isolates) used in the study
|
Cow, chicken, and deer samples were collected from south and central Mississippi. All of the fecal samples were collected from individuals except for some of the cow and chicken samples. Four of the cow samples were composite samples from several cows at the same farm. A total of 86 of the chicken samples were litter samples from commercial chicken farms, while 27 were obtained from cloacal swabs from individual chickens. Although some cow and chicken cloacal samples were collected and transported using CultureSwabs, the majority of cow and chicken litter samples, as well as all deer samples, were collected and frozen without additives at their respective collection sites across the State. Dog fecal samples from veterinary adoption centers and humane societies in Hattiesburg and Gulfport, Mississippi, and gull fecal samples from beaches along the Mississippi Gulf coast were collected using CultureSwabs.
Bacteria isolation.
Fecal samples were streaked on mTEC (Difco) and mEI plates for the isolation of E. coli and enterococci, respectively (15). mTEC plates were incubated at 37°C for 2 to 4 h and then at 44.5°C for 18 to 24 h. Yellow colonies were picked and confirmed using standard microbiological methods. Isolates that lacked phenylalanine deaminase, that produced indole from tryptophan, that were unable to utilize sodium citrate as a sole carbon source, and that fermented glucose through a mixed-acid fermentation pathway (but not a butanediol pathway) were considered to be E. coli. mEI plates were incubated at 41°C for 24 to 36 h. Colonies that formed blue halos were presumed to be enterococci. Confirmation was performed by testing each isolate for growth at 45°C and in the presence of 6.5% sodium chloride at 37°C and for esculin hydrolysis. Among the isolates picked, 89.5 and 73.1% were confirmed to be E. coli and enterococci, respectively.
rep-PCR and BOX-PCR.
rep-PCR (8, 17) was performed using a modified method of Rademaker and DeBruijn (12). Isolates were grown at 37°C in brain heart infusion for 12 to 16 h. Cells harvested from 0.5 and 1.0 ml of broth for E. coli and enterococci, respectively, were washed twice with 0.5 ml of sterile deionized water. The resulting pellets were resuspended in deionized sterile water (0.5 and 0.25 ml for E. coli and enterococci, respectively) and stored frozen at 20°C until use as a template for PCR. DNA amplification reactions were performed with a 10-µl reaction mixture that consisted of 1 µl of cell suspension and 9 µl of PCR master mix. The BOX-PCR (9, 16) master mix contained 2 µM primer (BOX A1R [CTA CGG CAA GGC GAC GCT GAC G]), 1 mM deoxynucleoside triphosphates, 4.5 mM MgCl2, 1x buffer provided by the manufacturer of the DNA polymerase, and 0.4 units of JumpStart Taq DNA polymerase (Sigma, St. Louis, Mo.). Thermal cycling started with 2 min at 95°C followed by 35 cycles of 94°C for 3 s, 92°C for 30 s, 50°C for 1 min, and 65°C for 8 min. A final extension step was performed at 65°C for 8 min after completion of the 35 cycles. The REP-PCR (6, 9, 14) master mix contained 3 µM of each of two primers (REP 1R [III ICG ICG ICA TCI GGC] and REP 2I [ICG ICT TAT CIG GCC TAC]), 1 mM deoxynucleoside triphosphates, 2.5 mM MgCl2, 1x buffer, and 0.4 units of JumpStart Taq DNA polymerase. The thermal cycling protocol for REP-PCR was the same as that for BOX-PCR except that 40°C was used instead of 50°C for primer annealing.
Jackknife analysis.
The effect of having clonal isolates in fingerprint libraries on RCAs was examined by performing Jackknife analysis both before and after their removal by use of BOX and REP fingerprints of enterococcal and E. coli isolates. Clonal isolates were defined in the present study as isolates with identical fingerprints obtained from the same sample. Removal of clonal isolates was performed for Jackknife analysis only.
Jackknife analysis was also used to compare the RCAs generated using six similarity coefficients (Cosine Coefficient, Pearson's Product Moment Correlation, Jaccard, Dice, Jeffrey's x, and Ochiai). The fingerprints used were produced by BOX-PCR using enterococcal isolates with clonal isolates removed. All Jackknife analyses were performed using BioNumerics version 3.0 (Applied Maths, Sint-Martens-Latem, Belgium). Pattern optimization (i.e., the percentage of pattern shift within which the software looks for the best match) was set at 5%, and band tolerance (i.e., the maximum gel migration difference for any pair of bands to be considered matching) was set at 2% with a 2% gradual tolerance increase towards the bottom of the gel.
Discriminant analysis and multivariate analysis of variance.
Enterococcal isolates from human, cow, deer, dog, chicken, and gull fecal samples were classified into groups according to their sources. Discriminant analysis was used to show the separation between these predefined groups on the basis of their BOX fingerprints. The first and second discriminants were plotted on the x and y axes, respectively, generating a two-dimensional plot showing the separation of isolates from six sources. Multivariate analysis of variance was performed accounting for the covariance structure to evaluate the significance of discriminant analysis. The P value indicated the probability of obtaining equivalent separation results among isolates of different sources due to random classification of isolates. The probability of obtaining the same level of discrimination, assuming that all isolates were obtained from a homogeneous population (i.e., the effect of grouping by source was insignificant), is indicated by the Wilkinson's likelihood for normal distribution (L). Low P and L values indicate significant discrimination by source group.
Identification libraries and the blind test.
The RCAs using each of the six similarity coefficients listed above were also determined using a blind test. First, libraries containing rep-PCR fingerprints of enterococcal and E. coli isolates from each known animal source were constructed. Each library consisted of five units, one for each known source: human, cow, deer, dog, and chicken. These libraries were then used as reference to determine the most likely animal source of new isolates in the blind test. The enterococcal fingerprint library contained 762 isolates (67 human, 141 cow, 99 deer, 103 dog, and 352 chicken). All were analyzed by BOX-PCR, but only 458 (40 human, 118 cow, 84 deer, 71 dog, and 145 chicken) were analyzed by REP-PCR. The E. coli library contained 514 isolates (65 human, 136 cow, 39 deer, 142 dog, and 132 chicken), and all were analyzed by both BOX- and REP-PCR.
Isolates used as blind samples were obtained from feces of animals in the same general population as those used to obtain isolates for rep-PCR fingerprint library construction. A total of 131 enterococcal isolates (28 human, 29 bovine, 27 deer, and 47 chicken) and 130 E. coli isolates (19 human, 43 bovine, 17 deer, and 51 chicken) were analyzed using BOX-PCR for the blind test. A total of 96 enterococcal isolates (12 human, 28 bovine, 23 deer, and 33 chicken) and the same 130 E. coli isolates were analyzed using REP-PCR for the blind test. Source assignments were made using Cosine Coefficient to calculate similarity matrices, and average RCAs were compared using both maximum and average similarity options.
Setting "similarity value" and "quality factor" thresholds.
A similarity threshold was determined for each indicator organism-fingerprinting technique combination. Cosine Coefficient was used to calculate similarity matrices. The threshold was determined by dividing the sum of the average similarity values of the correctly and the incorrectly assigned isolates by 2. In other words, the threshold was the midpoint between the average similarities of the correctly and incorrectly assigned isolates. When this method was used, the similarity threshold values were 90, 90.8, and 89.2% for enterococcal BOX, REP, and BOX-REP combined fingerprints, respectively. The combined fingerprints were generated electronically using BOX and REP fingerprints. The threshold values for E. coli BOX, REP, and combined fingerprints were 91.5, 90.1, and 87.7%, respectively.
A quality factor was also used as a threshold for determining the reliability of source assignments. A quality factor is generated by BioNumerics for each unknown as the unknown is assigned to an animal source. This value is calculated by dividing the average pairwise similarity of all fingerprints in the source group by the average pairwise similarity of the unknown with each of the library's component isolates. Assignments with a quality factor of 1.0 or less (B or better) were accepted, while those with a quality factor of more than 1.0 (C, D, or E) were considered unidentifiable.
|
|
|---|
|
View this table: [in a new window] |
TABLE 2. Comparison of the RCAs of enterococcal isolates obtained before and after excluding clonal isolates
|
![]() View larger version (34K): [in a new window] |
FIG. 1. The percentages of clonal isolates among enterococci isolated from different animal sources and the decreases in RCA that resulted from their removal.
|
The effect of using different similarity coefficients on RCAs.
A comparison of Jackknife RCAs of 1,020 enterococcal isolates on the basis of their BOX fingerprints indicated that the highest RCAs were obtained using curve-based similarity coefficients. The ORCA was 82% with both Pearson's Product Moment Correlation Coefficient and Cosine Coefficient. When band-based coefficients were used, the ORCA was 78% for each of the four coefficients. Although the RCAs differed among animal sources with each similarity coefficient, they were less variable using curve-based coefficients. The standard deviations for the ORCAs were 11.2 and 12.0% using Pearson's Product Moment Correlation Coefficient and Cosine Coefficient, respectively, but ranged from 13.9 to 14.7% for the four band-based coefficients (Table 3). These results suggest that the use of curve-based coefficients is preferred over the use of band-based coefficients for source tracking in our study area with BOX fingerprinting. Additional data, from other study areas and with other DNA fingerprinting protocols, are needed to ascertain whether the superiority of curve-based coefficients is a general rule.
|
View this table: [in a new window] |
TABLE 3. Comparison of the RCAs of enterococcal isolates source assigned using six similarity coefficientsa
|
|
View this table: [in a new window] |
TABLE 4. Comparison of the RCAs obtained by use of average versus maximum similarity as a basis for bacterial source assignments
|
![]() View larger version (45K): [in a new window] |
FIG. 2. A two-dimensional plot using discriminant analysis showing the separation of isolates on the basis of BOX fingerprints. The plot was generated by plotting the first discriminant (contributing 42% of total discrimination) on the x axis and the second discriminant (contributing 21% of total discrimination) on the y axis. The P value was 0.001 for both discriminants, while the L values were 0.0904 and 0.2166 for the first and second discriminants, respectively. Human isolates are shown in cyan, cow isolates are shown in green, deer isolates are shown in blue, dog isolates are shown in yellow, chicken isolates are shown in red, and gull isolates are shown in pink.
|
The effect of using a similarity value threshold on RCAs.
By classifying as unidentified those isolates that did not meet the similarity threshold requirement, the RCAs among isolates assigned a source improved significantly. The ORCA among enterococci fingerprinted by BOX-PCR increased from 76 to 87% (Table 5). In addition, the number of isolates assigned incorrectly to a source decreased from 24 to 9%. The drawback, however, is that the proportion of isolates that were assigned to a source decreased from 100 to 69%, and 16% of the isolates assigned correctly to a source when a threshold was not used are designated as unidentifiable when a threshold was used. Similar results were obtained using enterococcal REP fingerprints, where the ORCA increased from 63 to 80% after the threshold was applied. When combined BOX-REP fingerprints were used, the RCAs before and after application of the similarity threshold were 77 and 89%, respectively. With E. coli, the ORCAs increased from 65 to 70% for isolates fingerprinted using BOX-PCR, from 65 to 82% for isolates fingerprinted using REP-PCR, and from 69 to 89% for isolates fingerprinted using combined BOX- and REP-PCR fingerprints (Table 5). However, note that these increases in ORCA are achieved at a cost. A total 13 to 29% of the isolates formerly assigned to a correct source when a threshold was not used are classified as unidentifiable when a threshold is used.
|
View this table: [in a new window] |
TABLE 5. Effect of using a similarity threshold during source assignments on the ORCA, the percentages of isolates correctly assigned, and the percentages incorrectly assigneda
|
|
View this table: [in a new window] |
TABLE 6. Effect of using a similarity threshold on the RCAs of enterococcal isolates, the proportion of isolates assigned a source, and the proportion of isolates assigned to the correct sourcea
|
![]() View larger version (33K): [in a new window] |
FIG. 3. The effect of curve-based and band-based similarity coefficients on the percentages of enterococcal isolates assigned to an animal source and the percentages of isolates assigned correctly and incorrectly. Source assignments were made on the basis of their BOX-PCR fingerprints (n = 131).
|
|
View this table: [in a new window] |
TABLE 7. Effect of applying a quality factor threshold on the ORCA, the percentages of enterococcal isolates correctly assigned, and the percentages incorrectly assigneda
|
In conclusion, results from the present study indicate that (i) the use of curve-based coefficients (e.g., Cosine Coefficient and Pearson's Product Moment Correlation) results in higher ORCAs than the use of band-based coefficients (e.g., Jaccard and Dice); (ii) the removal of clonal isolates is essential for the proper calculation of RCAs by Jackknife analysis; (iii) the use of maximum, as opposed to average, similarity yields higher ORCAs; and (iv) the application of a similarity value or a quality factor threshold for source assignment improves the ORCA, but this is achieved at the expense of the total numbers of isolates assigned a source.
Financial support that made the research possible was provided by the Mississippi Departments of Agriculture and Commerce and Environmental Quality, the U.S. Environmental Protection Agency Gulf of Mexico Program (EPA MS97449202), and the U.S. Coastal Impact Assistance Program (NOAA 17OZ2171 Project MS.R.17).
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»