Previous Article | Next Article ![]()
Applied and Environmental Microbiology, December 2005, p. 8663-8676, Vol. 71, No. 12
0099-2240/05/$08.00+0 doi:10.1128/AEM.71.12.8663-8676.2005
Copyright © 2005, American Society for Microbiology. All Rights Reserved.
University of Washington, Seattle, Washington 98195, and,1 Biodetection Technologies Section, Argonne National Laboratory, Argonne, Illinois 604392
Received 21 April 2005/ Accepted 12 July 2005
|
|
|---|
|
|
|---|
Among the large variety of recent microarray technology platforms (see the review in reference 6), two formats have been most often used for microbial species identification: planar microarrays (51) and gel-pad microarrays (4, 12, 17, 20). Gel-pad microarrays, composed of oligonucleotides in a polyacrylamide gel-pad matrix attached to a glass surface, offer a number of advantages over planar microarrays because (i) they have a greater dynamic range due to the immobilization of a greater number of oligonucleotides (from 3 to 300 fmol) on the surface covered by each gel-pad (17), (ii) when used in combination with a temperature-controlled reaction chamber they can be employed to monitor arrays of probes that have different kinetics of association and dissociation (10), and (iii) when used under conditions that approximate equilibrium, thermodynamic analyses of probe-target duplexes in gel-pads have been reported to correlate to data obtained in solution (13), demonstrating a link to well-established principles of nucleic-acid chemistry (9, 29).
In conventional applications, fluorescent dye-labeled target nucleic acids hybridize to a short complementary oligonucleotide probe immobilized in a gel-pad, which yield stable duplexes under appropriate hybridization conditions. As the temperature of a microarray reaction chamber is increased, bound nucleic acid dissociates from probes, and there is an overall decrease in retained fluorescent dye (as inferred from signal intensity). Studies using synthesized targets (i.e., oligonucleotides) and fragmented nucleic acid extracted from microbes in environmental samples have demonstrated that it is possible to distinguish between target and nontarget sequences that differ by a single internal mismatched base-pair (12, 47, 48). This level of discrimination is needed to resolve variants of highly conserved genes (e.g., those encoding the rRNAs). Numerous studies have been conducted using gel-pad microarrays (8, 11, 12, 20, 23, 24, 26, 45, 47, 48, 49) because melting profiles of probe-target duplexes are thought to offer better discrimination between target and nontarget sequences than planar microarrays, which typically depend on signal intensity (SI) values obtained following hybridization and wash conditions adjusted to an appropriate (average) stringency. (The term "melting profile," used throughout this article, refers to nonequilibrium dissociation curves.) Based on these studies, we proposed that nonequilibrium dissociation analyses, in concert with advanced statistical approaches, could be used to develop a diagnostic tool for identifying microorganisms in complex communities such as that found in the human oral cavity.
The focus of this study was to evaluate the performance of gel-pad microarrays under established conditions, which includes the same nucleic acid sample preparation protocols and concentrations, and image analysis software. The evaluation employed a set of DNA probes complementary to the rRNAs of major groups and species of microbes known to inhabit the human oral cavity. Preliminary screening of numerous melting profiles revealed inconsistencies in signal intensity values and discrepancies in the form of melting profiles, even for identical probes on the same microarray hybridized to the same target. However, our working hypothesis is that the general form of a melting profile in gel-pad microarrays (within the dynamic range of the detection system) should be sigmoid-shaped and not change from one duplex to the next, and that observed differences are due to experimental variations. Modeling the observed melting profiles using a conventional approach (i.e., fitting to a theoretical curve) (29) was not possible since melting profiles in gel-pad microarrays cannot be considered as reflecting an equilibrium process (see Discussion). Additionally, we tried to infer a function that best describes melting profiles. This function was able to explain only a small fraction of melting profiles.
Since no underlying model for interpreting gel-pad melting profiles exists, we sought to determine the sources of the inconsistencies by training artificial neural networks (NNs) to recognize patterns in the form of melting profiles. Artificial NNs can be used to recognize patterns in data such as the variability and shape of melting profiles and to classify melting profiles (e.g., ideal or poor) based on their melt characteristics (e.g., initial signal intensity, area under the curve). NNs are implemented as computer programs and consist of networks of neurons that receive information from inputs or other neurons, make independent computations, and pass their outputs to other neurons in the network (1, 3, 42). Once an NN is properly trained, the optimized weighting factors can be used to generate a model that provides information on the relationships among (input) variables such as melt characteristics and different types of melting profiles (outputs) such as those typical of perfectly matched duplexes versus those of duplexes containing multiple mismatches.
The objectives of this study were: (i) to develop and evaluate a software tool for calculating the variability and shape of melting profiles and (ii) to determine the main sources of inconsistencies that affect interpretation of melting profiles.
We report the development of a melting profile performance (MPP) calculator that correctly determined that
18% of perfectly matched duplexes yielded highly variable (i.e., poor) melting profiles. Further experiments revealed that the main sources of inconsistencies contributing to the poor melting profiles were: the placement of the grid frames and the background subtraction method used by the image analysis system. Moreover, duplexes with low binding constants had highly variable melting profiles because SI values approached the detection limit of the system.
|
|
|---|
|
View this table: [in a new window] |
TABLE 1. Organisms testeda
|
Oligonucleotide array fabrication.
Oligonucleotides were designed by using the probe design function of ARB software (http://www.arb-home.de) (27). The specificity of the probe for the target was checked with the probe check function in the ARB software, the BLAST search (2) at the National Center for Biotechnology Information, and the Probe Match program in Ribosomal Database Project II (28). Self-complementarities were also examined by using Ribosomal Database Project II. Oligonucleotide probes, ranging in length from 13 to 25 nucleotides (Table S1 in the supplemental material), were synthesized with an amino linker at the 3'-end and fabricated at Argonne National Laboratory (52). The microarray matrix containing polyacrylamide gel pads (100 by 100 by 20 µm) spaced 200 µm apart from each other and fixed to a glass slide, was manufactured by photopolymerization procedure (52). A total of 3 nl of 1 mM amino-oligonucleotide solution was applied to each gel element containing aldehyde groups (45) which were designed and implemented by a robot arrayer (52). A total of 186 oligonucleotide probes were immobilized with the aldehyde group of the activated gel pad on the microarrays as described previously (40).
Hybridization and washing protocols.
Hybridizations were carried out at room temperature (20°C) for 12 h in 40 µl of hybridization buffer containing 5 to 10 µg of each target RNA, 0.9 M NaCl, 20 mM Tris-HCl (pH 8.0), and 40% formamide. Following overnight hybridization, the microarray was washed three times at room temperature with a washing buffer consisting of 20 mM Tris-HCl (pH 8.0), 5 mM EDTA, 4 mM NaCl and 1% wt/vol Tween 20. After the final wash, 200 µl of washing buffer was added to the imaging chamber (Grace BioLabs, Bend, OR) for image and melting profile capture.
Image and melting profile capture.
To generate melting profiles, the microarray was fixed on a thermotable mounted on the stage of a custom-designed epifluorescence microscope (State Optical Institute, St. Petersburg, Russia) and connected with a thermoelectric temperature controller (LFI-3735; Wavelength Electronics, Inc. Bozeman, MT) and a water bath (Cole Parmer Instruments Co., Chicago, IL). The microscope was equipped with fluorescence filters (Omega Optical, Brattleboro, VT) and a cooled charge-coupled device camera (Princeton Instruments, Trenton, NJ) and manipulated with a program that allows image acquisition, processing, and analysis (13). Melting profiles for all probe-target duplexes were monitored and recorded at 2°C intervals between 20 and 70°C by increasing the temperature at a rate of 1°C per min. The melting profile experiments were performed in triplicate and repeated on different days. We visually inspected each microarray for gross artifacts prior to recording melting profiles. In total, 13 microarrays were used in this study (n = 104 experiments) and each microarray was reused multiple times. Analysis of variance of initial signal intensities and MPP scores of universal probes revealed no significant differences by microarray lot or the number of times the microarray was reused.
To assess the precision and accuracy of the image acquisition system, an alternative approach was employed by creating a custom-designed module in V++ (Roper Scientific, Germany). The module controlled the camera, the microscope, and the Peltier element, and recorded the image at each temperature. An additional custom-designed program was created in C++ to convert the stack of images to SI values for each gel-pad on the microarray.
In silico prediction calculator.
The in silico Prediction calculator (http://noble.ce.washington.edu/programpage.jsp) evaluates probe and target sequence using lexicographical matching to yield information on the type of probe-target duplex (i.e., perfectly matched, P; terminal mismatch, T; internal mismatch, I; more than two mismatches, N). The position and type of the mismatch (e.g., A, T, G, and C) was determined for the duplex structure yielding the best match (highest hybridization score). The hybridization score of a probe to a target sequence was determined by counting the number of correctly base-paired nucleotides for a probe at every possible position in a target sequence, starting at the 5'-end of a target sequence, and moving the probe along the target sequence, one nucleotide position at a time until the end of the sequence was reached.
Td calculator.
The Td calculator was designed to automatically calculate the experimentally determined Td and slope for each probe-target duplex by using data obtained from the image acquisition, processing, and analysis software. A Web based interface for this software is available at http://noble.ce.washington.edu/programpage.jsp. The interface contains a Readme documentation describing how to use the software. The documentation contains links to demonstration files (that can be submitted to the calculator) and specifies required formatting for input files.
Since calculating the Td using a simple curve fitting yielded inconsistent results (presumably due to factors affecting the melting profiles, see Discussion), a new version of the Td calculator (47) was designed to determine the temperature corresponding to the maximum slope of the signal intensities, which was presumably related to the transition from duplex to random coil. Multiple regressions lines, consisting of various number of points, were used to determine the maximum slope. To maximize the number of points used for regression analyses, normalized SIs, i.e., Xnorm = (Xobs min)/(max min) in the range of 0.75 to 0.55 arbitrary units were used as the initial start points for regression lines, and a lag of 0 to 5 points was used to vary the endpoints, and thus the length of the regression lines. The minimum number of points used to calculate a regression line was 5. The program considered all possible slopes and calculated the mean slope and Td for all slopes meeting the following two criteria: (i) the calculated Td was within the mean temperature (±1.5°C) of normalized SI values between 0.35 to 0.65, and (ii) the calculated Td fell within a temperature range based on the normalized SI values of 0.55 and 0.45. The default values for the Td calculator were experimentally determined.
For each melting profile, the number of regression lines meeting these criteria, the initial SI, the Td, and the slope of SI values and temperature (dSI/dT), the area under the curve before min/max normalization, and the normalized area under the curve (i.e., after min/max normalization and zeroing) was recorded.
Data sets used for statistical analyses.
Melting profile characteristics and corresponding in silico predictions of each gel-pad experiment was merged to a single data set. Data set 1 consisted of melting profiles obtained from 22 known target sequences (Table 1), 186 probes targeting 16S and 23S rRNA (Table S1 in the supplemental material), and 81 hybridizations and melting runs, and included 1,017 perfectly matched (P) probe target duplexes, 110 probe target duplexes containing a terminal mismatch, 2,269 duplexes containing one or two internal mismatches (I), and 12,188 duplexes containing more than two mismatches (N). Data set 2 consisted of melting profiles obtained from 10 known target sequences (Table 1), 186 probe targeting rRNA genes (Supplementary Table S1), and 23 hybridizations and melting runs, and included 276 perfectly matched (P) probe target duplexes, 21 probe target duplexes containing a terminal mismatch, 604 duplexes containing one or two internal mismatches (I), and 4,559 duplexes containing more than two mismatches (N).
NN software.
An artificial NN package was developed for this project (Neuroet) (38, 46) and is available at the web site http://noble.ce.washington.edu/Neuroet. Unless otherwise specified, the following settings were used for training NNs: input and output scaling was set to standard linear (0, 1); the logistic transfer function was used for hidden neurons and pure linear transfer function was used for output neurons; 80% of the data was used for training, 10% was used for testing, and 10% was used for validating the NN; and, conjugate gradient error minimization was used as the training method.
The architectures of all NNs were optimized prior to conducting analyses by adjusting the number of hidden neurons (1 to 8) and identifying the architecture that provided the best predictive model. Comparison of different predictive models was conducted by computing their median Akaike's Information Criterion corrected (AICc) value (35), determining the probability that one model was better than another, and calculating the corresponding evidence ratio. The AICc value was used (rather than R-squared) because it balances the complexity of an NN model (i.e., the number of data records and the number of variables [i.e., input variables and number of hidden neurons]) with how well the NN predicts outputs. The model yielding the lowest AICc score contained the optimal number of hidden neurons. AICc was calculated using the following equation:
![]() | (1) |
The probability (Pmodel) that one NN model was likely to be more correct than another was determined using the following equation:
![]() | (2) |
![]() | (3) |
Identifying the most important inputs for predicting outputs.
The relative contributions of inputs to predicting outputs were determined by using the Measure Importance of Inputs procedure in the Neuroet package (38, 46). Fifteen NN models were generated for each input variable (e.g., SI value). Models that fell into the lower 25th percentile based on their ranked SS were removed from the analysis because they were considered fixed in local error minima. The AICc scores of the remaining 11 models were averaged. The procedure was then repeated for each input variable. The AICc scores of the inputs were ranked by their value. The probability score and evidence ratio of the ranked inputs (discussed above) were calculated to determine the probability that one input (or a combination of inputs), was better than another.
Developing the Quality and Shape calculator.
Melting profiles of 4800 probe-target duplexes from data set 1 were manually scored for their Quality and Shape values by comparing the profiles to predetermined standards as shown in Fig. 1. The Quality scoring was based on the disjointedness of adjacent points in the melting profile. A "smooth" melting profile was scored 0 (Fig. 1A and 1C) while a melting profile that had one to several disjointed points was scored 0.25 to 1.0 (Fig. 1B, 1D-F), respectively, depending on the amount of noise. The Shape scoring was based on the overall shape of the melting profile: an ideal shape was assigned a score of 0, a questionable shape was assigned a score of 0.5, and a random or not-interpretable shape was assigned a score of 1. The final data set used for training and testing the NN consisted of a subset of 1500 records (from the 4,800 melting profiles; i.e., 500 ideal, 500 uncertain, and 500 poor melting profiles) and 500 records that were generated with random numbers having a range of 0 to 1.
![]() View larger version (25K): [in a new window] |
FIG. 1. Example criteria used to classify melting profiles. A to E, Actual melting profiles; F, Random profile. Quality criterion: A and C have low scores because they are smooth; B and D to F have disjointed points as indicated by the arrows and have high scores. Shape criterion: A and B have low scores because they have ideal shapes.
|
Quantifying the effects of noise on Quality and Shape scores.
To determine the effects of noise on Quality and Shape scores, 10 melting profiles that were considered ideal by visual interpretation were selected from a single microarray experiment. Fixed amounts of randomly generated noise were then added to each of the ten profiles at levels of approximately 0, 0.1, 0.5, 1.0, 5.0, 10, and 50%. The Quality and Shape scores were computed for each melting profile.
Multivariate statistical analysis.
Principal-component analysis (PCA) was employed to examine the distribution of melt characteristics relative to duplex types (e.g., P, I, or N) and to construct ordination plots.
Thermodynamic calculations.
The thermodynamic properties (
G0) of each perfect-match duplex were calculated using OligoAnal (31). With exception to temperature and oligonucleotide length, we used all default parameters. For our calculations, the temperature was set to 20°C and oligonucleotide length was adjusted accordingly.
|
|
|---|
|
View this table: [in a new window] |
TABLE 2. R-squared values of observed versus predicted scores
|
![]() View larger version (25K): [in a new window] |
FIG. 2. Importance of NN inputs to predict Quality (top panel) and Shape (bottom panel) scores relative to an idealized melting profile. The relative importance of each input was based on rank order and their corresponding Pmodel and E ratio values (see text). Shaded areas (and solid black circles) indicate inputs that were statistically more important than the other inputs (white circles). Note that for Shape scores (lower panel), NN inputs 26 and 27 were also found to be important for predicting Shape scores. Inputs 26 and 27 correspond to Quality 1 and 2 scores, respectively.
|
Equations defining the relationship between SI values and Quality and Shape scores were extracted from the trained NNs and incorporated into an application on the web. The Melt Quality and Shape Calculator and documentation files are available at http://noble.ce.washington.edu/programpage.jsp.
Melting profile performance (MPP) calculator.
Although the Melt Quality and Shape Calculator provides information on variability and shape of melting profiles, it does not consider other melt characteristics such as the area under the curve or initial SI values. To account for these other melt characteristics, we developed a calculator that predicts melting profile performance using melt characteristics shown in Table 3 as inputs and using information from PCA as a guide to classify ideal from poor melting profiles. Note that ideal and poor terms are not the same as high and low scores for Quality and Shape. A balanced data set was constructed from data set 1 and analyzed by PCA. The balanced data set consisted of equal number (n = 976) of perfectly matched (P) duplexes, duplexes containing internal mismatches (I), and duplexes containing more than two mismatches (N) extracted from data set 1. The following characteristics were used for PCA: (i) initial signal intensity, (ii) true signal intensity area (i.e., the area under the curve without normalization), (iii) normalized signal intensity area (i.e., area under the curve with min/max normalization), (iv) number of regression lines used to estimate the dissociation temperature, (v) Quality scores 1 and 2, and (vi) Shape scores 1 and 2.
|
View this table: [in a new window] |
TABLE 3. Correlation coefficients of variables relative to PCA axes (n = 2,928)
|
![]() View larger version (18K): [in a new window] |
FIG. 3. Ordination plots produced by PCA of melting profile variables. P, perfectly matched probe target duplexes; I, duplexes containing an internal mismatch; N, duplexes containing more than two mismatches.
|
The optimal number of hidden neurons was determined by training NNs (multiple times) using one to eight hidden neurons, and identifying the architecture yielding the lowest AICc score. The lowest AICc score occurred for NNs having five hidden neurons, thus the optimal architecture used to train the NN was eight inputs (one for each variable), five hidden neurons, and one output neuron (i.e., one for the MPP score). Eighty percent of the data were used to train the NN, 10% of the data were used to test the NN and the remaining 10% were used to validate the NN. The final R-squared values of train, test, and validation data sets were close to 1 (e.g., 0.96), indicating considerable similarity. We determined the importance of melting profile characteristics to predict MPP scores and found that the initial SI value, the number of regression lines used to calculate the Td, and the Shape1 score provided the best combination of characteristics to predict the MPP score, accounting for approximately 90% of the variability (data not shown). The weights and biases extracted from the NN were used to calculate the MPP score from melting profile characteristics.
The relationship between the predicted MPP score and positions on the ordination plot for the balanced data set are shown in Fig. 4 (upper panel). Approximately 48.5% of the balanced data set was classified as ideal melting profiles, 5.7% could not be classified, and the remaining 45.6% was classified as poor (Fig. 4, lower panel).
![]() View larger version (21K): [in a new window] |
FIG. 4. Melting profile performance scores by position on the ordination plot. Top panel, two principal components; bottom panel, absolute duplex performance scores (predicted by MPP calculator) relative to PCA1. Blue dots represent melting profiles with duplex performance scores of >0.75; red dots represent profiles with scores between 0.25 and 0.75; black dots represent profiles with scores of <0.25.
|
|
View this table: [in a new window] |
TABLE 4. Classification of melting profile performance by duplex type
|
![]() View larger version (32K): [in a new window] |
FIG. 5. Interpretation of melting profiles for perfectly matched probe-target duplexes by the MPP calculator. Top panel, a probe (Univ 1390) that tends to yield high initial signal intensity values; lower panel, a probe (S-P-Grpos-1200-a-A-13) that tends to yield low initial SI values. Ideal profiles, blue lines; uncertain profiles, red lines; and poor profiles, black lines. Not all melting profiles are shown for clarity.
|
![]() View larger version (30K): [in a new window] |
FIG. 6. Distribution of initial SI values of perfectly matched probes from data set 1 by MPP type. Initial SI values of all perfectly matched melting profiles are shown as shaded background (i.e., including uncertain profiles). The range, mean ± standard deviations (gray) of initial signal intensities for probes 438 (S-P-Grpos-1200-a-A-13, G0 = 21.8 kcal/mol) and 62 (Univ 1390, G0 = 33.8 kcal/mol) are presented in the horizontal bars above.
|
G020 for all perfect-match duplexes. Figure 7 shows the relationship between initial SI values and
G020 for duplexes that were replicated at least 40 times. Approximately 83% of the variability in the data was explained by mean initial SI values and
G0. Note that probe 438 had a higher
G0 (21.8 kcal/mol) than probe 62 (33.8 kcal/mol), which supports the notion that duplexes with high binding constants (negative
G020) tend to have ideal melting profiles.
![]() View larger version (17K): [in a new window] |
FIG. 7. Relationship between initial SI values of perfectly matched duplexes and their corresponding binding constants ( G020) for solution. SI values were determined by the Fotin et al. (13) method. Mean and standard deviation (n > 40) of SI values are shown for each duplex. Probes in order from lowest to highest G020 are (number, name): 63, Univ 907; 64, Eub 927; 65, Eub 338; 390, Eub 336; 74, S-P-Grpos-1192-a-A-22; 62, Univ 1390; 75, S-P-Grpos-1199-a-A-15; and 438, S-P-Grpos-1200-a-A-13.
|
![]() View larger version (30K): [in a new window] |
FIG. 8. Distribution of initial SI values of probes having more than two mismatches to target sequences from data set 1 by MPP type. Initial SI values of melting profiles of all probes with more than two mismatches are shown as shaded background (i.e., including uncertain profiles).
|
Effects of image processing on melting profiles.
Given the significant number of perfectly matched duplexes yielding poor melting profiles, we investigated the effects of image processing on melt profiles by comparing three background subtraction methods. Raw image stacks (n = 120 images) were collected from thermal dissociations (20°C to 70°C) of three different oligonucleotide duplexes. Each image in a stack represented the SI value of a probe-target duplex collected at one temperature. To simulate subtle variations in placement of the frame used to collect averaged image SI values from the gel-pad, 20 different frame placements of the same image were produced by randomly displacing the x and y coordinates of the frame 20 times (i.e., a change of ±4 pixels representing a variation of 0 to 6.7% total gel-pad area). This displacement never removed the gel pad from the inner frame. All three background subtraction methods were applied to the same 20 modified image stacks.
The Fotin et al. (13) background subtraction method involves subtracting background using the average SI values immediately surrounding the frame and can be defined by the equation:
![]() | (4) |
![]() | (5) |
![]() | (6) |
![]() View larger version (46K): [in a new window] |
FIG. 9. Color-enhanced image of a portion of a gel-pad microarray showing the inner and outer grids framing the gel pads used by the Fotin et al. (13) image processing software. Gel-pads, green; frames, pink; background, blue. Panel B is a magnified image of single gel-pad from panel A.
|
![]() View larger version (43K): [in a new window] |
FIG. 10. Composite melting profiles derived from the analysis of 20 displaced image stacks from a single gel-pad microarray experiment. Placement of the original stack image was displaced in the by and y coordinates of the frame (see text for details). Panels A, C, and E represent stack images processed by using different background subtraction methods: in-out/out (equation 4) (13), the in-out (equation 5) (52), and the in method (equation 6) (this study), respectively. Panels B, D, and F represent the normalized melting profiles (used to calculate the dissociation temperature) from panels A, C, and E, respectively. Gray boxes indicate the range of Td values.
|
Data sets 1 and 2 were produced using Fotin et al. (13) method. Therefore, one can expect the background subtraction method and placement of the window framing the gel pad to have a substantial effect on the quality of these datasets. Unfortunately, we could not regenerate the datasets because the image acquisition software (13) did not store the original images.
Interestingly, most of the RNA profiles observed with our method (equation 6) were linear rather than sigmoid. When melting profiles were calculated in terms of the Fotin et al. (13) (equation 4) method, they turned out to be curved, indicating that the method alters the form of melting profiles to approximate sigmoid melting profiles in solution. To more thoroughly examine the extreme effects of equation 4 on the shape of melting profiles, in silico simulations were performed using two straight lines that approximated the values of Iinner and Bouter along the temperature course of a duplex melt. The simulations produced curved melting profiles resembling sigmoid melting profiles in solution. Moreover, increasing the slope of Bouter (to simulate the diffusion of labeled targets into solution), substantially increased the curvature of the melt, indicating that target size might have significant affects on the form of melting profiles calculated by this method.
Effects of diffusion on melting profiles.
The motion of large RNA fragments in the gel pad and the aqueous solution is determined by diffusion, which in turn, is governed by the size of the molecule. The effect of target fragment size on melting profiles was investigated to determine if diffusion was a major determinant in the observed melting profiles since diffusion affects Bouter in equations 4 and 5. The alternative image processing and background subtraction method (equation 6) was used for these experiments. Diffusion could be a major determinant if various-sized targets hybridizing to identical probes yielded different melting profiles. In this case, short synthetic targets (i.e., 18- to 22-nucleotide oligonucleotides) should diffuse more rapidly from the gel-pad into solution during the temperature course of the experiment than native fragmented 16S rRNA (approximately 100 to 150 nucleotides). Alternatively, if observed melting profiles were similar, we could conclude that melting profiles were independent of target fragment size. To compensate for possible dye effects (due to differences in the labeling of target oligonucleotides and native RNA), we compared the differences in Td between perfect-match and mismatched duplexes (Fig. 11) labeled with the same dye.
![]() View larger version (23K): [in a new window] |
FIG. 11. Differences in melting profiles obtained using synthetic (i.e., oligonucleotide) (panels A and C) and native rRNA target (panels B and D). Images were obtained using equation 6. Panels A and B represent probes 62 (Univ 1390, perfect match, PM) and 399 (Univ 1390-c13, single internal mismatch, MM), respectively. Panels C and D represent probes 63 (Univ 907, PM) and 401 (Univ 907-c9, MM), respectively.
|
It is important to recognize that initial SI values for perfect-match and mismatched duplexes were significantly different (data not shown). Scaling (in order to compare Tds) of the melting profiles, as currently practiced, masked the difference between perfect-match and mismatch duplexes when RNA was used as the target. Therefore, for long RNA fragments, melting may not provide additional power for discriminating perfect-match and mismatched duplexes.
|
|
|---|
In this study, NNs were trained to recognize patterns in gel-pad microarray data (e.g., the variability and shape of melting profiles) and to determine the variable (i.e., input), or combinations of variables (i.e., inputs), contributing to observed patterns in the data. The NN program, Neuroet (38) was employed because, in contrast to available NN packages (free and commercial), it allowed us (i) to automatically determine the optimal number of hidden neurons using an iterative approach, and (ii) to easily extract equations from trained NNs in order to link together several NNs that were used to recognize various aspects of melting profiles (e.g., Quality, Shape, and MPP scores), and (iii) to identify the input or combinations of inputs which are important for making predictions.
Before training the NN, the optimal architecture must be determined (5, 30). We optimized the number of hidden neurons (i.e., architecture) for training NNs and compared the predictions of test and validation data sets to ensure that the architecture and training conditions lead to accurate and reliable predictions. The automated procedure in Neuroet that calculated the number of hidden neurons substantially sped up (
100 times) the process of determining the optimum architecture of the NN because the user was not required to manually compare the performance of different NNs (e.g., R-square values of observed versus predicted outputs of test and validation data sets). Rather, the best NN model was selected based on AICc scores, with the lowest score containing the optimum number of hidden neurons for data analyses. Information on the predictions for testing and validation data (Table 2) demonstrated that the final NNs were not under- or overtrained and that the predictions were quite accurate-even for melting profiles not used for training. Equations extracted from these NNs were used to build the MPP calculator.
The MPP calculator consisted of equations extracted from four trained NNs (i.e., NNs that predicted Quality 1, Quality 2, Shape 1, and Shape 2 scores). They were linked together in the following fashion: (i) the predicted outputs from equations predicting the Quality scores were fed into the inputs of the equations predicting the Shape scores of melting profiles, and (ii) the predictions of Quality and Shape scores, as well as other melt characteristics, were used as inputs to an equation that predicted MPP scores. Our approach of linking together different characteristics of melting profiles into one calculator was successful since we were able to effectively discriminate between ideal and poor melt profiles as demonstrated in Fig. 4 and 5. Moreover, we were able to account for most (90%) of the variability in the data.
Just knowing that an NN accurately recognizes specific patterns in complex data does not provide explanatory insight into the contributions of input variables to the prediction process. For this reason, we used the Measuring Importance of Inputs procedure in the Neuroet package. This procedure yielded consistent results when repeated numerous times (n = 5), indicating that the approach was statistically robust and in agreement with the rigorous testing conducted in a similar study (38). To our knowledge, this is the first study to demonstrate the utility of this procedure using biological data. The finding that SI values at the beginning of the melt were important to predict Quality and Shape scores was anticipated since SI values are highest at the beginning of the melt and more variable along the temperature course as target sequences dissociate, and SI values approach the detection limit of the system (Fig. 5). The finding that SI values at the beginning, middle, and the end of the melting profile were used to predict Quality scores suggests that, in addition to considering the disjointedness of adjacent SI values at the beginning of the melt, the NN also considered SI values at the maximum slope of the melt, and at the end of the melt.
The MPP calculator provided a way to consistently classify melting profiles based on the change of SI values with temperature. Key findings obtained by analyzing the aggregated data with the MPP calculator were that: (i) approximately 18% of the perfect-match duplexes yielded poor melting profiles, (ii) approximately 20% of duplexes with two or more mismatches were classified as ideal and (iii) these results were similar for two independent data sets. Visual evaluation of these profiles confirmed that indeed they were classified correctly, in contrast to our expectations. Gel-pad microarray technology was developed to serve as an analytic method that would be able to identify specific nucleic acids in a sample. It turns out that the method itself imposes very high complications for interpreting its results (see Discussion about overlapping processes). Hence, these findings identified systematic problems intrinsic to interpretation of melting profiles and gel-pad technology. The observation that certain mismatched duplex types produced ideal melts was anticipated (41) since these duplexes presumably have high binding constants and their SI values were within the dynamic range of the camera. It was for this reason that we investigated the effects of image processing, binding constants, and diffusion on melting profiles.
Interpretation of melting profiles.
Classical DNA melting experiments using spectrophotometric analysis have shown that a sigmoid melting profile is precisely determined by the temperature course of the binding constant (9, 29). The hybridization process in a gel-pad microarray might be regarded as an equilibrium process (given adequate time for hybridization). The equilibrium process can be described by the Law of Mass Action (15) and expressed by the following equation:
![]() | (7) |
G0 is the change in the standard Gibbs free energy, R is the universal gas constant, T is the absolute temperature, and Kp is the binding constant of the nucleic acid strands. According to equation 7, the binding constant exponentially increases as
G° becomes more negative-therefore, strong binding of a probe to a target (i.e., high Kp) indicates high duplex stability (i.e., negative
G°). Binding constants (and therefore the
G0s) are sequence-dependent (9) since duplexes formed from different perfect-match probes yielded different initial SI values (Fig. 7). These findings are consistent with the notion that differences in initial SI values of duplexes (determined by using equation 4) are mainly (
83%) attributable to differences in the binding constants. In contrast, the melting process is not an equilibrium process because prior to the melting of duplexes, the microarray was washed with a stringent buffer that removes all nonhybridized material. Hence, duplexes in gel-pads are not in equilibrium with the original single-stranded nucleic acids. The dissolution process is determined by the rate with which duplexes dissociate, the diffusion of the target within the pad, and the diffusion of the target into the solution outside of the pad. Therefore, the observed (nonequilibrium) melting profile in a gel-pad is an overlap of the true melting of duplexes, the diffusion of the target into solution, and the temperature-dependence of the fluorescent dye used in these experiments (33). We demonstrated the affects of target size on melting profiles by comparing the melting profiles of short synthetic targets (oligonucleotides) to those of native rRNA fragmented using the currently established protocol (Materials and Methods). While this and a previous study (48) demonstrated that duplexes formed using synthetic targets in the range of 20 to 40 nucleotides allowed the effective discrimination of perfectly matched and mismatched duplexes (Fig. 11), fragmented native rRNA targets did not. Hence, discrimination of perfectly matched from single mismatched duplexes (12) with rRNA fragmented using the current protocol (producing fragments in the range of 100 to 150 nucleotides) appears to be problematic presumably because of reduced diffusivity (a function of size and possibly secondary structure).
The effects of diffusion are especially relevant to our study since the observed melting profiles were analyzed by image acquisition software that used the Fotin et al. (13) background subtraction method. Fotin et al. (13) originally proposed this method, which is just a division of the local background corrected SI value over its local background, because it partially compensated for the temperature dependence of the fluorescence dye and variations in the intensity of the exciting light. Moreover, this method has been widely used in numerous studies (8, 10, 12, 20, 24, 45, 47, 48, 49). The space immediately surrounding the gel-pad is the region most affected by the diffusion of the dissociated target into solution, which can be observed by the significant increase in SI values of the background (i.e., Bouter) during the temperature course of the experiment (data not shown). For this reason, minor differences (i.e., 6 to 7%) in placement of the rigid grid that frames gel-pads relative to the background (i.e., Bouter) significantly influence the form of melting profiles and their initial SI values produced by this method (Fig. 10).
The existing image analysis software was not sufficiently flexible to allow ideal placement of the grid to all pads on a microarray (centering the inner window on each gel pad). For instance, placement of the grid to pads in one region (e.g., corners of the microarray) was always associated with different placement of the grid to pads in other regions of the microarray (data not shown). Therefore, diffusion of the target and placement of the grid frames were major sources of experimental variation in this study (as is likely true of others) (8, 12, 24, 26, 47, 48) that affect our ability to interpret microarray data.
Although some erratic melting profiles may be attributed to the Fotin et al. (13) background subtraction method, uncertain or poor melting profiles might also be due to probes with low binding constants. For example, more than 60% of the targets that were complementary to probe 438 (S-P-Grpos-1200-a-A-13) were classified as having poor melting profiles, because probe 438 has a low binding constant (
G0 = 21.8 kcal/mol., see Fig. 7) to complementary targets as clearly shown in Fig. 5. Fluorescence originating from duplexes with low binding constants presumably approaches the detection limit of the system. These results also demonstrate that the concentration of fragmented target sequences and the binding constant of the probes, combined with the dynamic range of the camera, can sometimes confound our ability to detect fluorescent signals of melting profiles.
Our working hypothesis that the general form of a melting profile in gel-pad microarrays (within the dynamic range of the detection system) should be sigmoid-shaped and not change from one duplex to the next, and that observed differences are due to experimental variations is supported because we found ideal melting profiles for various perfect-match and mismatched duplexes. Therefore, melting of duplexes that have various types and compositions have the same general form of melting profiles. Deviations from the general form can be attributed to experimental artifacts (e.g., Fotins equation, diffusion of the target, SI values approaching the detection threshold of the camera, etc.).
In accordance with equation 7, the binding constant should exponentially decrease (and
G0 become more positive) for duplexes with mismatched base pairs, making the duplexes unstable. Table 4 supports this general trend: duplexes with 2 or more mismatched base pairs yielded a lower number of ideal melting profiles and a higher number of poor melting profiles. Most duplexes containing mismatches were classified as having poor melting profiles because they have low binding constants, and consequently, were below the detection limit of the system.
Some duplexes with mismatches (
18%) consistently yielded ideal melt profiles (Table 4). The MPP calculator classifies melting profiles (i.e., ideal, uncertain, or poor) based on subjective judgment about the form of melting profiles. Smooth sigmoid shape and absence of erratic glitches were the primary criteria for classifying a profile as ideal as shown by the majority of melting profiles from perfect-match duplexes. This finding indicates that the simple observation of melting profiles was not enough to interpret gel-pad microarray data, as used in a previous study (12). Indeed, some duplexes with multiple mismatches behaved the same way as perfect-match duplexes: reasonable Td and ideal shape. Hence, a melting profile, by itself, does not provide useful information for distinguishing between specific and nonspecific hybridizations. These results are very problematic for the application of gel-pad microarray technology to environmental samples of unknown composition (both sequence- and concentration-wise). Perhaps development of a robust physical model that considers the effects of diffusion, dissociation kinetics, thermodynamics of hybridization, sequence dependencies, target concentration, and rate of temperature increase would drastically improve the situation.
In summary, the NN approach outlined in this study was especially useful for examining data affected by overlapping factors (e.g., placement of the grid frames, background subtraction method) which were not obvious at the beginning of the study. The analysis of 1,293 perfect-match duplexes by the MPP calculator revealed that
18% were correctly classified as having poor melting profiles. Visual examination of all of these profiles showed that they indeed looked poor having a linear shape and very low SI values. Presumably, this result was due to image processing artifacts (see above) and/or low binding constants of the probes, despite the fact that they were perfect-match duplexes.
Towards developing a diagnostic tool for microbial identification.
To address the problems associated with the image acquisition and processing, we have developed a new image analysis system that captures the images of all gel pads on a microarray as they change with temperature. The images are stored as image stacks. Image processing software has also been developed that allows users: (i) to place grid frames, (ii) to convert image stacks to SI values, and (iii) to implement a variety of background subtraction methods including those in equations 4, 5, and 6. We have used this software to analyze melting profiles as shown in Fig. 10 and 11. Unfortunately, data sets 1 and 2 could not be reanalyzed because the original image analysis software did not save images. Our new image acquisition and processing system does save images.
Although single-base-pair discrimination is possible with gel-pad microarrays using oligonucleotide targets, this level of resolution might not be generally possible with native rRNA unless the target rRNA is fragmented to a very short length, similar to that of oligonucleotides used in this and other studies (47, 48). As shown in Fig. 11, we were not able to clearly distinguish between perfect-match and mismatched duplexes with a small study set of probes examined with the current fragmentation protocol, which produces fragments of 100 to 150 nucleotides in length, and the modified image analysis software. Also, the secondary structure of RNA may play a critical role in discrimination. Although fragmenting rRNA to shorter lengths might help resolve single nucleotide mismatches, one has to be aware that extensive fragmentation by any currently existing hydrolytic method using metal ions or Brønsted acids and bases induces bias towards the fragments involved in secondary structure (helices) (32). This bias will affect quantification of microorganisms from environmental samples since some of the informative single stranded portion of rRNA will be destroyed by hydrolysis.
Application of the MPP calculator and subsequent experiments revealed that the main variables affecting the form of melting profiles using the current gel-pad microarray are: the placement of the grid frames and the background subtraction method which are both used by the image analysis system, and duplexes with low binding constants. A key variable affecting the use of dissociation analysis to improve mismatch discrimination, relative to more conventional use of an average hybridization stringency, was shown to be dependent on target molecule size. Although single mismatch discrimination has been demonstrated for certain probes using oligonucleotide targets with the existing technology, an assessment of the more general utility of the gel-pad format for dissociation analysis will require reevaluation of both the image processing and fragmentation/labeling protocols.
This work was supported by grant 1U01DE014955-01 from NIH/NIDCR to H.S. and P.A.N. and grant R-82945801 from EPA-CEER-GOM to P.A.N. Funding through DARPA provided the gel pad microarrays.
Supplemental material for this article may be found at http://aem.asm.org/. ![]()
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»