Previous Article | Next Article ![]()
Applied and Environmental Microbiology, July 2007, p. 4631-4638, Vol. 73, No. 14
0099-2240/07/$08.00+0 doi:10.1128/AEM.00144-07
Copyright © 2007, American Society for Microbiology. All Rights Reserved.
,
Department of Chemical and Biological Engineering, Northwestern University, Evanston, Illinois 60208
Received 19 January 2007/ Accepted 15 May 2007
|
|
|---|
|
|
|---|
Once a microarray platform has been chosen, the next choice to be made is which of the available programs to use for microarray oligomer probe design. As stated by Li and Stormo (13), "Empirically, the optimum probe for a gene would be the one with minimum hybridization free energy for that gene (under the appropriate hybridization conditions) and maximum hybridization free energy for all other genes in the hybridizing pool. Unfortunately, those energies depend on knowledge that is not computable from the sequence alone, at least not currently." We are not aware of any established method for predicting the melting temperature (Tm) of a surface-immobilized probe and the corresponding labeled mRNA or reverse-transcribed cDNA (its target). As a consequence, probe design programs (2, 4, 14, 15, 20, 28, 30-32, 40, 41, 43) use different criteria for selecting the best set of probes for a given set of parameters, such as G+C content, the percentage of sequence identity or similarity, and Tms, assuming that the probe and its target are both in solutions. Each criterion by itself yields an "optimal" set of probes, but to capture the best possible probe set, several different criteria are used by the various design programs to rate each probe. Unfortunately, there are fewer reports on the performance of whole-genome microarrays created by using these programs than there are programs themselves. Given the different optimality criteria used by different programs, the best way to assess the quality of the probe design outcomes of these programs is experimentally.
This paper presents a general two-part strategy for developing high-quality microarrays for any sequenced organism. The first part consists of the in silico creation of a library of optimal probes for each target sequence and the selection of a first set of probes to be experimentally tested. The second part includes the experimental evaluation of the performance of the previously selected probes by using two mRNA pools corresponding to different strains of Clostridium acetobutylicum. The careful selection of the mRNA pools allowed for the estimation of the minimum intensity that a target has to achieve to be considered to be expressed. Finally, we compared the results obtained by using the newly designed array to those obtained by using our previously existing (1) cDNA microarray platform.
|
|
|---|
Probe design software.
Several oligomer design programs were tested (2, 4, 14, 15, 20, 28, 30-32, 40, 41, 43), and six programs (CommOligo [14], ROSO [28], YODA [20], ArrayOligoSelector [2], OligoWiz 2.0 [41], and PICKY [4]) were selected based on the higher numbers of criteria (e.g., the level of sequence identity, the number of contiguous matches, Tm, and the level of free binding energy) employed by these programs than by the other programs to select each probe, the extent of available details for individual algorithms, and the ease of use per our assessment.
Probe design parameters.
All programs were set to return as many 60-mer probes as possible with a maximum similarity to any nontarget sequence on the genome of 75 to 85%. This relatively high similarity level was chosen to attain maximal genome coverage at the expense of allowing some (low) level of cross hybridization. The targets to be covered were 3,916 of the 4,024 annotated C. acetobutylicum ATCC 824 (19) sequences identified as CACXXXX or CAPXXXX, where X's represent numbers in the gene designations. The remaining 108 sequences (rRNAs and tRNAs), together with the intergenic regions separating the annotated sequences, were used as background sequences in the programs featuring that option. We use the term background sequences to describe those sequences for which a probe is not designed but against which all probes will be tested so as to avoid cross hybridization. All other parameters were set to the program defaults.
Computational probe selection.
The potential of each probe for cross-hybridization was estimated by finding the minimum difference in Tm between probe homodimers (i.e., the probes and their intended targets) and the corresponding probe heterodimers (i.e., the probes and likely nonspecific matches found in the C. acetobutylicum ATCC 824 genome) (Fig. 1). For each probe, the set of nonspecific matches contained the first four hits returned by FASTA (22, 23). Tm calculations were done using Hybrid 2.5 (16) as described in reference 32.
![]() View larger version (33K): [in a new window] |
FIG. 1. Flow diagram for designing a library of probes for all the target sequences of the C. acetobutylicum genome and selecting several probes per target to be experimentally tested. A target sequence is any sequence in the genome for which a probe has to be designed. The total number of target sequences is represented by nt. A background sequence is any sequence in the genome for which a probe will not be designed. Subscript i indicates a particular target sequence, and subscript j indicates a particular probe; thus, probeij is the jth probe designed for the ith target sequence, and the total number of probes per target is denoted by npi. The total number of selected nonspecific matches for each probe is denoted by nsij, and subscript k is used to denote a particular nonspecific match for probeij. Homodimerij is the dimer formed by probeij and its complementary sequence (i.e., its target), whereas heterodimerijk is the dimer formed by probeij and the complementary sequence of its kth nonspecific match. The difference in Tm between a homodimer and a heterodimer in a pair is represented by Tm. The number of desired probes per target to be tested is represented by npi.
|
RNA isolation and labeling.
Cell samples were treated as described previously (1) and stored at 85°C. Prior to RNA isolation, cells were washed in 1 ml of SET buffer (1), centrifuged at 5,000 x g for 10 min at 4°C, and processed as described previously (1) but with the following modifications. Proteinase K (4.55 U/ml; Roche Applied Science, Indianapolis, IN) was added to the buffer, and the mixture was incubated for 6 min, followed by 4 min of subjection to a continuous vortex with glass beads (Sigma, St. Louis, MO) at room temperature; the RNeasy mini kit was used according to the instructions of the manufacturer (QIAGEN, Valencia, CA), and genomic DNA contamination was minimized by incubating buffer RW1 (1) at room temperature for 4 min; isolated RNA was eluted in 30 to 40 µl of RNase-free water. RNA samples for microarray hybridizations were labeled with the cyanine dye Cy3 or Cy5 (GE Healthcare Bio-Sciences, Piscataway, NJ) by using an indirect labeling protocol (1). Two mRNA pools were used for all experiments: pool A was created by mixing equal amounts of mRNA samples from wild-type flask cultures sampled at OD600 of 1.09, 1.8, 2.6, and 2.0, whereas pool B was composed of equal amounts of mRNA samples from strain M5 flask cultures sampled at OD600 of 0.454, 0.868, 1.36, 2.40, 3.20, and 4.20. The integrity of the mRNA was tested using a Bioanalyzer 2100 (Agilent, Palo Alto, CA).
Microarray hybridization, scanning, spot quantitation, and intensity normalization.
Spotted cDNA arrays were hybridized as described previously (1). After hybridizing different amounts of labeled material on a total of 10 design II arrays (see "ChIP-on-chip-capable probes" in Results), we determined that the best results were obtained using 0.75 µg of labeled material (data not shown), and this amount was used for all subsequent hybridizations of oligomer arrays. All oligomer arrays were hybridized and washed per Agilent's recommendations except that incubation was at 65°C for 17 h. Scanning was performed as described previously (1). Spot intensities were quantitated using Agilent's Feature Extraction software version 8.5 for the first set of experiments. Normalization and averaging of slide values were carried out as described previously (1) except that intensity ratios (calculated by comparing results for M5 and wild-type strains) and the mean intensities of probes corresponding to the same target were calculated after normalization.
Experimental probe selection.
The final selection of DNA microarray probes was carried out by analyzing the intensities, minus the background, of the probes coming from the in silico procedure diagrammed in Fig. 1 (design I; see "Computational design and selection of probes"), together with those of an additional set of probes (design II). This second set of probes was chosen to evaluate the potential of the probes for chromatin immunoprecipitation (ChIP)-on-chip applications (see "ChIP-on-chip-capable probes" in Results). Probe performance was evaluated by hybridizing a total of eight slides (four for design I and four for design II). For each design, two pairs of slides were employed in a dye swap configuration (5, 12) to account for gene-specific dye bias and technical replication effects. By using the procedure detailed in Fig. 2, each mRNA pool (A and B) was used to contribute a probe to the final array design (design III). To do so, the median of the ranks of all experimentally tested probes for a given target and mRNA pool was calculated, and the probe with the rank closest to the median was selected.
![]() View larger version (36K): [in a new window] |
FIG. 2. Flow chart detailing the process of selecting two probes per target by using two-color microarrays. Two different mRNA pools (A and B) representing two different conditions or phenotypes are used to maximize the number of targets expressed. Subscript s is used to refer to an mRNA pool, nt represents the total number of target sequences, subscript i is used to denote one of the nt target sequences, and subscript j indicates a particular probe. To account for target-specific dye bias, a dye swap configuration is needed, and to account for technical replication variability, several slides are required. We represent the total number of arrays hybridized as N, and subscript z is used to refer to a particular array. We use riijsz to indicate the ranked intensity, minus the background, of the jth probe against the ith target as measured on the zth slide on the channel containing the sth mRNA pool. Intensities, minus the background, were sorted in increasing order; a rank of zero was assigned the first member of the sorted list, whereas a rank of 100 was assigned the last member of the list, and the ranks of the remaining members of the list were proportional to their ordinals on the sorted list.
|
|
|
|---|
![]() View larger version (13K): [in a new window] |
FIG. 3. Most relevant properties of the library of probes generated in the first step of our microarray design. (A) Distribution of the numbers of probes per target. An average of 32 different probes per target was obtained. (B) Tm distribution. As different programs use different methods and/or sets of constants to calculate the Tm of a probe, all of the Tms were recalculated using Hybrid 2.5 (16) as described in reference 32. (C) G+C content distribution.
|
ChIP-on-chip-capable probes.
At the time of the creation of this 60-mer array, we considered the possibility of a hybrid design capable of transcriptional profiling and ChIP-on-chip DNA array (3, 27) applications by employing probes targeting the region closest to the translational start site of each gene instead of the gene's promoter region. The detailed calculations supporting the feasibility of such an approach will be presented elsewhere, along with the experimental data. In the context of this work, suffice it to say that any probes located more than 500 bp from the beginning of the target sequence and those located past the half point of the target sequence were discarded and that the first three probes in this restricted sorted list were selected for experimental testing. We refer to this set of probes as design II.
Probes common between designs I and II.
The two sets of probes are not mutually exclusive. In fact, designs I and II share 6,745 probes corresponding to 3,119 targets. A total of 797 of the 3,916 targets do not have a probe that is common between the two designs. Moreover, every target has at least one probe, while 99.5% of the targets are represented by two probes or more in each design.
Degree of probe replication (designs I and II).
One of our self-imposed limitations was the use of the 22K Agilent array format. These arrays could accommodate up to 21,495 user-designed 60-mers. This limitation prevented us from having two features (spots) for each of the three previously selected probes per target, as this scenario would require 23,496 features (23,496 = 3,916 [targets] x 3 [probes/target] x 2 [features/probe]). Design I contains 9,842 probes in duplicate and 1,811 single probes, whereas design II contains 9,848 probes in duplicate and 1,799 single probes. For either design, the probes which were represented by a single feature were chosen randomly.
Experimental selection of probes.
The general procedure that we devised for the selection of the final probe set based on experimental data is depicted in Fig. 2. To maximize the number of expressed targets, we used two mRNA pools, one coming from the wild-type strain and another from strain M5. The intensities, minus the background, from two pairs of slides per design hybridized in a dye swap configuration were ranked, and those from the same probe and the same mRNA pool were averaged as described in Materials and Methods. Among all probes for each target, the median rank per mRNA pool was calculated, and the probe with the rank closest to this median was chosen. According to this procedure, each mRNA pool independently provides a probe candidate. In some cases, these probes may be the same, and in the absence of other information, we suggest the selection of the probe with the rank second closest to the higher of the two medians as the second probe for that target (Fig. 2). In our case, given the availability of an alternative microarray platform (1), we selected the probe with the rank second closest to the median of the mRNA pool with the highest-ranked intensity for that target in our cDNA arrays. The probes selected according to this procedure are referred to as design III.
Contribution of designs I and II to design III.
Table 1 shows the contributions of each program to designs I, II, and III and reveals that the majority of probes were designed by CommOligo (14) and OligoWizard (41). Fifty percent of the probes in design III are common between designs I and II, and the remaining 50% are equally distributed between design I and design II.
|
View this table: [in a new window] |
TABLE 1. Number of oligomers generated by each programa
|
Assessing the level of nonspecific hybridization.
Although every probe has been designed to ensure the minimum amount of cross hybridization with any other sequence occurring in the C. acetobutylicum genome, the cumulative effect of the cross hybridization of all labeled cDNA elements may not be negligible. By following the approach used previously for our cDNA microarrays (1), we obtained an estimation of the level of cross hybridization by using the signal intensities coming from the labeled cDNA obtained from mRNA pool B. This pool is made up of mRNAs from strain M5 (6, 7), which has lost the 178-gene-long pSOL1 plasmid. By studying the distribution of the intensities of the 321 probes directed towards these missing pSOL1 targets, we could assess the level of cross hybridization that may be expected for any probe when its target is not expressed. Figure 4 shows that under the experimental conditions used, around 95% of the probes for pSOL1 targets exhibited an intensity, minus the background, of 50 U or less when their target transcripts were not present. We will refer to this value of 50 U as the threshold of expression.
![]() View larger version (28K): [in a new window] |
FIG. 4. Distribution of intensities, minus the background, of the array probes on the M5 channel for pSOL1 genes (solid bars) and chromosomal genes (open bars) when 1 µg of labeled cDNA was used. Intensities are expressed in units.
|
![]() View larger version (22K): [in a new window] |
FIG. 5. Reproducibility of expression ratios measured by the duplicate probes of the final array (design III). All duplicate probes are shown regardless of their mean intensity values. The regression line between ratios has a slope of 0.9418, an intercept (x = 0) of 0.0022, and an R2 value of 0.8881.
|
![]() View larger version (90K): [in a new window] |
FIG. 6. Consistency between our previous cDNA platform and the probes from our final array (design III). The three outer rings represent the chromosomal genes, whereas the three inner rings represent the pSOL1 genes. For each set of rings, the central ring shows the ratio measured using the cDNA array whereas the other two rings present the ratios obtained using the two different probes in the oligomer array. Gray segments indicate probes (either cDNA or oligomer) with intensities below the mean intensity cutoffs of 300 U for cDNA probes and 50 U for oligomer probes. White segments on the cDNA rings indicate open reading frames not previously covered in our array. For those targets with only one probe on the array, the corresponding segment in either the external or internal ring is white. Ratios were calculated as the M5 value divided by the wild-type value; saturated red indicates a ratio of 3 or greater, black indicates a ratio of 1, and saturated green indicates a ratio of 1/3 or smaller. Quantitative data for this figure can be found in the supplemental material.
|
|
|
|---|
![]() View larger version (19K): [in a new window] |
FIG. 7. Percentages of similarity between the probes from designs I and II and their four nonspecific matches. The percentage of similarity between each probe and each one of its four highest-scoring nonspecific matches returned by FASTA was calculated by using the rigorous Needleman-Wunsch global alignment algorithm as implemented in EMBOSS (29). Despite allowing the probe generation programs a maximum similarity of up to 80%, the bulk of the probes presented a similarity to their nonspecific matches of 70% or less.
|
Estimation of the level of cross hybridization.
One of the most promising applications of the large amount of data generated by high-throughput methods is the generation of new knowledge by using data-mining techniques. When dealing with microarray data, these techniques require the use of ratios corresponding to genes that are truly expressed. An indication of the minimum observed intensity of a probe when its target sequence is truly expressed can be obtained by spiking a selected set of targets (25) or by obtaining an experimental measure of the intensity that can be attributed to nonspecific cross hybridization. As in the previous study (1), we chose the second option by using an mRNA pool from a strain (M5) resulting from a significant deletion event (the loss of the 178-gene-long pSOL1 megaplasmid) and then measuring the intensities of the probes corresponding to the deleted targets (Fig. 4). We then used this information to calculate a threshold value above which it can be safely assumed that a gene is expressed and that its ratio contains meaningful information. Although this strategy may seem specific for C. acetobutylicum, similar strategies can be devised for other organisms whenever it is clear that a group of genes is expressed under some culture conditions but not under other conditions. Examples of such groups of genes would include genes related to motility and chemotaxis or to the catabolism of unusual substrates. In the former case, one would compare signal intensities from two mRNA pools, one from the motile and the other from the nonmotile stage of a culture, and a value for the expression threshold could be estimated. In general, a threshold of expression can be calculated whenever clearly distinguishable phenotypic traits or metabolic pathways are uniquely and robustly related to the expression of a relatively large set of genes. For a discussion about the use of cDNA to calculate the expression threshold, see the supplemental material.
Use of more than one probe per target.
Our strategy of creating a library of microarray probes allows us to increase the number of probes per gene, as the price per feature decreases without the need for generating an entirely new set of probes. For instance, Agilent's 44,000-element array allows the user to specify the contents of up to 42,034 features or, equivalently, 14,011 probes printed in triplicate. In our case, and after discounting the 126 features needed for targets with one (four targets), two (seven targets), or three (eight targets [CAC0624, CAC1087, CAC1112, CAC1288, CAC1704, CAC2811, CAC3175 and CAP0170]) probes, we could print 2,278 targets in quadruplicate and the shortest 1,619 targets in triplicate, leaving only one unoccupied feature {42,033 = (4 [targets] x 1 [probe/target] x 3 [features/probe] + 7 [targets] x 2 [probes/target] x 3 [features/probe] + 8 [targets] x 3 [probes/target] x 3 [features/probe]) + 2,278 (targets) x 4 (probes/target) x 3 (features/probe) + 1,619 (targets) x 3 (probes/target) x 3 (features/probe)}. Use of more than one probe per target would make it possible to quickly check the consistency of the expression results for each and every gene based on the principle that properly designed probes targeting the same gene should yield similar expression ratios (Fig. 5). As the number of microarray expression data capturing different phenotypes increases, probes reporting conflicting results can be singled out, and once the true level of expression of the target is determined through quantitative reverse transcription-PCR, the best probes can be identified and the array design can be revised. Moreover, this method would make it possible to revisit and refine any previously obtained transcriptional data instead of simply writing off inconsistent results for a particular target.
Published ahead of print on 25 May 2007. ![]()
Supplemental material for this article may be found at http://aem.asm.org/. ![]()
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»