PBIL

ROSO
a friend for your design

BBE contribution to PBIL in Lyon, France

ROSO a friend for your design

Home sweet home Who did it ?... I did it !:) Principles Try it and you will love it !
Just run it Credentials Comments I help you if I can. You're feeling down.

Abstract
References for energy values of nearest neighbor model used by ROSO
Methodology for a multiple run use of ROSO

Abstract

Until recently, the analysis of gene regulation and function has been carried out using step-by-step studies of a unique gene. Currently, with the increasing amount of genomic sequences available, microarray technology offers the possibility of studying the expression of hundreds of thousands of genes in a single experiment and appears to be a powerful tool in achieving a more global investigation of biological questions. To take advantage of this emerging technology, microarrays users need to be very careful as regards the definition of an appropriate probe design strategy. Either PCR amplified fragments (> 100 bases) or synthetic oligonucleotides (25-70 bases) can be used as probes. However, compared to long PCR probes, oligonucleotide probes provide the possibility of targeting a specific region of the transcript, thus minimizing non-specific hybridization events. However, the choice of an optimal oligonucleotide probe set is a time-consuming multicriteria problem that requires a complete bioinformatic analysis of genes. Thus, we have developed the ROSO software. ROSO is a french acronym for: "Recherche et Optimisation de Sondes Oligonucléotidiques" (Research of optimized oligonucleotide probes for microarrays).

ROSO utilization requires two kinds of input fasta-formated files:
  • an interest file. It contains the sequences of overall genes to be spotted on the chip.
  • a facultative external file. It contains the sequences of any genes that are not to be spotted. It enables the user to explicity avoid cross-hybridization with know genes.
  • As an example, if the interest file contains a subset of CDSs from one organism, the others CDSs, as well as pseudogenes and intergenic regions might be added in the external file.

    ROSO allows users to choose parameters to start the probe selection:
  • size and probe orientation (either reverse complementary or identical),
  • the number of probes per genes (overlapping or not),
  • absence of sequences AAAA, TTTT, CCCC and GGGG within the probes,
  • target and ionic concentrations,
  • hybridization temperature,
  • threshold values for secondary structures,
  • probe localization on genes.


  • The optimization process can be briefly described as follows:
    (1) The interest file is reduced in three ways.
    First of all, ROSO removes any putative identical genes, defined here as any pairs showing more than 98 % homology on more than 100 bases, or 95 % for EST sequences. For each group of identical genes only the longest is kept in the interest file.
    Secondly, the user can choose probe localization on a range of n nucleotides from the 3' or 5' end of genes.
    Finally, probes with undetermined nucleotide (N) are eliminated and the user can accept, or not, sequences with small repetitions of four nucleotides (AAAA, TTTT, CCCC and GGGG). Such repeated sequences might facilitate cross-hybridization and are difficult to synthetize.

    (2) Putative cross-hybridizations are checked for each gene within the interest file compared to both the interest and eventually external files using the BLAST program. BLAST parameters were estimated on simulated data sets to detect a minimal identity of 70 % on 20 bases.
    This phase ends by the computation of each putative probe's higher homology which ranges from 60 % (specific area) to 100 % (strictly homologous area).

    (3) The software then searches for stable secondary structures (hairpin and homoduplex) in the probe list. Such structures act as a barrier to hybridization between the probe and its target. Each secondary structure conformation is analyzed and the corresponding free energy is computed. All the probes which may turn into stable secondary structures (i.e. the free energy of which is above a certain threshold) are removed from the probe list.

    (4) The melting temperature (Tm) of all the remaining probes is estimated using the nearest-neighbor thermodynamic model. Then, the algorithm iteratively selects the probes (at least one per gene) in order to maintain a minimal Tm variability among the overall probe set.

    (5) The ultimate step is the selection of the optimal probe set. It is based on four stability criteria:
  • G+C rate (preferably between 40 and 65 %),
  • first and last bases (preferably a G or a C),
  • repetitions (avoid GGG or CCC strings)
  • and free energy (GC clamp at both ends).


  • The originality of our optimization process is mainly based on the separation between the specificity analysis (using BLAST) and the probe selection. Indeed, BLAST finds word matches of a minimal size before extending them. Then, running BLAST oligonucleotide probes against genes leads the loss of information as compared to running BLAST between the whole genes, when a word matches outside a potential probe.
    Moreover, since microarrays involve the parallel hybridization of many probes, the thermodynamic properties need to be homogeneous among the entire probes set. The ROSO structure enables the user to isolate the time-consuming BLAST step, thus enabling to efficiencly perform an iterative selection of the probes in order to ensure the thermodynamic uniformity. This progressive refinement of the probe selection enables to introduce different criteria of localization or secondary structure thresholds. Model of such multi-criteria research are proposed in: Methodology for a multiple run use of ROSO.



    References for energy values of nearest neighbor model used by ROSO

    Energy values of nearest-neighbor model used for calculation of melting temperature

    - Santa Lucia, J. (1998). A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. Proceedings of the National Academy of Sciences of the USA, 95: 1460-1465.

    Energy values of nearest-neighbor model used for calculation of secondary structures

    - Breslauer, K., Franck, R., Blocker, H. and Markey, L. (1986). Predicting DNA duplex stability from the base sequence. Proceedings of the National Academy of Sciences of the USA, 83: 3746-3750.
    - Freier, S., Kierzek, R., Jaeger, J. Sugimoto, N., Caruthers, M. Neilson, T. and Turner, D. (1986). Improved free-energy parameters of predictions of RNA duplex stability. Proceedings of the National Academy of Sciences of the USA, 83: 9373-9377.
    - Groebe, D. and Uhlenbeck, O. (1988). Characterization of RNA hairpin loop stability. Nucleic Acids Research, 16: 11725-11735.


    Guide-line for a "multiple run" use of ROSO

    Our optimization process takes into account different criteria through successive requests (Tm, secondary structures, location on the gene and homology).
    In front of "problematic genes" (paralogous, families, alternative transcripts...), the user may then dispose of different probes depending on the relative importance given to these criteria.

    The process described below requires five runs that will be described in detail further on. However, more complex process might be developed.
    First user has to define GC content of his data set. Depending on this rate, different threshold values for hairpin stability calculation will be proposed.

    Do not forget to save all your result files on your computer for every run !

    Run description

    Run 1 Estimate the GC rate of the studied genome and choose appropriate secondary structure thresholds. Run Roso as you would do for a single run, with the personal other settings (research settings, probes settings, solution concentrations, hybridization temperature and selected area research).
    Run 2 Save results of run 1. Concatenate the two files: roso_no.blast (the genes without probe in fasta format) and roso_sup.blast (the genes with probes with putative cross-hybridization in fasta format) to create "run2" file. Then, run Roso with this file as interest file and with the appropriate secondary structure thresholds (see below).
    Run 3 to 5 see run 2 to create "run2", "run4" and "run5" files. Then, run Roso with theses files and the new secondary structure thresholds (see below).

    Proposed Scheme

    Runs value of hairpin (kcal/mol) value of homoduplex (kcal/mol) Size of the probes
    GC<55% GC>=55% GC<55% GC>=55%
    1 0 -4 -6 -10 initial size
    2 -2 -8 -8 -14 initial size
    3 -4 -12 -10 -20 initial size
    4 -4 -12 -10 -20 initial size - 10 nucleotides
    5 -4 -12 -10 -20 initial size - 20 nucleotides


    INRALaboratory of Functional Biology, Insects and Interactions
    Copyright ROSO 2004
    INSA Lyon