Team home page

Associated Team HELIX-USP

Report for 2007

During this third year of the associated team, except for the participation of Élise Billoir and Sandrine Charles to a conference in Brazil, we privileged longer visits from Brazilian researchers and students to France. A grouped visit of French researchers and students is planned for early 2008, around March, to coincide with the defense at Recife, Pernambuco, of one Brazilian PhD student, Paulo G. Fonseca, who has been since September 2005 in Lyon. This trip should also inaugurate a new project with Brazil and Chile, funded by STIC AmSud and which was accepted in July of this year (more detail on it below).

Work on genome rearrangements
We continued studying the theoretical aspect of the sorting by reversals problem, that is, the problem of finding an optimal sequence of reversals which transforms one genome into another. When only genomes without gene duplications are considered, there are efficient algorithms to find an optimal solution to this problem. However, it becomes much more complicated when duplications are allowed. To this more complex version, David Sankoff proposed the exemplar model (D. Sankoff, Genome Rearrangement with gene families. Bioinformatics, 15(11):909-917, 1999), which consists in searching, for each family of duplicated genes, an exemplar representative on each genome. In biological terms, the exemplar gene may correspond to the original copy of the gene, which later originated all the other copies. Following the parsimony principle, the exemplar choices should be made so as to minimise the reversal distance between the two simpler versions of both genomes, composed only by the exemplar genes. An alternative to the exemplar is the multigene family model, which consists in maximising the number of paired genes among a family. Again, the gene pairs should be chosen in order to minimise the reversal distance between the genomes. Both exemplar and multigene approaches were proven to lead to NP-hard problems.

In collaboration with the Brazilian researchers Said S. Adi, Cristina G. Fernandes, Carlos E. Ferreira, Fábio V. Martinez, Marco A. Stefanes, Christian Tjandraatmadja and Yoshiko Wakabayashi, another approach was proposed to the sorting by reversals problem with duplicated genes, called the Repetition-free Longest Common Subsequence (RFLCS). Given two strings, the RFLCS is the problem of finding the longest subsequence that contains no repeated symbols. In a paper accepted at the conference Latin-American Algorithms, Graphs and Optimization Symposium (LAGOS), 2007 ([9]), the RFLCS was proved, by polynomial reduction from a particular version of max 2-SAT, to be NP-hard even when the number of occurrences of a symbol in each genome is at most equal to 2. The paper presents also three simple approximation algorithms and an integer linear programming formulation (IP) for the RFLCS that were implemented by the Brazilians, together with some computational results obtained with such implementations. Variants of the RFLCS were also the topic of two papers, independent of the above one and that appeared, respectively, at approximately the same time and slightly later than ours, and that are authored by Guillaume Fertin, Stéphane Vialette and colleagues.

Work on sequence analysis
Work on sequence analysis, notably alignment and long repeats detection in whole chromosomes or genomes, continues with Alair Pereira do Lago who is visiting again the French group, this time for a longer period, from September 20 to December 13, 2007. Since the last report, the paper that had been submitted ([6]) with as further co-authors Pierre Peterlongo (ex-PhD student of the French coordinator and currently postdoc in SYMBIOSE) and a long-term Italian collaborator, Nadia Pisanti, has been accepted ([6]). A second paper, that was indicated as in preparation in the last report ([7]), is about to be submitted (with also a student of Alair, Gustavo Akio Tominaga Sacomoto, as participant) after a delay due to a desire to further improve the performance of the new algorithm, called Tuiuiu (Tuiuiu is the familiar Brazilian name of a bird). The previous work ([6]) had already led to an algorithm, Ed'Nimbus, that may be used through the web at this address.

Alair's visit to France, and the arrival in July 2007 of his ex-PhD student, Augusto Fernandes Vellozo, as a postdoc for at least one year with the French team, will allow us to both continue this work, extending the filter into a fully-fledged multiple alignment algorithm, and to expand it to other collaborations that were already briefly mentioned in the previous report. These concern mainly two PhD students in the French team, Claire Lemaitre and Marc Deloger. Claire works with the problem of the precise identification of breakpoints (points along a genome that were broken following a genomic rearrangement such as an inversion, transposition, translocation, duplication, deletion, fusion, or fission) and analysis of the sequences around the breakpoints. This is motivated by trying to find whether such regions have characteristics that would help understand the mechanisms underlying each different type of rearrangement. Marc on his turn is interested in systematically detecting all types of repeats, among which transposable elements (sequences of DNA that can move around to different positions within the genome of a single cell by a process called transposition) and satellites (highly repetitive DNA; satellites are also called tandem repeats and thus represent repeats that appear contiguously along a genome), and in examining their possible participation in the regulation of genes at the global level of a genome.

Work on metabolic networks
The collaboration between Cristina Gomes Fernandes, Vincent Lacroix and Marie-France Sagot on motifs in metabolic networks had previously led to two publications in 2005 and 2006 ([4], [8]). A third paper is now in preparation presenting a software called MOTUS, which implements the definitions and algorithms discussed in the previous papers as well as several major extensions. Previous work concerned exclusively motif search. Extensions include: 1. the possibility to extract motifs from a graph (only the size of the motif is given), 2. a score reflecting the statistical significance of a motif, 3. a clustering method to group the occurrences that share the same topology, and 4. a viewer to explore the results of the extraction. The development of this software involved also the participation of several additional people: Odile Rogier, Ludovic Cottret and Fabien Jourdan (UMR 1089 Xénobiotiques INRA-ENVT in Toulouse). The software is already accessible through a website as a web service (http://pbil.univ-lyon1.fr/software/motus/) and will also be released as a downloadable version with command-line options. The website is accessible using the login/password : baobab/baobab.

Another way of investigating the metabolic network of an organism consists in determining which metabolites it is able to produce and those it needs to find in its environment. This question is particularly crucial in the case of symbiosis, in particular endosymbiosis. Endosymbionts are organisms that live within the tissues of a host, either in the intracellular space or extracellularly.

In terms of the endosymbionts' metabolism, two main (related) questions are of interest: 1. what are the metabolites that an endosymbiont is capable of producing and 2. which metabolites does the endosymbiont require from its environment (the host) to produce one or a set of metabolites (called target(s)) that the host in turn may need in order to survive but is unable anymore to produce on its own. An example of the latter is the case of aphids that share a mutually beneficial relationship with the bacterium Buchnera aphidicola. The aphids in symbiosis with Buchnera rely on the bacteria to synthetise some essential amino acids that it cannot obtain on its own (they are absent in the phloem sap that the aphids eat). On its turn, Buchnera lacks, for instance, the genes to make the proteins that are needed for membrane construction, and thus depends on the aphid for shelter and protection against harsh environments and predators.

One classical way to answer to the first question (what are the metabolites that an endosymbiont is capable of producing?) is to determine, for each such metabolite, whether the organism is capable of synthetising the enzymes that catalyse the reactions involved in the reference pathways that have the metabolite as one of their end product. However, the reference is in general to one specific organism, namely E. coli. The pathway used for the synthesis of the metabolite may be different in another organism. Furthermore, alternative paths for obtaining the metabolite may exist that cross over different such reference pathways. A so-called "network expansion method" was developed by Handorf et al. (T. Handorf, O. Ebenhoh and R. Heinrich R., Expanding metabolic networks: scopes of compounds, robustness, and evolution. J Mol Evol., 61(4):498-512, 2005) that tries to address the problem of which metabolites in general an organism is capable of producing by identifying the subnetwork that may be reconstructed from the reference one starting from the metabolites the organism can obtain from its environment (input metabolites) and using all the reactions known (from genomic analysis) to be available in the organism. In the case of endocytobiotic bacteria, that is of bacteria permanently living in the cells of another organism, this method however can not in general be applied because it is very difficult to determine the set of nutrients the endocytobiot may obtain from its environment. This comes from the fact that endocytobiots can not be cultured outside the host and there is therefore no possibility to control the nutritional medium. This leads then to the second question: which metabolites does the endosymbiont require from its environment (in this case, the host) to produce one or a set of metabolites (called target(s)) that the host in turn may need in order to survive but is unable anymore to produce on its own? The input metabolites are called in this case "precursors". Identifying all such precursors, in fact minimal sets thereof, remains, as far as we know, a largely open question. The only current method for finding precursors available in the literature (P. R. Romero and P. Karp. Nutrient-related analysis of pathway/genome databases. in Pac Symp Biocomput, pages 471-482, 2001) has for objective solving inconsistencies in the reconstructed network of E. coli. The definition of a precursor in this case is thus limited to the metabolites that are not produced by any reaction. We have started addressing this problem with Fábio Martinez and shall be developing it further during his visit to France planned for the period from December 16, 2007 to February 10, 2008. Fábio will be joined by his student, Paulo Vieira Milreu, from January 6, 2008, to February 10. Other topics concerning metabolic networks (such as network comparison) will be addressed during this visit.

Work on genetic networks
A topic of research being carried out in the scope of the associated team as part of the PhD work of Paulo G. Fonseca (in France since Sept. 2005), co-supervised by a Brazilian professor, Katia Guimarães from the UFPE (Universidade Federal de Pernambuco) and ArcoIris' French coordinator, concerns the identification of transcription regulation modules, i.e. groups of co-regulated genes and their regulators. One important distinction of this work in relation to what is available in the literature is that we propose to identify modules that are evolutionarily conserved. From the biological point of view, the approach is supported by three main premises:

co-regulated genes are bound by common regulatory proteins (transcription factors-TFs) and so they must present common sequence patterns (motifs) in their regulatory regions, which correspond to the binding sites of those TFs,
co-regulated genes respond coordinately to certain environmental or growth conditions, and so they must be co-expressed under those conditions, and
since transcription modules are supposedly responsible for important biological functions, they are more subject to selective pressure and therefore they must be evolutionary conserved.

We thus define the concept of transcriptional regulation metamodules (TRMMs) as groups of genes sharing regulatory motifs and displaying coherent context-specific expression behaviour consistently across species. From the methodological perspective, we note that the incompleteness and elevated noise levels of currently available data impose severe limitations on the reliability of the conclusions that can be drawn through the analysis of one data type in isolation. Therefore we propose to analyse heterogeneous experimental data concerning several species simultaneously, with emphasis on genomic sequence and gene expression data.

This work has been presented in the form of a seminar, during a visit to the Institute of Mathematics and Statistics of the University of São Paulo (Brazil), in December 2006. At the occasion, we discussed this and related problems which might be the subject of further collaboration specially with Junior Barrera and Roberto Junior from the USP, who also do research in the field of expression data analysis. Earlier this year, during their visit to HELIX at Lyon, we had the opportunity to deepen our previous discussion, ending up with one key idea concerning the potential application of techniques previously used in the domain of Computer Vision (a longstanding field of work of these Brazilian collaborators) to gene co-expression data analysis in the presence of high noise levels. We are currently pursuing this track. This work is also benefiting from discussions with Christian Gautier.

A PhD thesis proposition including the bibliographic review, problem formalisation and expected results has been produced and defended earlier this year by Paulo G. Fonseca at Recife (as is the custom in Brazil). The PhD defence itself is expected for early 2008. Junior Barrera has been invited to sit in the PhD committee and has already confirmed his participation. A stand-alone Java application with graphical user interface is being developed as part of the work.

Also during our December 2006 visit to Brazil, Christian Gautier and Emmanuel Prestat gave a talk on "Some bioinformatics tools for studying biodiversity" at Campo Grande, and while at the USP, discussed with Junior Barrera and Roberto Marcondes of their work on pharmacogenomics and the use of Bayesian networks to represent knowledge and extract possible correlations from data coming from both transcriptomics and the genome of patients. In turn, Junior and Roberto visited the French team in June 2007 and presented their work on probabilistic genetic networks to model and analyse genetic networks.

Work on ecotoxicology
The visit of Élise Billoir from the French team to Aloysio da Silva Ferrão-Filho (Fiocruz, Rio de Janeiro) in December 2006 was very productive. After the seminar (on the topic of "Dynamic energy budgets theory applied to ecotoxicology" she presented to the group of Aloysio and discussions she had with them, a collaboration started between the two and Élise's PhD supervisor, Sandrine Charles. In 2007, the exchanges between these two subgroups intensified, leading to the writing of an abstract ([10]) that was submitted and accepted at the II Conference on Computational and Mathematical Population Dynamics (CMPD2) that took place in Campinas. Both Élise and Sandrine travelled to Brazil for the conference (trip funded with other money from other projects of the French team). They are currently writing a paper, to be submitted to the proceedings of the CMPD2 conference that will be published as a special issue of Journal of Theoretical Biology. It is also planned that in 2007 or early 2008, Aloysio will be visiting the French team.

Submission of a STIC AmSud project
A STIC AmSud project entitled "Models and Algorithms for Integrative Biology" was submitted with, in Brazil, the USP (coordinator, Roberto Marcondes), Campo Grande (coordinator, Said S. Adi) and Chile (coordinator, Eric Goles). This 2-year project was accepted, and granted for the first year 10000 euros by the INRIA for the French part, 5000 euros by the FAPESP for the USP, 8200 euros by the CAPES for Campo Grande, and 8500 by the Conicyt for Chile.

Reminder of references for 2005 and 2006

[4] V. Lacroix, C. G. Fernandes, and M.-F. Sagot, Reaction motifs in metabolic networks, Proceedings of WABI 2005 (Workshop on Algorithms in BioInformatics) (R. Casadio and G. Myers, eds.), Lecture Notes in BioInformatics (LNBI), topical subseries of Lecture Notes in Computer Siences, vol. 3692, Springer-Verlag, 2005, pp. 178-191.
[5] R. Tavares, S. S. Adi, P. Blayo, and M.-F. Sagot, Utopia: an exact generic core algorithm for gene prediction using homology, submitted.
[6] P. Peterlongo, N. Pisanti, F. Boyer, A. P. do Lago, M.-F. Sagot, Lossless filter for multiple repetitions, submitted.
[7] P. Peterlongo, N. Pisanti, A. P. do Lago, M.-F. Sagot, Lossless Filter for Long Multiple Repetitions with Edit Distance, to be submitted (a preliminary version appears as a Technical Report from the University of Pisa).
[8] V. Lacroix, C. G. Fernandes, M.-F. Sagot, Motif Search in Graphs: Application to Metabolic Networks, IEEE/ACM Trans. on Comp. Biology and Bioinformatics, 3:360-368, 2006.

References for 2007

[6] P. Peterlongo, N. Pisanti, F. Boyer, A. P. do Lago, M.-F. Sagot, Lossless filter for multiple repetitions, J. Discrete Algorithms, in press, 2007.
[9] S. S. Adi, M. D. V. Braga, C. G. Fernandes, C. E. Ferreira, F. V. Martinez, M.-F. Sagot, M. A. Stefanes, C. Tjandraatmadja, Y. Wakabayashi. Repetition-free longest common subsequence, accepted at Latin-American Algorithms, Graphs and Optimization Symposium (LAGOS), in press, 2007.
[10] E. Billoir, A. da Silva Ferrão-Filho, M. L. Delignette-Muller and S. Charles, The ecotoxicology of the zooplankton-cyanobacteria interaction. presented at II Conference on Computational and Mathematical Population Dynamics (CMPD2), 2007; to be submitted to J. Theoretical Biology.