Team home page

Associated Team HELIX-USP

Purpose

Main motivation

The research proposed will focus mainly on computational biology. The technological advances in biological related areas of recent years has allowed a huge amount of information to be produced. It is one of the main current challenges for biologists to be able to extract knowledge from all this data, and a challenge also for the computer scientists and mathematicians to participate with the biologists in this process. The analysis of this data involves, among many other issues, the formalization of problems and the search for efficient algorithms to solve them. This is a continuous process in the sense that the solutions produced by the algorithms will usually lead to new research that has for objective refining the models, and will therefore result in new problems and/or the development of increasingly more sophisticated algorithms. The main goal of the association between the HELIX and USP teams is to work on such types of questions: the formalization of problems coming from biology and the search for efficient algorithms to solve these problems.

Broad lines of investigation

Living organisms are extraordinarily complex systems which recent technological advances have enabled to start studying in a completely new fashion, at scales never contemplated before. Although mathematical models have since a long time been attempted for many biological phenomena, usually this was at a relatively small specialized level. Even then, the data was often missing to test, revise and refine such models. This data is now coming so fast that it is the models and their accuracy that are lagging increasingly behind. The coverage of the data is also extensive enough already that one may start considering to model larger and wider covering systems. This is totally essential if one hopes to get a truly good understanding of how complex organisms function and reproduce.

Although the speed of arrival of new data, and their overall size have often been mentioned as the main bottlenecks to the mathematical and computational modelling and treatment of such new information, the real bottleneck is situated elsewhere. It is in the relative lack of sophistication of currently existing computational tools for dealing even sometimes with small biological datasets. The mathematical and algorithmical theory is often not yet there that would enable to fully satisfactorily treat the complexity of living organisms.

The association between the HELIX and the USP teams aims precisely at addressing this point. The two teams gather people with interest in mainly three areas of computer science: computational biology, combinatorial optimization and graph theory. These areas are related to each other, as many problems in the first two are modelled and formalized as problems in graph theory. Also, many problems in computational biology are in fact optimization problems. Sharing knowledge within these areas is therefore an important step towards solving important practical problems, as well as towards developing the background to find more efficient solutions for such problems. We want to use the theoretical skills of both teams to improve some of the mathematical and algorithmical theory needed to get a better understanding of evolution and of the preservation of structures in living systems. We intend to work on four main topics: sequence analysis and models for molecular evolution (Part 1 of this project), large-scale comparative genomics and genome dynamics (Part 2), phylogeny (Part 3), and biochemical networks (Part 4). Each of these topics is already the object of research that is being conducted in one of the teams or both, either each team individually or in collaboration.

Part 1: Sequence analysis and models for molecular evolution

This part is basically concerned with identifying motifs of different types and sizes in molecular sequences. This identification requires also to have good models of the evolution of such sequences and this is therefore a third topic that appears as background to the other two.

The following subtopics will more particularly interest us in the context of an HELIX-USP collaboration:

1.1: Improving or investigating the various models for motifs that have been developed over the years. The motifs here include: complex motifs, that is motifs which may be composed of various parts separated by more or less constrained distances; footprint motifs, that is motifs that explicitely take a (known) phylogeny into account when trying to infer motifs from a set of orthologous sequences; finally, motifs that try to combine both the pattern and position weight matrix models for representing motifs.
1.2: Systematically investigating new approaches for motif inference, in particular by exploring probabilistic methods and general machine learning techniques, and more specifically analysing the feasibility of non supervised motifs learning using biclustering techniques adapted to motifs.
1.3: Extending the types of motifs that have been considered to treat long motifs (which may require filtering techniques specially adapted to the problem -- genes may be seen as very long motifs, split in the case of eukaryotes), motifs under an edit distance model, parameterized motifs (motifs which match under general functions), motifs appearing in permuted forms.
1.4: Systematically investigating new indexes for old and new motif models, in particular indexes for approximate motifs search and inference and for complex motifs.
1.5: Analysing various exact or approximate formulations of motif clustering, from the most elegant ones (such as bases of motifs already undergoing investigation in one of the teams) to heuristical clustering methods.
1.6: Improving the algorithms for multiple sequence alignment that underly motif search or inference methods, in particular, exploring new cost functions such as normalized edit distance (where the alignment quality of the sequences is measured by considering the average of the quality of the local alignment of the sequences).
1.7: Exploiting conservation differences, that is, differences in evolutionary constraints, to identify gene regulatory sequences. Algorithms will be considered for scanning sequences from different organisms for conserved equivalent segments and within these segments, to identify hyper conserved and hyper variable regions.

Part 2: Large-scale comparative genomics and genome dynamics

It has since long been known that genomes are not static. The work of Barbara Clintock in the late 40's showing that genes could jump spontaneously from one site to another was a first clear sign of this. ``Jumping genes'' were called transposable elements by Clintock. Genes may also get duplicated. There are strong indications that the duplication may sometimes affect whole chromosomes or even genomes or, inversely, only pieces of a gene, in particular exons. It has thus been shown that, during evolution, DNA segments coding for modules or domains in proteins have been duplicated and rearranged through what has been called intronic recombination. By shuffling modules between genes, protein families have thus evolved. Genomic segments can be reversed, in general through ectopic recombination, or deleted. Chromosomes in multi-chromosomal organisms may undergo fusion or fission, or exchange genetic material with another chromosome (through homologous recombination), usually at their ends (translocation) or internally. Genetic material may also be transferred across sub-species or species (lateral transfer), thus leading to the insertion of new elements in a genome. Parts of a genome may be amplified, through, for instance, slippage resulting in the multiplication of the copies of a tandem repeat.

Although much is known about the dynamic behaviour of genomes, much more remains to be discovered about the forces and exact mechanisms behind such dynamics, its function and extend, the frequency of each type of rearrangement, and the impact genomic reorganizations may have on gene expression and genome development.

The main topics in which the association between the researchers of HELIX and USP can yield significant breakthroughs are:

2.1: Algorithms and complexity analysis for calculating a rearrangement distance between two or more genomes under various models. Classical methods of DNA sequence comparison assumed that sequences may only mutate by operations that act on individual nucleotides, i.e., substitutions, insertions, and deletions. More recently, additional studies considered large scale genome rearrangement events such as inversions, transpositions and translocation. We aim to broaden the theory of genome rearrangement in several directions, and to tighten the contact between the theoretical analysis and the real data that are gradually becoming available. The key topics we shall study are algorithms for sorting by signed reversals, length-sensitive sorting by reversal, sorting by transpositions, handling duplicated genes, handling missing genes, handling multiple genomes.
2.2: Modelling, detection and analysis of ``segments conserved by rearrangements''. ``Segments conserved by rearrangements'' mean parts of a chromosome which are relativily stable under large-scale evolutionary events. Looking for such segments is a difficult task for two reasons. The first one is that the conservation is not exact. Some rearrangements preserve the function associated with a segment provided there are ``not many'' of them. How precisely to define the number and type that should be allowed, and therefore which definition(s) to adopt for a conserved segment (possibly there will be more than one depending on the biological question) remains very much an open problem. Our first task will thus be to derive models that are satisfying both mathematically and biologically. The second difficulty of the problem is that such models may be hard to compute.
2.3: Study of breakpoint regions of the genome. This consists in analysing the regions where rearrangements have broken the genome, and trying to find some characterics that may enable to classify them according to the type of rearrangement that gave them origin. The characteristics sought could be the motifs or repeats such regions may contain, or some other features still to be determined. We intend to build methods for detecting such regions as accurately as possible (this is the counterpart of the conserved segments mentioned just above). Then by studying the distribution and length of these regions, we shall try to evaluate the reality of the ``fragile regions'' model, which asserts that there are evolution hotspots in the genome. This theory is under discussion in the scientific community, and still lacks efficient theoretical bases. We shall combine gene homology data and global genomic alignment data and provide reliable tools and analyses based on previous studies on rearrangements.
2.4: We have interest also in a particular subproblem of the previous one, namely the problem of alignments with inversions. Sequence alignments are broadly studied for biological sequence comparison but considering only biological events such as mutations, insertions and deletions. Other biological events such as inversions are not automatically detected by the usual alignment algorithms. Some alternative strategies have been considered in the attempt to include inversions and other types of rearrangements. We plan to improve further on some initial results we have already concerning this topic, and possibly to generalize them to other types of alignments and objective functions.
2.5: Repetitions, recombinations and rearrangements. The objective is to design algorithms for identifying various types of repeats, comparing various alleles of a tandem repeat and studying the relationship between repeats and recombination. We shall investigate new models, algorithms and indexes for identifying various types of repeats in a sequence. The work will start by attempting a typology of the various types of repeats that may be found in biological sequences. Mathematical models and efficient algorithms for their detection will then be investigated for some of these repeats. A tandem repeat has a history and any two individuals may have different tandem repeat sequences at the same genomic location. For various scientific reasons, biologists are interested in tracing back the history of a tandem repeat and in comparing different alleles of a tandem repeat. In the proposed work, we shall focus on combinatorial and algorithmic aspects of the duplication phenomenon. The more complicated case where recombination is present between the copies of a tandem repeat will also be addressed.

Part 3: Phylogeny

Although rearrangements are now generally known to be a major force of the evolution of organisms, they have been surprisingly little used to study such evolution, in particular to infer phylogenies. Phylogenies have instead continued to be most often derived from the point mutation information of one or more genes. Although it is well known, and is indeed a current topic of investigation, that the ``evolutionary story'' told by a gene often contradicts the ``story'' told by another gene, or the ``story'' obtained by using some more general information from an organism (e.g. a set of genes taken together, or a set of morphological characters), such evolutionary stories continue to be represented in a hierarchical manner, that is using trees as models instead of graphs.

This difficult problem will be addressed in a step-wise manner by considering the following subproblems.

3.1: Revisiting the problem of a phylogenetic tree reconstruction by examining two different variants that have several applications in the computation of a distance/dissimilarity between two trees. One variant models phylogenetic trees as Steiner trees with terminal leaves and seeks for a minimum-cost Steiner tree where the terminals appear only as leaves of the tree. A version of this problem that seem particularly relevant for our purposes is one where we are given also a permutation of the terminals and we want a minimum-cost Steiner tree with terminal leaves that respects the given permutation. By this we mean that, if in the permutation terminals r₁, r₂, r₃ and r₄ appear in this order, then the paths in the tree between r₁ and r₃ and between r₂ and r₄ intersect. The second variant concerns the maximum homeomorphic agreement subtree problem. This consists of the following: given a collection of rooted trees with its leaves labelled by elements of a set A, find a subset of A of maximum cardinality such that the subtrees of the given trees ``induced'' by the leaves with labels in this subset coincide, except for possible edge subdivisions.
3.2: Computing a recombination distance between two phylogenetic trees. In our preliminary theoretical work with this topic, the distance we explored is based on the MAF (``Maximum Agreement Forest'') between two trees. We shall continue this work by examining the relation between the MAF and another possibly more appropriate recombination distance between trees: the SPR (``Subtree Prune and Regraft'') distance. This is the minimum number of subtree cuts and regrafts one must do in order to transform one tree into another. We shall also work on elaborating new algorithms for the MAF and other distances, either with a better ratio than those established so far, including by us, or parameterized.
3.3: Exploring the network (graph) nature of the evolution of organisms by combining optimization methods and graph algorithms for inferring what are called reticulate phylogenies.

Part 4: Biochemical networks

It is now commonly accepted that the functioning and development of a living organism is controlled by the networks of interactions between its genes, proteins, and small molecules. Studying such networks and their underlying complexity is the main objective of this part. This objective hides a second one, no less crucial, which is to greatly improve the mathematical and algorithmical theory needed to accurately model, and then explore and analyse highly complex living systems. Biochemical networks may represent protein-protein interactions, the metabolism of an organism, its system of gene expression regulation, or even, mixed networks that contain information coming from various of the previous sources.

The amount and spread of the data now becoming available enable us also to introduce an evolutionary perspective into the study of living organisms, and in particular of biochemical networks. Evolution is a general underlying principle of life that allows us to compare and decipher the meaning and function of structure, the modification of biochemical pathways and networks, the preservation and variation of cell signalling systems, and so on. It thus serves to study the fundamental aspects of life, taking advantage for this of the exploratory and comparative possibilities provided, in particular, by the availability of an increasing number of whole sequences and datasets from different genomes.

The main topic for which the intensive collaboration between HELIX and USP can be profitable is the search of pathways, motifs and modules in different kinds of biochemical networks, that is the search of what is functional, and therefore preserved by evolution. In particular, we shall be concerned with the following two main topics.

4.1: Motifs and modules in biochemical networks. No fully satisfying or complete definition of motifs and modules in biochemical networks exist and most of the work will consist in exploring the various which may be considered (topological or other) and the algorithmical complexity of such definitions. For each, efficient data structures, filters and algorithms for both searching known motifs and for inferring new ones in large networks will be developed. The definitions will of course vary depending on the type of biochemical network that is considered.

The question of the statistical significance of the motifs identified will be of primary importance. This question is still open. An answer to it may depend on the definition of a random graph that is appropriate to the biological problem at hand, a definition of a motif occurrence in such a network, and how to calculate the probability of such motifs. The possibility to transpose questions and results on motif statistics in a random sequence to motifs statistics in a random network will be examined. This will be a more exploratory research activity.

Finally, modules and motifs will also be essential instruments for studying and understanding the evolution of networks. Indeed metabolic pathways have already been used to infer phylogenies but the topological aspects of such pathways are only very partially and indirectly taken into account. This is another area that will be developed, possibly in conjunction with other types of information.
4.2: Reconstruction, alignment and simulation of metabolic pathways. Metabolic pathways reflect the sum of an organism's chemical reactions, and their elucidation is key to the understanding of cellular processes as a whole. Such pathways can be represented as labeled graphs and networks of processes, thus making them amenable to algorithmic analyses of several kinds. Our objective is to combine methods for computational analysis and simulation of these structures with experimental work that reveals the (kinetic and other) parameters that are required to characterize the behavior of these systems in order to allow life science researchers to better understand how metabolic pathways function.

In our work, we aim to provide researchers with systematic and predictive means to do their work. These include the ability to compare metabolisms both of a variety of organisms as well as of similar processes within the same organism, the provision of tools and methods to do both static and dynamic analyses of pathways, and the ability to reconstruct complex pathways from their constituents. Note that some of the methods developed in this context are applicable also to other cases, such as regulatory networks, or protein-protein interaction networks.