Associated Team HELIX-USP
Purpose
Main motivation
The research proposed will focus mainly on computational biology. The
technological advances in biological related areas of recent years has
allowed a huge amount of information to be produced. It is one of the
main current challenges for biologists to be able to extract knowledge
from all this data, and a challenge also for the computer scientists
and mathematicians to participate with the biologists in this
process. The analysis of this data involves, among many other issues,
the formalization of problems and the search for efficient algorithms
to solve them. This is a continuous process in the sense that the
solutions produced by the algorithms will usually lead to new research
that has for objective refining the models, and will therefore result
in new problems and/or the development of increasingly more
sophisticated algorithms. The main goal of the association between the
HELIX and USP teams is to work on such types of questions: the
formalization of problems coming from biology and the search for
efficient algorithms to solve these problems.
Broad lines of investigation
Living organisms are extraordinarily complex systems which recent
technological advances have enabled to start studying in a completely
new fashion, at scales never contemplated before. Although
mathematical models have since a long time been attempted for many
biological phenomena, usually this was at a relatively small
specialized level. Even then, the data was often missing to test,
revise and refine such models. This data is now coming so fast that it
is the models and their accuracy that are lagging increasingly
behind. The coverage of the data is also extensive enough already that
one may start considering to model larger and wider covering
systems. This is totally essential if one hopes to get a truly good
understanding of how complex organisms function and reproduce.
Although the speed of arrival of new data, and their overall size have
often been mentioned as the main bottlenecks to the mathematical and
computational modelling and treatment of such new information, the
real bottleneck is situated elsewhere. It is in the relative lack of
sophistication of currently existing computational tools for dealing
even sometimes with small biological datasets. The mathematical and
algorithmical theory is often not yet there that would enable to fully
satisfactorily treat the complexity of living organisms.
The association between the HELIX and the USP teams aims precisely at
addressing this point. The two teams gather people with interest in
mainly three areas of computer science: computational biology,
combinatorial optimization and graph theory. These areas are related
to each other, as many problems in the first two are modelled and
formalized as problems in graph theory. Also, many problems in
computational biology are in fact optimization problems. Sharing
knowledge within these areas is therefore an important step towards
solving important practical problems, as well as towards developing
the background to find more efficient solutions for such problems. We
want to use the theoretical skills of both teams to improve some of
the mathematical and algorithmical theory needed to get a better
understanding of evolution and of the preservation of structures in
living systems. We intend to work on four main topics: sequence
analysis and models for molecular evolution (Part 1 of this project),
large-scale comparative genomics and genome dynamics (Part 2),
phylogeny (Part 3), and biochemical networks (Part 4). Each of these
topics is already the object of research that is being conducted in
one of the teams or both, either each team individually or in
collaboration.
Part 1: Sequence analysis and models for molecular
evolution
This part is basically concerned with identifying motifs of different
types and sizes in molecular sequences. This identification requires
also to have good models of the evolution of such sequences and this
is therefore a third topic that appears as background to the other
two.
The following subtopics will more particularly interest us in the
context of an HELIX-USP collaboration:
-
1.1
- Improving or investigating the various models for motifs that
have been developed over the years. The motifs here include: complex
motifs, that is motifs which may be composed of various parts
separated by more or less constrained distances; footprint motifs,
that is motifs that explicitely take a (known) phylogeny into account
when trying to infer motifs from a set of orthologous sequences;
finally, motifs that try to combine both the pattern and position
weight matrix models for representing motifs.
- 1.2
- Systematically investigating new approaches for motif
inference, in particular by exploring probabilistic methods and
general machine learning techniques, and more specifically analysing
the feasibility of non supervised motifs learning using biclustering
techniques adapted to motifs.
- 1.3
- Extending the types of motifs that have been considered
to treat long motifs (which may require filtering techniques specially
adapted to the problem -- genes may be seen as very long motifs, split
in the case of eukaryotes), motifs under an edit distance model,
parameterized motifs (motifs which match under general functions),
motifs appearing in permuted forms.
- 1.4
- Systematically investigating new indexes for old and new
motif models, in particular indexes for approximate motifs search and
inference and for complex motifs.
- 1.5
- Analysing various exact or approximate formulations of
motif clustering, from the most elegant ones (such as bases of motifs
already undergoing investigation in one of the teams) to heuristical
clustering methods.
- 1.6
- Improving the algorithms for multiple sequence alignment that
underly motif search or inference methods, in particular, exploring
new cost functions such as normalized edit distance (where the
alignment quality of the sequences is measured by considering the
average of the quality of the local alignment of the sequences).
- 1.7
- Exploiting conservation differences, that is, differences
in evolutionary constraints, to identify gene regulatory
sequences. Algorithms will be considered for scanning sequences from
different organisms for conserved equivalent segments and within these
segments, to identify hyper conserved and hyper variable regions.
Part 2: Large-scale comparative genomics and genome
dynamics
It has since long been known that genomes are not static. The work of
Barbara Clintock in the late 40's showing that genes could jump
spontaneously from one site to another was a first clear sign of
this. ``Jumping genes'' were called transposable elements by
Clintock. Genes may also get duplicated. There are strong indications
that the duplication may sometimes affect whole chromosomes or even
genomes or, inversely, only pieces of a gene, in particular exons. It
has thus been shown that, during evolution, DNA segments coding for
modules or domains in proteins have been duplicated and rearranged
through what has been called intronic recombination. By shuffling
modules between genes, protein families have thus evolved. Genomic
segments can be reversed, in general through ectopic recombination, or
deleted. Chromosomes in multi-chromosomal organisms may undergo fusion
or fission, or exchange genetic material with another chromosome
(through homologous recombination), usually at their ends
(translocation) or internally. Genetic material may also be
transferred across sub-species or species (lateral transfer), thus
leading to the insertion of new elements in a genome. Parts of a
genome may be amplified, through, for instance, slippage resulting in
the multiplication of the copies of a tandem repeat.
Although much is known about the dynamic behaviour of genomes, much
more remains to be discovered about the forces and exact mechanisms
behind such dynamics, its function and extend, the frequency of each
type of rearrangement, and the impact genomic reorganizations may have
on gene expression and genome development.
The main topics in which the association between the researchers of
HELIX and USP can yield significant breakthroughs are:
-
2.1
- Algorithms and complexity analysis for calculating
a rearrangement distance between two or more genomes under various
models. Classical methods of DNA sequence comparison assumed that
sequences may only mutate by operations that act on individual
nucleotides, i.e., substitutions, insertions, and
deletions. More recently, additional studies considered large scale
genome rearrangement events such as inversions, transpositions and
translocation. We aim to broaden the theory of genome rearrangement
in several directions, and to tighten the contact between the
theoretical analysis and the real data that are gradually becoming
available. The key topics we shall study are algorithms for sorting by
signed reversals, length-sensitive sorting by reversal, sorting by
transpositions, handling duplicated genes, handling missing genes,
handling multiple genomes.
- 2.2
- Modelling, detection and analysis of ``segments conserved
by rearrangements''. ``Segments conserved by rearrangements'' mean
parts of a chromosome which are relativily stable under large-scale
evolutionary events. Looking for such segments is a difficult task
for two reasons. The first one is that the conservation is not
exact. Some rearrangements preserve the function associated with a
segment provided there are ``not many'' of them. How precisely to
define the number and type that should be allowed, and therefore which
definition(s) to adopt for a conserved segment (possibly there will be
more than one depending on the biological question) remains very much
an open problem. Our first task will thus be to derive models that are
satisfying both mathematically and biologically. The second difficulty
of the problem is that such models may be hard to compute.
- 2.3
- Study of breakpoint regions of the genome. This consists in
analysing the regions where rearrangements have broken the genome, and
trying to find some characterics that may enable to classify them
according to the type of rearrangement that gave them origin. The
characteristics sought could be the motifs or repeats such regions may
contain, or some other features still to be determined. We intend to
build methods for detecting such regions as accurately as possible
(this is the counterpart of the conserved segments mentioned just
above). Then by studying the distribution and length of these regions,
we shall try to evaluate the reality of the ``fragile regions'' model,
which asserts that there are evolution hotspots in the genome. This
theory is under discussion in the scientific community, and still
lacks efficient theoretical bases. We shall combine gene homology data
and global genomic alignment data and provide reliable tools and
analyses based on previous studies on rearrangements.
- 2.4
- We have interest also in a particular subproblem of the
previous one, namely the problem of alignments with inversions.
Sequence alignments are broadly studied for biological sequence
comparison but considering only biological events such as mutations,
insertions and deletions. Other biological events such as inversions
are not automatically detected by the usual alignment algorithms. Some
alternative strategies have been considered in the attempt to include
inversions and other types of rearrangements. We plan to improve
further on some initial results we have already concerning this topic,
and possibly to generalize them to other types of alignments and
objective functions.
- 2.5
- Repetitions, recombinations and rearrangements. The
objective is to design algorithms for identifying various types of
repeats, comparing various alleles of a tandem repeat and studying the
relationship between repeats and recombination. We shall investigate
new models, algorithms and indexes for identifying various types of
repeats in a sequence. The work will start by attempting a typology of
the various types of repeats that may be found in biological
sequences. Mathematical models and efficient algorithms for their
detection will then be investigated for some of these repeats. A
tandem repeat has a history and any two individuals may have different
tandem repeat sequences at the same genomic location. For various
scientific reasons, biologists are interested in tracing back the
history of a tandem repeat and in comparing different alleles of a
tandem repeat. In the proposed work, we shall focus on combinatorial
and algorithmic aspects of the duplication phenomenon. The more
complicated case where recombination is present between the copies of
a tandem repeat will also be addressed.
Part 3: Phylogeny
Although rearrangements are now generally known to be a major force of
the evolution of organisms, they have been surprisingly little used to
study such evolution, in particular to infer phylogenies. Phylogenies
have instead continued to be most often derived from the point
mutation information of one or more genes. Although it is well known,
and is indeed a current topic of investigation, that the
``evolutionary story'' told by a gene often contradicts the ``story''
told by another gene, or the ``story'' obtained by using some more
general information from an organism (e.g. a set of genes taken
together, or a set of morphological characters), such evolutionary
stories continue to be represented in a hierarchical manner, that is
using trees as models instead of graphs.
This difficult problem will be addressed in a step-wise manner by
considering the following subproblems.
- 3.1
- Revisiting the problem of a phylogenetic tree
reconstruction by examining two different variants that have several
applications in the computation of a distance/dissimilarity between
two trees. One variant models phylogenetic trees as Steiner trees with
terminal leaves and seeks for a minimum-cost Steiner tree where the
terminals appear only as leaves of the tree. A version of this problem
that seem particularly relevant for our purposes is one where we are
given also a permutation of the terminals and we want a minimum-cost
Steiner tree with terminal leaves that respects the given
permutation. By this we mean that, if in the permutation terminals
r1, r2, r3 and r4 appear in this order, then the paths in
the tree between r1 and r3 and between r2 and r4
intersect. The second variant concerns the maximum homeomorphic
agreement subtree problem. This consists of the following: given a
collection of rooted trees with its leaves labelled by elements of a
set A, find a subset of A of maximum cardinality such that the
subtrees of the given trees ``induced'' by the leaves with labels in
this subset coincide, except for possible edge subdivisions.
- 3.2
- Computing a recombination distance between two phylogenetic
trees. In our preliminary theoretical work with this topic, the
distance we explored is based on the MAF (``Maximum Agreement
Forest'') between two trees. We shall continue this work by examining
the relation between the MAF and another possibly more appropriate
recombination distance between trees: the SPR (``Subtree Prune and
Regraft'') distance. This is the minimum number of subtree cuts and
regrafts one must do in order to transform one tree into another. We
shall also work on elaborating new algorithms for the MAF and other
distances, either with a better ratio than those established so far,
including by us, or parameterized.
- 3.3
- Exploring the network (graph) nature of the evolution of
organisms by combining optimization methods and graph algorithms for
inferring what are called reticulate phylogenies.
Part 4: Biochemical networks
It is now commonly accepted that the functioning and development of a
living organism is controlled by the networks of interactions between
its genes, proteins, and small molecules. Studying such networks and
their underlying complexity is the main objective of this part. This
objective hides a second one, no less crucial, which is to greatly
improve the mathematical and algorithmical theory needed to accurately
model, and then explore and analyse highly complex living
systems. Biochemical networks may represent protein-protein
interactions, the metabolism of an organism, its system of gene
expression regulation, or even, mixed networks that contain
information coming from various of the previous sources.
The amount and spread of the data now becoming available enable us
also to introduce an evolutionary perspective into the study of living
organisms, and in particular of biochemical networks. Evolution is a
general underlying principle of life that allows us to compare and
decipher the meaning and function of structure, the modification of
biochemical pathways and networks, the preservation and variation of
cell signalling systems, and so on. It thus serves to study the
fundamental aspects of life, taking advantage for this of the
exploratory and comparative possibilities provided, in particular, by
the availability of an increasing number of whole sequences and
datasets from different genomes.
The main topic for which the intensive collaboration between HELIX and
USP can be profitable is the search of pathways, motifs and modules in
different kinds of biochemical networks, that is the search of what is
functional, and therefore preserved by evolution. In particular, we
shall be concerned with the following two main topics.
- 4.1
- Motifs and modules in biochemical networks. No fully
satisfying or complete definition of motifs and modules in biochemical
networks exist and most of the work will consist in exploring the
various which may be considered (topological or other) and the
algorithmical complexity of such definitions. For each, efficient data
structures, filters and algorithms for both searching known motifs and
for inferring new ones in large networks will be developed. The
definitions will of course vary depending on the type of biochemical
network that is considered.
The question of the statistical significance of the motifs identified
will be of primary importance. This question is still open. An answer
to it may depend on the definition of a random graph that is
appropriate to the biological problem at hand, a definition of a motif
occurrence in such a network, and how to calculate the probability of
such motifs. The possibility to transpose questions and results on
motif statistics in a random sequence to motifs statistics in a random
network will be examined. This will be a more exploratory research
activity.
Finally, modules and motifs will also be essential instruments for
studying and understanding the evolution of networks. Indeed metabolic
pathways have already been used to infer phylogenies but the
topological aspects of such pathways are only very partially and
indirectly taken into account. This is another area that will be
developed, possibly in conjunction with other types of information.
- 4.2
- Reconstruction, alignment and simulation of metabolic
pathways. Metabolic pathways reflect the sum of an organism's chemical
reactions, and their elucidation is key to the understanding of
cellular processes as a whole. Such pathways can be represented as
labeled graphs and networks of processes, thus making them amenable to
algorithmic analyses of several kinds. Our objective is to combine
methods for computational analysis and simulation of these structures
with experimental work that reveals the (kinetic and other) parameters
that are required to characterize the behavior of these systems in
order to allow life science researchers to better understand how
metabolic pathways function.
In our work, we aim to provide researchers with systematic and
predictive means to do their work. These include the ability to
compare metabolisms both of a variety of organisms as well as of
similar processes within the same organism, the provision of tools and
methods to do both static and dynamic analyses of pathways, and the
ability to reconstruct complex pathways from their constituents. Note
that some of the methods developed in this context are applicable also
to other cases, such as regulatory networks, or protein-protein
interaction networks.