Purpose
Objectives and background
Historically, the field of computational biology, at least from the
more algorithmic point of view, started with the analysis of
sequences, isolated sequences for many years, and then whole genomes
when the sequencing projects started accelerating. Other topics of
interest in the early years of the discipline addressed issues that
were either sequence-related (for instance, molecular phylogeny) or
were related to the analysis of other elementary biological objects
(such as protein or RNA structures).
More recently, concern with getting a deeper understanding on how all
such elementary biological objects interact in the general context of
a genome, cell or organism has led to the development of whole new
areas of investigation by computational biologists. The study of
relations, although not altogether new to the field, has thus been in
full bloom in the last few years. Such relations concern every
element in a cell or organism. They may even concern extra-cellular
elements, or elements that belong to the environment of an organism as
some may have an influence on the inner functioning of a living
system. Investigating relations requires therefore to study what has
been called integrative biology. This is the general objective of the
present project.
In order to reach this objective, the project will rely on the mixed and complementary
expertise (in mathematical modelling and formal analysis, data analysis, discrete
and stochastic algorithmics, combinatorics, computational and evolutionary biology)
of the various groups involved, some of which have a long history of collaborations
(the French coordinator is an ex-student of the University of São Paulo) while other
collaborations are more recent (between the UFMS and France) or completely new (between
Chile and both France and Brazil).
Introduction
Lack of enough or sufficiently clean data has
hindered for now a full growth of the areas of investigation within or
related to integrative biology. Worse however, the lack of good models
has slowed down the development of revolutionarily new ways of
considering such relations, and thus of considering biology at the
global level of an organism, in a way similar to the revolution
brought about by the sequencing of genomic sequences to our view of
genetics and molecular evolution.
The data come under many different
forms, but three main categories may be distinguished: 1. sequence
data, 2. expression data (this includes both transcriptomic and
proteomic data), and 3. biochemical data (such as provided by the
lists of reactions, enzymes and compounds composing the metabolism of
an organism). This diversity requires expertise in different types of
techniques such as text algorithmics, combinatorial and statistical
data analysis methods, graph theory modelling and algorithmics, in order
to adequately deal with the data available.
Like many other issues in computational biology, obtaining good models
is a particularly difficult problem both because we lack general
knowledge of the biological processes at play and so have no clear
idea of where to look at, or even what we are supposed to look for,
and because we have at our disposal an often vast collection of
partial and very specialised knowledge that may strongly bias our
search for new information, or even prevent us completely from finding
anything new.
The investigation for good models - it is very unlikely that a unique
one will exist - must therefore involve all the steps from
1. biological theory of what could be the important forces or
constraints to consider to 2. the development of mathematical models
that lead to 3. (theoretical) considerations of algorithmical
complexity which may either 4. feed back into the biological
formulations of the problems considered or 4'. lead to the elaboration
of efficient algorithms, and then 5. to the application of such
algorithms to data in a specific or systematic way, 6. analysis of the
results obtained and 7. feedback to either the initial theory or the
mathematical models derived from it. Upstream of this, the problem of
obtaining the elementary pieces of information on what relations
actually exist must be addressed.
In the context of this project, we shall concentrate our attention on
a somewhat restricted view of the functioning of a cell or
organism, the one provided by biochemical and evolutionary networks,
and by the relation between the two.
This study will be conducted through the development of diverse
models and methods of analysis (statistical, combinatorial, etc.).
As much as possible, the models and methods will be systematically
explored and confronted in an agnostic spirit with the final
objective of tentatively addressing the questions above. Methodological
development and a better understanding of biological
phenomena will thus represent two processes that proceed together,
side by side. The expected results first concern each such process
taken separately: 1. getting at better mathematical models and
algorithms for analysing the networks; and 2. answering both
specific biological questions (hypothesis tests) and more general
ones (exhaustive exploration of available data).
Concerning the latter, we shall try to address essentially three
general biological questions with a stress put on the first
two, the third being a longer term goal: 1. are there regularities,
structural and functional, in the
diversity that is observed, regularities that could provide
evidence of a deeper organisation of living organisms; 2. can
we identify these regularities in a systematic fashion and thus
manage to distinguish an order in the complex network of the
observed interactions; finally, 3. how has this network been set
up in the course of evolution, to accomplish what functions, and
could it have evolved in a different way?
By regularity, we mean the conservation of some elements, at the
level either of the genome, or of the network of molecular
interactions inside a cell. This conservation can be observed
within a same organism (we then speak of approximate repetitions
of parts of a genome or of a network of interactions), or among
different organisms.
The simultaneous
confrontation of our modelling attempts with biological data
should also, above all, allow us to get a better grasp of the
structure of living systems: is this structure simple or
simplifiable into some general principles, or is life made
essentially of exceptions?
Genome dynamics
It has since long been known that genomes are not static. The work of
Barbara Clintock in the late 40's showing that genes could jump
spontaneously from one site to another was a first clear sign of
this. Jumping genes were called transposable elements by
Clintock. Genes may also get duplicated. There is now undisputed evidence
that the duplication may sometimes affect whole chromosomes or even
genomes or, inversely, only pieces of a gene, in particular exons. It
has thus been shown that, during evolution, DNA segments coding for
modules or domains in proteins have been duplicated and rearranged
through what has been called intronic recombination. By shuffling
modules between genes, protein families have thus evolved. Genomic
segments can be reversed, in general through ectopic recombination, or
deleted. Chromosomes in multi-chromosomal organisms may undergo fusion
or fission, or exchange genetic material with another chromosome,
usually at their
ends (translocation) or internally. Genetic material may also be
transferred across sub-species or species (lateral transfer), thus
leading to the insertion of new elements in a genome. Parts of a
genome may be amplified, through, for instance, slippage resulting in
the multiplication of the copies of a tandem repeat.
Although much is known about the dynamic behaviour of genomes, much
more remains to be discovered about the forces and exact mechanisms
behind such dynamics, its function and extend, the frequency of each
type of rearrangement, and the impact genomic reorganisations may have
on gene expression and genome development.
The main topics we shall address in this project are the following.
In all cases, we shall use data provided by our collaborators biologist.
This concerns mostly eukaryotic genomes.
- Algorithms and complexity analysis for calculating
a rearrangement distance between two or more genomes under various
models. Classical methods of DNA sequence comparison assumed that
sequences may only mutate by operations that act on individual
nucleotides, i.e., substitutions, insertions, and
deletions. More recently, additional studies considered large scale
genome rearrangement events such as inversions, transpositions and
translocations. We aim to broaden the theory of genome rearrangement
in several directions, and to tighten the contact between the
theoretical analysis and the real data that are gradually becoming
available, or that have been available since a long time but that have
been little used, such as cytogenetics data. The key topics we
shall study are algorithms for sorting by
signed reversals, length-sensitive sorting by reversal, sorting by
transpositions, handling duplicated genes, handling missing genes,
handling multiple genomes, preserving segments conserved by rearrangements
(parts of a chromosome which are relatively stable under large-scale
evolutionary events) while sorting.
- Study of breakpoint regions of a genome. This consists in
identifying as precisely as possible and analysing the regions where
rearrangements have broken the genome, and trying to
find some characteristics that may enable to classify these regions
according to the type of rearrangement that gave them origin. The
characteristics sought could be motifs or repeats such regions may
contain, or some other features still to be determined. We intend to
build methods for detecting such regions as accurately as possible.
Then by studying the distribution and length of these regions,
we shall try to evaluate the reality of the fragile regions model,
which asserts that there are evolution hotspots in the genome. This
theory is under discussion in the scientific community, and still
lacks clear theoretical bases. We shall combine gene homology data and
global genomic alignment data and provide reliable tools and analyses
based on previous studies on rearrangements.
- We have interest also in a particular subproblem of the
previous one, namely the problem of performing alignments with rearrangements.
Sequence alignments are broadly studied for biological sequence
comparison but considering only biological events such as mutations,
insertions and deletions. Other biological events such as inversions or
duplications
are not automatically detected by the usual alignment algorithms. Some
alternative strategies have been considered in the attempt to include
inversions and other types of rearrangements. We plan to improve
further on some initial results that have already been published in
the community concerning this topic, and to generalise them
to other types of alignments and objective functions.
- Repetitions, recombinations and rearrangements. The
objective is to design algorithms for identifying various types of
repeats (transposons, satellites, etc.) and studying the relationship between repeats and
recombination on the one hand, and repeats and regulation at the level
of a whole genome on the other. We shall investigate new models,
algorithms and indexes for identifying various types of repeats in a
sequence. The work will start by attempting a typology of the various
types of repeats that may be found in biological
sequences. Mathematical models and efficient algorithms for their
detection will then be investigated for some of these repeats.
Genetic and biochemical networks
It is now commonly accepted that the functioning and development of a
living organism is controlled by the networks of interactions between
its genes, proteins, and small molecules. Studying such networks and
their underlying complexity is the main objective of this part. This
objective hides a second one, no less crucial, which is to greatly
improve the mathematical and algorithmic theory needed to accurately
model, and then explore and analyse highly complex living
systems. Biochemical networks may represent protein-protein
interactions, the metabolism of an organism, its system of gene
expression regulation, or even, mixed networks that contain
information coming from various of the previous sources.
The amount and spread of the data now becoming available enable us
also to introduce an evolutionary perspective into the study of living
organisms, and in particular of biochemical networks. Evolution is a
general underlying principle of life that allows us to compare and
decipher the meaning and function of structure, the modification of
biochemical pathways and networks, the preservation and variation of
cell signalling systems, and so on. It thus serves to study the
fundamental aspects of life, taking advantage for this of the
exploratory and comparative possibilities provided, in particular, by
the availability of an increasing number of whole sequences and
interaction datasets from different genomes.
We shall be concerned with the following main topics.
- Motifs and modules in biochemical networks. Modules are in
general considered to be parts of a network that function in relative
independence from other parts, while motifs are small patterns of
interactions that are repeatedly found in the network. No fully
satisfying or complete definition of motifs and modules in biochemical
networks exist and most of the work will consist in exploring the
various which may be considered (topological or other) and the
algorithmic complexity of such definitions. For each, efficient data
structures, filters and algorithms for both searching known motifs and
for inferring new ones in large networks will be developed. The
definitions will of course vary depending on the type of biochemical
network that is considered. The applications will be mostly to
E. coli and Yeast for which the cleanest datasets exist in
the literature. In the case of Yeast, we shall use also data stored
in the Yeastract database maintained by collaborators (Ana Teresa
Freitas and Arlindo Oliveira) at the Instituto
Superior Técnico, Lisbon, Portugal and expertised by biologists
from the group of Isabel Sá-Correia from the same institute
(http://www.yeastract.com/).
The question of the statistical significance of the motifs identified
will be of primary importance. This question is still open. An answer
to it may depend on the definition of a random graph that is
appropriate to the biological problem at hand, a definition of a motif
occurrence in such a network, and how to calculate the probability of
such motifs. This will be a more exploratory research
activity that will be conducted in collaboration with statisticians
(Sophie Schbath from the INRA and Séphane Robin from the INA-PG,
France) external to this project.
- Comparison and alignment of networks.
Metabolic networks reflect the sum of an organism chemical
reactions, and their elucidation is key to the understanding of
cellular processes as a whole. Such networks can be represented as
labeled graphs and networks of processes, thus making them amenable to
algorithmic analyses of several kinds. Our objective is to combine
methods for computational analysis and simulation of these structures
with experimental work that reveals the (kinetic and other) parameters
that are required to characterise the behaviour of these systems in
order to allow life science researchers to better understand how
metabolic networks function.
We aim to provide researchers with systematic and
predictive means to do their work. These include the ability to
compare metabolisms both of a variety of organisms as well as of
similar processes within the same organism, the provision of tools and
methods to do both static and dynamic analyses of networks,
and the ability to reconstruct complex subnetworks from their constituents.
Our main application will be to the metabolic networks of
symbionts (symbionts are organisms living in symbiosis, that is,
in a close association in a community). We shall collaborate for this
with a laboratory of experimental biologists from the INSA at Lyon
(team of Hubert Charles).
On the long range, we plan to extend the study to
other cases, such as genetic or mixed networks.
- Co-evolution of genomes and of metabolic/genetic networks.
Preliminary studies have started revealing the links existing between
operons (group of closely located genes controlled as a unit to produce messenger RNA) in
prokaryotes and the set of reactions that are close in the metabolic
network of an organism. Besides the fact that such studies have not
yet been conducted in a systematic manner, operons probably do not exist
in eukaryotic organisms. The question is open whether there is
nevertheless a link between the genomic organisation and the metabolism
in such organisms also, maybe related to regulons (collection of genes
under regulation by the same regulatory protein - operons are a subclass
of regulons)? We wish to address this problem in the present
project. In general, gene regulation
is doubtless an important element for establishing a possible link
between genome organisation and metabolism. The introduction of information
pertaining to regulation will be one of our concerns also when studying
the evolution of biological networks. More precisely, the type of general
questions to which we would like to provide some initial answers is:
- how does an organism acquire new functions, does it happen by
the duplication of a set of genes and then evolution (specialisation)
of their expression mode?
- is the organisation of a genome optimised for one (or more) metabolic function(s) (synthesis of essential molecules, production of energy) and in
what measure is the metabolism constrained by the organisation of the genome?
- what are the links between genome organisation and gene network,
is there a compromise between organisation and disorganisation
(where by organisation, we mean the optimisation of the capacities of an
organism and by disorganisation the capacity of an organism to evolve in
a changing environment)?
- given a genome, what metabolism or gene network can we have and inversely, given a
metabolism or gene network, what genome(s) can we have?
- Exploring the relation between gene expression and some
environmental factors.
The main goal of a recently developed field of research called
pharmacogenomics is to predict the effect drugs
may have based on the genomic information of a patient. Using
microarray technologies, this may help improve both diagnosis and
the posology policy to be adopted. A few research structures are
now ready to simultaneously screen the transcriptome and the genome
of patients in order to reveal possible correlations
with drug effects, and to consider adapting the medical protocols
with this new information. However, it is not easy to extract
knowledge from the amount of data this entails. To try to address this
issue, we propose to develop new methods that use bayesian networks.
Bayesian networks permit to represent causal relations (through a DAG),
and then to estimate state probabilities. This method has already
been applied to microarray data. Our contribution will be to
append clinical information to the network and to specify a model
that takes pharmacogenomic constraints into account. This should
enable to do more accurate predictions. It will also be
interesting to compare the information on the different diseases
(by comparing the corresponding bayesian networks) which
can reflect some cancer mechanisms. This represents an original
approach, which could then be applied to other biological systems.
We shall use here data provided by the medical groups inside
the French laboratory partner of this project.
- Modelling and analysing genetic networks. The life of an
organism depends on many metabolic pathways that are regulated by gene
expression networks. The mechanism of
pathways regulation involves a complex system with many forward and
feedback signals. These signals are RNA, produced by gene expression,
and protein complexes, produced by interaction of proteins built by
translation of mRNA. Protein complexes act as feedback signals that
control gene transcription. Forward signals, in the form of enzymes,
act as control metabolic pathways. In such networks, the expression of
each gene depends both on its own expression and on the expression
levels of other genes at previous time instants. This complex network
of interactions can thus be modelled by a dynamical system. Finite
dynamical systems, discrete in time and finite in range, can model the
behaviour of gene expression networks. In such a model, each transcript
is represented by a variable that takes the expression value of that
transcript. All these variables, taken collectively, are the
components of a vector called the state of the system. Each component
(i.e. transcript) of the state vector has an associated function that
calculates its next value (i.e. expression value) from the state at
previous time instants. These functions are the components of a
function vector, called transition function, which defines the
transition from one state to the next and represents the gene
regulation mechanisms. Gene expression networks are modelled by
stochastic processes. The stochastic transition function is a
particular family of Markov chains called probabilistic genetic
network (PGN). Building upon some initial work from one of the
partners in this proposal, we propose, in
interaction with some of the other partners, to extend this model by
considering the equivalence between linear combinations of
inputs. This should improve the results obtained, since the estimation
errors will diminish and the hypothesis is quite consistent with
observed gene dynamics. Besides, this model will permit to distinguish
between inhibitory and excitatory signals. The proposed approach
explores pattern recognition techniques such as feature selection
to identify the network connections using entropy and
mutual information.
- Towards inferring genetic networks through an integrated
approach.
Recent studies have demonstrated that biological networks display nonrandom
characteristics, among which we highlight the modular architecture. We are
interested in the modular organisation of transcriptional regulation networks
(TRN), which model the interactions between genes and proteins that control
their expression at the transcriptional level. Understanding the mechanisms
of transcriptional regulation is crucial to explaining the morphological and
functional diversity of cells. We propose to address the problem of identifying
transcriptional regulation modules, i.e. groups of coregulated genes and their
regulators. One important distinction of our work is that we propose to identify
modules that are evolutionarily conserved. From the biological point of view,
the proposed approach is supported by three main premises:
- coregulated
genes are bound by common regulatory proteins (transcription factors - TFs)
and so they must present common sequence patterns (motifs) in their regulatory
regions, which correspond to the binding sites of those TFs;
- coregulated
genes respond coordinatedly to certain environmental or growth conditions,
and so they must be coexpressed under those conditions;
- since
transcriptional modules may involve genes that interact at the protein level,
they are submitted to more evolutionary constraints and therefore
they may be evolutionary conserved.
We thus define the concept of a transcriptional regulation
metamodules (TRMMs) as groups of genes sharing regulatory motifs and displaying
coherent context-specific expression behaviour consistently across species. From
the methodological perspective, we note that the incompleteness and elevated
noise levels of currently available data impose severe limitations on the
reliability of the conclusions that can be drawn through the analysis of one
data type in isolation. Therefore we propose to analyse heterogeneous
experimental data concerning several species simultaneously, with emphasis on
genomic sequence and gene expression data.
- Integrating genetic regulation and metabolism.
A considerable effort has been
dedicated to the elucidation of the mechanisms that determine the process of
cellular metabolism.
This process is essential for the cell since it allows the production of
energy, precursors or macro-molecules. On the other hand, although the
understanding of the cellular metabolism is very important, it is also
essential to understand how the cell is able to adapt in response to changing
environments such as nutrient excess, starvation and other stresses.
In those cases, the processes that govern the adaptation of the
cell take place at the genetic network level which in some cases shows different time
scales with respect to the metabolic network. This poses an important
mathematical challenge. However, the expression of genes produces enzymes
which are used in the metabolic network. Therefore, this expression determines
the terminal velocities that the metabolic network fluxes can reach. On the
other hand, internal metabolite concentrations can regulate the expression of
the genes of the metabolic network, influencing in this way the genetic
regulation network. There is an important amount of data that allows the
identification of the underlying regulatory network of particular cells.
Theoretical modelling, together with simulations and computational approaches,
provide an extremely useful framework for integrating data and gaining
insights into the dynamics and functional properties of such networks: fixed
points, transient times, the best iteration procedure to get the right
attractors, etc.
Over the last few years several
mathematical models of cellular processes have been developed,
among which a unique discrete model of gene regulation of
metabolic fluxes in E. coli. The
model represents both genetic regulation and metabolism in an
integrated form which is how they function biologically. This is one of the
first attempts to include both of these networks in one model.
Given an initial network which admits a finite number of
fixed points (the interesting attractors for genetic networks), we shall try to
- determine families of operators (new local iteration modes) acting on the
network such that, for the new network, the set of fixed points is the same;
- study the dynamics of such operators and apply this to build
specific networks related with genetic-metabolic interaction;
- develop the notion of filter which is the iterated
application of local operators that changes the network without changing
the fixed points.