Team home page

Purpose

Objectives and background

Historically, the field of computational biology, at least from the more algorithmic point of view, started with the analysis of sequences, isolated sequences for many years, and then whole genomes when the sequencing projects started accelerating. Other topics of interest in the early years of the discipline addressed issues that were either sequence-related (for instance, molecular phylogeny) or were related to the analysis of other elementary biological objects (such as protein or RNA structures).

More recently, concern with getting a deeper understanding on how all such elementary biological objects interact in the general context of a genome, cell or organism has led to the development of whole new areas of investigation by computational biologists. The study of relations, although not altogether new to the field, has thus been in full bloom in the last few years. Such relations concern every element in a cell or organism. They may even concern extra-cellular elements, or elements that belong to the environment of an organism as some may have an influence on the inner functioning of a living system. Investigating relations requires therefore to study what has been called integrative biology. This is the general objective of the present project.

In order to reach this objective, the project will rely on the mixed and complementary expertise (in mathematical modelling and formal analysis, data analysis, discrete and stochastic algorithmics, combinatorics, computational and evolutionary biology) of the various groups involved, some of which have a long history of collaborations (the French coordinator is an ex-student of the University of São Paulo) while other collaborations are more recent (between the UFMS and France) or completely new (between Chile and both France and Brazil).

Introduction

Lack of enough or sufficiently clean data has hindered for now a full growth of the areas of investigation within or related to integrative biology. Worse however, the lack of good models has slowed down the development of revolutionarily new ways of considering such relations, and thus of considering biology at the global level of an organism, in a way similar to the revolution brought about by the sequencing of genomic sequences to our view of genetics and molecular evolution.

The data come under many different forms, but three main categories may be distinguished: 1. sequence data, 2. expression data (this includes both transcriptomic and proteomic data), and 3. biochemical data (such as provided by the lists of reactions, enzymes and compounds composing the metabolism of an organism). This diversity requires expertise in different types of techniques such as text algorithmics, combinatorial and statistical data analysis methods, graph theory modelling and algorithmics, in order to adequately deal with the data available.

Like many other issues in computational biology, obtaining good models is a particularly difficult problem both because we lack general knowledge of the biological processes at play and so have no clear idea of where to look at, or even what we are supposed to look for, and because we have at our disposal an often vast collection of partial and very specialised knowledge that may strongly bias our search for new information, or even prevent us completely from finding anything new. The investigation for good models - it is very unlikely that a unique one will exist - must therefore involve all the steps from 1. biological theory of what could be the important forces or constraints to consider to 2. the development of mathematical models that lead to 3. (theoretical) considerations of algorithmical complexity which may either 4. feed back into the biological formulations of the problems considered or 4'. lead to the elaboration of efficient algorithms, and then 5. to the application of such algorithms to data in a specific or systematic way, 6. analysis of the results obtained and 7. feedback to either the initial theory or the mathematical models derived from it. Upstream of this, the problem of obtaining the elementary pieces of information on what relations actually exist must be addressed.

In the context of this project, we shall concentrate our attention on a somewhat restricted view of the functioning of a cell or organism, the one provided by biochemical and evolutionary networks, and by the relation between the two.

This study will be conducted through the development of diverse models and methods of analysis (statistical, combinatorial, etc.). As much as possible, the models and methods will be systematically explored and confronted in an agnostic spirit with the final objective of tentatively addressing the questions above. Methodological development and a better understanding of biological phenomena will thus represent two processes that proceed together, side by side. The expected results first concern each such process taken separately: 1. getting at better mathematical models and algorithms for analysing the networks; and 2. answering both specific biological questions (hypothesis tests) and more general ones (exhaustive exploration of available data).

Concerning the latter, we shall try to address essentially three general biological questions with a stress put on the first two, the third being a longer term goal: 1. are there regularities, structural and functional, in the diversity that is observed, regularities that could provide evidence of a deeper organisation of living organisms; 2. can we identify these regularities in a systematic fashion and thus manage to distinguish an order in the complex network of the observed interactions; finally, 3. how has this network been set up in the course of evolution, to accomplish what functions, and could it have evolved in a different way? By regularity, we mean the conservation of some elements, at the level either of the genome, or of the network of molecular interactions inside a cell. This conservation can be observed within a same organism (we then speak of approximate repetitions of parts of a genome or of a network of interactions), or among different organisms.

The simultaneous confrontation of our modelling attempts with biological data should also, above all, allow us to get a better grasp of the structure of living systems: is this structure simple or simplifiable into some general principles, or is life made essentially of exceptions?

Genome dynamics

It has since long been known that genomes are not static. The work of Barbara Clintock in the late 40's showing that genes could jump spontaneously from one site to another was a first clear sign of this. Jumping genes were called transposable elements by Clintock. Genes may also get duplicated. There is now undisputed evidence that the duplication may sometimes affect whole chromosomes or even genomes or, inversely, only pieces of a gene, in particular exons. It has thus been shown that, during evolution, DNA segments coding for modules or domains in proteins have been duplicated and rearranged through what has been called intronic recombination. By shuffling modules between genes, protein families have thus evolved. Genomic segments can be reversed, in general through ectopic recombination, or deleted. Chromosomes in multi-chromosomal organisms may undergo fusion or fission, or exchange genetic material with another chromosome, usually at their ends (translocation) or internally. Genetic material may also be transferred across sub-species or species (lateral transfer), thus leading to the insertion of new elements in a genome. Parts of a genome may be amplified, through, for instance, slippage resulting in the multiplication of the copies of a tandem repeat.

Although much is known about the dynamic behaviour of genomes, much more remains to be discovered about the forces and exact mechanisms behind such dynamics, its function and extend, the frequency of each type of rearrangement, and the impact genomic reorganisations may have on gene expression and genome development.

The main topics we shall address in this project are the following. In all cases, we shall use data provided by our collaborators biologist. This concerns mostly eukaryotic genomes.

Algorithms and complexity analysis for calculating a rearrangement distance between two or more genomes under various models. Classical methods of DNA sequence comparison assumed that sequences may only mutate by operations that act on individual nucleotides, i.e., substitutions, insertions, and deletions. More recently, additional studies considered large scale genome rearrangement events such as inversions, transpositions and translocations. We aim to broaden the theory of genome rearrangement in several directions, and to tighten the contact between the theoretical analysis and the real data that are gradually becoming available, or that have been available since a long time but that have been little used, such as cytogenetics data. The key topics we shall study are algorithms for sorting by signed reversals, length-sensitive sorting by reversal, sorting by transpositions, handling duplicated genes, handling missing genes, handling multiple genomes, preserving segments conserved by rearrangements (parts of a chromosome which are relatively stable under large-scale evolutionary events) while sorting.

Study of breakpoint regions of a genome. This consists in identifying as precisely as possible and analysing the regions where rearrangements have broken the genome, and trying to find some characteristics that may enable to classify these regions according to the type of rearrangement that gave them origin. The characteristics sought could be motifs or repeats such regions may contain, or some other features still to be determined. We intend to build methods for detecting such regions as accurately as possible. Then by studying the distribution and length of these regions, we shall try to evaluate the reality of the fragile regions model, which asserts that there are evolution hotspots in the genome. This theory is under discussion in the scientific community, and still lacks clear theoretical bases. We shall combine gene homology data and global genomic alignment data and provide reliable tools and analyses based on previous studies on rearrangements.

We have interest also in a particular subproblem of the previous one, namely the problem of performing alignments with rearrangements. Sequence alignments are broadly studied for biological sequence comparison but considering only biological events such as mutations, insertions and deletions. Other biological events such as inversions or duplications are not automatically detected by the usual alignment algorithms. Some alternative strategies have been considered in the attempt to include inversions and other types of rearrangements. We plan to improve further on some initial results that have already been published in the community concerning this topic, and to generalise them to other types of alignments and objective functions.

Repetitions, recombinations and rearrangements. The objective is to design algorithms for identifying various types of repeats (transposons, satellites, etc.) and studying the relationship between repeats and recombination on the one hand, and repeats and regulation at the level of a whole genome on the other. We shall investigate new models, algorithms and indexes for identifying various types of repeats in a sequence. The work will start by attempting a typology of the various types of repeats that may be found in biological sequences. Mathematical models and efficient algorithms for their detection will then be investigated for some of these repeats.

Genetic and biochemical networks

It is now commonly accepted that the functioning and development of a living organism is controlled by the networks of interactions between its genes, proteins, and small molecules. Studying such networks and their underlying complexity is the main objective of this part. This objective hides a second one, no less crucial, which is to greatly improve the mathematical and algorithmic theory needed to accurately model, and then explore and analyse highly complex living systems. Biochemical networks may represent protein-protein interactions, the metabolism of an organism, its system of gene expression regulation, or even, mixed networks that contain information coming from various of the previous sources.

The amount and spread of the data now becoming available enable us also to introduce an evolutionary perspective into the study of living organisms, and in particular of biochemical networks. Evolution is a general underlying principle of life that allows us to compare and decipher the meaning and function of structure, the modification of biochemical pathways and networks, the preservation and variation of cell signalling systems, and so on. It thus serves to study the fundamental aspects of life, taking advantage for this of the exploratory and comparative possibilities provided, in particular, by the availability of an increasing number of whole sequences and interaction datasets from different genomes.

We shall be concerned with the following main topics.

Motifs and modules in biochemical networks. Modules are in general considered to be parts of a network that function in relative independence from other parts, while motifs are small patterns of interactions that are repeatedly found in the network. No fully satisfying or complete definition of motifs and modules in biochemical networks exist and most of the work will consist in exploring the various which may be considered (topological or other) and the algorithmic complexity of such definitions. For each, efficient data structures, filters and algorithms for both searching known motifs and for inferring new ones in large networks will be developed. The definitions will of course vary depending on the type of biochemical network that is considered. The applications will be mostly to E. coli and Yeast for which the cleanest datasets exist in the literature. In the case of Yeast, we shall use also data stored in the Yeastract database maintained by collaborators (Ana Teresa Freitas and Arlindo Oliveira) at the Instituto Superior Técnico, Lisbon, Portugal and expertised by biologists from the group of Isabel Sá-Correia from the same institute (http://www.yeastract.com/).

The question of the statistical significance of the motifs identified will be of primary importance. This question is still open. An answer to it may depend on the definition of a random graph that is appropriate to the biological problem at hand, a definition of a motif occurrence in such a network, and how to calculate the probability of such motifs. This will be a more exploratory research activity that will be conducted in collaboration with statisticians (Sophie Schbath from the INRA and Séphane Robin from the INA-PG, France) external to this project.

Comparison and alignment of networks. Metabolic networks reflect the sum of an organism chemical reactions, and their elucidation is key to the understanding of cellular processes as a whole. Such networks can be represented as labeled graphs and networks of processes, thus making them amenable to algorithmic analyses of several kinds. Our objective is to combine methods for computational analysis and simulation of these structures with experimental work that reveals the (kinetic and other) parameters that are required to characterise the behaviour of these systems in order to allow life science researchers to better understand how metabolic networks function.

We aim to provide researchers with systematic and predictive means to do their work. These include the ability to compare metabolisms both of a variety of organisms as well as of similar processes within the same organism, the provision of tools and methods to do both static and dynamic analyses of networks, and the ability to reconstruct complex subnetworks from their constituents. Our main application will be to the metabolic networks of symbionts (symbionts are organisms living in symbiosis, that is, in a close association in a community). We shall collaborate for this with a laboratory of experimental biologists from the INSA at Lyon (team of Hubert Charles).

On the long range, we plan to extend the study to other cases, such as genetic or mixed networks.

Co-evolution of genomes and of metabolic/genetic networks. Preliminary studies have started revealing the links existing between operons (group of closely located genes controlled as a unit to produce messenger RNA) in prokaryotes and the set of reactions that are close in the metabolic network of an organism. Besides the fact that such studies have not yet been conducted in a systematic manner, operons probably do not exist in eukaryotic organisms. The question is open whether there is nevertheless a link between the genomic organisation and the metabolism in such organisms also, maybe related to regulons (collection of genes under regulation by the same regulatory protein - operons are a subclass of regulons)? We wish to address this problem in the present project. In general, gene regulation is doubtless an important element for establishing a possible link between genome organisation and metabolism. The introduction of information pertaining to regulation will be one of our concerns also when studying the evolution of biological networks. More precisely, the type of general questions to which we would like to provide some initial answers is:
1. how does an organism acquire new functions, does it happen by the duplication of a set of genes and then evolution (specialisation) of their expression mode?
2. is the organisation of a genome optimised for one (or more) metabolic function(s) (synthesis of essential molecules, production of energy) and in what measure is the metabolism constrained by the organisation of the genome?
3. what are the links between genome organisation and gene network, is there a compromise between organisation and disorganisation (where by organisation, we mean the optimisation of the capacities of an organism and by disorganisation the capacity of an organism to evolve in a changing environment)?
4. given a genome, what metabolism or gene network can we have and inversely, given a metabolism or gene network, what genome(s) can we have?

Exploring the relation between gene expression and some environmental factors. The main goal of a recently developed field of research called pharmacogenomics is to predict the effect drugs may have based on the genomic information of a patient. Using microarray technologies, this may help improve both diagnosis and the posology policy to be adopted. A few research structures are now ready to simultaneously screen the transcriptome and the genome of patients in order to reveal possible correlations with drug effects, and to consider adapting the medical protocols with this new information. However, it is not easy to extract knowledge from the amount of data this entails. To try to address this issue, we propose to develop new methods that use bayesian networks. Bayesian networks permit to represent causal relations (through a DAG), and then to estimate state probabilities. This method has already been applied to microarray data. Our contribution will be to append clinical information to the network and to specify a model that takes pharmacogenomic constraints into account. This should enable to do more accurate predictions. It will also be interesting to compare the information on the different diseases (by comparing the corresponding bayesian networks) which can reflect some cancer mechanisms. This represents an original approach, which could then be applied to other biological systems. We shall use here data provided by the medical groups inside the French laboratory partner of this project.

Modelling and analysing genetic networks. The life of an organism depends on many metabolic pathways that are regulated by gene expression networks. The mechanism of pathways regulation involves a complex system with many forward and feedback signals. These signals are RNA, produced by gene expression, and protein complexes, produced by interaction of proteins built by translation of mRNA. Protein complexes act as feedback signals that control gene transcription. Forward signals, in the form of enzymes, act as control metabolic pathways. In such networks, the expression of each gene depends both on its own expression and on the expression levels of other genes at previous time instants. This complex network of interactions can thus be modelled by a dynamical system. Finite dynamical systems, discrete in time and finite in range, can model the behaviour of gene expression networks. In such a model, each transcript is represented by a variable that takes the expression value of that transcript. All these variables, taken collectively, are the components of a vector called the state of the system. Each component (i.e. transcript) of the state vector has an associated function that calculates its next value (i.e. expression value) from the state at previous time instants. These functions are the components of a function vector, called transition function, which defines the transition from one state to the next and represents the gene regulation mechanisms. Gene expression networks are modelled by stochastic processes. The stochastic transition function is a particular family of Markov chains called probabilistic genetic network (PGN). Building upon some initial work from one of the partners in this proposal, we propose, in interaction with some of the other partners, to extend this model by considering the equivalence between linear combinations of inputs. This should improve the results obtained, since the estimation errors will diminish and the hypothesis is quite consistent with observed gene dynamics. Besides, this model will permit to distinguish between inhibitory and excitatory signals. The proposed approach explores pattern recognition techniques such as feature selection to identify the network connections using entropy and mutual information.

Towards inferring genetic networks through an integrated approach. Recent studies have demonstrated that biological networks display nonrandom characteristics, among which we highlight the modular architecture. We are interested in the modular organisation of transcriptional regulation networks (TRN), which model the interactions between genes and proteins that control their expression at the transcriptional level. Understanding the mechanisms of transcriptional regulation is crucial to explaining the morphological and functional diversity of cells. We propose to address the problem of identifying transcriptional regulation modules, i.e. groups of coregulated genes and their regulators. One important distinction of our work is that we propose to identify modules that are evolutionarily conserved. From the biological point of view, the proposed approach is supported by three main premises:
1. coregulated genes are bound by common regulatory proteins (transcription factors - TFs) and so they must present common sequence patterns (motifs) in their regulatory regions, which correspond to the binding sites of those TFs;
2. coregulated genes respond coordinatedly to certain environmental or growth conditions, and so they must be coexpressed under those conditions;
3. since transcriptional modules may involve genes that interact at the protein level, they are submitted to more evolutionary constraints and therefore they may be evolutionary conserved.
We thus define the concept of a transcriptional regulation metamodules (TRMMs) as groups of genes sharing regulatory motifs and displaying coherent context-specific expression behaviour consistently across species. From the methodological perspective, we note that the incompleteness and elevated noise levels of currently available data impose severe limitations on the reliability of the conclusions that can be drawn through the analysis of one data type in isolation. Therefore we propose to analyse heterogeneous experimental data concerning several species simultaneously, with emphasis on genomic sequence and gene expression data.

Integrating genetic regulation and metabolism. A considerable effort has been dedicated to the elucidation of the mechanisms that determine the process of cellular metabolism. This process is essential for the cell since it allows the production of energy, precursors or macro-molecules. On the other hand, although the understanding of the cellular metabolism is very important, it is also essential to understand how the cell is able to adapt in response to changing environments such as nutrient excess, starvation and other stresses. In those cases, the processes that govern the adaptation of the cell take place at the genetic network level which in some cases shows different time scales with respect to the metabolic network. This poses an important mathematical challenge. However, the expression of genes produces enzymes which are used in the metabolic network. Therefore, this expression determines the terminal velocities that the metabolic network fluxes can reach. On the other hand, internal metabolite concentrations can regulate the expression of the genes of the metabolic network, influencing in this way the genetic regulation network. There is an important amount of data that allows the identification of the underlying regulatory network of particular cells. Theoretical modelling, together with simulations and computational approaches, provide an extremely useful framework for integrating data and gaining insights into the dynamics and functional properties of such networks: fixed points, transient times, the best iteration procedure to get the right attractors, etc. Over the last few years several mathematical models of cellular processes have been developed, among which a unique discrete model of gene regulation of metabolic fluxes in E. coli. The model represents both genetic regulation and metabolism in an integrated form which is how they function biologically. This is one of the first attempts to include both of these networks in one model. Given an initial network which admits a finite number of fixed points (the interesting attractors for genetic networks), we shall try to
1. determine families of operators (new local iteration modes) acting on the network such that, for the new network, the set of fixed points is the same;
2. study the dynamics of such operators and apply this to build specific networks related with genetic-metabolic interaction;
3. develop the notion of filter which is the iterated application of local operators that changes the network without changing the fixed points.