High throughput computing - GriPPS
Genomics acquiring programs such as full genomes sequencing projects are producing greater amounts of data. The analysis of these raw biological data require very large computing resources. Functional sites and signatures of protein are very useful for analyzing these data or for correlating different kind of existing biological data. These methods are applied, for example for identification and characterization of the potential functions of new sequenced proteins, clusterization in protein family of the sequences contained in international databanks, and so on.
The sites and signatures of proteins can be expressed by using the syntax defined by the PROSITE databank, and written as a « protein regular expression ». Searching one such site in a sequence can be done with the criterion of the identity between the searched and the found pattern. Most of the time, this kind of analysis is quite fast. However, in order to identify non perfectly matching but biologically relevant sites, the user can accept a certain level of error between the searched and the matching pattern. Analysis like this can be very resource consuming.
In some cases, due to the lack of enough computing and storage resources, skilled staff or technical abilities, laboratories cannot afford such huge analysis. Grid computing may be a viable solution to the needs of the genomic research field: it can provide the scientist transparent acces to large computational and data management resources.
The Grid Protein Pattern Scanning-GriPPS project (granted by the french program ACI GRID 2002) aims to develop and adapt these bioinformatic algorithms so that they can exploit the underlying grid infrastructure. Models of those algorithms will be devised to be able to foresee their behavior on a grid platform and proposals will be written to adapt other bioinformatic algorithms to the grid. Within this context, we propose to study such algorithms to identify the constraints related to the biological applications and to determine their granularity and the possible parallelization schemes that can be applied to them. This will lead us to examine classical problems in the field of grid computing such as job scheduling, resource allocation and discovery, and network quality of service applied to our specific needs.