Langage certifié en Coq pour la provenance des données issues d'analyses bioinformatiques

par Rébecca Zucchini

Projet de thèse en Informatique

Sous la direction de Véronique Benzaken, Sarah Cohen-Boulakia et de Chantal Keller.

Thèses en préparation à université Paris-Saclay , dans le cadre de École doctorale Sciences et technologies de l'information et de la communication , en partenariat avec Laboratoire de Recherche en Informatique (laboratoire) , VALS - Vérification d'Algorithmes, Langages et Systèmes (equipe de recherche) et de Faculté des sciences d'Orsay (référent) depuis le 01-09-2019 .


  • Résumé

    La compréhension des phénomènes biologiques (par exemple, le rôle d‘un ensemble de gènes dans le développement d'une maladie) requiert la collecte de très grandes masses de données qui doivent ensuite être comparées, croisées et combinées à d'autres jeux de données. L'analyse de ces données peut se faire au travers de workflows scientifiques, chaînant l'utilisation de nombreux outils bioinformatiques et scripts d'analyse. Face aux volumes de données et à la multitude d'outils bioinformatiques disponibles, se pose alors le problème de la reproductibilité de ces analyses qui nécessite la connaissance précise de la provenance des données. Cette provenance peut consister en le stockage de toutes les données intermédiaires générées lors d'une analyse (irréalisable dans la plupart des cas d'utilisation) ou la capacité à rejouer une analyse à l'identique, ce qui est difficile dans un contexte où les versions des outils changent rapidement. L'objectif de cette thèse est de garantir formellement, au moyen d'outils certifiés à l'aide d'assistants à la preuve, la provenance des données impliqués dans une analyse, garantissant ainsi des propriétés de reproductibilité sur des analyses bio-informatiques. Pour cela, nous utiliserons une approche sceptique : les workflows seront instrumentés pour produire des traces (certificats) qui, une fois vérifiées par un outil certifié, garantiront la reproductibilité des résultats. Cela demandera de concevoir ces certificats pour qu'ils soient le plus concis possibles tout en garantissant la reproductibilité, et robustes à des changements dans les workflows. Ces certificats seront basés sur les annotations de provenance proposées dans les scientifiques du domaine pour garantir l'origine des données issues d'analyses. Dans un premier temps, nous nous intéresserons aux transformations de données et à leur provenance pour des langages fondés sur l'algèbre relationnelle (avec extensions), qui sont alors majoritairement implantés au moyen du langage SQL auquel des traits sont ajoutés. Afin de capturer l'essence de ces transformations (de manière plus abstraite que l'implantation), une spécification de haut niveau sera définie et prouvée pour les différents opérateurs de l'algèbre. Dans la suite, ces résultats seront étendus à des processus de transformations de données plus expressifs, tels que rencontrés dans les analyses de données bioinformatiques réelles. Les workflows scientifiques seront instrumentés pour fournir une trace dont l'outil certifié en Coq pourra garantir la reproductibilité. Un langage formel de spécification de workflows sera défini, permettant d'établir des propriétés sur ces derniers et notamment de montrer l'équivalence de (parties de) workflows pour garantir la reproductibilité. Le sujet détaillé de cette thèse est disponible à l'adresse suivante: https://www.lri.fr/~keller/provenance-certifiee.pdf

  • Titre traduit

    A Coq certified language for provenance of bio-analysis generated data


  • Résumé

    Societal, economic and industrial context ------------------------------------------------------------------------ Understanding the mechanisms of the living (cell activity, genes linked to a disease) today depends heavily on advances in multiple fields including mathematics and computer science. In particular, genome sequencing has recently experienced a real technological revolution: while 12 years were needed to sequence the first human genome for an estimated cost of $ 10,000 per Megabase (genome fragment), in 2019 the same machine can sequence several hundred human genomes in a week for $ 0.03 per Megabase. As a result, the amount of sequencing data generated double every 5 months. These raw data generated are not enough, however, to understand the mechanisms of the living. Other analyzes need to be carried out, crossing several data sets and involving very many bioinformatic tools (software). Facing the growing number of raw data produced, available analytical tools, and the complexity of the analyzes to be carried out, it becomes very difficult to reproduce a bioinformatic analysis. The lack of reproducibility can have particularly serious consequences especially when preclinical studies are at stake since these analyzes are the basis of future therapies. A recent study estimated the cost of lack of reproducibility of preclinical studies at billions of dollars a year. Thus, a series of initiatives has emerged, especially among major scientific publishers such as Nature or Science to encourage authors of scientific articles to always provide the precise provenance of their data. Two major problems then arise. First, the volume of data to provide to ensure reproducibility is particularly important when it comes to storing all the data generated and consumed at each of the many stages of analysis, represent up to several tens of Teras of data per analysis. The cost of storing all of this data and its impact on the environment is particularly important. Only raw (input) data are therefore often kept. Another aspect is to be able to replay the analysis from the data thus stored to reproduce the result. We must then face the rapid evolution of tools and the difficulty of being able to determine if two (versions of) tools are close enough to ensure the reproducibility of an analysis. The originality of this thesis is to ensure the reproducibility of bioinformatic analyzes based on certification. Scientific context ------------------------------------------------------------------------ Data analysis processes (also called scientific workflows), can be described by multiple languages. We will begin this work on languages ​​based on relational algebra (with extensions), which are then predominantly implemented by means of the SQL language to which traits are added. In the relational context, many models have been defined in the last fifteen years to represent the source of the data. A unifying framework has been introduced where the data is stored in relationships (tables) and each of the lines has an annotation (its origin). When data is combined in a process analysis (data selection, combination - or join - between data from several tables), the source is represented as a combination of initial provenances. More formally, all existing frameworks of provenance are represented in a uniform way as semi-rings. The aim of this thesis is to propose a semantic framework to specify the bioinformatic analyzes and define their traces in order to guarantee the correctness of the provenance of the data obtained by the certification in Coq steps of data transformation. The approach will then be validated on a collection of process analyses of biological data (scientific workflows). The objective is then to tackle the two problems introduced previously: (i) propose slight traces of execution, less bulky than the following data associated with an analysis result and (ii) provide a mechanization able to determine if the new variant of an analysis always ensures its reproducibility. In addition to providing strong guarantees, formalization by means of a proof assistant requires such a level of abstraction that will bring a new perspective to the field of the origin of biological data and bioinformatic analyzes. This will be concretised in particular by the definition of a high level specification for the operators of provenance, and an associated language for requests (first year). A formal language for the specification of more expressive data analysis processes (scientific workflows) will be studied in a second phase (second and third year). Approach ------------------------------------------------------------------------ An accepted approach for obtaining strong guarantees of correction is to rely on proof assistants such as Coq to get deep specifications allowing, in fine, to obtain code which is correct by construction. Starting from theoretical work on provenance semi-rings from of the database community on the one hand and of the other Coq formalizations of the relational model, the SQL language and the polynomial rings, we propose to certify by means of the proof assistant Coq the results around the algebras of origin and extend them to take account of richer data transformation. As a first step, the candidate will work on the formalization of an algebra of annotated relations, the K-relations, as well as the generalization to the latter of the semantics of relational algebra. In this relational algebra, the tuples are annotated by elements of a commutative semi-ring K. The operators of the algebra combine these annotations thanks to the operations on K, in a process similar to abstract interpretation. This work will be complemented by the definition of a query language extending SQL for annotations of origin, semantically equivalent to the algebra described above. This language will make it possible to obtain provenance information on simple data transformations. In order to capture the essence of these transformations (more abstractly than implantation), a high level specification will be defined and proven for different operators of algebra. In the rest of the thesis, this work will be extended to more expressive data transformations, as encountered in real bioinformatic data analysis. For that, the scientific workflows will be instrumented to provide a trace, that will be guaranteed to be reproducible by the tool certified in Coq: this process of certification will be based on a skeptical approach, and will be therefore independent of the tools appearing in the workflows. To verify these traces, it will be necessary to define a formal language of workflow specification, allowing the establishment of properties on these and show the equivalence of (parts of) workflows, as required to ensure reproducibility.