Thèse en cours

Représentations pour l'apprentissage statistique à grande échelle en génomique

FR  |  
EN
Auteur / Autrice : Romain Menegaux
Direction : Jean-Philippe Vert
Type : Projet de thèse
Discipline(s) : Bio-informatique
Date : Inscription en doctorat le 01/10/2017
Etablissement(s) : Université Paris sciences et lettres
Ecole(s) doctorale(s) : Ecole doctorale Ingénierie des Systèmes, Matériaux, Mécanique, Énergétique (Paris)
Partenaire(s) de recherche : Laboratoire : Centre de Bio-informatique
établissement opérateur d'inscription : Université de Recherche Paris Sciences et Lettres (2015-2019)

Mots clés

FR  |  
EN

Résumé

FR  |  
EN

The cost of DNA sequencing has been divided by 100,000 in the last 10 years. It is now so cheap that it has quickly become a routine technique to characterize the genomic content of biological samples with numerous applications in health, food or energy. The output of a typical DNA sequencing experiment is a set of billions of short sequences, called reads, of lengths 100~300 in the {A,C,G,T} alphabet ; these billions of reads are then automatically processed and analyzed by computers to get some biological information such as the presence of particular bacterial species in a sample, or of a specific mutation in a cancer. As the throughput of DNA sequencing continues to increase at a fast rate, the major bottleneck in many applications involving DNA sequencing is quickly becoming computational. The goal of this PhD project is to advance the state-of-the-art and propose new solutions for storing and analyzing efficiently the billions of reads produced by each experiment.