Représentations pour l'apprentissage statistique à grande échelle en génomique
Auteur / Autrice : | Romain Menegaux |
Direction : | Jean-Philippe Vert |
Type : | Projet de thèse |
Discipline(s) : | Bio-informatique |
Date : | Inscription en doctorat le 01/10/2017 |
Etablissement(s) : | Université Paris sciences et lettres |
Ecole(s) doctorale(s) : | Ecole doctorale Ingénierie des Systèmes, Matériaux, Mécanique, Énergétique (Paris) |
Partenaire(s) de recherche : | Laboratoire : Centre de Bio-informatique |
établissement opérateur d'inscription : Université de Recherche Paris Sciences et Lettres (2015-2019) |
Mots clés
Mots clés libres
Résumé
The cost of DNA sequencing has been divided by 100,000 in the last 10 years. It is now so cheap that it has quickly become a routine technique to characterize the genomic content of biological samples with numerous applications in health, food or energy. The output of a typical DNA sequencing experiment is a set of billions of short sequences, called reads, of lengths 100~300 in the {A,C,G,T} alphabet ; these billions of reads are then automatically processed and analyzed by computers to get some biological information such as the presence of particular bacterial species in a sample, or of a specific mutation in a cancer. As the throughput of DNA sequencing continues to increase at a fast rate, the major bottleneck in many applications involving DNA sequencing is quickly becoming computational. The goal of this PhD project is to advance the state-of-the-art and propose new solutions for storing and analyzing efficiently the billions of reads produced by each experiment.