Projet de thèse en Informatique, données, IA
Sous la direction de Fabian Suchanek et de Antoine Amarilli.
Thèses en préparation à l'Institut polytechnique de Paris , dans le cadre de Ecole Doctorale de l'Institut Polytechnique de Paris , en partenariat avec LTCI - Laboratoire de Traitement et Communication de l'Information (laboratoire) , DIG – Data, Intelligence and Graphs (equipe de recherche) et de Télécom Paris (établissement de préparation de la thèse) depuis le 15-09-2017 .
Datalog  is a declarative logical language to express deduction rules on databases. The language consists of rules, and ground facts. An example of a rule is divide(a,b) :- divide(a,c), divide(c,b), which states that, for all a, b, and c, whenever a divides c and c divides b, then a divides b. The ground fact divide(2, 4) indicates that 2 divides 4. While Datalog programs can be written by hand, they can also be extracted or learnt automatically from data sources: this is an efficient and scalable way to help database systems to deduce missing information. One challenge of automated approaches, however, is that they often lead to errors and contradictions in the program or its derivations. This motivates the topic of this PhD proposal: developing automated approaches to find and correct mistakes in Datalog programs, which we call automated Datalog debugging. // Pour plus de détails, voir le PDF attaché
Debugging Datalog programs
We have as input a Datalog program and a gold standard specifying a set of positive examples (facts that should be derived) and negative examples (facts that should not be derived). Whenever a positive example is not derived, or whenever a negative example is derived, we want to determine the minimal sets of rules and ground facts that have to be modified to fix the problem, with a notion of minimality that depends on the use case. Of course, it can be the case that the same rule is both necessary to derive some positive examples and responsible for the derivation of bad examples. When this happens, there are several possibilities: • We can loosen our correctness and completeness requirements, and accept that not all positive examples are derived, or that some negative examples may be derived. In this case, multiple tradeoffs are possible, depending on whether we want to optimize precision or recall. One interesting question in this context would be to show lower bounds on the size of any Datalog program that would perfectly discriminate between the positive and negative examples. • We can adopt heuristics and define notions of usefulness for rules, to identify the rules that it would be most beneficial to repair. For instance, one naive such measure would be the number of positive examples that are derived, minus the number of negative examples. A natural question is to refine this metric, or study how efficiently it could be computed, using provenance notions. // Pour plus de détails, voir le PDF attaché