Lagrangian Based Approaches for Lexicalized Tree Adjoining Grammar Parsing

par Caio Corro

Thèse de doctorat en Informatique

Sous la direction de Adeline Nazarenko et de Joseph Le Roux.

Soutenue le 26-06-2018

à Sorbonne Paris Cité , dans le cadre de École doctorale Galilée (Villetaneuse, Seine-Saint-Denis) , en partenariat avec Laboratoire informatique de Paris-Nord (Villetaneuse, Seine-Saint-Denis) (laboratoire) et de Université Paris 13 (Etablissement de préparation) .

Le président du jury était Isabelle Demnard-Tellier.

Le jury était composé de Adeline Nazarenko, André Martins, Roberto Wolfer Calvo, Benoît Crabbé, Mathieu Lacroix.

Les rapporteurs étaient Alexis Nasr, Leo Liberti.

  • Titre traduit

    Approches fondées sur la relaxation Lagrangienne pour l'analyse syntaxique avec grammaires d'arbres adjoints


  • Résumé

    Ces dernières années, des méthodes issues de l'optimisation combinatoire ont été appliquées avec succès pour résoudre des problèmes algorithmiques difficiles en Traitement Automatique des Langues (TAL). Nous suivons cette méthodologie dans le cadre de l'analyse syntaxique avec des Grammaires d'Arbres Adjoints Lexicalisés; Plus précisément, un problème d'analyse est d'abord réduit à un problème de sélection de sous-graphe. Ensuite nous formulons ce dernier sous forme de Programme Linéaire en Nombres Entiers. Beaucoup d'algorithmes ont été proposés pour ces formulations. Nous nous concentrons sur la Relaxation Lagrangienne qui a reçu beaucoup d'attention de la part de la communauté du TAL. La particularité de notre méthode réside dans le fait que nos algorithmes résolvent des problèmes généraux et peuvent donc être testés sur différentes données.


  • Résumé

    In linguistics and Natural Language Processing (NLP), syntax is the studyof the structure of sentences in a given language. Two approaches have mainlybeen considered to describe them: dependency structures and phrase-structures.A dependency links a pair of words together with its relation type whereas aphrase-structure describe a sentence by means of a hierarchy of word sets calledconstituents. In this thesis, we focus on phrase-structure parsing, that is thecomputation of the constituency structure of a given sentence. Context-FreeGrammars (CFGs) have been widely adopted by the NLP community due totheir simplicity and the low complexity of their parsing algorithms. However,CFGs are too limited in order to describe all phenomena observed in naturallanguage structures. Therefore, Lexicalized Tree Adjoining Grammars (LTAGs)have been widely studied as a plausible alternative, among others. They aremore expressive than CFGs but can also be parsed in polynomial time. Unfortunately,the best known algorithm has a O(n7) time complexity with n thelength of the input sentence. Thus, in practice most algorithms are based ongreedy methods which require fairly strong independence assumptions. Themain approach in the literature, called supertagging, lters the search space ina pre-processing step while ignoring long distance relationships, one of the mainmotivation for LTAGs.In the past years, combinatorial optimization techniques have been successfullyapplied to computationally challenging NLP tasks. We follow this line ofwork in the case of LTAG parsing. More precisely, in our setting, a given NLPproblem is reduced to a subgraph selection problem. As such, it has a genericform which may interest other research communities. Then we formulate thegeneric graph problem as an Integer Linear Program. Integer Linear Programinghas been widely studied and many optimization methods exist. We focus onLagrangian relaxation which previously received much attention from the NLPcommunity. Interestingly, the proposed algorithms can be parametrized to Et arange of different data without impacting eciency.Our erst contribution is a novel pipeline for LTAG parsing. Contrary tothe supertagging approach, we propose a pre-processing step which takes intoaccount relationships between words: well-nested dependency parsing with 2-bounded block degree. An algorithm with a O(n7) time complexity has beenproposed for this problem in the literature, which is similar to the standardLTAG parser complexity. In order to tackle the complexity challenge, we showthat it can be reduced to a subgraph selection problem which can be expressed23via a generic ILP. With our algorithm, the well-nested constraint can easily betoggled o and the block degree bound can be changed. Thus, as an example,it can be used for parsing problems related to other lexicalized grammars. Weexperiment on several problems showing the emciency and usefulness of ourmethod.Our second contribution is a novel approach for discontinuous constituentparsing. We introduce a variant of LTAG for this task. Parsing is then equivalentto the joint tagging and non-projective dependency parsing problem. Weshow that it can be reduced to the Generalized Maximum Spanning Arborescenceproblem which has been previously studied in the combinatorial optimizationliterature. A novel resolution algorithm based on Lagrangian relaxation isproposed. We experiment on two standard discontinuous constituent datasetsand obtain state-of-the-art results alongside competitive decoding speed.


Il est disponible au sein de la bibliothèque de l'établissement de soutenance.

Consulter en bibliothèque

La version de soutenance existe

Où se trouve cette thèse\u00a0?

  • Bibliothèque : Université Paris 13 (Villetaneuse, Seine-Saint-Denis). Bibliothèque universitaire.
Voir dans le Sudoc, catalogue collectif des bibliothèques de l'enseignement supérieur et de la recherche.