2020-02-03T11:33:05Z
2020-03-18T17:41:45
Lagrangian Based Approaches for Lexicalized Tree Adjoining Grammar Parsing
2018
2018-06-26
Electronic Thesis or
Dissertation
text
Text
electronic
Ces dernières années, des méthodes issues de l'optimisation combinatoire ont été appliquées avec succès pour résoudre des problèmes algorithmiques difficiles en Traitement Automatique des Langues (TAL). Nous suivons cette méthodologie dans le cadre de l'analyse syntaxique avec des Grammaires d'Arbres Adjoints Lexicalisés; Plus précisément, un problème d'analyse est d'abord réduit à un problème de sélection de sous-graphe. Ensuite nous formulons ce dernier sous forme de Programme Linéaire en Nombres Entiers. Beaucoup d'algorithmes ont été proposés pour ces formulations. Nous nous concentrons sur la Relaxation Lagrangienne qui a reçu beaucoup d'attention de la part de la communauté du TAL. La particularité de notre méthode réside dans le fait que nos algorithmes résolvent des problèmes généraux et peuvent donc être testés sur différentes données.
In linguistics and Natural Language Processing (NLP), syntax is the studyof the structure of sentences in a given language. Two approaches have mainlybeen considered to describe them: dependency structures and phrase-structures.A dependency links a pair of words together with its relation type whereas aphrase-structure describe a sentence by means of a hierarchy of word sets calledconstituents. In this thesis, we focus on phrase-structure parsing, that is thecomputation of the constituency structure of a given sentence. Context-FreeGrammars (CFGs) have been widely adopted by the NLP community due totheir simplicity and the low complexity of their parsing algorithms. However,CFGs are too limited in order to describe all phenomena observed in naturallanguage structures. Therefore, Lexicalized Tree Adjoining Grammars (LTAGs)have been widely studied as a plausible alternative, among others. They aremore expressive than CFGs but can also be parsed in polynomial time. Unfortunately,the best known algorithm has a O(n7) time complexity with n thelength of the input sentence. Thus, in practice most algorithms are based ongreedy methods which require fairly strong independence assumptions. Themain approach in the literature, called supertagging, lters the search space ina pre-processing step while ignoring long distance relationships, one of the mainmotivation for LTAGs.In the past years, combinatorial optimization techniques have been successfullyapplied to computationally challenging NLP tasks. We follow this line ofwork in the case of LTAG parsing. More precisely, in our setting, a given NLPproblem is reduced to a subgraph selection problem. As such, it has a genericform which may interest other research communities. Then we formulate thegeneric graph problem as an Integer Linear Program. Integer Linear Programinghas been widely studied and many optimization methods exist. We focus onLagrangian relaxation which previously received much attention from the NLPcommunity. Interestingly, the proposed algorithms can be parametrized to Et arange of different data without impacting eciency.Our erst contribution is a novel pipeline for LTAG parsing. Contrary tothe supertagging approach, we propose a pre-processing step which takes intoaccount relationships between words: well-nested dependency parsing with 2-bounded block degree. An algorithm with a O(n7) time complexity has beenproposed for this problem in the literature, which is similar to the standardLTAG parser complexity. In order to tackle the complexity challenge, we showthat it can be reduced to a subgraph selection problem which can be expressed23via a generic ILP. With our algorithm, the well-nested constraint can easily betoggled o and the block degree bound can be changed. Thus, as an example,it can be used for parsing problems related to other lexicalized grammars. Weexperiment on several problems showing the emciency and usefulness of ourmethod.Our second contribution is a novel approach for discontinuous constituentparsing. We introduce a variant of LTAG for this task. Parsing is then equivalentto the joint tagging and non-projective dependency parsing problem. Weshow that it can be reduced to the Generalized Maximum Spanning Arborescenceproblem which has been previously studied in the combinatorial optimizationliterature. A novel resolution algorithm based on Lagrangian relaxation isproposed. We experiment on two standard discontinuous constituent datasetsand obtain state-of-the-art results alongside competitive decoding speed.
Langages de programmation -- Syntaxe
Grammaire d'arbres adjoints
Relaxation, Méthodes de (mathématiques)
Analyse Syntaxique
Relaxation Lagrangienne
Arborescence généralisée de poids maximum
Generalized Maximum Spanning Arborescence
Corro, Caio
Nazarenko, Adeline
Le Roux, Joseph
Sorbonne Paris Cité
École doctorale Galilée (Villetaneuse, Seine-Saint-Denis)
Laboratoire informatique de Paris-Nord (Villetaneuse, Seine-Saint-Denis)
Université Paris 13
http://www.theses.fr/2018USPCD051/document