Thèse soutenue

FR
Accès à la thèse
Auteur / Autrice : Ahmed El Sayed
Direction : Djamel Abdelkader Zighed
Type : Thèse de doctorat
Discipline(s) : Informatique
Date : Soutenance en 2008
Etablissement(s) : Lyon 2

Résumé

FR  |  
EN

This dissertation focuses on two key issues in text mining, namely unsupervised learning and knowledge acquisition. In spite of their relative maturity, both issues still present some major challenges that need to be addressed. First, for unsupervised learning, a well-known, unresolved challenge is to perform clustering with minimal input parameters. One natural way to reach this is to involve validity indices in the clustering process. Although of great interest, validity indices were not extensively explored in the literature, especially when dealing with high-dimensional data like text. Hence, we make three main contributions: (1) an experimental study comparing extensively 8 validity indices; (2) a context-aware method enhancing validity indices usage as stopping criteria; (3) I-CBC, an Incremental version of the CBC (Clustering By Committee) algorithm. Contributions were validated in two real-world applications: document and word clustering. Second, for knowledge acquisition, we face major issues related to ontology learning from text: low recall of the pattern-based approach, low precision of the distributional approach, context-dependency, and ontology evolution. Thus, we propose a new framework for taxonomy learning from text. The proposal is a hybrid approach which has the following advantages over the other approaches: (1) ability to capture more “flexibly” relations in text; (2) concepts better reflecting the context of the target corpus; (3) more reliable decisions during the learning process; (4) and finally evolution of the learned taxonomy without any manual effort, after its incorporation in a core of an information retrieval system.