Sujet de thèse SANS FINANCEMENT

Construction automatisée d’un lexique sémantique compositionnel

PhD proposal WITHOUT GRANT

Automated construction of a compositional semantic lexicon

Sujet de thèse sans financement (seuls les étudiants avec leur propre financement peuvent choisir ces sujets:
étudiants étrangers avec une bourse de leur pays, normaliens,..)

PhD proposal without grant (only students with their own grant can be appointed on such PhD proposals, e.g. with foreign grants)

http://www.lirmm.fr/~retore/thesisproposals.html

Advisors: Richard Moot, Christian Retoré

Prerequisite / prérequis: advanced notions in logic / de solides connaissances en logique

    L’analyse sémantique de texte vue comme une fonction qui associe à chaque phrase une ou plusieurs formules logiques, requiert un lexique sémantique riche qui fournit pour chaque mot plusieurs lambda termes typés décrivant sa structure argumentale (grosso modo « qui fait quoi, à qui etc. »). [LCG] Lorsqu’on souhaite prendre en compte le sens lexical, c’est-à-dire lorsqu’on souhaite déterminer le sens de chaque mot dans son contexte, les lambda-termes sont encore plus complexes ; la sorte des entités doit être précisée (par un type) ainsi que les coercitions qui peuvent changer le sens d’un mot en fonction de son contexte. Toutes ces informations sont représentées sous la forme de lambda termes typés ce qui garantit que les représentations du sens sont effectivement calculables. Ces informations lexicales, permettent de trouver que l’un des sens possibles de « Fred a commencé un livre. » est que « Fred a commencé à lire un livre » (et c’est même le sens préféré à moins que Fred ne soit une romancière comme Fred Vargas). [MGL]

    Les grammaires peuvent s’acquérir à partir de corpus annotés, et nous disposons effectivement une grammaire du français couplée à un analyseur [Grail], qui produit des formules logiques représentant le sens des phrases analysées, qui pour le moment n’incluent pas de traitement du sens lexical. L’objectif de cette thèse est de construire un lexique sémantique riche, c’est-à-dire d’acquérir les lambda termes typés et de les intégrer dans le calcul des formules logiques. Plutôt que d’acquérir ces termes à partir de corpus annotés nous suggérons de convertir en lambda termes typés les informations contenues dans des réseaux lexicaux existants comme celui de JeuxDeMots [JdM].

    Un second objectif sera de raffiné le lexique pour un domaine spécifique dont la terminologie est bien définie, comme par exemple l’un des débats de la plateforme Dialoguea de Jean Sallantin de débats en ligne écrits. Cela permettra de tester que le système est capable de produire certaines inférences comme celle mentionnées ci-dessus (« commencer à lire un livre» se déduit de « commencer un livre ») ainsi que des inférences spécifiques au débats considéré. Par exemple, dans le contexte des débats en ligne, on pourra vérifier si la paraphrase du texte initial ou d’une intervention précédente, par laquelle toute intervention commence, est correcte ou si le sens des mots est tellement modifié que la paraphrase est clairement erronée. Etant donné la finesse de ce type d’inférence et l’indécidabilité des logiques usuelles, une partie du travail sera de délimiter un type de problèmes suffisamment intéressants mais néanmoins décidables, au moins dans la plupart des cas rencontrés couramment.

____________

The semantic analysis of texts, viewed as the mapping of a sentence to one or several logical formulae requires a rich semantic lexicon that associates each word with several typed lambda terms describing its argumental structure (indicating, roughly speaking “who does what to whom”) [LCG]. When one wishes to take lexical meaning into account, i.e. to determine the sense of each word in a given context, the lambda terms in the semantic lexicon are even more complex: the sort of the entities must be specified (as types) as well as the coercions that may change the meaning of a word given its context. All this information is represented as typed lambda terms that ensure the semantics is computable. Using this lexical information, we can infer from a sentence such as “Stephen started a book” that one of its possible meanings is paraphrased as “John started reading a book” (in many contexts, this is even the preferred meaning, unless Stephen is a novelist like Stephen King).[MGL]

Grammars can be acquired automatically from annotated corpora, and we do actually have a large French categorial grammar integrated with a wide-coverage parser [Grail], which provides semantic representations as logical formulae, though without incorporating lexical meaning. The objective of the PhD is to construct a semantic lexicon, i.e. to acquire the typed lambda terms, and to integrate these into the logical formulae automatically obtained. Rather than inferring the lambda terms from annotated corpora we suggest to convert information from an existing lexical network i.e. from JeuxDeMots [JdM] into typed lambda terms.

A second objective will be to refine the lexicon for application to a specific domain with a well-defined terminology, e.g. for one or several of the online written debates of the platform Dialoguea by Jean Sallantin, and to evaluate the resulting system by testing it on some types of entailments from this domain. These include entailments of the type sketched above (from “start a book” infer “start reading a book”) but also domain specific entailments. For example, in the context of online debates, entailment allows us to determine (at least in some clear cases) whether one debater adequately paraphrases another or whether what seems a paraphrase crucially shifts the meaning of some terms. Given the subtlety of many of these inferences and the undecidability of first-order logic, part of the problem consists of carving out a problem which is at the same time interesting yet decidable in enough of the cases commonly encountered.

_____________

[Grail] Wide-Coverage French Syntax and Semantics using Grail Moot, Richard Proceedings of Traitement Automatique des Langues Naturelles (TALN) 2010

[JdM ] Chamberlain, J., Fort, K.,. Kruschwitz, U., Lafourcade, M. & Poesio, M. Using Games to Create Language Resources: Successes and Limitations of the Approach. Theory and Applications of Natural Language Processing. Gurevych, Iryna; Kim, Jungi (Eds.), Springer, ISBN 978-3-642-35084-9, 2013.

[LCG] Richard Moot et Christian Retoré The logic of categorial grammars : a deductive account of natural language syntax and semantics, LNCS 6850 Springer 2012.

[MGL] Christian Retoré (2014) The Montagovian Generative Lexicon /\Ty_n: a Type Theoretical Framework for Natural Language Semantics in R Matthes & A Schubert TYPES 2013 Postproceedings LIPICS 2014 http://dx.doi.org/10.4230/LIPIcs.TYPES.2013.202