Inductive Improvement of Part-of-Speech Tagging and its Effect on a Terminology of Molecular Biology
Ahmed Amrani, Mathieu Roche, Yves Kodratoff, Oriane Matte-Tailliez
Abstract
In the context of Part-of-Speech (PoS)-tagging of specialized corpora, we proposed an inductive approach focusing on the most lsquoimportantrsquo PoStags because mistaking them can lead to a total misunderstanding of the text. After a standard tagging of a biological corpus by Brillrsquos tagger, we noted persistent errors that are very hard to deal with. As an application, we studied two cases of different nature: first, confusion between past participle, adjective and preterit for verbs that end with lsquoedrsquo; second, confusion between plural nouns and verbs, 3rd person singular present. With a friendly user interface, the expert corrected the examples. Then, from these well-annotated examples, we induced rules using a propositional rule induction algorithm. Experimental validation showed improvement in tagging precision. The relevance of the terminology of the considered field, here molecular biology, is greatly improved when the number of these tagging errors decreases.