Semi-automatic construction of a taxonomy for a biological field
Oriane Matte-Tailliez, Mathieu Roche, Yves Kodratoff, Michel Termier
Abstract
The abundance and the richness of biology scientific publications, it becomes impossible to read the all texts published in this field. In order to address this problem, i.e., to extract relevant information from the texts, automatic methods must be investigated. We investigated the possibility for a conceptual classification dedicated to a particular biological field to help building effective generic extraction patterns. Our biological problem is the characterization of DNA-binding proteins of Saccharomyces cerevisiae. The object of this work is the construction of the hierarchy of concepts, i.e., a taxonomy related to this field, from a textual corpus. A concept is an entity which gathers words sharing a same meaning.
This work is made of three main steps. The first step is the constitution of "clean" corpus. A specific query enabled us to obtain a corpus of some 6000 abstracts from the bibliographical base Medline. We then carried out an homogenisation of these textual data in order to obtain the cleaned corpus.
The second step automatically extracts the terms relevant to the field. In order to extract the terminology, we initially indexed each word with a grammatical tag. This tagging is carried out by Brill's tagger (Brill, 1994) which we adapted to the biology corpus. From this tagged corpus, we extracted the most relevant terms for the field. To that effect, we used the association measure of Jacquemin (Jacquemin, 1997) which computes the independance degree between the two words of which the candidate term is made. In order to favour the terms of the field, we devised several heuristics which are parameters for this measure.
Then, the cleaned corpus, together with its terminology, is used as a basis at the third step consisting in building a taxonomy. To that purpose, an expert of the field is asked to use two main methods. The first method examines the list of terms and determines if they belong or not to field concepts. Of the 1447 extracted terms, 1013 were evaluated as instances of concepts, which represents an accuracy of 70%. The second method to develop a taxonomy is based on the hypothesis that the semantic is included in the syntax. Xerox's Shallow Parser enabled us to find the whole grammatical relations of the corpus. This set of syntaxical relationships is the main input of a software, called expert of the field, we used as a helper in building a more complete taxonomy, as developed by our team. One of the main functions of Rowan is the visualization of all the words, their grammatical role (noun-noun relations, verb-noun relations etc.) and of the concepts found in the sentences. This visualization of the context eases up development of a taxonomy.
Rowan also uses an analytical method to link syntactic relations and existing concepts. This induction step is based on the fact that two different words having a large number of common syntactic relations should be seen as instances of the same concept.
This methodology enabled us to take into account around 200 concepts for our field. This semi-automatic approach will be able to make it possible to deal with other biology corpora and to build their specific taxonomies.