Large catalogues are moving from the database metadata management age using specific Information Science and Libraries formats, to the Web age using Semantic Web standard languages (RDF / S, OWL). This development, bringing many advantages (better document availability, increased data exchange capabilities, creation of new search / use services for documents), raises important issues about the quality of document databases.
This project aims to develop mechanisms to:

  • describe the quality of an existing document database;
  • maintain a given level of quality by controlling updates on such databases;
  • improve the quality of a database;
  • exploit these databases according to their level of quality (eg the search for documents or combination of bases).

Representing data using Semantic Web standards allows for a Knowledge Representation approach to this problem. This approach will allow on one hand to give a logical semantics to the notion of quality and, on the other, to use reasoning mechanisms for dealing with various problems. This approach is rooted on the (i) formalization of knowledge found in document catalogues, (ii) the development of a quality model for the individual entities (named entities) identification problem, (iii) the definition of a trust model suitable for reconciliation and different source information fusion and (iv) the discovery of entity identification characteristics and their manipulation by different techniques (logical, numerical, probabilistic, etc.).
A large part of the project is devoted to the evaluation of the proposed approach by experiments conducted on suitable test benchmarks and the development of demonstrators adapted to the two document databases owners involved in the project.
The consortium brings together five complementary partners: two major national players of document catalogues and three research groups of computer scientists. The Bibliographic Agency for Higher Education (ABES) and the Institut National de l'Audiovisuel (INA) are managing very large document databases and are heavily involved, both at a national and international level, in the exposure, standardization, interconnection and use of their metadata. The teams of the LIG, LIRMM and LRI involved in this project have a strong expertise in databases, knowledge representation and semantic web. Furthermore, numerous research connections exist between the project partners. The skills of scientific partners and links forged between them as part of joint projects are very important to the success of this multidisciplinary project that involves both Information Science and Libraries as information technology, and should impact not only the field of document databases but also the Web of Data ("Linked Data").

logo-anr