Uncertainty-Sensitive Reasoning for Inferring sameAs Facts in Linked Data (ECAI 2016)

Mustafa Al-Bakri, Manuel Atencia, Jérôme David, Steffen Lalande, Marie-Christine Rousset.
Proceedings of ECAI 2016 (22nd European Conference on Artificial Intelligence)

Discovering whether or not two URIs described in Linked Data—in the same or different RDF datasets—refer to the same real-world entity is crucial for building applications that exploit the cross-referencing of open data. A major challenge in data interlinking is to design tools that effectively deal with incomplete and noisy data, and exploit uncertain knowledge. In this paper, we model data interlinking as a reasoning problem with uncertainty. We introduce a probabilistic framework for modelling and reasoning over uncertain RDF facts and rules that is based on the semantics of probabilistic Datalog. We have designed an algorithm, ProbFR, based on this framework. Experiments on real-world datasets have shown the usefulness and effectiveness of our approach for data linkage and disambiguation.

Fusion de données redondantes : une approche explicative (EGC 2016)

Fatiha Saïs, Rallou Thomopoulos.
Proceedings of EGC 2016 (16ième Journées Francophones Extraction et Gestion des Connaissances)

Nous nous intéressons, dans le cadre du projet ANR Qualinca au traitement des données redondantes. Nous supposons dans cet article que cette redondance a déjà été établie par une étape préalable de liage de données. La question abordée est la suivante : comment proposer une représentation unique en fusionnant les "duplicats" identifiés ? Plus spécifiquement, comment décider, pour chaque propriété de la donnée considérée, quelle valeur choisir parmi celles figurant dans les "duplicats" à fusionner ? Quelle méthode adopter dans le but de pouvoir, par la suite, retracer et expliquer le résultat obtenu de façon transparente et compréhensible par l’utilisateur ? Nous nous appuyons pour cela sur une approche de décision multicritère et d’argumentation.

Towards a Dual Process Cognitive Model for Argument Evaluation (SUM 2015)

Pierre Bisquert, Madalina Croitoru, Florence Dupin de Saint-Cyr.
Proceedings of SUM 2015 (9th Conference on Scalable Uncertainty Management)

In this paper we are interested in the computational and formal analysis of the persuasive impact that an argument can produce on a human agent. We propose a dual process cognitive computational model based on the highly influential work of Kahneman and investigate its reasoning mechanisms in the context of argument evaluation. This formal model is a first attempt to take a greater account of human reasoning and is a first step to a better understanding of persuasion processes as well as human argumentative strategies, which is crucial in collective decision making domain.

Inferring same-as facts from Linked Data: an iterative import-by-query approach (AAAI 2015)

Mustafa Al-Bakri, Manuel Atencia, Steffen Lalande, Marie-Christine Rousset.
Proceedings of AAAI 2015 (29th Conference on Artificial Intelligence)

In this paper we model the problem of data linkage in Linked Data as a reasoning problem on possibly decentralized data. We describe a novel import-by-query algorithm that alternates steps of sub-query rewriting and of tailored querying the Linked Data cloud in order to import data as specific as possible for inferring or contradicting given target same-as facts. Experiments conducted on a real-world dataset have demonstrated the feasibility of this approach and its usefulness in practice for data linkage and disambiguation.

C-SAKey : une approche de découverte de clés conditionnelles dans des données RDF (IC 2015)

Nathalie Pernelle, Danai Symeonidou, Fatiha Saïs
IC 2015, 2015, Rennes, France

L'exploitation des liens d'identité entre ressources RDF permet aux applications de combiner des données issues de différentes sources. Les approches permettant de lier des données sont largement fondées sur l'existence de clés éventuellement composites. Ces clés étant rarement disponibles, des approches récentes se sont intéressées à la découverte automatique de clés à partir de données RDF. Cependant, dans certains do-maines, les classes de l'ontologie sont très générales et les clés valides pour tout l'ensemble d'instances d'une classe sont peu nombreuses. Aussi, dans l'approche C-SAKey, nous proposons de détecter des clés condition-nelles qui ne s'appliqueront qu'à un sous-ensemble des instances d'une classe. Nous avons réalisé une première expérimentation sur un jeu de données de l'INA qui montre que les clés découvertes par notre approche peuvent effectivement varier selon les conditions exprimées dans la clé. Mots-clés : Intégration de données, Liens d'identité, Liage de données, Clés, RDF, OWL

Linked Data Annotation and Fusion driven by Data Quality Evaluation (EGC 2015)

Ioanna Giannopoulou, Fatiha Saïs, Rallou Thomopoulos
In EGC 2015, vol. RNTI-E-28, pp.257-262

In this work, we are interested in exploring the problem of data fusion, starting from reconciled datasets whose objects are linked with semantic sameAs relations.We attempt to merge the often conflicting information of these reconciled objects in order to obtain unified representations that only contain the best quality information.

Uncertainty-Sensitive Reasoning over the Web of Data (PhD Thesis)

Mustafa El Bakri
PhD Thesis, University of Grenoble, 2014

In this thesis we investigate several approaches that help users to find useful and trustful information in the Web of Data using the Semantic Web technologies. In this purpose, we tackle two research issues: Data Linkage in Linked Data and Trust in Semantic P2P Networks. We model the problem of data linkage in Linked Data as a reasoning problem on possibly decentralized data. We describe a novel Import-by-Query algorithm that alternates steps of sub-query rewriting and of tailored querying the Linked Data cloud in order to import data as specific as possible for inferring or contradicting given target same-as facts. Experiments conducted on real-world datasets have demonstrated the feasibility of this approach and its usefulness in practice for data linkage and disambiguation. Furthermore, we propose an adaptation of this approach to take into account possibly uncertain data and knowledge, with a result, the inference of same-as and different-from links having some weights. In this adaptation we modeled uncertainty as probability values. Our experiments have showed that our the adapted approach scales to large data sets and produces meaningful probabilistic weights. Concerning trust, we introduce a trust mechanism for guiding the query-answering process in Semantic P2P Networks. Peers in Semantic P2P Networks organize their information using separate ontologies and rely on alignments between their ontologies for translating queries. Trust is such a setting is subjective and estimates the probability that a peer will provide satisfactory answers for specific queries in future interactions. In order to compute trust, the mechanism exploits the information provided by alignments, along with the one that comes from peer’s experiences. The calculated trust values are refined over time using Bayesian inference as more queries are sent and answers received. For the evaluation of our mechanism, we build a semantic P2P bookmarking system (TrustMe) in which we can vary different quantitative and qualitative parameters. The results show the convergence of trust, and highlight the gain in the quality of peers’ answers —measured with precision and recall— when the process of query answering is guided by our trust mechanism.

Automatic key discovery for Data Linking (PhD Thesis)

Danai Symeonidou
PhD Thesis, University Paris XI, 2014

In the recent years, the Web of Data has increased significantly, containing a huge number of RDF triples. Integrating data described in different RDF datasets and creating semantic links among them, has become one of the most important goals of RDF applications. These links express semantic correspondences between ontology entities or data. Among the different kinds of semantic links that can be established, identity links express that different resources refer to the same real world entity. By comparing the number of resources published on the Web with the number of identity links, one can observe that the goal of building a Web of data is still not accomplished. Several data linking approaches infer identity links using keys. Nevertheless, in most datasets published on the Web, the keys are not available and it can be difficult, even for an expert, to declare them.The aim of this thesis is to study the problem of automatic key discovery in RDF data and to propose new efficient approaches to tackle this problem. Data published on the Web are usually created automatically, thus may contain erroneous information, duplicates or may be incomplete. Therefore, we focus on developing key discovery approaches that can handle datasets with numerous, incomplete or erroneous information. Our objective is to discover as many keys as possible, even ones that are valid in subparts of the data.We first introduce KD2R, an approach that allows the automatic discovery of composite keys in RDF datasets that may conform to different schemas. KD2R is able to treat datasets that may be incomplete and for which the Unique Name Assumption is fulfilled. To deal with the incompleteness of data, KD2R proposes two heuristics that offer different interpretations for the absence of data. KD2R uses pruning techniques to reduce the search space. However, this approach is overwhelmed by the huge amount of data found on the Web. Thus, we present our second approach, SAKey, which is able to scale in very large datasets by using effective filtering and pruning techniques. Moreover, SAKey is capable of discovering keys in datasets where erroneous data or duplicates may exist. More precisely, the notion of almost keys is proposed to describe sets of properties that are not keys due to few exceptions.

Partitioning semantics for entity resolution and link repairs in bibliographic knowledge bases (PhD Thesis)

Léa Guizol
PhD Thesis, University Montpellier 2, 2014

We propose a qualitative entity resolution approach to repair links in a bibliographic knowledge base. Our research question is: "How to detect and repair erroneous links in a bibliographic knowledge base using qualitative methods?". The proposed approach is decomposed into two major parts. The first contribution consists in a partitioning semantics using symbolic criteria used in order to detect erroneous links. The second one consists in a repair algorithm restoring link quality. We implemented our approach and proposed qualitative and quantitative evaluation for the partitioning semantics as well as proving certain properties for the repair algorithms.

Data interlinking through robust linkkey extraction (ECAI 2014)

Manuel Atencia, Jérôme David, Jérôme Euzenat
21st european conference on artificial intelligence (ECAI), Aug 2014, Praha, Czech Republic. IOS press, pp.15-20, 2014

Links are important for the publication of RDF data on the web. Yet, establishing links between data sets is not an easy task. We develop an approach for that purpose which extracts weak linkkeys. Linkkeys extend the notion of a key to the case of different data sets. They are made of a set of pairs of properties belonging to two different classes. A weak linkkey holds between two classes if any resources having common values for all of these properties are the same resources. An algorithm is proposed to generate a small set of candidate linkkeys. Depending on whether some of the, valid or invalid, links are known, we define supervised and non supervised measures for selecting the appropriate linkkeys. The supervised measures approximate precision and recall, while the non supervised measures are the ratio of pairs of entities a linkkey covers (coverage), and the ratio of entities from the same data set it identifies (discrimination). We have experimented these techniques on two data sets, showing the accuracy and robustness of both approaches.

SAKEY: Scalable Almost KEY Discovery in RDF Data (ISWC 2014)

Danai Symeonidou, Vincent Armant, Nathalie Pernelle, Fatiha Saïs
13th International Semantic Web Conference, Riva del Garda, Italy, October 19-23, 2014. LNCS n°8796 (pp 33-49)

Exploiting identity links among RDF resources allows applications to efficiently integrate data. Keys can be very useful to discover these identity links. A set of properties is considered as a key when its values uniquely identify resources. However, these keys are usually not available. The approaches that attempt to automatically discover keys can easily be overwhelmed by the size of the data and require clean data. We present SAKey, an approach that discovers keys in RDF data in an efficient way. To prune the search space, SAKey exploits characteristics of the data that are dynamically detected during the process. Furthermore, our approach can discover keys in datasets where erroneous data or duplicates exist (i.e., almost keys). The approach has been evaluated on different synthetic and real datasets. The results show both the relevance of almost keys and the efficiency of discovering them.

Logical Detection of Invalid SameAs Statements in RDF Data (EKAW 2014)

Laura Papaleo, Nathalie Pernelle, Fatiha Saïs, Cyril Dumont
19th International Conference, EKAW 2014, Linköping, Sweden, November 24-28, 2014. LNCS n°8876 (pp 373-384)

In the last years, thanks to the standardization of Semantic Web technologies, we are experiencing an unprecedented production of data, published online as Linked Data. In this context, when a typed link is instantiated between two different resources referring to the same real world entity, the usage of owl:sameAs is generally predominant. However, recent research discussions have shown issues in the use of owl:sameAs. Problems arise both in cases in which sameAs is automatically discovered by a data linking tool erroneously, or when users declare it but meaning something less ’strict’ than the semantics defined by OWL. In this work, we discuss further this issue and we present a method for logically detect invalid sameAs statements under specific circumstances. We report our experimental results, performed on OAEI datasets, to prove that the approach is promising.

An analysis of the SUDOC bibliographic knowledge base from a link validity viewpoint (IPMU 2014)

Léa Guizol, Olivier Rousseaux, Madalina Croitoru, Yann Nicolas, Aline Le Provost
15th International Conference, IPMU 2014, Montpellier, France, July 15-19, 2014, Proceedings of Information Processing and Management of Uncertainty in Knowledge-Based Systems. Volume 443 of the series Communications in Computer and Information Science pp 204-213.

In the aim of evaluating and improving link quality in bibliographical knowledge bases, we develop a decision support system based on partitioning semantics. The novelty of our approach consists in using symbolic values criteria for partitioning and suitable partitioning semantics. In this paper we evaluate and compare the above mentioned semantics on a real qualitative sample. This sample is issued from the catalogue of French university libraries (SUDOC), a bibliographical knowledge base maintained by the University Bibliographic Agency (ABES).

Defining key semantics for RDF datasets: experiments and evaluations (ICCS 2014)

Manuel Atencia, Michel Chein, Madalina Croitoru, Jérôme David, Michel Leclère, Nathalie Pernelle, Fatiha Saïs, Francois Scharffe, Danai Symeonidou
21st International Conference on Conceptual Structures, ICCS 2014, Iaşi, Romania, July 27-30, 2014. LNCS n°8577 (pp 65-78)

Many techniques were recently proposed to automate the linkage of RDF datasets. Predicate selection is the step of the linkage process that consists in selecting the smallest set of relevant predicates needed to enable instance comparison. We call keys this set of predicates that is analogous to the notion of keys in relational databases. We explain formally the different assumptions behind two existing key semantics. We then evaluate experimentally the keys by studying how discovered keys could help dataset interlinking or cleaning. We discuss the experimental results and show that the two different semantics lead to comparable results on the studied datasets.

Investigating the quality of a bibliographic knowledge base using partitioning semantics (FUZZ-IEEE 2014)

Léa Guizol, Madalina Croitoru
2014 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), pp. 948 - 955

With the aim of evaluating and improving link quality in bibliographical knowledge bases, we develop a decision support system based on partitioning semantics. Two such semantics have been proposed, the novelty of this approach consisting on using symbolic values criteria for partitioning. In this paper we investigate the limits of those partitioning semantics: how the characteristics of the input (objects and criteria) influences characteristics of the result, namely correctness of the result and execution time.

What can FCA do for database linkkey extraction?

Manuel Atencia, Jérôme David, Jérôme Euzenat
3rd ECAI workshop on What can FCA do for Artificial Intelligence? (FCA4AI), Aug 2014, Praha, Czech Republic. pp.85-92, 2014

Links between heterogeneous data sets may be found by using a generalisation of keys in databases, called linkkeys, which apply across data sets. This paper considers the question of characterising such keys in terms of formal concept analysis. This question is natural because the space of candidate keys is an ordered structure obtained by reduction of the space of keys and that of data set partitions. Classical techniques for generating functional dependencies in formal concept analysis indeed apply for finding candidate keys. They can be adapted in order to find database candidate linkkeys. The question of their extensibility to the RDF context would be worth investigating.

Définition de la sémantique des clés dans le Web sémantique : un point de vue théorique (IC 2014, in french)

Michel Chein, Madalina Croitoru, Michel Leclère, Nathalie Pernelle, Fatiha Saïs, Danai Symeonidou
25es Journées francophones d'Ingénierie des Connaissances, Clermont Ferrand, du 12 au 16 mai 2014. pp.225-236 Presentation

De nombreuses approches ont été définies pour permettre le liage automatique de sources de données RDF publiées sur le Web. Certaines de ces approches sont basées sur la sélection des plus petits ensembles de propriétés pertinentes pour comparer deux données. Ces ensembles forment des clés et cette notion est similaire aux clés définies pour les bases de données relationnelles. Dans cet article, nous proposons d'explorer différentes sémantiques de clés qui peuvent être utilisées dans le cadre du Web sémantique.

An Automatic Key Discovery Approach for Data Linking (J. Web Sem.)

Nathalie Pernelle, Fatiha Saïs, Danai Symeonidou
Journal of Web Semantics vol. 23, pp 16--30, 2013

In the context of Linked Data, different kinds of semantic links can be established between data. However when data sources are huge, detecting such links manually is not feasible. One of the most important types of links, the identity link, expresses that different identifiers refer to the same real world entity. Some automatic data linking approaches use keys to infer identity links, nevertheless this kind of knowledge is rarely available. In this work we propose KD2R, an approach which allows the automatic discovery of composite keys in RDF data sources that may conform to different schemas. We only consider data sources for which the Unique Name Assumption is fulfilled. The obtained keys are correct with respect to the RDF data sources in which they are discovered. The proposed algorithm is scalable since it allows the key discovery without having to scan all the data. KD2R has been tested on real datasets of the international contest OAEI 2010 and on data sets available on the web of data, and has obtained promising results.

Trust in networks of ontologies and alignments (Knowledge and Information Systems Journal)

Manuel Atencia, Mustafa Al-Bakri, Marie-Christine Rousset
Knowledge and Information Systems, February 2015, Volume 42, Issue 2, pp 353-379 (First online 2013).

In this paper, we introduce a mechanism of trust adapted to semantic peer-topeer networks in which every peer is free to organize its local resources as instances of classes of its own ontology. Peers use their ontologies to query other peers, and alignments between peers’ ontologies make it possible to reformulate queries from one local peer’s vocabulary to another. Alignments are typically the result of manual or (semi)automatic ontology matching. However, resulting alignments may be unsound and/or incomplete, and therefore, query reformulation based on alignments may lead to unsatisfactory answers. Trust can assist peers to select the peers in the network that are better suited to answer their queries. In our model, the trust that a peer has toward another peer depends on a specific query, and it represents the probability that the latter peer will provide a satisfactory answer to the query. In order to compute trust, we perform Bayesian inference that exploits ontologies, alignments and user feedback.We have implemented our method and conducted an evaluation. Experimental results show that trust values converge as more queries are sent and answers received. Furthermore, when query answering is guided by trust, the quality of peers’ answers, measured with precision and recall, is improved.

SudocAD: A Knowledge-Based System for the Author Linkage Problem (KSE 2013)

Michel Chein, Michel Leclère, Yann Nicolas
Proceedings of the Fifth International Conference KSE 2013, Volume 244 of the series Advances in Intelligent Systems and Computing pp 65-83

SudocAD is a system concerning the author linkage problem in a bibliographic database context. Having a bibliographic database E and a (new) bibliographic notice d, r being an identifier of an author in E and r' being an identifier of an author in d: is that r and r' refer to the same author? The system, which is a prototype, has been evaluated in a real situation. Compared to results given by expert librarians, the results of SudocAD are interesting enough to plan a transformation of the prototype into a production system. SudocAD is based on a method combining numerical and knowledge based techniques. This method is abstractly defined and even though SudocAD is devoted to the author linkage problem the method could be adapted for other kinds of linkage problems especially in the semantic web context.

Aggregation Semantics for Link Validity (SGAI 2013)

Léa Guizol, Madalina Croitoru, Michel Leclère
The Thirty-third SGAI International Conference on Innovative Techniques and Applications of Artificial Intelligence. Research and Development in Intelligent Systems XXX. Springer 2013 (pp 359-372).

In this paper we address the problem of link repair in bibliographic knowledge bases. In the context of the SudocAD project, a decision support system (DSS) is being developed, aiming to assist librarians when adding new bibliographic records. The DSS makes the assumption that existing data in the system contains no linkage errors. We lift this assumption and detail a method that allows for link validation. Our method is based on two partitioning semantics which are formally introduced and evaluated on a sample of real data.

Discovering Keys in RDF/OWL Datasets with KD2R (Workshop on Open Data 2013)

Danai Symeonidou, Nathalie Pernelle, Fatiha Saïs
Proceedings of the 2nd International Workshop on Open Data. Article No. 9, ACM New York, NY, USA, 2013

KD2R allows the automatic discovery of composite key constraints in RDF data sources that conform to a given ontology. We consider data sources for which the Unique Name Assumption is fulfilled. KD2R allows this discovery without having to scan all the data. Indeed, the proposed system looks for maximal non keys and derives minimal keys from this set of non keys. KD2R has been tested on several datasets available on the web of data and it has obtained promising results when the discovered keys are used to link data. In the demo, we will demonstrate the functionality of our tool and we will show on several datasets that the keys can be used in a datalinking tool.

On Link Validity in Bibliographic Knowledge Bases (IPMU 2012)

Léa Guizol, Madalina Croitoru, Michel Leclère
14th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, IPMU 2012, Catania, Italy, July 9-13, 2012. Volume 297 of the series Communications in Computer and Information Science (pp 380-389)

The Entity Resolution problem has been widely addressed in the literature. In its simplest version, the problem takes as input a knowledge base composed of records describing real world entities and outputs the sets of records judged to correspond to the same real world entity. More elaborated versions take into account links amongst records representing relationships between the entities which represent. However, none of the approaches in the literature question the validity of certain links between records. In this paper we highlight this new aspect of “link validity” in knowledge bases and show how Entity Resolution approaches should take this aspect into consideration.