# Research

## Gradual Patterns

Gradual patterns refer to rules like *the older, the higher the salary*. They have been extensively studied for command-based systems but have received little attention for the automatically extraction. Recently, the problem has started to be investigated but there are still many challenges. First, the problem of mining such patterns if very difficult as algorithms are very time and memory-consuming. We thus study (i) and how to optimize the memory storage of the data to be dealt with (using binary representations and computations) and (ii) how to distribute both the data and computations in order to reduce the runtime and in order to be able to deal with very large databases that cannot be currently handled. Second, as usual in data mining problems, the number of patterns extracted by current solutions is too large to be easily handled by end-users. We thus study (i) how to define relevant quality measures in order to , (ii) how to visualize such rules, and (iii) how to compute concise nuclei of rules (cf. closed patterns, Galois connections). Third, many semantic can be associated with such rules, which we aim at studying. Note that this research domain is investigated in the framework of two PhD thesis on data mining and medical applications and on concise representations (together with S. Ben Yahia from Tunis). It is also investigated together with the LIP6 (B. Bouchon-Meunier, MJ. Lesot, M. Rifqi) for studying how ranking measures (e.g. Kendall, Spearman can be used in this framework), and together with EFREI (N. Sicard) and LIG (A. Termier) for studying how these very time and memory consuming algorithms can be distributed.

**Key Words**: Gradual Patterns, Numerical Data, Gradual Patterns, Concise Representations, Rankings, Parallel Data Mining.

**PhD Students:** L. Di Jorio (2007-2010) . Co-advisor: M. Teisseire and S. Ayouni (2009-2012). Co-advisors: P. Poncelet and S. Ben Yahia

## Stream Mining

We investigate some approaches related to summarizing complex data streams. When dealing with data streams, storing all the data is known as being far from a feasible task due to the huge amount of data. Two approaches are thus possible to tackle the issue of keeping information on the data being treated: sampling or aggregating. We address here the second alternative, aggregation. This solution is seen as being highly correlated to the concept of data cubes and OLAP, which have been studied for about 10 years.

**PhD Student**: Y. Pitarch (2008-2011). co-advisor: P. Poncelet

**Key Words**: Stream Mining, OLAP Mining.

## Fuzzy Sequential Patterns

Most real world databases are constituted from historical and numerical data such as sensors, scientific or even demographic data. In this context, algorithms extracting sequential patterns, which are well adapted to the temporal aspect of the data, do not allow processing numerical information. Therefore the data are pre-processed to be transformed into a binary representation which leads to a loss of information. Algorithms have been proposed to process numerical data using intervals and particularly fuzzy intervals, but none of these methods is satisfying. Therefore we completely define the concepts linked to fuzzy sequential patterns mining. Using different levels of fuzzification, we propose three methods to mine fuzzy sequential patterns and detail the resulting algorithms (SPEEDYFUZZY, MINIFUZZY and TOTALLYFUZZY). Finally we assess them through different experiments showing the robustness and the relevancy of this work.

**Key Words**: Sequential Patterns, Numerical Data, Fuzzy Intervals.

**PhD Student:** C. Fiot (defended Sept. 2007) . Co-advisor: M. Teisseire

## Multidimensional Sequential Patterns

Mining sequential patterns aims at discovering correlations between events through time like A customer who bought a TV and a DVD player at the same time later bought a recorder. However, even if many works have dealt with sequential pattern mining, none of them considers frequent sequential patterns involving several dimensions in the general case to rules like A customer who bought a surfboard together with a bag in NY later bought a wetsuit in SF. We propose a novel approach, called M2SP, to mine multidimensional sequential patterns. The main originality of our proposition is that we obtain not only intra-pattern sequences but also inter-pattern sequences. Moreover, we consider generalized multidimensional sequential patterns, called wild-carded patterns, in which some of the dimension values may not be instanciated. Experiments have shown the scalability of our approach. Closed Multidimensional sequential patterns have also been defined. Finally, we have studied how to extract exceptions and outliers from multdimensional historized databases.

**Key Words**: Data Mining, Sequential Patterns, Multidimensional Rules.

**PhD Student**: M. Plantevit (to be defended in Mid July 2008) . Co-advisor: M. Teisseire

## Unexpected Sequential Patterns

We study how to extract unexpected sequential patterns. We thus formally define believes in the framework of sequential patterns, and we propose some algorithms to mine for unexpected sequences and rules. Applications are related to Web Log analysis, text mining etc.

**PhD Student**: H. Li (to be defended in 2009) . co-advisor: P. Poncelet

## Statistics and Sequential Patterns

This work addresses the probabilistic and statistical study of Sequential Patterns, and its Applications to Large Databases.

**PhD Student**: C. Low-Kam (to be defended in 2010) . advisors: A. Mas & M. Teisseire

## Tree Mining

Tree Mining aims at automatically extracting frequent subtrees from large tree databases. In this work, we have addressed the problem of defining an efficient representation of the trees so as to save memory space when dealing with huge amounts of data. We have defined a binary way to represent the database, and have softened our methods in order to extract more relevant results.

**PhD Student**: F. Del Razo Lopez (defended in July 2007). co-advisors: P. Poncelet & M. Teisseire

**Key Words**: Tree Mining, Fuzzy Tree Mining.

## Text Mining

Text categorization is a well-known task based essentially on statistical approaches using neural networks, Support Vector Machines and other machine learning algorithms. Texts are generally considered as bags of words without any order. Although these approaches have proven to be efficient, they do not provide users with comprehensive and reusable rules about their data. Such rules are, however, very important for users to describe trends in the data they have to analyze. In this framework, an association-rule based approach has been proposed by Bing Liu (CBA). We propose to extend this approach by using sequential patterns in the SPaC method (Sequential Patterns for Classification) for text categorization. Taking order into account allows us to represent the succession of words through a document without complex and time-consuming representations and treatments such as those performed in natural language and grammatical methods. The original method we propose consists of mining sequential patterns in order to build a classifier. We experimentally show that our proposal is relevant, and that it is very interesting compared to other methods. In particular, our method outperforms CBA and provides better results than SVM on some corpus.

**Key Words**: Text Mining, Categorization, Sequential Patterns, SPaC.

## Knowledge Discovery from Fuzzy Multidimensional Databases

Multidimensional databases and OLAP tools provide an efficient framework for data mining and lead to the so-called OLAP Mining architecture. Besides, data from real world are often imperfect, either because they are uncertain, or because they are imprecise. Moreover, the use of fuzzy set theory in data mining systems enhances the understandability of the discovered knowledge when considering numerical attributes and it leads to more generalizable rules. Thus our work aims at defining an approach to perform OLAP-based mining using fuzzy multidimensional databases and fuzzy data mining algorithms. We propose an extension of multidimensional databases in order to handle imperfect information and flexible multidimensional queries. We also integrate fuzzy multidimensional databases with data mining algorithms. In particular, we propose methods to automatically mine blocks of homogeneous values from multidimensional databases and we study empty cells from a semantic point of view. A general architecture is provided, which uses fuzzy multidimensional databases as a support for knowledge discovery.

**Key Words**: Databases, OLAP, Data Mining, Fuzzy Logic, Summaries.

## Semantic Web

We aim at providing tools for large scale data mediation so that users can query data without any knowledge of the data distribution and heterogeneity. Research problems are dealt with by combining methods from databases and methods from artificial intelligence and data mining. In particular, we study data mining and classification methods in order to regroup data sources depending on their semantics. We also study the construction and evolution of mediation schemas. For this purpose, we mine frequent subtrees from large tree databases, such as XML databases. In order to soften the mining process, we propose fuzzy ways to consider tree inclusion.

**Key Words:** Databases, XML, P2P, Semantic Web, Data Mining, Fuzzy Logic.

Last update on 04/06/2014