Mastodons achievements

Recommendation for big data sharing

We consider the problem of big data sharing among professional communities, e.g. scientific communities, which have their own datasets and documents in a private cloud and wish to share some of them in a personalized and controlled way. Our approach to help community members locate useful information is to exploit recommendations based on explicit personalization (e.g. using the friendship networks of scientists) over multiple private clouds in a P2P fashion. We can then develop new decentralized, scalable recommendation protocols.

In [Servajean 2013], we investigate profile diversity for P2P search and recommendation of scientific documents. In scientific domains, endorsements from different communities are important indicators of the broad focus of scientific documents and should be accounted for in search and recommendation. To do so, we introduce profile diversity, a novel idea in searching scientific documents. Traditional content diversity has been thoroughly studied in centralized search and advertising, database queries, and recommendations and addresses the question of returning relevant but too-similar documents. We argue that content diversity alone does not suffice for finding documents endorsed by different scientific communities and that profile diversity is needed to alleviate returning popular but too-focused documents. We used a self built benchmark from INRA contents and TREC09 benchmark and to validate our proposal.

This work is part of Maximilien Servajean Phd Thesis, partially funded by Labex Numev.

Phenotypic Data Analysis

The goal, here, was for us to investigate data mining techniques in the field of phenotyping and to open the dialogue between our research teams (Zenith team of Inria for computer science and data mining, and the LEPSE laboratory for phenotyping).

This work consists in a two steps approach. First, the data acquired by phenoArc are not always clean and cause troubles to experts since they may lead to inappropriate results. Our goal was to propose an automatic detection of “potentially wrong data” in order to provide the experts with alarms and allow them to save time when cleaning their data.

The second contribution is about Time series clustering. Here, the goal is to work on the total set of clean data (after applying the anomaly detection techniques described above) in order to obtain clusters of time series. For the experts, this analysis is important, since they want to identify characteristics for explaining rapid or low plant growth in different environments. We have implemented and tested various techniques for this step, including DBSCAN and HAC (for which multiple distance measures and merging criterias have been tested).

This work is described in the Master’s thesis report of Irina Alles.