COLLOQUE MASTODONS Indexing Scientific Big Data,

Paris, France, January 15th 2014

Big Data Recommendation and Analysis in Mastodons (CNRS INS2I) for Plant Phenotyping

Large-scale Scientific Data Sharing and Analysis – application to plant phenotyping data

Project leaders: Eric Rivals (DR CNRS) and Esther Pacitti (Professor UM2, LIRMM)

Mastodons : Scientific Data Management

Phenodata is a CNRS INS2I Mastodons project (2012-2014).

Recent progress in agronomy, bio-informatics, physics and environmental science result in the generation of overwhelming amounts of experimental data produced through observation and simulation. Lately, new observational instruments (e.g. satellites, sensors, large hadron collider) and simulation tools create a huge data overload. For example, climate modeling data are growing so fast that they will lead to collections of hundreds of exabytes expected by 2020. Such data must be processed, i.e. cleaned, transformed and analyzed in order to draw conclusions, prove scientific theories and produce knowledge. The goal of scientific data management is to make scientific data easier to access, reproduce, and share by scientists of different disciplines and institutions.

The datasets generated this way are complex, in particular because of heterogeneous methods used for producing them, of the uncertainty of captured data and, above all, the inherently multi-scale nature (spatial and temporal scales). This results in data with hundreds of attributes, dimensions or descriptors. Processing and analyzing such massive sets of complex data is therefore a major challenge, with solutions that combine new data management techniques with large-scale parallelism in cluster, grid or cloud environments.

Furthermore, current scientific issues require integrated datasets and involve scientists from different disciplines (e.g. biologists, soil scientists, and geologists working on an environmental project), in some cases from different organizations distributed in different countries. But each discipline or organization tends to produce and manage its own data, in specific formats, with its own processes, it is increasingly difficult to share distributed data.

This raises two major challenges for data management. The first challenge refers to the sharing of these datasets among scientists of different disciplines who want to collaborate and the second one refers to data analysis.

This project also involves Mab team (link to Mab) of Lirmm and their partners. Here we present the scientific activities carried by Zenith team related to plant phenotyping with our partners. The two teams share the same events.