ZENITH: Gestion de donnees scientifiques
Modern science such as agronomy, bio-informatics, and environmental science must deal with overwhelming amounts of experimental data. Such data must be processed (cleaned, transformed, analyzed) in all kinds of ways in order to draw new conclusions, prove scientific theories and produce knowledge. However, constant progress in scientific observational instruments and simulation tools creates a huge data overload. For example, climate modeling data are growing so fast that they will lead to collections of hundreds of exabytes expected by 2020. Scientific data is also very complex, in particular because of heterogeneous methods used for producing data, the uncertainty of captured data, the inherently multi-scale nature of many sciences and the growing use of imaging, resulting in data with hundreds of attributes, dimensions or descriptors. Processing and analyzing such massive sets of complex data is therefore a major challenge since solutions must combine new data management techniques with large-scale parallelism in cluster, grid or cloud environments.
Scientific data management is now on the agenda of a very active research community composed of scientists from different disciplines and data management researchers . For instance, the SciDB organization is building an open source database system for scientific data analytics.
The three main challenges of scientific data management can be summarized by:
- scale (big data, big applications);
- complexity (uncertain, multi-scale data with lots of dimensions),
- heterogeneity (in particular, data semantics heterogeneity).
The overall goal of Zenith is to address these challenges, by proposing innovative solutions with significant advantages in terms of scalability, functionality, ease of use, and performance. We plan to design and validate our solutions by working closely with scientific application partners. To further validate our solutions and extend the scope of our results, we also want to foster industrial collaborations, even in non scientific applications, provided that they exhibit similar challenges.
- Reza Akbarinia, Research Fellow INRIA
- Alexis Joly, Research Fellow INRIA
- Antoine Liutkus, Research Fellow INRIA
- Jean-Christophe Lombardo, Research Engineer INRIA
- Florent Masseglia, Research Fellow INRIA
- Esther Pacitti, Professor UM
- Didier Parigot, Research Fellow INRIA
- Patrick Valduriez, Research Director INRIA
- Antoine Affouard, CDD Engineer-Technician
- Christophe Botella, Doctoral
- Gaetan Heidsieck, Doctoral
- Boyan Kolev, CDD Researcher
- Oleksandra Levchenko, CDD Engineer-Technician
- Valentin Leveau, CDD Engineer-Technician
- Ji Liu, POST
- Titouan Lorieul, Doctoral
- Sakina Mahboubi, Doctoral
- Khadidja Meguelati, Doctoral
- Dennis Shasha, INVML
- Djamel Yagoubi, Doctoral
Our approach is to capitalize on the principles of distributed data management. In particular, we plan to exploit: high-level languages as the basis for data independence and automatic optimization; data semantics (taxonomies, folksonomies, ontologies, …) to improve information retrieval and automate data integration; declarative languages (algebra, calculus) to manipulate data and workflows, with user-defined functions; and exploit user (social) profiles and relationships between participants to help recommendation. Furthermore, we will exploit highly distributed environments in particular, P2P for data sharing between participants and parallel processing to scale up in the cloud. To reflect our approach, we organize our research program in three complementary research themes:
- Data and Metadata Management. This theme addresses the problems of managing and integrating data and metadata with uncertainty, in particular, uncertain entity resolution and distributed probabilistic query processing.
- Data and process sharing. This theme addresses the problems of scientific data and processes in highly distributed and parallel environments, in particular, social-based P2P data sharing, recommendation and scientific workflow management.
- Scalable data analysis. Given the gap between the growth of computing power and that of data production, our ability to analyze these data is inevitably at stake. This theme addresses the scalability problem by investigating new data mining and content-based retrieval techniques that exploit parallelism in the cloud.
R. Akbarinia, P. Valduriez, G. Verger, Efficient Evaluation of SUM Queries Over Probabilistic Data. IEEE Transactions on Knowledge and Data Engineering, Data. Vol. 25, No. 4, 764-775, 2013.
M. El Dick, E. Pacitti, R. Akbarinia, B. Kemme, Building a Peer-to-Peer Content Distribution Network with High Performance, Scalability and Robustness, Information Systems, Vol. 36, No 2, p. 222-247, 2011.
P. Letessier, O. Buisson, A. Joly, N. Boujemaa, Scalable Mining of Small Visual Objects, ACM Multimedia Conf., 2012.
E. Ogasawara, D. De Oliveira, P. Valduriez, J. Dias, F. Porto, M. Mattoso,An Algebraic Approach for Data-Centric Scientific Workflows, Proceedings of VLDB, Vol. 4, No 11, p. 1328-1339, 2011.
F. Petitjean, F. Masseglia, P. Gançarski, G. Forestier, Discovering Signiﬁcant Evolution Patterns from Satelllite Image Time Series, International Journal of Neural Systems, Vol. 21, No 6, 475-489, 2011.
Last update on 06/11/2017