MASTODONS Indexing Scientific Big Data

15th January 2014

CNRS, 3, rue Michel Ange, PARIS


The amazing increase of data produced and stored in all kinds of scientific domains from trade, life sciences, social networks, or research, certainly represents a major change in human activities. The management of big data is at the heart of the MASTODONS project call. Inevitably, computer based solutions are needed to organize, manage, maintain and exploit this data, but the size of the data makes impractical current solutions. The design of novel computational approaches must take up the challenge of scalability, that is to remain highly efficient despite the increase of data. A key towards this goal is how the data is stored and organized within the computer memory or disk, and how it is pre-processed to compute additional data structures that index this data. Once computed, the index is repeatedly used to query the data almost instantaneously and thus serves to mine novel information from huge and originally unstructured data sets.

This forum gives us the opportunity to survey state-of-the-art developments in indexing strategies and their applications for mining huge information sets of multimedia, like for instance texts, images, or genomes. Invited speakers will describe aspects of indexing related to ontologies, database management, scientific applications, machine learning, algorithms and data structures. Our goal is to point out and discuss the current challenges that computational sciences must take up, as well as future research directions.


  • Registration 9h30 - 10h00
  • Welcome address 10h00 - 10h15
  • Succinct Data Structures: From Theory to Practice 10h15 - 11h15
    Simon Gog
    Research Fellow, Melbourne School of Engineering. The University of Melbourne

    Succinct Data Structure use space close to the compressed representation of the underlaying objects but provide operations in the same time complexity as their uncompressed counterparts. Since the early 90s, more and more succinct versions of data structures have been proposed – ranging from bitvectors to complex information retrieval (IR) systems. Despite their attractive theoretical properties, there are only rare examples of practical use in systems. One reason for this is that an efficient implementation of complex structures, like a compressed text index, requires not only a profound knowledge of data compression and structures but also of modern hardware. In this seminar, I will give a introduction to Succinct Data Structures and present the C++ template library SDSL, which represents the state of the art in the field. This library facilitates easy composition of compression and indexing tools, which can be used to operate on large data sets in areas as Bioinformatics, IR, and Natural Language Processing. One recent example of the adoption of the techniques in industry is the social graph of Facebook.

  • Indexing and Processing Big Data 11h15 - 12h00
    Patrick Valduriez
    Zenith team, INRIA and LIRMM, Montpellier

    Big data has become a buzzword, referring to massive amounts of data that are very hard to deal with traditional data management tools. In particular, the ability to produce high-value information and knowledge from big data makes it critical for many applications such as decision support, forecasting, business intelligence, research, and (data-intensive) science. Indexing, processing and analyzing massive, possibly complex data is a major challenge since solutions must combine new data management techniques (to deal with new kinds of data) with large-scale parallelism in cluster, grid or cloud environments. Parallel data processing has long been exploited in the context of distributed and parallel database systems for highly structured data. But big data encompasses different data formats (documents, sequences, graphs, arrays, …) that require significant extensions to traditional data indexing and processing techniques. In this talk

  • Données multimédia à grande échelle : petit panorama des applications, des techniques d'analyse et de recherche 13h30 - 14h20
    Laurent Amsaleg

    Le volume d'informations multimédia croît à un rythme encore jamais atteint. Cette croissance concerne tant les milieux professionnels de production et d'archivage de documents multimédia que le milieu grand public. Du côté professionnel, la télévision, le cinéma, la radio sont tous passés au tout numérique, et l'INA doit quotidiennement stocker plus d'une centaine de canaux de diffusions, 24 heures sur 24, et la taille de son stock d'archives numériques augmente de plusieurs centaines de milliers d'heures par an. Du côté de la sphère grand public, les réseaux sociaux génèrent la production et le stockage de quantités phénoménales de données multimédia. À titre d'exemple, Facebook à lui seul gère plus de 1000 milliard de photos, et ce stock croît de plus de 3 milliard d'images chaque mois. La recherche en analyse automatique de documents multimédia a beaucoup progressé ces dernières années et l'on sait maintenant mieux automatiquement décrire les documents (descriptions plus robustes, plus compactes, plus discriminantes) permettant reconnaissances ou classifications fines, on sait mieux indexer pour permettre des recherches rapides malgré l'échelle, et, dans une certaine mesure et pour quelques applications, on sait mieux combler le fossé sémantique, notamment grâce à l'exploitation de la multi-modalité. Cet exposé brossera un paysage rapide des applications bâties sur l'exploitation de grandes collections de documents multimédia. Quelques techniques d'indexation pour y faire des recherches seront évoquées.

  • On the need for indexing large data sets in scientific applications 14h20 - 14h50
    Mohand-Said Hacid
    Professeur, LIRIS, Université Claude Bernard Lyon 1

    The access time is an important consideration when it comes to query and analyze large data sets. Several techniques (indexing, view materialization …) that allow to quickly locate fragments of data on disk (or cluster) or to anticipate costly computation have emerged. We focus here on the need of indexing in scientific domains. We will discuss the requirements and existing approaches. Then, we will consider those issues in the frameworks of the projects Amadeus (, GAIA ( and PetaSky ( We will also present the first results of our experiments.
  • Break
  • A Hitchhiker's guide to ontology 15h10 - 16h00
    Fabian Suchanek
    Télécom ParisTech

    In this talk, I will present recent work of my group in the area of knowledge bases. It covers 4 areas of research around ontologies and knowledge bases: The first area is the construction of the YAGO knowledge base. YAGO now includes time and space information, and has grown into a larger project at the Max Planck Institute for Informatics. The second area is the alignment of knowledge bases. This includes the alignment of classes, instances, and relations across knowledge bases. The third area is rule mining. Our project finds semantic correlations in the form of Horn rules in the knowledge base. I will also talk about watermarking approaches to trace the provenance of ontological data. Finally, I will show applications of the knowledge base for mining news corpora
  • ARESOS Project: Reconstruction, Analyse et Accès aux Données dans les Grands Réseaux Socio-Sémantiques 16h00 - 16h30
    Patrick Gallinari
    Professeur, LIP6, Université Pierre et Marie Curie, Paris

    Prof. Gallinari will present the challenges of mining and analsing large social network data. Keywords include social information retrieval, social tagging and interaction, collaborative recommandation, phylomemy.