Mastodons International Workshop on
"Big Data Management and Crowd Sourcing towards Scientific Data"
Monday 30th june 2014, in MONTPELLIER, 95 rue de la Galéra
IBC & LIRMM (UM2, CNRS-Mastodons), INRIA-UCSB associated team Bigdatanet
Organisation : Esther Pacitti
Lirmm reception desk : +33 (0)4 67 41 85 85
In the context of the Mastodons project in Montpellier, we are addressing problems related to the management and analysis of big scientific data, in particular biology data such as those produced by next generation sequencing tools and plant phenotyping platforms. The objective of the workshop is to discuss emerging solutions for big data management with world-class scientists.
- 10h30, 95 rue de la Galéra, Montpellier, salle 127 at
Institut de Biologie Computationnelle
Emerging Technologies for Big Data Management and AnalyticsDivy Agrawal Professor of Computer Science, University of California at Santa Barbara
During the past decade Google has been instrumental in establishing the broader research agenda in the context of Big Data management and Big Data analytics. With the advent of BigTable and related technologies (Google File System, Chubby Lock Service, and Paxos based Distributed Consensus protocol), Google initiated a data management revolution called NoSQL that has taken both data management researchers and practitioners by the storm. Numerous offerings, both proprietary and in the open-source domain, are now available that essentially mimic Google’s approach for managing Big Data. Similarly, Google’s MapReduce paradigm has resulted in the abandonment of established data analytics paradigm both within Google as well as in the broader commercial arena.
While academics and practitioners are enamored with Google’s Big Data technologies that are almost a decade old, Google is continuing to define the future research agenda in the context of Big Data. In particular, Google has recognized the deficiencies of NoSQL approach for data management especially in the context of data-centric products and services. In the recent past, Google has revealed a flurry of next-generation Big Data management technologies that provide stronger consistency guarantees similar to traditional database management solutions. Notable examples being Megastore, Spanner, and a distributed database management solution called F1. What is noteworthy is that all these technologies are inherently designed to be scalable and are multi-homed (i.e., can withstand large-scale datacenter outages). In the same vein, in the context of Big Data analytics, Google has developed key technologies such as: Dremel, Photon, Power-drill, and Mill-wheel. Dremel is a system that enables interactive analysis (as opposed to batched analysis using MapReduce) of Web-scale datasets. Photon is a system that enables fault-tolerant and scalable joining of continuous data streams (e.g., query logs with advertising click logs). Power-drill is an analytic engine that is capable of processing trillions of cells with a single mouse-click. Finally, Mill-wheel is a system that has been developed for fault-tolerant stream processing at Internet scale.
In this presentation, we will introduce and summarize these point solutions from the recent research papers published by the research and engineering teams at Google. The goal of this undertaking is to underscore that there is more to Big Data management and analytics than just BigTable and MapReduce especially in a broader research and development context of Big Data.
- 13h30, Lecture room "Saint Priest", on Saint Priest Campus
Cloud & big data: the perfect marriage?
Patrick Valduriez, Zenith team, INRIA and LIRMM, Montpellier
AbstractCloud and big data technologies are now converging to promise cost-effective delivery of big data services. But is it a perfect marriage?
The ability to produce high-value information and knowledge from big data makes it critical for many applications such as decision support, forecasting, business intelligence, research, and (data-intensive) science. Processing and analyzing massive, possibly complex data must combine new data management techniques (to deal with new kinds of data) with large-scale parallelism, typically in cluster environments.
Cloud computing, on the other hand, encompasses on demand, reliable services provided over the Internet with easy access to virtually infinite computing, storage and networking resources. Through simple Web interfaces, users can outsource complex tasks, such as data storage, system administration, or application deployment, to very large data centers operated by cloud providers.
Although cloud and big data have different goals (big data aims at added value and operational performance while the cloud targets flexibility and reduced cost, they can well help each other by (1) encouraging organizations to outsource more and more strategic internal data in the cloud and (2) get value out of it (for instance, by integrating with external data) through big data analytics.
However, a perfect marriage is yet to come. The current cloud data management solutions have traded data consistency for scalability and performance, thus requiring tremendous programming effort and expertise to develop data-intensive cloud applications with correct semantics. Furthermore, they have specialized in different kinds of data (structured, unstructured, documents, graphs), which makes data integration and analysis very hard. Finally, it is often the case that useful data span multisite clouds or even multiple, heterogeneous clouds that do not interoperate.
In this talk, I will review current cloud and big data technologies and discuss these issues. I will also point out current directions of research, in particular, in the context of the CoherentPaaS FP7 ICT project we are engaging.
- Coffee Break
- 14h45, Lecture room "Saint Priest", on Saint Priest Campus
Consistent, Elastic and Fault-Tolerant Management of Big Data in the Cloud
Amr EL ABBADI, Professor of Computer Science, University of California at Santa Barbara
AbstractThe advent of cloud computing and the increasing demands of Big Data applications has led to the development of global large scale computing systems which require data management across multiple data centers. Initially key-value stores were proposed to provide single row level operations with eventual consistency. Although key-value stores provide high availability, they are not ideal for applications that require consistent data views. More recently, there has been a gradual shift to provide transactions with strong consistency to simplify application development. In this talk, we will start by analyzing the need and the revival of SQL database management systems in large cloud settings. We will discuss several state of the art systems, which provide transactional guarantees on collections of data items, thus supporting complex queries, while attempting to ensure scalability, elasticity and fault-tolerance. Of particular interest are applications which require geo-replicated data management. We will therefore discuss different approaches for replicating data in multi-datacenter environments. Throughout the talk, principles will be illustrated using concrete systems and protocols.
Discussion and Perspectives on Scientific Big Data Management