Exposé CIEL R. Akbarinia (ZENITH) – Lundi 15 mars 2021
Parallel Techniques for Big Data Analytics
Abstract : Nowadays, we are witnessing the production of large volumes of data in many applicationsdomains like social networks, medical monitoring, weather forecasting, biology, agronomy, earthmonitoring, etc. Analyzing this data would help us to extract a lot of hidden knowledge about theevents happened or to be happened in the future. However, traditional data analytics techniques arenot efficient for analyzing such data volumes. A promising solution for improving the performanceof data analytics is to take advantage of the computing power of distributed systems and parallelframeworks such as Spark.In this talk, I present the parallel and distributed techniques which we developed in the Zenith teamto deal with two main data analytics problems: 1) Similarity search in big time series datasets; 2)Maximally informative k-itemsets mining.Fast and accurate similarity search over time series is very important for many applications such asfraud detection in finance, earthquake prediction, plant monitoring, etc. In order to improve theperformance of similarity queries, index construction is one of the most popular techniques, whichhas been successfully used in a variety of settings and applications. In our research activities, we tookadvantage of parallel and distributed frameworks such as Spark, and developed efficient solutions forparallel construction of tree-based and grid-based indexes over large time series datasets. We alsodeveloped efficient algorithms for parallel similarity search over distributed time series datasets usingindexes.The second addressed problem is maximally informative k-itemsets mining (miki for short) that isone of the fundamental building bricks for exploring informative patterns in databases. Efficient mikimining has a high impact on various tasks such as super- vised learning, unsupervised learning orinformation retrieval, to cite a few. A typical application is the discovery of discriminative sets offeatures, based on joint entropy, which allows distinguishing between different categories of objects.Indeed, with massive amounts of data, the miki mining is a very challenging, due to high number ofentropy computations. An efficient miki mining solution should scale up with the increase in the sizeof the itemsets, calling for cutting edge parallel algorithms and high performance computation ofmiki. We developed such a parallel solution rendering the discovery of miki from a very largedatabase (up to Terabytes of data) simple and effective.