How to survive the Data Deluge: Cloud computing for large scale data analysis Gianmarco De Francisci Morales IMT Institute for Advanced Studies Lucca CSE PhD XXIV Cycle 8 Mar 2010 lunedì 8 marzo 2010
Outline Part 1: Introduction What, Why and History Part 2: State of the art Current technologies and research Part 3: Proposal Ideas for future improvements lunedì 8 marzo 2010
How would you sort... ... 1GB of data? ... 100GB of data? ... 10TB of data? Scale matters! Because More Isn't Just More, More Is Different lunedì 8 marzo 2010
The Data Deluge The world is drowning in data Web 2.0 (user generated content) Scientific experiments Physics (particle accelerators) Astronomy (satellite images) Biology (genomic maps) Sensors (GPS, RFID) lunedì 8 marzo 2010
The “Big Data” problem “Data whose size forces us to look beyond the tried- and-true methods that are prevalent at that time”
Jacobs, CACM 2009 Requirements for a large scale data analysis system Scalability (scale free) Cost effectiveness (autonomic) Fault tolerance (highly available) lunedì 8 marzo 2010
Methodology evolution DBMS are the most common tool for data analysis ‘60s CODASYL ‘70s Relational DBMS ‘80s & ‘90s Parallel DBMS Not much has happened since the ‘70s The fundamental model and code are still the same lunedì 8 marzo 2010
Parallel DBMS A solution for performance problems Scale-out on shared nothing architectures using dataflow operators and horizontal partitioning Problems: Not enough flexibility and ease of use Limited fault-tolerance and scalability lunedì 8 marzo 2010
Cloud Computing Convergence of parallel computing, virtualization and service oriented architectures Focus on being scale-free, fault tolerant, cost effective and easy to use Buzzword! Distributed system Scalability Location, replication and failure transparency lunedì 8 marzo 2010
Google Yahoo Microsoft Others High Level Languages Computation Data Abstraction Distributed Data Coordination Sawzall Pig/Latin DryadLINQ SCOPE Hive Cascading MapReduce Hadoop Dryad BigTable HBase PNUTS Cassandra Voldemort GFS HDFS Cosmos Dynamo Chubby Zookeeper Software stacks lunedì 8 marzo 2010
Comparison with PDBMS CAP Theorem BASE vs ACID Computing on large data vs Handling large data OLAP vs OLTP User Defined Functions vs Select-Project-Join Nested vs Flat data model lunedì 8 marzo 2010
Comparison with PDBMS MapReduce, a major step backwards
DeWitt, Stonebraker "If the only tool you have is a hammer, you tend to see every problem as a nail"
Abraham Maslow SQL and Relational Model are not the answer lunedì 8 marzo 2010
A new computation model for rack-based computing Goal: I/O cost characterization Issues: only theoretical analysis no existing reference system Future: best algorithms for the model, model adaptation to real systems F. Afrati and J. Ullman. Unpublished lunedì 8 marzo 2010
A model of computation for MapReduce Goal: theoretical computability characterization of MapReduce algorithms Result: algorithmic design technique for MapReduce Future: develop algorithms in this class, find relationships with other classes H. Karloff, S. Suri, and S. Vassilvitskii. In SODA, 2010 lunedì 8 marzo 2010
Google’s MapReduce programming model - revisited Goal: functional style reverse engineering of MapReduce How: top down functional analysis Result: simplified and rationalized MapReduce model with runnable functional specification R. Lammel. In Science of Computer Programming, 2007 lunedì 8 marzo 2010
Map-Reduce-Merge Simplified relational data processing on large clusters Goal: implement relational operators efficiently How: new final phase that merges 2 key-value lists Issues: very low level and hard to use, needs integration into a high level language H. Yang, A. Dasdan, R. Hsiao, and D. Parker. In SIGMOD 2007 lunedì 8 marzo 2010
HadoopDB An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Goal: advantages of both DB and MapReduce How: integrate a DBMS (PostgreSQL) in Hadoop, Hive as interface Issues: better reuse principles than artifacts A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, A. Rasin. In VLDB, 2009 lunedì 8 marzo 2010
Interactive analysis of web-scale data Goal: speed up general queries for big data How: pre-computed templates to fill at run-time Future: which templates are useful for interactive? help the user to formulate templates (sampling?) C. Olston, E. Bortnikov, K. Elmeleegy, F. Junqueira, B. Reed. In CIDR, 2009 lunedì 8 marzo 2010
MapReduce Online Goal: speed up turnaround of MapReduce jobs How: operator pipelining, online aggregation Issues: limited inter-job pipelining (data only) inter-job aggregation problematic (scratch data) T. Condie, N. Conway, P. Alvaro, J. Hellerstein, K. Elmeleegy, and R. Sears. Technical report, University of California, Berkeley, 2009 lunedì 8 marzo 2010
Research Problems Cloud Computing is batch oriented High level languages push for efficient relational operators Not clear which algorithms and problems are best for these systems Research efforts are “erratic”, no common research agenda yet Early stage of development, more effort needed lunedì 8 marzo 2010
Research Questions How to design novel algorithms for large scale data analysis? How to support these algorithms on cloud computing systems? Is it possible to carry out online data analysis on such systems? lunedì 8 marzo 2010
Methodology Top down: start by studying existing algorithms, extract a representative workload Identify weaknesses in existing systems Use principles of database research to fill the gaps Evaluate contributions from both theoretical and experimental point of view lunedì 8 marzo 2010
Some Ideas Sampling and result estimation A good enough result is often acceptable Semantic clues Leverage properties of M/R functions (distributivity, associativity, commutativity) Properties of the input may speed up the computation lunedì 8 marzo 2010
Thesis Goals Build and evaluate a toolbox of algorithms for large scale data analysis on cloud computing systems Design extensions to existing programming paradigms in order to support these algorithms Develop methods to speed up these algorithms to support online processing lunedì 8 marzo 2010