How to survive the Data Deluge: Cloud computing for large scale data analysis

Slide 1

Slide 1 text

How to survive the Data Deluge: Cloud computing for large scale data analysis Gianmarco De Francisci Morales IMT Institute for Advanced Studies Lucca CSE PhD XXIV Cycle 8 Mar 2010 lunedì 8 marzo 2010

Slide 2

Slide 2 text

Outline Part 1: Introduction What, Why and History Part 2: State of the art Current technologies and research Part 3: Proposal Ideas for future improvements lunedì 8 marzo 2010

Slide 3

Slide 3 text

Part 1 Introduction lunedì 8 marzo 2010

Slide 4

Slide 4 text

How would you sort... lunedì 8 marzo 2010

Slide 5

Slide 5 text

How would you sort... ... 1GB of data? lunedì 8 marzo 2010

Slide 6

Slide 6 text

How would you sort... ... 1GB of data? ... 100GB of data? lunedì 8 marzo 2010

Slide 7

Slide 7 text

How would you sort... ... 1GB of data? ... 100GB of data? ... 10TB of data? lunedì 8 marzo 2010

Slide 8

Slide 8 text

How would you sort... ... 1GB of data? ... 100GB of data? ... 10TB of data? Scale matters! Because More Isn't Just More, More Is Different lunedì 8 marzo 2010

Slide 9

Slide 9 text

The Petabyte Age lunedì 8 marzo 2010

Slide 10

Slide 10 text

The Data Deluge The world is drowning in data Web 2.0 (user generated content) Scientiﬁc experiments Physics (particle accelerators) Astronomy (satellite images) Biology (genomic maps) Sensors (GPS, RFID) lunedì 8 marzo 2010

Slide 11

Slide 11 text

“Data is not information, information is not knowledge, knowledge is not wisdom.” Clifford Stoll lunedì 8 marzo 2010

Slide 12

Slide 12 text

The “Big Data” problem “Data whose size forces us to look beyond the tried- and-true methods that are prevalent at that time” Jacobs, CACM 2009 Requirements for a large scale data analysis system Scalability (scale free) Cost effectiveness (autonomic) Fault tolerance (highly available) lunedì 8 marzo 2010

Slide 13

Slide 13 text

Methodology evolution DBMS are the most common tool for data analysis ‘60s CODASYL ‘70s Relational DBMS ‘80s & ‘90s Parallel DBMS Not much has happened since the ‘70s The fundamental model and code are still the same lunedì 8 marzo 2010

Slide 14

Slide 14 text

Relational DBMS evolution Yesterday: Relational model & OLTP & SQL Today: Different markets (OLTP, OLAP, Stream, etc..) Stored Procedures & User Deﬁned Functions High performance requirements lunedì 8 marzo 2010

Slide 15

Slide 15 text

Parallel DBMS A solution for performance problems Scale-out on shared nothing architectures using dataﬂow operators and horizontal partitioning Problems: Not enough ﬂexibility and ease of use Limited fault-tolerance and scalability lunedì 8 marzo 2010

Slide 16

Slide 16 text

Is parallel wrong? Parallel computing is dead Amdahl’s law: SpUp(N) = 1 / ((1-Pa)+Pa/N) Long live parallel computing Gustafson’s law: SpUp(N) = PG*N + (1-PG) Physical limits Manycore Money lunedì 8 marzo 2010

Slide 17

Slide 17 text

Cloud Computing Convergence of parallel computing, virtualization and service oriented architectures Focus on being scale-free, fault tolerant, cost effective and easy to use Buzzword! Distributed system Scalability Location, replication and failure transparency lunedì 8 marzo 2010

Slide 18

Slide 18 text

Data Intensive Cloud Computing I/O bound problems Move computing near data Simple, scale-agnostic programming interface Shared nothing architecture Commodity hardware lunedì 8 marzo 2010

Slide 19

Slide 19 text

Part 2 State of the art lunedì 8 marzo 2010

Slide 20

Slide 20 text

Who is involved? lunedì 8 marzo 2010

Slide 21

Slide 21 text

Distributed Data Computation Coordination Data Abstraction High Level Languages Architecture lunedì 8 marzo 2010

Slide 22

Slide 22 text

Google Yahoo Microsoft Others High Level Languages Computation Data Abstraction Distributed Data Coordination Sawzall Pig/Latin DryadLINQ SCOPE Hive Cascading MapReduce Hadoop Dryad BigTable HBase PNUTS Cassandra Voldemort GFS HDFS Cosmos Dynamo Chubby Zookeeper Software stacks lunedì 8 marzo 2010

Slide 23

Slide 23 text

Comparison with PDBMS CAP Theorem BASE vs ACID Computing on large data vs Handling large data OLAP vs OLTP User Deﬁned Functions vs Select-Project-Join Nested vs Flat data model lunedì 8 marzo 2010

Slide 24

Slide 24 text

Comparison with PDBMS MapReduce, a major step backwards DeWitt, Stonebraker "If the only tool you have is a hammer, you tend to see every problem as a nail" Abraham Maslow SQL and Relational Model are not the answer lunedì 8 marzo 2010

Slide 25

Slide 25 text

Computational Models I/O cost, Computability, Functional Programming Paradigm Enrichments MapReduceMerge, HadoopDB Online Analytics Templates, MapReduce Online Research directions lunedì 8 marzo 2010

Slide 26

Slide 26 text

A new computation model for rack-based computing Goal: I/O cost characterization Issues: only theoretical analysis no existing reference system Future: best algorithms for the model, model adaptation to real systems F. Afrati and J. Ullman. Unpublished lunedì 8 marzo 2010

Slide 27

Slide 27 text

A model of computation for MapReduce Goal: theoretical computability characterization of MapReduce algorithms Result: algorithmic design technique for MapReduce Future: develop algorithms in this class, ﬁnd relationships with other classes H. Karloff, S. Suri, and S. Vassilvitskii. In SODA, 2010 lunedì 8 marzo 2010

Slide 28

Slide 28 text

Google’s MapReduce programming model - revisited Goal: functional style reverse engineering of MapReduce How: top down functional analysis Result: simpliﬁed and rationalized MapReduce model with runnable functional speciﬁcation R. Lammel. In Science of Computer Programming, 2007 lunedì 8 marzo 2010

Slide 29

Slide 29 text

Map-Reduce-Merge Simplified relational data processing on large clusters Goal: implement relational operators efficiently How: new final phase that merges 2 key-value lists Issues: very low level and hard to use, needs integration into a high level language H. Yang, A. Dasdan, R. Hsiao, and D. Parker. In SIGMOD 2007 lunedì 8 marzo 2010

Slide 30

Slide 30 text

HadoopDB An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Goal: advantages of both DB and MapReduce How: integrate a DBMS (PostgreSQL) in Hadoop, Hive as interface Issues: better reuse principles than artifacts A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, A. Rasin. In VLDB, 2009 lunedì 8 marzo 2010

Slide 31

Slide 31 text

Interactive analysis of web-scale data Goal: speed up general queries for big data How: pre-computed templates to ﬁll at run-time Future: which templates are useful for interactive? help the user to formulate templates (sampling?) C. Olston, E. Bortnikov, K. Elmeleegy, F. Junqueira, B. Reed. In CIDR, 2009 lunedì 8 marzo 2010

Slide 32

Slide 32 text

MapReduce Online Goal: speed up turnaround of MapReduce jobs How: operator pipelining, online aggregation Issues: limited inter-job pipelining (data only) inter-job aggregation problematic (scratch data) T. Condie, N. Conway, P. Alvaro, J. Hellerstein, K. Elmeleegy, and R. Sears. Technical report, University of California, Berkeley, 2009 lunedì 8 marzo 2010

Slide 33

Slide 33 text

Part 3 Proposal lunedì 8 marzo 2010

Slide 34

Slide 34 text

Research Problems Cloud Computing is batch oriented High level languages push for efﬁcient relational operators Not clear which algorithms and problems are best for these systems Research efforts are “erratic”, no common research agenda yet Early stage of development, more effort needed lunedì 8 marzo 2010

Slide 35

Slide 35 text

Research Questions How to design novel algorithms for large scale data analysis? How to support these algorithms on cloud computing systems? Is it possible to carry out online data analysis on such systems? lunedì 8 marzo 2010

Slide 36

Slide 36 text

Methodology Top down: start by studying existing algorithms, extract a representative workload Identify weaknesses in existing systems Use principles of database research to ﬁll the gaps Evaluate contributions from both theoretical and experimental point of view lunedì 8 marzo 2010

Slide 37

Slide 37 text

Some Ideas Sampling and result estimation A good enough result is often acceptable Semantic clues Leverage properties of M/R functions (distributivity, associativity, commutativity) Properties of the input may speed up the computation lunedì 8 marzo 2010

Slide 38

Slide 38 text

Thesis Goals Build and evaluate a toolbox of algorithms for large scale data analysis on cloud computing systems Design extensions to existing programming paradigms in order to support these algorithms Develop methods to speed up these algorithms to support online processing lunedì 8 marzo 2010

Slide 39

Slide 39 text

Questions? lunedì 8 marzo 2010