How to survive the Data Deluge: Cloud computing for large scale data analysis

How to survive the Data Deluge: Cloud computing for large
scale data analysis Gianmarco De Francisci Morales IMT Institute for Advanced Studies Lucca CSE PhD XXIV Cycle 8 Mar 2010 lunedì 8 marzo 2010

Outline Part 1: Introduction What, Why and History Part 2:
State of the art Current technologies and research Part 3: Proposal Ideas for future improvements lunedì 8 marzo 2010

Part 1 Introduction lunedì 8 marzo 2010

How would you sort... lunedì 8 marzo 2010

How would you sort... ... 1GB of data? lunedì 8
marzo 2010

How would you sort... ... 1GB of data? ... 100GB
of data? lunedì 8 marzo 2010

of data? ... 10TB of data? lunedì 8 marzo 2010

of data? ... 10TB of data? Scale matters! Because More Isn't Just More, More Is Different lunedì 8 marzo 2010

The Petabyte Age lunedì 8 marzo 2010

The Data Deluge The world is drowning in data Web
2.0 (user generated content) Scientiﬁc experiments Physics (particle accelerators) Astronomy (satellite images) Biology (genomic maps) Sensors (GPS, RFID) lunedì 8 marzo 2010

“Data is not information, information is not knowledge, knowledge is
not wisdom.” Clifford Stoll lunedì 8 marzo 2010

The “Big Data” problem “Data whose size forces us to
look beyond the tried- and-true methods that are prevalent at that time” Jacobs, CACM 2009 Requirements for a large scale data analysis system Scalability (scale free) Cost effectiveness (autonomic) Fault tolerance (highly available) lunedì 8 marzo 2010

Methodology evolution DBMS are the most common tool for data
analysis ‘60s CODASYL ‘70s Relational DBMS ‘80s & ‘90s Parallel DBMS Not much has happened since the ‘70s The fundamental model and code are still the same lunedì 8 marzo 2010

Relational DBMS evolution Yesterday: Relational model & OLTP & SQL
Today: Different markets (OLTP, OLAP, Stream, etc..) Stored Procedures & User Deﬁned Functions High performance requirements lunedì 8 marzo 2010

Parallel DBMS A solution for performance problems Scale-out on shared
nothing architectures using dataﬂow operators and horizontal partitioning Problems: Not enough ﬂexibility and ease of use Limited fault-tolerance and scalability lunedì 8 marzo 2010

Is parallel wrong? Parallel computing is dead Amdahl’s law: SpUp(N)
= 1 / ((1-Pa)+Pa/N) Long live parallel computing Gustafson’s law: SpUp(N) = PG*N + (1-PG) Physical limits Manycore Money lunedì 8 marzo 2010

Cloud Computing Convergence of parallel computing, virtualization and service oriented
architectures Focus on being scale-free, fault tolerant, cost effective and easy to use Buzzword! Distributed system Scalability Location, replication and failure transparency lunedì 8 marzo 2010

Data Intensive Cloud Computing I/O bound problems Move computing near
data Simple, scale-agnostic programming interface Shared nothing architecture Commodity hardware lunedì 8 marzo 2010

Part 2 State of the art lunedì 8 marzo 2010

Who is involved? lunedì 8 marzo 2010

Distributed Data Computation Coordination Data Abstraction High Level Languages Architecture
lunedì 8 marzo 2010

Google Yahoo Microsoft Others High Level Languages Computation Data Abstraction
Distributed Data Coordination Sawzall Pig/Latin DryadLINQ SCOPE Hive Cascading MapReduce Hadoop Dryad BigTable HBase PNUTS Cassandra Voldemort GFS HDFS Cosmos Dynamo Chubby Zookeeper Software stacks lunedì 8 marzo 2010

Comparison with PDBMS CAP Theorem BASE vs ACID Computing on
large data vs Handling large data OLAP vs OLTP User Deﬁned Functions vs Select-Project-Join Nested vs Flat data model lunedì 8 marzo 2010

Comparison with PDBMS MapReduce, a major step backwards DeWitt, Stonebraker
"If the only tool you have is a hammer, you tend to see every problem as a nail" Abraham Maslow SQL and Relational Model are not the answer lunedì 8 marzo 2010

Computational Models I/O cost, Computability, Functional Programming Paradigm Enrichments MapReduceMerge,
HadoopDB Online Analytics Templates, MapReduce Online Research directions lunedì 8 marzo 2010

A new computation model for rack-based computing Goal: I/O cost
characterization Issues: only theoretical analysis no existing reference system Future: best algorithms for the model, model adaptation to real systems F. Afrati and J. Ullman. Unpublished lunedì 8 marzo 2010

A model of computation for MapReduce Goal: theoretical computability characterization
of MapReduce algorithms Result: algorithmic design technique for MapReduce Future: develop algorithms in this class, ﬁnd relationships with other classes H. Karloff, S. Suri, and S. Vassilvitskii. In SODA, 2010 lunedì 8 marzo 2010

Google’s MapReduce programming model - revisited Goal: functional style reverse
engineering of MapReduce How: top down functional analysis Result: simpliﬁed and rationalized MapReduce model with runnable functional speciﬁcation R. Lammel. In Science of Computer Programming, 2007 lunedì 8 marzo 2010

Map-Reduce-Merge Simplified relational data processing on large clusters Goal: implement
relational operators efficiently How: new final phase that merges 2 key-value lists Issues: very low level and hard to use, needs integration into a high level language H. Yang, A. Dasdan, R. Hsiao, and D. Parker. In SIGMOD 2007 lunedì 8 marzo 2010

HadoopDB An architectural hybrid of MapReduce and DBMS technologies for
analytical workloads. Goal: advantages of both DB and MapReduce How: integrate a DBMS (PostgreSQL) in Hadoop, Hive as interface Issues: better reuse principles than artifacts A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, A. Rasin. In VLDB, 2009 lunedì 8 marzo 2010

Interactive analysis of web-scale data Goal: speed up general queries
for big data How: pre-computed templates to ﬁll at run-time Future: which templates are useful for interactive? help the user to formulate templates (sampling?) C. Olston, E. Bortnikov, K. Elmeleegy, F. Junqueira, B. Reed. In CIDR, 2009 lunedì 8 marzo 2010

MapReduce Online Goal: speed up turnaround of MapReduce jobs How:
operator pipelining, online aggregation Issues: limited inter-job pipelining (data only) inter-job aggregation problematic (scratch data) T. Condie, N. Conway, P. Alvaro, J. Hellerstein, K. Elmeleegy, and R. Sears. Technical report, University of California, Berkeley, 2009 lunedì 8 marzo 2010

Part 3 Proposal lunedì 8 marzo 2010

Research Problems Cloud Computing is batch oriented High level languages
push for efﬁcient relational operators Not clear which algorithms and problems are best for these systems Research efforts are “erratic”, no common research agenda yet Early stage of development, more effort needed lunedì 8 marzo 2010

Research Questions How to design novel algorithms for large scale
data analysis? How to support these algorithms on cloud computing systems? Is it possible to carry out online data analysis on such systems? lunedì 8 marzo 2010

Methodology Top down: start by studying existing algorithms, extract a
representative workload Identify weaknesses in existing systems Use principles of database research to ﬁll the gaps Evaluate contributions from both theoretical and experimental point of view lunedì 8 marzo 2010

Some Ideas Sampling and result estimation A good enough result
is often acceptable Semantic clues Leverage properties of M/R functions (distributivity, associativity, commutativity) Properties of the input may speed up the computation lunedì 8 marzo 2010

Thesis Goals Build and evaluate a toolbox of algorithms for
large scale data analysis on cloud computing systems Design extensions to existing programming paradigms in order to support these algorithms Develop methods to speed up these algorithms to support online processing lunedì 8 marzo 2010

Questions? lunedì 8 marzo 2010

How to survive the Data Deluge: Cloud computing...

How to survive the Data Deluge: Cloud computing for large scale data analysis

More Decks by Gianmarco De Francisci Morales

Other Decks in Research

Featured

Transcript