How to survive the Data Deluge: Petabyte Scale Cloud Computing

How to survive the Data Deluge: Petabyte scale Cloud Computing
Gianmarco De Francisci Morales IMT Institute for Advanced Studies Lucca CSE PhD XXIV Cycle 18 Jan 2010 1 lunedì 18 gennaio 2010

Outline • Part 1: Introduction • What, Why and History
• Part 2: Technology overview • Current systems and comparison • Part 3: Research directions • Ideas for future improvements 2 lunedì 18 gennaio 2010

Part 1 Introduction 3 lunedì 18 gennaio 2010

How would you sort... • ... 1GB of data? •
... 100GB of data? • ... 10TB of data? • Scale matters! • Because More Isn't Just More, More Is Different 4 lunedì 18 gennaio 2010

The Petabyte Age 5 lunedì 18 gennaio 2010

What is scalability? • The ability for a system to
accept increased volume without impacting the proﬁts • Scale-free systems • Scale-up vs Scale-out • Types of parallel architectures: • Shared memory, Shared disk, Shared nothing 6 lunedì 18 gennaio 2010

What if you need... • ... to store and analyze
10TB of data per day? • Parallel is a must, but not enough • Usual approaches fail at this scale because of secondary effects • Operational costs • Faults 7 lunedì 18 gennaio 2010

What is fault tolerance? • System operates properly in spite
of the failure of some of its components • High Availability • Real world need • Software has bugs • Hardware fails 8 lunedì 18 gennaio 2010

Why data? • The world is drowning in data: Data
Deluge • Data sources: • Web 2.0 (user generated content) • Scientiﬁc experiments • Physics (particle accelerators) Astronomy (satellite images) Biology (genomic maps) • Can you think of others? 9 lunedì 18 gennaio 2010

“Data is not information, ⋅information is not knowledge, ⋅knowledge is
not wisdom.” Clifford Stoll 10 lunedì 18 gennaio 2010

DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS •
‘80s Object-Oriented DBMS (Back to navigation) • ‘80s & ‘90s Parallel DBMS • Not much has happened since the ‘70s • The fundamental model and the code lines are still the same 11 lunedì 18 gennaio 2010

DBMS yesterday • Business transaction processing (OLTP) • Relational model
• SQL 12 lunedì 18 gennaio 2010

DBMS today • Different markets (OLTP, OLAP, Stream, etc..) •
Stored Procedures & User Deﬁned Functions • Parallel DBMS (Teradata, Vertica, etc..) • Not enough ﬂexibility • Limited fault-tolerance and scalability 13 lunedì 18 gennaio 2010

Why cloud? • Parallel computing is dead • Amdahl’s law:
SpUp(N) = 1 / ((1-Pa)+Pa/N) • Long live parallel computing • Gustafson’s law: SpUp(N) = PG*N + (1-PG) • Physical limits • Manycore • Money 14 lunedì 18 gennaio 2010

Parallel computing evolution • Parallel (single) • Cluster (intra-site) •
Grid (inter-site) • Cloud (scale-free) • What’s next? 15 lunedì 18 gennaio 2010

Parallel computing yesterday • CPU bound problems • Tightly coupled
• Use of MPI or PVM • Move data among computing nodes • Use of NAS/SAN • Expensive and does not scale (shared disk) 16 lunedì 18 gennaio 2010

Parallel computing today • I/O bound problems (often) • Move
computing near data • Focus on scalability and fault tolerance • Simple! • Shared nothing architecture on commodity hardware • Data streaming 17 lunedì 18 gennaio 2010

Wrap-up • Main motivations • Scalability • Money • Focus
on BIG data • BIG = need to stop & think because of its size • Common issues with PDBMS (load balancing, data skew) 18 lunedì 18 gennaio 2010

Part 2 Technology overview 19 lunedì 18 gennaio 2010

What is Cloud Computing? • Did anyone notice I skipped
the deﬁnition? • Buzzword! • IaaS (EC2, S3) • PaaS (App Engine, Azure Services Platform) • SaaS (Salesforce, OnLive, virtually any Web App) • Scale free computing architecture 20 lunedì 18 gennaio 2010

Who is involved? 21 lunedì 18 gennaio 2010

Google Yahoo Microsoft Others High Level Languages Computation Data Abstraction
Distributed Data Coordination Sawzall Pig/Latin DryadLINQ, Scope Hive, Cascading MapReduce Hadoop Dryad BigTable HBase, PNUTS Cassandra, Voldemort GFS HDFS Cosmos CloudStore, Dynamo Chubby Zookeeper Software stacks 22 lunedì 18 gennaio 2010

Comparison with PDBMS • CAP Theorem • BASE vs ACID
• Computing on large data vs Handling large data • OLAP vs OLTP • User Deﬁned Functions vs Select-Project-Join • Nested vs Flat data model 23 lunedì 18 gennaio 2010

Comparison with PDBMS • Start small (no upfront schema, ﬂexible,
agile) Grow big (optimize common patterns) • MapReduce, a major step backwards DeWitt, Stonebraker • "If the only tool you have is a hammer, you tend to see every problem as a nail" Abraham Maslow • SQL and Relational Model are not the answer 24 lunedì 18 gennaio 2010

Wrap-up • A lot of hype • But also activity
• Industry is leading the trend, has cutting edge software • Different approaches • Most focus on MapReduce • Shift toward higher level abstractions 25 lunedì 18 gennaio 2010

Wrap-up • NoSQL movement • No Relational Model • No
ACID • No Join 26 lunedì 18 gennaio 2010

Part 3 Research Directions 27 lunedì 18 gennaio 2010

• Extensions • Models • High velocity analytics • Hybrid
systems • Optimizations 28 lunedì 18 gennaio 2010

Extensions • Map-Reduce-Merge: simplified relational data processing on large clusters.
H. Yang, A. Dasdan, R. Hsiao, and D. Parker. In SIGMOD 2007. • Goal: implement relational operators efficiently • How: new final phase that merges 2 key-value lists • Issues: very low level and hard to use needs integration into a high level language 29 lunedì 18 gennaio 2010

Models • A new computation model for rack-based computing. F.
Afrati and J. Ullman. Unpublished. • Goal: I/O cost characterization • Issues: only theoretical analysis no existing reference system • Future: best algorithms for the model model adaptation to real systems 30 lunedì 18 gennaio 2010

Models • A model of computation for MapReduce. H. Karloff,
S. Suri, and S. Vassilvitskii. In SODA, 2010. • Goal: theoretical computability characterization of MapReduce algorithms • Result: algorithmic design technique for MapReduce • Future: develop algorithms in this class ﬁnd relationships with other classes 31 lunedì 18 gennaio 2010

High velocity analytics • Interactive analysis of web-scale data. C.
Olston, E. Bortnikov, K. Elmeleegy, F. Junqueira, B. Reed. In CIDR, 2009. • Goal: speed up general queries for big data • How: pre-computed templates to ﬁll at run-time • Future: which templates are useful for interactive? help the user to formulate templates (sampling?) 32 lunedì 18 gennaio 2010

High velocity analytics • MapReduce online. T. Condie, N. Conway,
P. Alvaro, J. Hellerstein, K. Elmeleegy, and R. Sears. Technical report, University of California, Berkeley, 2009. • Goal: speed up turnaround of MapReduce jobs • How: operator pipelining, online aggregation • Issues: limited inter-job pipelining (data only) inter-job aggregation problematic (scratch data) 33 lunedì 18 gennaio 2010

Hybrid systems • HadoopDB: an architectural hybrid of MapReduce and
DBMS technologies for analytical workloads. A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, A. Rasin. In VLDB, 2009. • Goal: advantages of both DB and MapReduce • How: integrate a DBMS (PostgreSQL) in Hadoop, Hive as interface • Issues: better reuse principles than technology 34 lunedì 18 gennaio 2010

Optimizations • The Curse of Zipf and Limits to Parallelization:
A Look at the Stragglers Problem in MapReduce. J. Lin. In LSDS-IR, 2009. • Goal: data distribution effects on MapReduce parallel query/pairwise similarity as case study • How: balance input data (split long posting lists) • Issues: very speciﬁc for the problem/algorithm 35 lunedì 18 gennaio 2010

Other ideas • Sampling and result estimation • A good
enough result is often acceptable • Semantic clues • Leverage properties of M/R functions (associativity, commutativity) • Properties of the input may speed up the computation 36 lunedì 18 gennaio 2010

Wrap-up • New and active ﬁeld • Many opportunities for
research • Crossroad of Distributed Systems and Databases • Answer the plea not to "reinvent the wheel" 37 lunedì 18 gennaio 2010

How to survive the Data Deluge: Petabyte scale Cloud Computing
• Integrate DB principles into Cloud systems • Enable interactive and approximate analytics • Evolve beyond the MapReduce paradigm 38 lunedì 18 gennaio 2010

Questions? 39 lunedì 18 gennaio 2010

How to survive the Data Deluge: Petabyte Scale ...

How to survive the Data Deluge: Petabyte Scale Cloud Computing

More Decks by Gianmarco De Francisci Morales

Other Decks in Research

Featured

Transcript