How to survive the Data Deluge: Petabyte Scale Cloud Computing

How to survive the Data Deluge: Petabyte Scale Cloud Computing

Presentation of my Ph.D. proposal at ISTI-CNR

Transcript

  1. How to survive the Data Deluge: Petabyte scale Cloud Computing

    Gianmarco De Francisci Morales IMT Institute for Advanced Studies Lucca CSE PhD XXIV Cycle 18 Jan 2010 1 lunedì 18 gennaio 2010
  2. Outline • Part 1: Introduction • What, Why and History

    • Part 2: Technology overview • Current systems and comparison • Part 3: Research directions • Ideas for future improvements 2 lunedì 18 gennaio 2010
  3. Part 1 Introduction 3 lunedì 18 gennaio 2010

  4. How would you sort... • ... 1GB of data? •

    ... 100GB of data? • ... 10TB of data? • Scale matters! • Because More Isn't Just More, More Is Different 4 lunedì 18 gennaio 2010
  5. The Petabyte Age 5 lunedì 18 gennaio 2010

  6. What is scalability? • The ability for a system to

    accept increased volume without impacting the profits • Scale-free systems • Scale-up vs Scale-out • Types of parallel architectures: • Shared memory, Shared disk, Shared nothing 6 lunedì 18 gennaio 2010
  7. What if you need... • ... to store and analyze

    10TB of data per day? • Parallel is a must, but not enough • Usual approaches fail at this scale because of secondary effects • Operational costs • Faults 7 lunedì 18 gennaio 2010
  8. What is fault tolerance? • System operates properly in spite

    of the failure of some of its components • High Availability • Real world need • Software has bugs • Hardware fails 8 lunedì 18 gennaio 2010
  9. Why data? • The world is drowning in data: Data

    Deluge • Data sources: • Web 2.0 (user generated content) • Scientific experiments • Physics (particle accelerators) Astronomy (satellite images) Biology (genomic maps) • Can you think of others? 9 lunedì 18 gennaio 2010
  10. “Data is not information, ⋅information is not knowledge, ⋅knowledge is

    not wisdom.” Clifford Stoll 10 lunedì 18 gennaio 2010
  11. DBMS evolution • ‘60s CODASYL • ‘70s Relational DBMS •

    ‘80s Object-Oriented DBMS (Back to navigation) • ‘80s & ‘90s Parallel DBMS • Not much has happened since the ‘70s • The fundamental model and the code lines are still the same 11 lunedì 18 gennaio 2010
  12. DBMS yesterday • Business transaction processing (OLTP) • Relational model

    • SQL 12 lunedì 18 gennaio 2010
  13. DBMS today • Different markets (OLTP, OLAP, Stream, etc..) •

    Stored Procedures & User Defined Functions • Parallel DBMS (Teradata, Vertica, etc..) • Not enough flexibility • Limited fault-tolerance and scalability 13 lunedì 18 gennaio 2010
  14. Why cloud? • Parallel computing is dead • Amdahl’s law:

    SpUp(N) = 1 / ((1-Pa)+Pa/N) • Long live parallel computing • Gustafson’s law: SpUp(N) = PG*N + (1-PG) • Physical limits • Manycore • Money 14 lunedì 18 gennaio 2010
  15. Parallel computing evolution • Parallel (single) • Cluster (intra-site) •

    Grid (inter-site) • Cloud (scale-free) • What’s next? 15 lunedì 18 gennaio 2010
  16. Parallel computing yesterday • CPU bound problems • Tightly coupled

    • Use of MPI or PVM • Move data among computing nodes • Use of NAS/SAN • Expensive and does not scale (shared disk) 16 lunedì 18 gennaio 2010
  17. Parallel computing today • I/O bound problems (often) • Move

    computing near data • Focus on scalability and fault tolerance • Simple! • Shared nothing architecture on commodity hardware • Data streaming 17 lunedì 18 gennaio 2010
  18. Wrap-up • Main motivations • Scalability • Money • Focus

    on BIG data • BIG = need to stop & think because of its size • Common issues with PDBMS (load balancing, data skew) 18 lunedì 18 gennaio 2010
  19. Part 2 Technology overview 19 lunedì 18 gennaio 2010

  20. What is Cloud Computing? • Did anyone notice I skipped

    the definition? • Buzzword! • IaaS (EC2, S3) • PaaS (App Engine, Azure Services Platform) • SaaS (Salesforce, OnLive, virtually any Web App) • Scale free computing architecture 20 lunedì 18 gennaio 2010
  21. Who is involved? 21 lunedì 18 gennaio 2010

  22. Google Yahoo Microsoft Others High Level Languages Computation Data Abstraction

    Distributed Data Coordination Sawzall Pig/Latin DryadLINQ, Scope Hive, Cascading MapReduce Hadoop Dryad BigTable HBase, PNUTS Cassandra, Voldemort GFS HDFS Cosmos CloudStore, Dynamo Chubby Zookeeper Software stacks 22 lunedì 18 gennaio 2010
  23. Comparison with PDBMS • CAP Theorem • BASE vs ACID

    • Computing on large data vs Handling large data • OLAP vs OLTP • User Defined Functions vs Select-Project-Join • Nested vs Flat data model 23 lunedì 18 gennaio 2010
  24. Comparison with PDBMS • Start small (no upfront schema, flexible,

    agile) Grow big (optimize common patterns) • MapReduce, a major step backwards DeWitt, Stonebraker • "If the only tool you have is a hammer, you tend to see every problem as a nail" Abraham Maslow • SQL and Relational Model are not the answer 24 lunedì 18 gennaio 2010
  25. Wrap-up • A lot of hype • But also activity

    • Industry is leading the trend, has cutting edge software • Different approaches • Most focus on MapReduce • Shift toward higher level abstractions 25 lunedì 18 gennaio 2010
  26. Wrap-up • NoSQL movement • No Relational Model • No

    ACID • No Join 26 lunedì 18 gennaio 2010
  27. Part 3 Research Directions 27 lunedì 18 gennaio 2010

  28. • Extensions • Models • High velocity analytics • Hybrid

    systems • Optimizations 28 lunedì 18 gennaio 2010
  29. Extensions • Map-Reduce-Merge: simplified relational data processing on large clusters.

    H. Yang, A. Dasdan, R. Hsiao, and D. Parker. In SIGMOD 2007. • Goal: implement relational operators efficiently • How: new final phase that merges 2 key-value lists • Issues: very low level and hard to use needs integration into a high level language 29 lunedì 18 gennaio 2010
  30. Models • A new computation model for rack-based computing. F.

    Afrati and J. Ullman. Unpublished. • Goal: I/O cost characterization • Issues: only theoretical analysis no existing reference system • Future: best algorithms for the model model adaptation to real systems 30 lunedì 18 gennaio 2010
  31. Models • A model of computation for MapReduce. H. Karloff,

    S. Suri, and S. Vassilvitskii. In SODA, 2010. • Goal: theoretical computability characterization of MapReduce algorithms • Result: algorithmic design technique for MapReduce • Future: develop algorithms in this class find relationships with other classes 31 lunedì 18 gennaio 2010
  32. High velocity analytics • Interactive analysis of web-scale data. C.

    Olston, E. Bortnikov, K. Elmeleegy, F. Junqueira, B. Reed. In CIDR, 2009. • Goal: speed up general queries for big data • How: pre-computed templates to fill at run-time • Future: which templates are useful for interactive? help the user to formulate templates (sampling?) 32 lunedì 18 gennaio 2010
  33. High velocity analytics • MapReduce online. T. Condie, N. Conway,

    P. Alvaro, J. Hellerstein, K. Elmeleegy, and R. Sears. Technical report, University of California, Berkeley, 2009. • Goal: speed up turnaround of MapReduce jobs • How: operator pipelining, online aggregation • Issues: limited inter-job pipelining (data only) inter-job aggregation problematic (scratch data) 33 lunedì 18 gennaio 2010
  34. Hybrid systems • HadoopDB: an architectural hybrid of MapReduce and

    DBMS technologies for analytical workloads. A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, A. Rasin. In VLDB, 2009. • Goal: advantages of both DB and MapReduce • How: integrate a DBMS (PostgreSQL) in Hadoop, Hive as interface • Issues: better reuse principles than technology 34 lunedì 18 gennaio 2010
  35. Optimizations • The Curse of Zipf and Limits to Parallelization:

    A Look at the Stragglers Problem in MapReduce. J. Lin. In LSDS-IR, 2009. • Goal: data distribution effects on MapReduce parallel query/pairwise similarity as case study • How: balance input data (split long posting lists) • Issues: very specific for the problem/algorithm 35 lunedì 18 gennaio 2010
  36. Other ideas • Sampling and result estimation • A good

    enough result is often acceptable • Semantic clues • Leverage properties of M/R functions (associativity, commutativity) • Properties of the input may speed up the computation 36 lunedì 18 gennaio 2010
  37. Wrap-up • New and active field • Many opportunities for

    research • Crossroad of Distributed Systems and Databases • Answer the plea not to "reinvent the wheel" 37 lunedì 18 gennaio 2010
  38. How to survive the Data Deluge: Petabyte scale Cloud Computing

    • Integrate DB principles into Cloud systems • Enable interactive and approximate analytics • Evolve beyond the MapReduce paradigm 38 lunedì 18 gennaio 2010
  39. Questions? 39 lunedì 18 gennaio 2010