Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to survive the Data Deluge: Cloud computing for large scale data analysis

How to survive the Data Deluge: Cloud computing for large scale data analysis

Presentation of my Ph.D. proposal at Yahoo! Research

More Decks by Gianmarco De Francisci Morales

Other Decks in Research

Transcript

  1. How to survive the Data Deluge: Cloud computing for large

    scale data analysis Gianmarco De Francisci Morales IMT Institute for Advanced Studies Lucca CSE PhD XXIV Cycle 8 Mar 2010 lunedì 8 marzo 2010
  2. Outline Part 1: Introduction What, Why and History Part 2:

    State of the art Current technologies and research Part 3: Proposal Ideas for future improvements lunedì 8 marzo 2010
  3. How would you sort... ... 1GB of data? ... 100GB

    of data? lunedì 8 marzo 2010
  4. How would you sort... ... 1GB of data? ... 100GB

    of data? ... 10TB of data? lunedì 8 marzo 2010
  5. How would you sort... ... 1GB of data? ... 100GB

    of data? ... 10TB of data? Scale matters! Because More Isn't Just More, More Is Different lunedì 8 marzo 2010
  6. The Data Deluge The world is drowning in data Web

    2.0 (user generated content) Scientific experiments Physics (particle accelerators) Astronomy (satellite images) Biology (genomic maps) Sensors (GPS, RFID) lunedì 8 marzo 2010
  7. “Data is not information, information is not knowledge, knowledge is

    not wisdom.” Clifford Stoll lunedì 8 marzo 2010
  8. The “Big Data” problem “Data whose size forces us to

    look beyond the tried- and-true methods that are prevalent at that time” Jacobs, CACM 2009 Requirements for a large scale data analysis system Scalability (scale free) Cost effectiveness (autonomic) Fault tolerance (highly available) lunedì 8 marzo 2010
  9. Methodology evolution DBMS are the most common tool for data

    analysis ‘60s CODASYL ‘70s Relational DBMS ‘80s & ‘90s Parallel DBMS Not much has happened since the ‘70s The fundamental model and code are still the same lunedì 8 marzo 2010
  10. Relational DBMS evolution Yesterday: Relational model & OLTP & SQL

    Today: Different markets (OLTP, OLAP, Stream, etc..) Stored Procedures & User Defined Functions High performance requirements lunedì 8 marzo 2010
  11. Parallel DBMS A solution for performance problems Scale-out on shared

    nothing architectures using dataflow operators and horizontal partitioning Problems: Not enough flexibility and ease of use Limited fault-tolerance and scalability lunedì 8 marzo 2010
  12. Is parallel wrong? Parallel computing is dead Amdahl’s law: SpUp(N)

    = 1 / ((1-Pa)+Pa/N) Long live parallel computing Gustafson’s law: SpUp(N) = PG*N + (1-PG) Physical limits Manycore Money lunedì 8 marzo 2010
  13. Cloud Computing Convergence of parallel computing, virtualization and service oriented

    architectures Focus on being scale-free, fault tolerant, cost effective and easy to use Buzzword! Distributed system Scalability Location, replication and failure transparency lunedì 8 marzo 2010
  14. Data Intensive Cloud Computing I/O bound problems Move computing near

    data Simple, scale-agnostic programming interface Shared nothing architecture Commodity hardware lunedì 8 marzo 2010
  15. Google Yahoo Microsoft Others High Level Languages Computation Data Abstraction

    Distributed Data Coordination Sawzall Pig/Latin DryadLINQ SCOPE Hive Cascading MapReduce Hadoop Dryad BigTable HBase PNUTS Cassandra Voldemort GFS HDFS Cosmos Dynamo Chubby Zookeeper Software stacks lunedì 8 marzo 2010
  16. Comparison with PDBMS CAP Theorem BASE vs ACID Computing on

    large data vs Handling large data OLAP vs OLTP User Defined Functions vs Select-Project-Join Nested vs Flat data model lunedì 8 marzo 2010
  17. Comparison with PDBMS MapReduce, a major step backwards DeWitt, Stonebraker

    "If the only tool you have is a hammer, you tend to see every problem as a nail" Abraham Maslow SQL and Relational Model are not the answer lunedì 8 marzo 2010
  18. Computational Models I/O cost, Computability, Functional Programming Paradigm Enrichments MapReduceMerge,

    HadoopDB Online Analytics Templates, MapReduce Online Research directions lunedì 8 marzo 2010
  19. A new computation model for rack-based computing Goal: I/O cost

    characterization Issues: only theoretical analysis no existing reference system Future: best algorithms for the model, model adaptation to real systems F. Afrati and J. Ullman. Unpublished lunedì 8 marzo 2010
  20. A model of computation for MapReduce Goal: theoretical computability characterization

    of MapReduce algorithms Result: algorithmic design technique for MapReduce Future: develop algorithms in this class, find relationships with other classes H. Karloff, S. Suri, and S. Vassilvitskii. In SODA, 2010 lunedì 8 marzo 2010
  21. Google’s MapReduce programming model - revisited Goal: functional style reverse

    engineering of MapReduce How: top down functional analysis Result: simplified and rationalized MapReduce model with runnable functional specification R. Lammel. In Science of Computer Programming, 2007 lunedì 8 marzo 2010
  22. Map-Reduce-Merge Simplified relational data processing on large clusters Goal: implement

    relational operators efficiently How: new final phase that merges 2 key-value lists Issues: very low level and hard to use, needs integration into a high level language H. Yang, A. Dasdan, R. Hsiao, and D. Parker. In SIGMOD 2007 lunedì 8 marzo 2010
  23. HadoopDB An architectural hybrid of MapReduce and DBMS technologies for

    analytical workloads. Goal: advantages of both DB and MapReduce How: integrate a DBMS (PostgreSQL) in Hadoop, Hive as interface Issues: better reuse principles than artifacts A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, A. Rasin. In VLDB, 2009 lunedì 8 marzo 2010
  24. Interactive analysis of web-scale data Goal: speed up general queries

    for big data How: pre-computed templates to fill at run-time Future: which templates are useful for interactive? help the user to formulate templates (sampling?) C. Olston, E. Bortnikov, K. Elmeleegy, F. Junqueira, B. Reed. In CIDR, 2009 lunedì 8 marzo 2010
  25. MapReduce Online Goal: speed up turnaround of MapReduce jobs How:

    operator pipelining, online aggregation Issues: limited inter-job pipelining (data only) inter-job aggregation problematic (scratch data) T. Condie, N. Conway, P. Alvaro, J. Hellerstein, K. Elmeleegy, and R. Sears. Technical report, University of California, Berkeley, 2009 lunedì 8 marzo 2010
  26. Research Problems Cloud Computing is batch oriented High level languages

    push for efficient relational operators Not clear which algorithms and problems are best for these systems Research efforts are “erratic”, no common research agenda yet Early stage of development, more effort needed lunedì 8 marzo 2010
  27. Research Questions How to design novel algorithms for large scale

    data analysis? How to support these algorithms on cloud computing systems? Is it possible to carry out online data analysis on such systems? lunedì 8 marzo 2010
  28. Methodology Top down: start by studying existing algorithms, extract a

    representative workload Identify weaknesses in existing systems Use principles of database research to fill the gaps Evaluate contributions from both theoretical and experimental point of view lunedì 8 marzo 2010
  29. Some Ideas Sampling and result estimation A good enough result

    is often acceptable Semantic clues Leverage properties of M/R functions (distributivity, associativity, commutativity) Properties of the input may speed up the computation lunedì 8 marzo 2010
  30. Thesis Goals Build and evaluate a toolbox of algorithms for

    large scale data analysis on cloud computing systems Design extensions to existing programming paradigms in order to support these algorithms Develop methods to speed up these algorithms to support online processing lunedì 8 marzo 2010