Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to survive the Data Deluge: Cloud computing for large scale data analysis

How to survive the Data Deluge: Cloud computing for large scale data analysis

Presentation of my Ph.D. proposal at Yahoo! Research

More Decks by Gianmarco De Francisci Morales

Other Decks in Research

Transcript

  1. How to survive the
    Data Deluge:
    Cloud computing for
    large scale data analysis
    Gianmarco De Francisci Morales
    IMT Institute for Advanced Studies Lucca
    CSE PhD XXIV Cycle
    8 Mar 2010
    lunedì 8 marzo 2010

    View Slide

  2. Outline
    Part 1: Introduction
    What, Why and History
    Part 2: State of the art
    Current technologies and research
    Part 3: Proposal
    Ideas for future improvements
    lunedì 8 marzo 2010

    View Slide

  3. Part 1
    Introduction
    lunedì 8 marzo 2010

    View Slide

  4. How would you sort...
    lunedì 8 marzo 2010

    View Slide

  5. How would you sort...
    ... 1GB of data?
    lunedì 8 marzo 2010

    View Slide

  6. How would you sort...
    ... 1GB of data?
    ... 100GB of data?
    lunedì 8 marzo 2010

    View Slide

  7. How would you sort...
    ... 1GB of data?
    ... 100GB of data?
    ... 10TB of data?
    lunedì 8 marzo 2010

    View Slide

  8. How would you sort...
    ... 1GB of data?
    ... 100GB of data?
    ... 10TB of data?
    Scale matters!
    Because More Isn't Just More,
    More Is Different
    lunedì 8 marzo 2010

    View Slide

  9. The Petabyte Age
    lunedì 8 marzo 2010

    View Slide

  10. The Data Deluge
    The world is drowning in data
    Web 2.0 (user generated content)
    Scientific experiments
    Physics (particle accelerators)
    Astronomy (satellite images)
    Biology (genomic maps)
    Sensors (GPS, RFID)
    lunedì 8 marzo 2010

    View Slide

  11. “Data is not information, information is not
    knowledge, knowledge is not wisdom.”
    Clifford Stoll
    lunedì 8 marzo 2010

    View Slide

  12. The “Big Data” problem
    “Data whose size forces us to look beyond the tried-
    and-true methods that are prevalent at that time”
    Jacobs, CACM 2009
    Requirements for a large scale data analysis system
    Scalability (scale free)
    Cost effectiveness (autonomic)
    Fault tolerance (highly available)
    lunedì 8 marzo 2010

    View Slide

  13. Methodology evolution
    DBMS are the most common tool for data analysis
    ‘60s CODASYL
    ‘70s Relational DBMS
    ‘80s & ‘90s Parallel DBMS
    Not much has happened since the ‘70s
    The fundamental model and code are still the same
    lunedì 8 marzo 2010

    View Slide

  14. Relational DBMS evolution
    Yesterday:
    Relational model & OLTP & SQL
    Today:
    Different markets (OLTP, OLAP, Stream, etc..)
    Stored Procedures & User Defined Functions
    High performance requirements
    lunedì 8 marzo 2010

    View Slide

  15. Parallel DBMS
    A solution for performance problems
    Scale-out on shared nothing architectures using
    dataflow operators and horizontal partitioning
    Problems:
    Not enough flexibility and ease of use
    Limited fault-tolerance and scalability
    lunedì 8 marzo 2010

    View Slide

  16. Is parallel wrong?
    Parallel computing is dead
    Amdahl’s law: SpUp(N) = 1 / ((1-Pa)+Pa/N)
    Long live parallel computing
    Gustafson’s law: SpUp(N) = PG*N + (1-PG)
    Physical limits
    Manycore
    Money
    lunedì 8 marzo 2010

    View Slide

  17. Cloud Computing
    Convergence of parallel computing, virtualization and
    service oriented architectures
    Focus on being scale-free, fault tolerant,
    cost effective and easy to use
    Buzzword!
    Distributed system
    Scalability
    Location, replication and failure transparency
    lunedì 8 marzo 2010

    View Slide

  18. Data Intensive
    Cloud Computing
    I/O bound problems
    Move computing near data
    Simple, scale-agnostic programming interface
    Shared nothing architecture
    Commodity hardware
    lunedì 8 marzo 2010

    View Slide

  19. Part 2
    State of the art
    lunedì 8 marzo 2010

    View Slide

  20. Who is involved?
    lunedì 8 marzo 2010

    View Slide

  21. Distributed Data
    Computation
    Coordination
    Data
    Abstraction
    High Level
    Languages
    Architecture
    lunedì 8 marzo 2010

    View Slide

  22. Google Yahoo Microsoft Others
    High Level
    Languages
    Computation
    Data
    Abstraction
    Distributed
    Data
    Coordination
    Sawzall Pig/Latin
    DryadLINQ
    SCOPE
    Hive
    Cascading
    MapReduce Hadoop Dryad
    BigTable
    HBase
    PNUTS
    Cassandra
    Voldemort
    GFS HDFS Cosmos Dynamo
    Chubby Zookeeper
    Software stacks
    lunedì 8 marzo 2010

    View Slide

  23. Comparison with PDBMS
    CAP Theorem
    BASE vs ACID
    Computing on large data vs Handling large data
    OLAP vs OLTP
    User Defined Functions vs Select-Project-Join
    Nested vs Flat data model
    lunedì 8 marzo 2010

    View Slide

  24. Comparison with PDBMS
    MapReduce, a major step backwards
    DeWitt, Stonebraker
    "If the only tool you have is a hammer,
    you tend to see every problem as a nail"
    Abraham Maslow
    SQL and Relational Model are not the answer
    lunedì 8 marzo 2010

    View Slide

  25. Computational Models
    I/O cost, Computability, Functional
    Programming Paradigm Enrichments
    MapReduceMerge, HadoopDB
    Online Analytics
    Templates, MapReduce Online
    Research directions
    lunedì 8 marzo 2010

    View Slide

  26. A new computation model
    for rack-based computing
    Goal: I/O cost characterization
    Issues: only theoretical analysis
    no existing reference system
    Future: best algorithms for the model,
    model adaptation to real systems
    F. Afrati and J. Ullman. Unpublished
    lunedì 8 marzo 2010

    View Slide

  27. A model of computation for
    MapReduce
    Goal: theoretical computability characterization
    of MapReduce algorithms
    Result: algorithmic design technique for
    MapReduce
    Future: develop algorithms in this class,
    find relationships with other classes
    H. Karloff, S. Suri, and S. Vassilvitskii. In SODA, 2010
    lunedì 8 marzo 2010

    View Slide

  28. Google’s MapReduce
    programming model - revisited
    Goal: functional style reverse engineering of
    MapReduce
    How: top down functional analysis
    Result: simplified and rationalized MapReduce
    model with runnable functional specification
    R. Lammel. In Science of Computer Programming, 2007
    lunedì 8 marzo 2010

    View Slide

  29. Map-Reduce-Merge
    Simplified relational data processing on large clusters
    Goal: implement relational operators efficiently
    How: new final phase that merges 2 key-value lists
    Issues: very low level and hard to use,
    needs integration into a high level language
    H. Yang, A. Dasdan, R. Hsiao, and D. Parker. In SIGMOD 2007
    lunedì 8 marzo 2010

    View Slide

  30. HadoopDB
    An architectural hybrid of MapReduce and DBMS
    technologies for analytical workloads.
    Goal: advantages of both DB and MapReduce
    How: integrate a DBMS (PostgreSQL) in Hadoop,
    Hive as interface
    Issues: better reuse principles than artifacts
    A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, A. Rasin.
    In VLDB, 2009
    lunedì 8 marzo 2010

    View Slide

  31. Interactive analysis of
    web-scale data
    Goal: speed up general queries for big data
    How: pre-computed templates to fill at run-time
    Future: which templates are useful for interactive?
    help the user to formulate templates (sampling?)
    C. Olston, E. Bortnikov, K. Elmeleegy, F. Junqueira, B. Reed. In CIDR, 2009
    lunedì 8 marzo 2010

    View Slide

  32. MapReduce Online
    Goal: speed up turnaround of MapReduce jobs
    How: operator pipelining, online aggregation
    Issues: limited inter-job pipelining (data only)
    inter-job aggregation problematic (scratch data)
    T. Condie, N. Conway, P. Alvaro, J. Hellerstein, K. Elmeleegy, and R. Sears.
    Technical report, University of California, Berkeley, 2009
    lunedì 8 marzo 2010

    View Slide

  33. Part 3
    Proposal
    lunedì 8 marzo 2010

    View Slide

  34. Research Problems
    Cloud Computing is batch oriented
    High level languages push for efficient relational
    operators
    Not clear which algorithms and problems are best for
    these systems
    Research efforts are “erratic”, no common research
    agenda yet
    Early stage of development, more effort needed
    lunedì 8 marzo 2010

    View Slide

  35. Research Questions
    How to design novel algorithms for large scale data
    analysis?
    How to support these algorithms on cloud computing
    systems?
    Is it possible to carry out online data analysis on such
    systems?
    lunedì 8 marzo 2010

    View Slide

  36. Methodology
    Top down: start by studying existing algorithms,
    extract a representative workload
    Identify weaknesses in existing systems
    Use principles of database research to fill the gaps
    Evaluate contributions from both theoretical and
    experimental point of view
    lunedì 8 marzo 2010

    View Slide

  37. Some Ideas
    Sampling and result estimation
    A good enough result is often acceptable
    Semantic clues
    Leverage properties of M/R functions
    (distributivity, associativity, commutativity)
    Properties of the input may speed up the
    computation
    lunedì 8 marzo 2010

    View Slide

  38. Thesis Goals
    Build and evaluate a toolbox of algorithms for large
    scale data analysis on cloud computing systems
    Design extensions to existing programming paradigms
    in order to support these algorithms
    Develop methods to speed up these algorithms to
    support online processing
    lunedì 8 marzo 2010

    View Slide

  39. Questions?
    lunedì 8 marzo 2010

    View Slide