Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Making sense of big data

Making sense of big data

The outline of the presentation that was given for 2012, February 7th's webmardi:

Life without MySQL: Manipulating data sets that break the single machine barrier
Understanding the role of Hadoop & Map/Reduce
Handling large amounts of storage and computing instances
Applications: Logs, Open Data, Streams

Pierre-Yves Ritschard

February 11, 2012
Tweet

More Decks by Pierre-Yves Ritschard

Other Decks in Programming

Transcript

  1. Big Data ? • what do we mean by big

    ? • why is it becoming an issue now ? • a shift in internet usage: from consumption to sharing • cost of storage Saturday, February 11, 12
  2. Interesting Numbers • 400% growth of generated data per year,

    5% of global IT spending growth • 4 billion pieces of content shared everyday • 1 billion tweet sent out every 3 days • 1 petabyte processed by google every 72 minutes Saturday, February 11, 12
  3. Challenges • Extracting meaningful results from such large amounts of

    data • Presenting relevant data to users • It’s not a new problem • Remember life before google analytics ? (and what it would mean to do the same on your own) Saturday, February 11, 12
  4. Traditionnal Approaches • Huge Systems • Oracle BI • Still

    bound to a single processing unit • Huge upfront costs: the value of the result must be much higher than the cost of producing it Saturday, February 11, 12
  5. Who needs a new approach • Web applications • Logs

    • Event Processing • Open Data Applications • Everywhere cost of infrastructure matters Saturday, February 11, 12
  6. The Google Way • Search is the canonical big data

    use case • Needed to use off-the-shelf hardware • A tool suite: • Storage: GFS • Database: Big Table • Processing: Map/Reduce Saturday, February 11, 12
  7. The Open Source Ecosystem • Not settled yet • Hadoop:

    Backed by Yahoo • A lot of other players Saturday, February 11, 12
  8. Category Theory what we do with data • Simple Operations

    • counting • Relational Algebra • Projection: “SELECT a,b,c” • Selection: “WHERE a > 1” • Joins: “LEFT JOIN on” Saturday, February 11, 12
  9. A focus on map/reduce • One of the more established

    big data tools • Primarily for batch jobs, analytics • Addresses category theory on huge amounts of data • Doesn’t solve every problem Saturday, February 11, 12
  10. Map/Reduce principles • Breaks category theory into map and reduce

    operations • Map phase: operations that can run in parallel: projection, selection • Operations in the reduce phase: joins, counts • Just like cooking! Saturday, February 11, 12
  11. Pratical Overview (I) • http://xkcd.com/1007 • we can feed off

    of google’s ngram data • simple projection and selection • let’s see some code Saturday, February 11, 12
  12. Quick Poll: Next Time ? • Infrastructure automation • Event

    Streaming with Apache Kafka • The CAP Theorem and Cassandra • Cluster ordering with zookeeper • Hadoop workshop Saturday, February 11, 12