Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Making sense of big data

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.

Making sense of big data

The outline of the presentation that was given for 2012, February 7th's webmardi:

Life without MySQL: Manipulating data sets that break the single machine barrier
Understanding the role of Hadoop & Map/Reduce
Handling large amounts of storage and computing instances
Applications: Logs, Open Data, Streams

Avatar for Pierre-Yves Ritschard

Pierre-Yves Ritschard

February 11, 2012
Tweet

More Decks by Pierre-Yves Ritschard

Other Decks in Programming

Transcript

  1. Big Data ? • what do we mean by big

    ? • why is it becoming an issue now ? • a shift in internet usage: from consumption to sharing • cost of storage Saturday, February 11, 12
  2. Interesting Numbers • 400% growth of generated data per year,

    5% of global IT spending growth • 4 billion pieces of content shared everyday • 1 billion tweet sent out every 3 days • 1 petabyte processed by google every 72 minutes Saturday, February 11, 12
  3. Challenges • Extracting meaningful results from such large amounts of

    data • Presenting relevant data to users • It’s not a new problem • Remember life before google analytics ? (and what it would mean to do the same on your own) Saturday, February 11, 12
  4. Traditionnal Approaches • Huge Systems • Oracle BI • Still

    bound to a single processing unit • Huge upfront costs: the value of the result must be much higher than the cost of producing it Saturday, February 11, 12
  5. Who needs a new approach • Web applications • Logs

    • Event Processing • Open Data Applications • Everywhere cost of infrastructure matters Saturday, February 11, 12
  6. The Google Way • Search is the canonical big data

    use case • Needed to use off-the-shelf hardware • A tool suite: • Storage: GFS • Database: Big Table • Processing: Map/Reduce Saturday, February 11, 12
  7. The Open Source Ecosystem • Not settled yet • Hadoop:

    Backed by Yahoo • A lot of other players Saturday, February 11, 12
  8. Category Theory what we do with data • Simple Operations

    • counting • Relational Algebra • Projection: “SELECT a,b,c” • Selection: “WHERE a > 1” • Joins: “LEFT JOIN on” Saturday, February 11, 12
  9. A focus on map/reduce • One of the more established

    big data tools • Primarily for batch jobs, analytics • Addresses category theory on huge amounts of data • Doesn’t solve every problem Saturday, February 11, 12
  10. Map/Reduce principles • Breaks category theory into map and reduce

    operations • Map phase: operations that can run in parallel: projection, selection • Operations in the reduce phase: joins, counts • Just like cooking! Saturday, February 11, 12
  11. Pratical Overview (I) • http://xkcd.com/1007 • we can feed off

    of google’s ngram data • simple projection and selection • let’s see some code Saturday, February 11, 12
  12. Quick Poll: Next Time ? • Infrastructure automation • Event

    Streaming with Apache Kafka • The CAP Theorem and Cassandra • Cluster ordering with zookeeper • Hadoop workshop Saturday, February 11, 12