Slide 1

Slide 1 text

Making Sense of Big- Data Saturday, February 11, 12

Slide 2

Slide 2 text

Big Data ? • what do we mean by big ? • why is it becoming an issue now ? • a shift in internet usage: from consumption to sharing • cost of storage Saturday, February 11, 12

Slide 3

Slide 3 text

Interesting Numbers • 400% growth of generated data per year, 5% of global IT spending growth • 4 billion pieces of content shared everyday • 1 billion tweet sent out every 3 days • 1 petabyte processed by google every 72 minutes Saturday, February 11, 12

Slide 4

Slide 4 text

Challenges • Extracting meaningful results from such large amounts of data • Presenting relevant data to users • It’s not a new problem • Remember life before google analytics ? (and what it would mean to do the same on your own) Saturday, February 11, 12

Slide 5

Slide 5 text

Traditionnal Approaches • Huge Systems • Oracle BI • Still bound to a single processing unit • Huge upfront costs: the value of the result must be much higher than the cost of producing it Saturday, February 11, 12

Slide 6

Slide 6 text

Who needs a new approach • Web applications • Logs • Event Processing • Open Data Applications • Everywhere cost of infrastructure matters Saturday, February 11, 12

Slide 7

Slide 7 text

The Google Way • Search is the canonical big data use case • Needed to use off-the-shelf hardware • A tool suite: • Storage: GFS • Database: Big Table • Processing: Map/Reduce Saturday, February 11, 12

Slide 8

Slide 8 text

The Open Source Ecosystem • Not settled yet • Hadoop: Backed by Yahoo • A lot of other players Saturday, February 11, 12

Slide 9

Slide 9 text

Category Theory what we do with data • Simple Operations • counting • Relational Algebra • Projection: “SELECT a,b,c” • Selection: “WHERE a > 1” • Joins: “LEFT JOIN on” Saturday, February 11, 12

Slide 10

Slide 10 text

A focus on map/reduce • One of the more established big data tools • Primarily for batch jobs, analytics • Addresses category theory on huge amounts of data • Doesn’t solve every problem Saturday, February 11, 12

Slide 11

Slide 11 text

Map/Reduce principles • Breaks category theory into map and reduce operations • Map phase: operations that can run in parallel: projection, selection • Operations in the reduce phase: joins, counts • Just like cooking! Saturday, February 11, 12

Slide 12

Slide 12 text

Pratical Overview (I) • http://xkcd.com/1007 • we can feed off of google’s ngram data • simple projection and selection • let’s see some code Saturday, February 11, 12

Slide 13

Slide 13 text

Pratical Overview (II) Saturday, February 11, 12

Slide 14

Slide 14 text

Quick Poll: Next Time ? • Infrastructure automation • Event Streaming with Apache Kafka • The CAP Theorem and Cassandra • Cluster ordering with zookeeper • Hadoop workshop Saturday, February 11, 12