Making sense of big data

Making Sense of Big- Data Saturday, February 11, 12

Big Data ? • what do we mean by big
? • why is it becoming an issue now ? • a shift in internet usage: from consumption to sharing • cost of storage Saturday, February 11, 12

Interesting Numbers • 400% growth of generated data per year,
5% of global IT spending growth • 4 billion pieces of content shared everyday • 1 billion tweet sent out every 3 days • 1 petabyte processed by google every 72 minutes Saturday, February 11, 12

Challenges • Extracting meaningful results from such large amounts of
data • Presenting relevant data to users • It’s not a new problem • Remember life before google analytics ? (and what it would mean to do the same on your own) Saturday, February 11, 12

Traditionnal Approaches • Huge Systems • Oracle BI • Still
bound to a single processing unit • Huge upfront costs: the value of the result must be much higher than the cost of producing it Saturday, February 11, 12

Who needs a new approach • Web applications • Logs
• Event Processing • Open Data Applications • Everywhere cost of infrastructure matters Saturday, February 11, 12

The Google Way • Search is the canonical big data
use case • Needed to use off-the-shelf hardware • A tool suite: • Storage: GFS • Database: Big Table • Processing: Map/Reduce Saturday, February 11, 12

The Open Source Ecosystem • Not settled yet • Hadoop:
Backed by Yahoo • A lot of other players Saturday, February 11, 12

Category Theory what we do with data • Simple Operations
• counting • Relational Algebra • Projection: “SELECT a,b,c” • Selection: “WHERE a > 1” • Joins: “LEFT JOIN on” Saturday, February 11, 12

A focus on map/reduce • One of the more established
big data tools • Primarily for batch jobs, analytics • Addresses category theory on huge amounts of data • Doesn’t solve every problem Saturday, February 11, 12

Map/Reduce principles • Breaks category theory into map and reduce
operations • Map phase: operations that can run in parallel: projection, selection • Operations in the reduce phase: joins, counts • Just like cooking! Saturday, February 11, 12

Pratical Overview (I) • http://xkcd.com/1007 • we can feed off
of google’s ngram data • simple projection and selection • let’s see some code Saturday, February 11, 12

Pratical Overview (II) Saturday, February 11, 12

Quick Poll: Next Time ? • Infrastructure automation • Event
Streaming with Apache Kafka • The CAP Theorem and Cassandra • Cluster ordering with zookeeper • Hadoop workshop Saturday, February 11, 12

Making sense of big data

Making sense of big data

Pierre-Yves Ritschard

More Decks by Pierre-Yves Ritschard

Other Decks in Programming

Featured

Transcript

Making Sense of Big- Data Saturday, February 11, 12

Big Data ? • what do we mean by big

Interesting Numbers • 400% growth of generated data per year,

Challenges • Extracting meaningful results from such large amounts of

Traditionnal Approaches • Huge Systems • Oracle BI • Still

Who needs a new approach • Web applications • Logs

The Google Way • Search is the canonical big data

The Open Source Ecosystem • Not settled yet • Hadoop:

Category Theory what we do with data • Simple Operations

A focus on map/reduce • One of the more established

Map/Reduce principles • Breaks category theory into map and reduce

Pratical Overview (I) • http://xkcd.com/1007 • we can feed off

Pratical Overview (II) Saturday, February 11, 12

Quick Poll: Next Time ? • Infrastructure automation • Event