The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases

The Artful Business of Data Mining Distributed Schema-less Document-Based Databases
Wednesday 24 April 13

David Coallier @davidcoallier Wednesday 24 April 13

Data Scientist At Engine Yard (.com) Wednesday 24 April 13

RDBMs Wednesday 24 April 13

Structure Restrictions Safety Wednesday 24 April 13

id name age address 1 2 3 4 5 6
7 ... david divad foo bar john jack jill ... 1 3 41 42 3315 4 8 ... 315 51 31 98 85 11 66 ... Wednesday 24 April 13

What If? Wednesday 24 April 13

id name age address phone 1 2 3 4 5
6 7 ... david divad foo bar john jack jill ... 26 27 42 31 17 128 21 ... IE US IE CA NZ DK IE ... 353 1 353 1 131 311 353 ... Wednesday 24 April 13

Before Moving on Wednesday 24 April 13

JSON Wednesday 24 April 13

What is JSON? Wednesday 24 April 13

{ "firstName": "David", "lastName": "Coallier", "age": 26, "address": { "streetAddress":
"Mansfield House", "city": "Crosshaven", }, "phoneNumbers": [ { "type": "mobile", "number": "0863299999" } ] } Wednesday 24 April 13

What is HTTP? Wednesday 24 April 13

What is a Schema? Wednesday 24 April 13

Alternative Wednesday 24 April 13

Schema-less Wednesday 24 April 13

Does NOT Mean Structure-less Wednesday 24 April 13

Documents and K-V Buckets Wednesday 24 April 13

CouchDB Cluster of unreliable commodity hardware Wednesday 24 April 13

Replication Attachments Generated “random” ids Dictionary Revisions? JSON Objects HTTP
CRUD Wednesday 24 April 13

Documents Wednesday 24 April 13

{ "_id": "131dafsd1vasd", "_rev": "12-fva32asdf", "firstName": "David", "lastName": "Coallier", "age":
26, "address": { "streetAddress": "Mansfield House", "city": "Crosshaven", }, "phoneNumbers": [ { "type": "mobile", "number": "0863299999" } ] } Wednesday 24 April 13

How do you ﬁnd Anything? Wednesday 24 April 13

Map/Reduce Wednesday 24 April 13

... Wednesday 24 April 13

Riak Wednesday 24 April 13

Dynamo Paper Wednesday 24 April 13

CAP Theorem Wednesday 24 April 13

Key-Value Buckets Wednesday 24 April 13

Differences? Wednesday 24 April 13

CouchDB Riak Storage Model append-only bitcask Access HTTP HTTP, PB
Retrieval Views(M/R) M/R, Indexes, Search Versioning Eventual Consistency Vector Clocks Concurrency No Locking Client Resolution Replication master/master/slave replication, clustering Scaling In/Out Big Couch Built-in Management Futon/Fuxton Riak Control http://downloads.basho.com/papers/bitcask-intro.pdf http://guide.couchdb.org Wednesday 24 April 13

Map/Reduce Wednesday 24 April 13

Mapper: Reducer: Receives output from mappers Executed on document Wednesday
24 April 13

{ "_id": "...", "_rev": "...", "age": "26" } { "_id":
"...", "_rev": "...", "age": "32", "heads": "3", } { "_id": "...", "_rev": "...", "age": "42" } { "_id": "...", "_rev": "...", "age": "17" } Wednesday 24 April 13

{ "_id": "...", "_rev": "...", "age": "26" } { "_id":
"...", "_rev": "...", "age": "42" } { "_id": "...", "_rev": "...", "age": "17" } { "_id": "...", "_rev": "...", "age": "32", "heads": "3", } Wednesday 24 April 13

{ "age": "32", "heads": "3", } Wednesday 24 April 13

{ "_id": "...", "_rev": "...", "age": "26" } { "_id":
"...", "_rev": "...", "age": "42" } { "_id": "...", "_rev": "...", "age": "17" } { "_id": "...", "_rev": "...", "age": "32", "heads": "3", } Map: ﬁnd-ages Wednesday 24 April 13

function find_ages(doc) { if (typeof(doc.age) != undefined) { emit(doc._id, doc.age);
} } Map: ﬁnd-ages Wednesday 24 April 13

{ "_id": "...", "_rev": "...", "age": "26" } { "_id":
"...", "_rev": "...", "age": "42" } { "_id": "...", "_rev": "...", "age": "17" } { "_id": "...", "_rev": "...", "age": "32", "heads": "3", } Map: ﬁnd-ages Wednesday 24 April 13

{ "_id": "...", "_rev": "...", "age": "26" } { "_id":
"...", "_rev": "...", "age": "42" } { "_id": "...", "_rev": "...", "age": "17" } { "_id": "...", "_rev": "...", "age": "32", "heads": "3", } Map: ﬁnd-ages 26 32 42 17 Wednesday 24 April 13

Map: ﬁnd-ages 26 32 42 Reduce: sum 17 Wednesday 24
April 13

Reduce: sum function sum(values) { return sum(values); } Wednesday 24
April 13

Map: ﬁnd-ages 26 32 42 Reduce: sum 17 117 Wednesday
24 April 13

24 April 13

So What? Wednesday 24 April 13

The Machines They Lurn. Wednesday 24 April 13

The Problem Wednesday 24 April 13

Statistics Example Wednesday 24 April 13

Mean, Std. Deviation Age Wednesday 24 April 13

µ = 1 n x i i=1 n ∑ Wednesday
24 April 13

σ = 1 n (x i − µ)2 i=1 n
∑ Wednesday 24 April 13

24 April 13

Mapper: Reducer: Receive, process further. Retrieve values, pre-process Wednesday 24
April 13

{ "_id": "...", "_rev": "...", "age": "26" } { "_id":
"...", "_rev": "...", "age": "32", "heads": "3", } { "_id": "...", "_rev": "...", "age": "42" } { "_id": "...", "_rev": "...", "age": "17" } Wednesday 24 April 13

[ [ 26, 676], [ 32, 1024], [ 42, 1764],
[ 17, 289 ] ] Wednesday 24 April 13

/** * Our mapper function. */ map: function(doc) { emit(null,
[doc.age, doc.age * doc.age]); } /** * Our reducer... */ reduce: function(keys, values, rereduce) { var N = 0; var summed = 0; var summedSquare = 0; for (var i in values) { N += 1; summed += values[i][0]; summedSquare += values[i][1]; } var mean = summed / N; var standard_deviation = Math.sqrt( (summedSquare / N) - (mean* mean) ) return [mean, standard_deviation] } Wednesday 24 April 13

/** * Our mapper function. */ map: function(doc) { emit(null,
[doc.age, doc.age * doc.age]); } /** * Our reducer... */ reduce: function(keys, values, rereduce) { var N = values.length; var summed = sum(values.map(function(v) { return v[0]; })); var summedSquares = sum(values.map(function(v) { return v[1];})); var mean = summed / N; var standard_deviation = Math.sqrt( (summedSquares / N) - (mean*mean) ) return [mean, standard_deviation] } Wednesday 24 April 13

Naive Bayes Wednesday 24 April 13

Real Life Fraud Wednesday 24 April 13

P(x j = k | y = fraudulent) P(x j
= k | y = normal) P(y) Wednesday 24 April 13

We need to: Sum , for each y to calculate
P(x|y) x j = k Wednesday 24 April 13

We need: More than 1 mapper. Wednesday 24 April 13

We need 4 mappers Wednesday 24 April 13

Mapper #1: 1i P(x j = k | y =
fraudulent) ∑ Wednesday 24 April 13

Mapper #2: 1i P(x j = k | y =
normal) ∑ Wednesday 24 April 13

Mapper #3: 1i P(y = fraudulent) ∑ Wednesday 24 April
13

Mapper #4: 1i P(y = normal) ∑ Wednesday 24 April
13

Reducer Sums up results for parameters Wednesday 24 April 13

Cluster Analysis Wednesday 24 April 13

k-means Wednesday 24 April 13

Mapper: Reducer: Sum up the sums, get new centroids. Divide
vectors into subgroups, Calculate d(p,q) between vectors, ﬁnd centroids, sum them up. Wednesday 24 April 13

The Artful Business of Data Mining: Distributed...

The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases

More Decks by David Coallier

Other Decks in Education

Featured

Transcript