The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases

The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases

Data comes in all forms and shapes. Data also evolves as life and people adapt to new situations, and so should your database.

When working with data, traditional relational database systems come to mind because it is how most of us have been trained.

However, data is rarely homogeneous, and your database should not force you into a certain schema if your data is not relational.

During this talk we analyse the composition of "documents" in the context of a document-based database, and cover the basic principles of Map-Reduce and its potential use in the context of computational statistics.

What happens when the amount of data you have no longer fits on 1 server? How easy is it for your favourite database to currently expand and adapt to your new growing requirements? What is your contingency plan if your server goes down?

We go over some of the features that CouchDB, Riak provide you with, alongside some of David's personal opinions.

This is an intermediary talk. Listeners should have a working concept of Bayesian statistics, standard internet protocols as such as HTTP, and a minimum understanding of programming languages as such as JavaScript and Erlang.

C6ec08260e13aa2d5e9a7519546bed27?s=128

David Coallier

March 27, 2013
Tweet

Transcript

  1. 6.

    id name age address 1 2 3 4 5 6

    7 ... david divad foo bar john jack jill ... 1 3 41 42 3315 4 8 ... 315 51 31 98 85 11 66 ... Wednesday 24 April 13
  2. 7.

    id name age address 1 2 3 4 5 6

    7 ... david divad foo bar john jack jill ... 1 3 41 42 3315 4 8 ... 315 51 31 98 85 11 66 ... Wednesday 24 April 13
  3. 8.

    id name age address 1 2 3 4 5 6

    7 ... david divad foo bar john jack jill ... 1 3 41 42 3315 4 8 ... 315 51 31 98 85 11 66 ... Wednesday 24 April 13
  4. 9.

    id name age address 1 2 3 4 5 6

    7 ... david divad foo bar john jack jill ... 1 3 41 42 3315 4 8 ... 315 51 31 98 85 11 66 ... Wednesday 24 April 13
  5. 10.

    id name age address 1 2 3 4 5 6

    7 ... david divad foo bar john jack jill ... 1 3 41 42 3315 4 8 ... 315 51 31 98 85 11 66 ... Wednesday 24 April 13
  6. 12.

    id name age address phone 1 2 3 4 5

    6 7 ... david divad foo bar john jack jill ... 26 27 42 31 17 128 21 ... IE US IE CA NZ DK IE ... 353 1 353 1 131 311 353 ... Wednesday 24 April 13
  7. 16.

    { "firstName": "David", "lastName": "Coallier", "age": 26, "address": { "streetAddress":

    "Mansfield House", "city": "Crosshaven", }, "phoneNumbers": [ { "type": "mobile", "number": "0863299999" } ] } Wednesday 24 April 13
  8. 27.

    { "_id": "131dafsd1vasd", "_rev": "12-fva32asdf", "firstName": "David", "lastName": "Coallier", "age":

    26, "address": { "streetAddress": "Mansfield House", "city": "Crosshaven", }, "phoneNumbers": [ { "type": "mobile", "number": "0863299999" } ] } Wednesday 24 April 13
  9. 36.

    CouchDB Riak Storage Model append-only bitcask Access HTTP HTTP, PB

    Retrieval Views(M/R) M/R, Indexes, Search Versioning Eventual Consistency Vector Clocks Concurrency No Locking Client Resolution Replication master/master/slave replication, clustering Scaling In/Out Big Couch Built-in Management Futon/Fuxton Riak Control http://downloads.basho.com/papers/bitcask-intro.pdf http://guide.couchdb.org Wednesday 24 April 13
  10. 39.

    { "_id": "...", "_rev": "...", "age": "26" } { "_id":

    "...", "_rev": "...", "age": "32", "heads": "3", } { "_id": "...", "_rev": "...", "age": "42" } { "_id": "...", "_rev": "...", "age": "17" } Wednesday 24 April 13
  11. 40.

    { "_id": "...", "_rev": "...", "age": "26" } { "_id":

    "...", "_rev": "...", "age": "42" } { "_id": "...", "_rev": "...", "age": "17" } { "_id": "...", "_rev": "...", "age": "32", "heads": "3", } Wednesday 24 April 13
  12. 42.

    { "_id": "...", "_rev": "...", "age": "26" } { "_id":

    "...", "_rev": "...", "age": "42" } { "_id": "...", "_rev": "...", "age": "17" } { "_id": "...", "_rev": "...", "age": "32", "heads": "3", } Map: find-ages Wednesday 24 April 13
  13. 44.

    { "_id": "...", "_rev": "...", "age": "26" } { "_id":

    "...", "_rev": "...", "age": "42" } { "_id": "...", "_rev": "...", "age": "17" } { "_id": "...", "_rev": "...", "age": "32", "heads": "3", } Map: find-ages Wednesday 24 April 13
  14. 45.

    { "_id": "...", "_rev": "...", "age": "26" } { "_id":

    "...", "_rev": "...", "age": "42" } { "_id": "...", "_rev": "...", "age": "17" } { "_id": "...", "_rev": "...", "age": "32", "heads": "3", } Map: find-ages 26 32 42 17 Wednesday 24 April 13
  15. 56.

    σ = 1 n (x i − µ)2 i=1 n

    ∑ Wednesday 24 April 13
  16. 59.

    { "_id": "...", "_rev": "...", "age": "26" } { "_id":

    "...", "_rev": "...", "age": "32", "heads": "3", } { "_id": "...", "_rev": "...", "age": "42" } { "_id": "...", "_rev": "...", "age": "17" } Wednesday 24 April 13
  17. 60.

    [ [ 26, 676], [ 32, 1024], [ 42, 1764],

    [ 17, 289 ] ] Wednesday 24 April 13
  18. 61.

    /** * Our mapper function. */ map: function(doc) { emit(null,

    [doc.age, doc.age * doc.age]); } /** * Our reducer... */ reduce: function(keys, values, rereduce) { var N = 0; var summed = 0; var summedSquare = 0; for (var i in values) { N += 1; summed += values[i][0]; summedSquare += values[i][1]; } var mean = summed / N; var standard_deviation = Math.sqrt( (summedSquare / N) - (mean* mean) ) return [mean, standard_deviation] } Wednesday 24 April 13
  19. 62.

    /** * Our mapper function. */ map: function(doc) { emit(null,

    [doc.age, doc.age * doc.age]); } /** * Our reducer... */ reduce: function(keys, values, rereduce) { var N = values.length; var summed = sum(values.map(function(v) { return v[0]; })); var summedSquares = sum(values.map(function(v) { return v[1];})); var mean = summed / N; var standard_deviation = Math.sqrt( (summedSquares / N) - (mean*mean) ) return [mean, standard_deviation] } Wednesday 24 April 13
  20. 65.

    P(x j = k | y = fraudulent) P(x j

    = k | y = normal) P(y) Wednesday 24 April 13
  21. 66.

    We need to: Sum , for each y to calculate

    P(x|y) x j = k Wednesday 24 April 13
  22. 69.

    Mapper #1: 1i P(x j = k | y =

    fraudulent) ∑ Wednesday 24 April 13
  23. 70.

    Mapper #2: 1i P(x j = k | y =

    normal) ∑ Wednesday 24 April 13
  24. 76.

    Mapper: Reducer: Sum up the sums, get new centroids. Divide

    vectors into subgroups, Calculate d(p,q) between vectors, find centroids, sum them up. Wednesday 24 April 13
  25. 85.

    P(x | y) ∝ P(x) P(k | x) k∈y ∏

    Wednesday 24 April 13
  26. 86.

    class label an entry a feature P(x | y) ∝

    P(x) P(k | x) k∈y ∏ Wednesday 24 April 13
  27. 91.

    P(k | x) = z+c(k | x) z + c(k'

    | x) k' ∑ where z = { } smooth if z > 0, unsmoothed otherwise Wednesday 24 April 13
  28. 95.

    # This is pseudo-code def mapperPrior(self, _, line): pass def

    combinerPrior(self, key, values): pass def reducerPrior(self, key, values): pass def mapperProb(self, _, line): pass def combinerProb(self, key, values): pass def reducerProb(self, key, values): pass def steps(self): return [ self.mr(mapper=self.mapperPrior, combiner=self.combinerPrior, reducer=self.reducerPrior), self.mr(mapper=self.mapperProb, combiner=self.combinerProb, reducer=self.reducerProb) ] } Wednesday 24 April 13
  29. 97.

    from mrjob.job import MRJob class MRDoubleWordFreqCount(MRJob): def get_words(self, _, line):

    for word in WORD_RE.findall(line): yield word.lower(), 1 def sum_words(self, word, counts): yield word, sum(counts) def double_counts(self, word, counts): yield word, counts * 2 def steps(self): return [self.mr(mapper=self.get_words, combiner=self.sum_words, reducer=self.sum_words), self.mr(mapper=self.double_counts)] } Wednesday 24 April 13
  30. 99.

    ∑ = 1 m ( x i x i T

    )− µµ i=1 m ∑ T Wednesday 24 April 13
  31. 100.

    ∑ = 1 m ( x i x i T

    ) i=1 m ∑ − µµT Wednesday 24 April 13
  32. 101.

    Summation ∑ = 1 m ( x i x i

    T ) i=1 m ∑ − µµT Wednesday 24 April 13
  33. 102.

    µ = 1 m x i i=1 m ∑ ∑

    = 1 m ( x i x i T )− µµ i=1 m ∑ T Wednesday 24 April 13