The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases

The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases

Data comes in all forms and shapes. Data also evolves as life and people adapt to new situations, and so should your database.

When working with data, traditional relational database systems come to mind because it is how most of us have been trained.

However, data is rarely homogeneous, and your database should not force you into a certain schema if your data is not relational.

During this talk we analyse the composition of "documents" in the context of a document-based database, and cover the basic principles of Map-Reduce and its potential use in the context of computational statistics.

What happens when the amount of data you have no longer fits on 1 server? How easy is it for your favourite database to currently expand and adapt to your new growing requirements? What is your contingency plan if your server goes down?

We go over some of the features that CouchDB, Riak provide you with, alongside some of David's personal opinions.

This is an intermediary talk. Listeners should have a working concept of Bayesian statistics, standard internet protocols as such as HTTP, and a minimum understanding of programming languages as such as JavaScript and Erlang.

C6ec08260e13aa2d5e9a7519546bed27?s=128

David Coallier

March 27, 2013
Tweet

Transcript

  1. The Artful Business of Data Mining Distributed Schema-less Document-Based Databases

    Wednesday 24 April 13
  2. David Coallier @davidcoallier Wednesday 24 April 13

  3. Data Scientist At Engine Yard (.com) Wednesday 24 April 13

  4. RDBMs Wednesday 24 April 13

  5. Structure Restrictions Safety Wednesday 24 April 13

  6. id name age address 1 2 3 4 5 6

    7 ... david divad foo bar john jack jill ... 1 3 41 42 3315 4 8 ... 315 51 31 98 85 11 66 ... Wednesday 24 April 13
  7. id name age address 1 2 3 4 5 6

    7 ... david divad foo bar john jack jill ... 1 3 41 42 3315 4 8 ... 315 51 31 98 85 11 66 ... Wednesday 24 April 13
  8. id name age address 1 2 3 4 5 6

    7 ... david divad foo bar john jack jill ... 1 3 41 42 3315 4 8 ... 315 51 31 98 85 11 66 ... Wednesday 24 April 13
  9. id name age address 1 2 3 4 5 6

    7 ... david divad foo bar john jack jill ... 1 3 41 42 3315 4 8 ... 315 51 31 98 85 11 66 ... Wednesday 24 April 13
  10. id name age address 1 2 3 4 5 6

    7 ... david divad foo bar john jack jill ... 1 3 41 42 3315 4 8 ... 315 51 31 98 85 11 66 ... Wednesday 24 April 13
  11. What If? Wednesday 24 April 13

  12. id name age address phone 1 2 3 4 5

    6 7 ... david divad foo bar john jack jill ... 26 27 42 31 17 128 21 ... IE US IE CA NZ DK IE ... 353 1 353 1 131 311 353 ... Wednesday 24 April 13
  13. Before Moving on Wednesday 24 April 13

  14. JSON Wednesday 24 April 13

  15. What is JSON? Wednesday 24 April 13

  16. { "firstName": "David", "lastName": "Coallier", "age": 26, "address": { "streetAddress":

    "Mansfield House", "city": "Crosshaven", }, "phoneNumbers": [ { "type": "mobile", "number": "0863299999" } ] } Wednesday 24 April 13
  17. What is HTTP? Wednesday 24 April 13

  18. What is a Schema? Wednesday 24 April 13

  19. Alternative Wednesday 24 April 13

  20. Schema-less Wednesday 24 April 13

  21. Does NOT Mean Structure-less Wednesday 24 April 13

  22. Documents and K-V Buckets Wednesday 24 April 13

  23. CouchDB Cluster of unreliable commodity hardware Wednesday 24 April 13

  24. Replication Attachments Generated “random” ids Dictionary Revisions? JSON Objects HTTP

    CRUD Wednesday 24 April 13
  25. Documents Wednesday 24 April 13

  26. Wednesday 24 April 13

  27. { "_id": "131dafsd1vasd", "_rev": "12-fva32asdf", "firstName": "David", "lastName": "Coallier", "age":

    26, "address": { "streetAddress": "Mansfield House", "city": "Crosshaven", }, "phoneNumbers": [ { "type": "mobile", "number": "0863299999" } ] } Wednesday 24 April 13
  28. How do you find Anything? Wednesday 24 April 13

  29. Map/Reduce Wednesday 24 April 13

  30. ... Wednesday 24 April 13

  31. Riak Wednesday 24 April 13

  32. Dynamo Paper Wednesday 24 April 13

  33. CAP Theorem Wednesday 24 April 13

  34. Key-Value Buckets Wednesday 24 April 13

  35. Differences? Wednesday 24 April 13

  36. CouchDB Riak Storage Model append-only bitcask Access HTTP HTTP, PB

    Retrieval Views(M/R) M/R, Indexes, Search Versioning Eventual Consistency Vector Clocks Concurrency No Locking Client Resolution Replication master/master/slave replication, clustering Scaling In/Out Big Couch Built-in Management Futon/Fuxton Riak Control http://downloads.basho.com/papers/bitcask-intro.pdf http://guide.couchdb.org Wednesday 24 April 13
  37. Map/Reduce Wednesday 24 April 13

  38. Mapper: Reducer: Receives output from mappers Executed on document Wednesday

    24 April 13
  39. { "_id": "...", "_rev": "...", "age": "26" } { "_id":

    "...", "_rev": "...", "age": "32", "heads": "3", } { "_id": "...", "_rev": "...", "age": "42" } { "_id": "...", "_rev": "...", "age": "17" } Wednesday 24 April 13
  40. { "_id": "...", "_rev": "...", "age": "26" } { "_id":

    "...", "_rev": "...", "age": "42" } { "_id": "...", "_rev": "...", "age": "17" } { "_id": "...", "_rev": "...", "age": "32", "heads": "3", } Wednesday 24 April 13
  41. { "age": "32", "heads": "3", } Wednesday 24 April 13

  42. { "_id": "...", "_rev": "...", "age": "26" } { "_id":

    "...", "_rev": "...", "age": "42" } { "_id": "...", "_rev": "...", "age": "17" } { "_id": "...", "_rev": "...", "age": "32", "heads": "3", } Map: find-ages Wednesday 24 April 13
  43. function find_ages(doc) { if (typeof(doc.age) != undefined) { emit(doc._id, doc.age);

    } } Map: find-ages Wednesday 24 April 13
  44. { "_id": "...", "_rev": "...", "age": "26" } { "_id":

    "...", "_rev": "...", "age": "42" } { "_id": "...", "_rev": "...", "age": "17" } { "_id": "...", "_rev": "...", "age": "32", "heads": "3", } Map: find-ages Wednesday 24 April 13
  45. { "_id": "...", "_rev": "...", "age": "26" } { "_id":

    "...", "_rev": "...", "age": "42" } { "_id": "...", "_rev": "...", "age": "17" } { "_id": "...", "_rev": "...", "age": "32", "heads": "3", } Map: find-ages 26 32 42 17 Wednesday 24 April 13
  46. Map: find-ages 26 32 42 Reduce: sum 17 Wednesday 24

    April 13
  47. Reduce: sum function sum(values) { return sum(values); } Wednesday 24

    April 13
  48. Map: find-ages 26 32 42 Reduce: sum 17 117 Wednesday

    24 April 13
  49. Mapper: Reducer: Receives output from mappers Executed on document Wednesday

    24 April 13
  50. So What? Wednesday 24 April 13

  51. The Machines They Lurn. Wednesday 24 April 13

  52. The Problem Wednesday 24 April 13

  53. Statistics Example Wednesday 24 April 13

  54. Mean, Std. Deviation Age Wednesday 24 April 13

  55. µ = 1 n x i i=1 n ∑ Wednesday

    24 April 13
  56. σ = 1 n (x i − µ)2 i=1 n

    ∑ Wednesday 24 April 13
  57. Mapper: Reducer: Receives output from mappers Executed on document Wednesday

    24 April 13
  58. Mapper: Reducer: Receive, process further. Retrieve values, pre-process Wednesday 24

    April 13
  59. { "_id": "...", "_rev": "...", "age": "26" } { "_id":

    "...", "_rev": "...", "age": "32", "heads": "3", } { "_id": "...", "_rev": "...", "age": "42" } { "_id": "...", "_rev": "...", "age": "17" } Wednesday 24 April 13
  60. [ [ 26, 676], [ 32, 1024], [ 42, 1764],

    [ 17, 289 ] ] Wednesday 24 April 13
  61. /** * Our mapper function. */ map: function(doc) { emit(null,

    [doc.age, doc.age * doc.age]); } /** * Our reducer... */ reduce: function(keys, values, rereduce) { var N = 0; var summed = 0; var summedSquare = 0; for (var i in values) { N += 1; summed += values[i][0]; summedSquare += values[i][1]; } var mean = summed / N; var standard_deviation = Math.sqrt( (summedSquare / N) - (mean* mean) ) return [mean, standard_deviation] } Wednesday 24 April 13
  62. /** * Our mapper function. */ map: function(doc) { emit(null,

    [doc.age, doc.age * doc.age]); } /** * Our reducer... */ reduce: function(keys, values, rereduce) { var N = values.length; var summed = sum(values.map(function(v) { return v[0]; })); var summedSquares = sum(values.map(function(v) { return v[1];})); var mean = summed / N; var standard_deviation = Math.sqrt( (summedSquares / N) - (mean*mean) ) return [mean, standard_deviation] } Wednesday 24 April 13
  63. Naive Bayes Wednesday 24 April 13

  64. Real Life Fraud Wednesday 24 April 13

  65. P(x j = k | y = fraudulent) P(x j

    = k | y = normal) P(y) Wednesday 24 April 13
  66. We need to: Sum , for each y to calculate

    P(x|y) x j = k Wednesday 24 April 13
  67. We need: More than 1 mapper. Wednesday 24 April 13

  68. We need 4 mappers Wednesday 24 April 13

  69. Mapper #1: 1i P(x j = k | y =

    fraudulent) ∑ Wednesday 24 April 13
  70. Mapper #2: 1i P(x j = k | y =

    normal) ∑ Wednesday 24 April 13
  71. Mapper #3: 1i P(y = fraudulent) ∑ Wednesday 24 April

    13
  72. Mapper #4: 1i P(y = normal) ∑ Wednesday 24 April

    13
  73. Reducer Sums up results for parameters Wednesday 24 April 13

  74. Cluster Analysis Wednesday 24 April 13

  75. k-means Wednesday 24 April 13

  76. Mapper: Reducer: Sum up the sums, get new centroids. Divide

    vectors into subgroups, Calculate d(p,q) between vectors, find centroids, sum them up. Wednesday 24 April 13
  77. More Problems Wednesday 24 April 13

  78. Iterative Map/Reduce Wednesday 24 April 13

  79. )BEPPQ Wednesday 24 April 13

  80. Hadoop And Mr. Job Wednesday 24 April 13

  81. mrjob Wednesday 24 April 13

  82. Chain Mappers Wednesday 24 April 13

  83. And... Reduce Wednesday 24 April 13

  84. Back To Naïve Bayes Wednesday 24 April 13

  85. P(x | y) ∝ P(x) P(k | x) k∈y ∏

    Wednesday 24 April 13
  86. class label an entry a feature P(x | y) ∝

    P(x) P(k | x) k∈y ∏ Wednesday 24 April 13
  87. ARGH! argmax(ln...) Wednesday 24 April 13

  88. ˆ P(x) = c(N) N = N x N Wednesday

    24 April 13
  89. P(k | x) = c(k,x) c(k',x) k' ∑ Wednesday 24

    April 13
  90. ARGH! Wednesday 24 April 13

  91. P(k | x) = z+c(k | x) z + c(k'

    | x) k' ∑ where z = { } smooth if z > 0, unsmoothed otherwise Wednesday 24 April 13
  92. Wednesday 24 April 13

  93. What did You see? Wednesday 24 April 13

  94. c(N) N c(k, x) c(x,k') ∑ Wednesday 24 April 13

  95. # This is pseudo-code def mapperPrior(self, _, line): pass def

    combinerPrior(self, key, values): pass def reducerPrior(self, key, values): pass def mapperProb(self, _, line): pass def combinerProb(self, key, values): pass def reducerProb(self, key, values): pass def steps(self): return [ self.mr(mapper=self.mapperPrior, combiner=self.combinerPrior, reducer=self.reducerPrior), self.mr(mapper=self.mapperProb, combiner=self.combinerProb, reducer=self.reducerProb) ] } Wednesday 24 April 13
  96. Simpler Example Wednesday 24 April 13

  97. from mrjob.job import MRJob class MRDoubleWordFreqCount(MRJob): def get_words(self, _, line):

    for word in WORD_RE.findall(line): yield word.lower(), 1 def sum_words(self, word, counts): yield word, sum(counts) def double_counts(self, word, counts): yield word, counts * 2 def steps(self): return [self.mr(mapper=self.get_words, combiner=self.sum_words, reducer=self.sum_words), self.mr(mapper=self.double_counts)] } Wednesday 24 April 13
  98. Principal Component Analysis Wednesday 24 April 13

  99. ∑ = 1 m ( x i x i T

    )− µµ i=1 m ∑ T Wednesday 24 April 13
  100. ∑ = 1 m ( x i x i T

    ) i=1 m ∑ − µµT Wednesday 24 April 13
  101. Summation ∑ = 1 m ( x i x i

    T ) i=1 m ∑ − µµT Wednesday 24 April 13
  102. µ = 1 m x i i=1 m ∑ ∑

    = 1 m ( x i x i T )− µµ i=1 m ∑ T Wednesday 24 April 13
  103. Mappers Separate Processes Wednesday 24 April 13

  104. Reducer Sum Partial Results Wednesday 24 April 13

  105. Shifting Thought Paradigms Wednesday 24 April 13

  106. Summation Form Wednesday 24 April 13

  107. y = f(x) ∑ Wednesday 24 April 13

  108. y = f(x) ∑ Mapper Reducer Wednesday 24 April 13

  109. Reducer ∑ Wednesday 24 April 13

  110. f(x) Mapper Wednesday 24 April 13

  111. Looking Back Wednesday 24 April 13

  112. Thanks Wednesday 24 April 13