Upgrade to Pro — share decks privately, control downloads, hide ads and more …

NoSQL Matters 2013 - Introduction to Map Reduce...

NoSQL Matters 2013 - Introduction to Map Reduce with Couchbase 2.0

Introduction to Map Reduce and how it is used in Couchbase Server 2.0 to query documents

Tugdual Grall

April 26, 2013
Tweet

More Decks by Tugdual Grall

Other Decks in Technology

Transcript

  1. Introduc)on  to  Map  Reduce with  Couchbase Tugdual  Grall  /  @tgrall

    NoSQL  Ma)ers  ‘13  -­‐  Cologne  -­‐  April  25th  2013 Friday, April 26, 13
  2. About  Me   • Tugdual  “Tug”  Grall ­ Couchbase •

    Technical  Evangelist ­ eXo • CTO ­ Oracle • Developer/Product  Manager • Mainly  Java/SOA ­ Developer  in  consul@ng  firms • Web • @tgrall • hEp://blog.grallandco.com • tgrall • NantesJUG  co-­‐founder • Pet  Project  : • hEp://www.resultri.com Friday, April 26, 13
  3. What’s  the  Problem  ? Lots  of  Data Big  Data SaaS/Cloud

      CompuDng Big  Users Friday, April 26, 13
  4. Map  Reduce   MapReduce  is  a  programming  model  for  processing

      large  data  sets,  and  the  name  of  an  implementa@on   of  the  model  by  Google.  MapReduce  is  typically  used   to  do  distributed  compu@ng  on  clusters  of   computers. hEp://research.google.com/archive/mapreduce.html Friday, April 26, 13
  5. In  details • Developer  specifies  2  methods: ­ map (in_key,

    in_value) -> list(out_key, intermediate_value) • Processes  input  data   • Produces  key,  values  pairs ­ reduce (out_key, list(intermediate_value)) -> list(out_value) • Combines  all  intermediate  values  for  a  par@cular  key • Produce  a  set  of  merged  output  values Friday, April 26, 13
  6. Couchbase  Open  Source  Project • Leading  NoSQL  database  project  

    focused  on  distributed  database   technology  and  surrounding   ecosystem • Supports  both  key-­‐value  and   document-­‐oriented  use  cases • All  components  are  available  under   the  Apache  2.0  Public  License • Obtained  as  packaged  soXware  in   both  enterprise  and  community   edi@ons. Couchbase Open Source Project Friday, April 26, 13
  7. Couchbase  Server  Core  Principles Easy   Scalability Consistent  High  

    Performance Always  On   24x365 Grow  cluster  without  applica@on   changes,  without  down@me  with  a   single  click Consistent  sub-­‐millisecond   read  and  write  response  @mes   with  consistent  high  throughput No  down@me  for  soXware  upgrades,   hardware  maintenance,  etc. Flexible  Data   Model JSON  document  model  with  no  fixed   schema. JSON JSON JSON JSON JSON PERFORMANCE Friday, April 26, 13
  8. Addi)onal  Couchbase  Server  Features Built-­‐in  clustering  –  All  nodes  equal

    Data  replica@on  with  auto-­‐failover Zero-­‐down@me  maintenance   Built-­‐in  managed  cached Append-­‐only  storage  layer Online  compac@on Monitoring  and  admin  API  &  UI SDK  for  a  variety  of  languages Friday, April 26, 13
  9. Heartbeat Process  monitor Global  singleton  supervisor Configura@on  manager on  each

     node Rebalance  orchestrator Node  health  monitor one  per  cluster vBucket  state  and  replica@on  manager hVp REST  management  API/Web  UI HTTP 8091 Erlang  port  mapper 4369 Distributed  Erlang 21100  -­‐  21199 Erlang/OTP storage  interface Couchbase  EP  Engine 11210 Memcapable    2.0 Moxi 11211 Memcapable    1.0 Memcached New  Persistence  Layer 8092 Query  API Query  Engine Data  Manager Cluster  Manager Couchbase  Server  2.0  Architecture Friday, April 26, 13
  10. New  Persistence  Layer storage  interface Couchbase  EP  Engine 11210 Memcapable

       2.0 Moxi 11211 Memcapable    1.0 Object-­‐level  Cache Disk  Persistence 8092 Query  API Query  Engine HTTP 8091 Erlang  port  mapper 4369 Distributed  Erlang 21100  -­‐  21199 Heartbeat Process  monitor Global  singleton  supervisor Configura@on  manager on  each  node Rebalance  orchestrator Node  health  monitor one  per  cluster vBucket  state  and  replica@on  manager hVp REST  management  API/Web  UI Erlang/OTP Server/Cluster   Management  &   CommunicaDon (Erlang) RAM  Cache,   Indexing  &   Persistence   Management (C  &  V8) The Unreasonable Effectiveness of C by Damien Katz Couchbase  Server  2.0  Architecture Friday, April 26, 13
  11. COUCHBASE  SERVER  CLUSTER Basic  Opera)on • Docs  distributed  evenly  across

     servers   • Each  server  stores  both  ac)ve  and  replica   docs Only  one  server  ac@ve  at  a  @me • Client  library  provides  app  with  simple   interface  to  database • Cluster  map  provides  map   to  which  server  doc  is  on App  never  needs  to  know • App  reads,  writes,  updates  docs • Mul)ple  app  servers  can  access  same   document  at  same  )me User  Configured  Replica  Count  =  1 READ/WRITE/UPDATE ACTIVE Doc  5 Doc  2 Doc Doc Doc SERVER  1 ACTIVE Doc  4 Doc  7 Doc Doc Doc SERVER  2 Doc  8 ACTIVE Doc  1 Doc  2 Doc Doc Doc REPLICA Doc  4 Doc  1 Doc  8 Doc Doc Doc REPLICA Doc  6 Doc  3 Doc  2 Doc Doc Doc REPLICA Doc  7 Doc  9 Doc  5 Doc Doc Doc SERVER  3 Doc  6 APP  SERVER  1 COUCHBASE  Client  Library CLUSTER  MAP COUCHBASE  Client  Library CLUSTER  MAP APP  SERVER  2 Doc  9 Friday, April 26, 13
  12. Key {        “string”  :  “string”,    

       “string”  :  value,        “string”  :                        {    “string”  :  “string”,                              “string”  :  value  },        “string”  :  [  array  ] } JSON OBJECT (“DOCUMENT”) • How  to  find  document  based   on  its  aVributes? ­ get  employee  by  email ­ get  products  by  type ­ ... • You  need  to  look  “into”  the   document/value Look  at  a  document Friday, April 26, 13
  13. { "name": "Aventinus", "abv": 8.2, "ibu": 0, "srm": 0, "upc":

    0, "type": "beer", "brewery_id": "110f1f2012", "updated": "2010-07-22 20:00:20", "description": "Dark-ruby, ... Weizenbock", "category": "German Ale" } { "id": "110f37fa30", "rev": "1-000000000", "expiration": 0, "flags": 0, "type": "json" } Key Value Aven@nus 8.2 Avenue  Ale 4.1 ... ... { "name": "Aventinus", "abv": 8.2, "ibu": 0, "srm": 0, "upc": 0, "type": "beer", "brewery_id": "110f1f2012", "updated": "2010-07-22 20:00:20", "description": "Dark-ruby, ... Weizenbock", "category": "German Ale" } { "id": "110f37fa30", "rev": "1-000000000", "expiration": 0, "flags": 0, "type": "json" } { "name": "Aventinus", "abv": 8.2, "ibu": 0, "srm": 0, "upc": 0, "type": "beer", "brewery_id": "110f1f2012", "updated": "2010-07-22 20:00:20", "description": "Dark-ruby, ... Weizenbock", "category": "German Ale" } { "id": "110f37fa30", "rev": "1-000000000", "expiration": 0, "flags": 0, "type": "json" } { "name": "Aventinus", "abv": 8.2, "ibu": 0, "srm": 0, "upc": 0, "type": "beer", "brewery_id": "110f1f2012", "updated": "2010-07-22 20:00:20", "description": "Dark-ruby, ... Weizenbock", "category": "German Ale" } { "id": "110f37fa30", "rev": "1-000000000", "expiration": 0, "flags": 0, "type": "json" } { "name": "Aventinus", "abv": 8.2, "ibu": 0, "srm": 0, "upc": 0, "type": "beer", "brewery_id": "110f1f2012", "updated": "2010-07-22 20:00:20", "description": "Dark-ruby, ... Weizenbock", "category": "German Ale" } { "id": "110f37fa30", "rev": "1-000000000", "expiration": 0, "flags": 0, "type": "json" } { "name": "Aventinus", "abv": 8.2, "ibu": 0, "srm": 0, "upc": 0, "type": "beer", "brewery_id": "110f1f2012", "updated": "2010-07-22 20:00:20", "description": "Dark-ruby, ... Weizenbock", "category": "German Ale" } { "id": "110f37fa30", "rev": "1-000000000", "expiration": 0, "flags": 0, "type": "json" } { "name": "Aventinus", "abv": 8.2, "ibu": 0, "srm": 0, "upc": 0, "type": "beer", "brewery_id": "110f1f2012", "updated": "2010-07-22 20:00:20", "description": "Dark-ruby, ... Weizenbock", "category": "German Ale" } { "id": "110f37fa30", "rev": "1-000000000", "expiration": 0, "flags": 0, "type": "json" } { "name": "Aventinus", "abv": 8.2, "ibu": 0, "srm": 0, "upc": 0, "type": "beer", "brewery_id": "110f1f2012", "updated": "2010-07-22 20:00:20", "description": "Dark-ruby, ... Weizenbock", "category": "German Ale" } { "id": "110f37fa30", "rev": "1-000000000", "expiration": 0, "flags": 0, "type": "json" } { "name": "Aventinus", "abv": 8.2, "ibu": 0, "srm": 0, "upc": 0, "type": "beer", "brewery_id": "110f1f2012", "updated": "2010-07-22 20:00:20", "description": "Dark-ruby, ... Weizenbock", "category": "German Ale" } { "id": "110f37fa30", "rev": "1-000000000", "expiration": 0, "flags": 0, "type": "json" } Create  the  index Friday, April 26, 13
  14. Concrete  Example • This  map  func)on: ­ receives  the  document

     and  metadata ­ as  developer  you  just  have  to  emit  the  K,V Friday, April 26, 13
  15. doc.email meta.id [email protected] u::1 [email protected] u::7 [email protected] u::2 [email protected] u::5

    [email protected] u::6 ye@@couchbase.com u::4 [email protected] u::3 ?startkey=”b1”  &  endkey=”zz” Pulls  the  Index-­‐Keys   between  UTF-­‐8  Range   specified  by  the   startkey  and  endkey. ?startkey=”bz”  &  endkey=”zn” Pulls  the  Index-­‐Keys   between  UTF-­‐8  Range   specified  by  the   startkey  and  endkey. Friday, April 26, 13
  16. doc.email meta.id [email protected] u::1 [email protected] u::7 [email protected] u::2 [email protected] u::5

    [email protected] u::6 ye@@couchbase.com u::4 [email protected] u::3 ?keys=[“[email protected]”, “[email protected]”] Query  Mul@ple  in  the   Set  (Array  Nota@on) Friday, April 26, 13
  17. COUCHBASE  SERVER    CLUSTER Indexing  and  Querying   User  Configured

     Replica  Count  =  1 ACTIVE Doc  5 Doc  2 Doc Doc Doc SERVER  1 REPLICA Doc  4 Doc  1 Doc  8 Doc Doc Doc APP  SERVER  1 COUCHBASE  Client  Library CLUSTER  MAP COUCHBASE  Client  Library CLUSTER  MAP APP  SERVER  2 Doc  9 • Indexing  work  is  distributed  amongst   nodes • Large  data  set  possible • Parallelize  the  effort • Each  node  has  index  for  data  stored  on  it • Queries  combine  the  results  from   required  nodes ACTIVE Doc  5 Doc  2 Doc Doc Doc SERVER  2 REPLICA Doc  4 Doc  1 Doc  8 Doc Doc Doc Doc  9 ACTIVE Doc  5 Doc  2 Doc Doc Doc SERVER  3 REPLICA Doc  4 Doc  1 Doc  8 Doc Doc Doc Doc  9 Query Friday, April 26, 13
  18. Couchbase  Server  2.0:  Views • Views  can  cover  a  few

     different  use  cases ­ Primary  Index   ­ Simple  secondary  indexes  (the  most  common) ­ Complex  secondary,  ter@ary  and  composite  indexes ­ Aggrega@on  func@ons  (reduc@on) • Example:  count  the  number  of  “North  American  Ales” ­ Organizing  related  data • Built  using  Map/Reduce ­ Map  func@on  creates  a  matrix  from  document  fields ­ Reduce  func@on  summarizes  (reduces)  informa@on Friday, April 26, 13
  19. Distributed  Index  Build  Phase • Op)mized  for  lookups,  in-­‐order  access

     and  aggrega)ons • All  view  reads  from  disk  (different  performance  profile) • View  builds  against  every  document  on  every  node ­ This  is  why  you  should  group  them  in  a  design  document • Automa)cally  kept  up  to  date ­ “Incremental”  Map  Reduce Friday, April 26, 13
  20. Dynamic  Range  Queries  with  Op5onal  Aggrega5on •Efficiently  fetch  an  row

     or  group  of  related  rows. •Queries  use  cached  values  from  B-­‐tree  inner  nodes  when  possible •Take  advantage  of  in-­‐order  tree  traversal  with  group_level  queries Doc  4 Doc  2 Doc  5 SERVER  1 Doc  6 Doc  4 SERVER  2 Doc  7 Doc  1 SERVER  3 Doc  3 Doc  9 Doc  7 Doc  8 Doc  6 Doc  3 DOC DOC DOC DOC DOC DOC DOC DOC DOC DOC DOC DOC DOC DOC DOC Doc  9 Doc  5 DOC DOC DOC Doc  1 Doc  8 Doc  2 Replica  Docs Replica  Docs Replica  Docs Ac@ve  Docs Ac@ve  Docs Ac@ve  Docs ?startkey=“J”&endkey=“K” {“rows”:[{“key”:“Juneau”,“value”:null}]} Friday, April 26, 13
  21. Append  Only  Index • Disk  acDvity  is  slow • UpdaDng

     disk  blocks  is  very  slow • Appending  new  data  to  the  end  of  the  current  file  is  fast • Overhead  of  reverse  reading  is  small • Because  exisDng  blocks  are  not  re-­‐used,  can  lead  to  fragmentaDon ­ Couchbase  will  compact  the  index  automa@cally Doc View Processor Disk Doc View Processor Changed Documents Appended Original Friday, April 26, 13
  22. Adding  a  new  Document A-R 15 I-R 8 M-R 5

    A B C D F G H I K L N O Q R A-C 3 D-F 2 G-H 2 I-L 3 N-R 4 A-H 7 I-R 7 A-R 14 M new root new key new reductions Friday, April 26, 13
  23. What  about  Reduce  ? • Out  of  the  box  func)ons

     : ­ _count() ­ _sum() ­ _stats() • Create  your  own  if  needed function(key, values, rereduce) { if (rereduce) { var result = 0; for (var i = 0; i < values.length; i++) { result += values[i]; } return result; } else { return values.length; } } Friday, April 26, 13
  24. Reduce  Func)on • Key  and  Arrays  of  values  as  parameters

    • WriVen  Javascript • Called  aner  the  map  func)on • Used  to  reduce  the  result  of  a  map  of  single  values • Used  with  grouping • Could  be  ignored  when  querying ­ reuse  the  index Friday, April 26, 13
  25. • Map()  Result • Reduce() • Result Reduce  in  Ac)on

    Key Value Belgian-­‐Style  Dubbel 1 Belgian-­‐Style  Dubbel 1 Belgian-­‐Style  Dubbel 1 Belgian-­‐Style  Pale  Ale 1 Belgian-­‐Style  White 1 Belgian-­‐Style  White 1 ... ... _count() Key Value Belgian-­‐Style  Dubbel 3 Belgian-­‐Style  Pale  Ale 1 Belgian-­‐Style  White 2 Friday, April 26, 13
  26. How  to  use  it? • Use  client  SDK  to  call

     the  view: View view = client.getView("beer", "by_name"); Query query = new Query(); query.setIncludeDocs(true) .setLimit(20) .setRangeStart(ComplexKey.of(startKey)) .setRangeEnd(ComplexKey.of(startKey + "\uefff")); ViewResponse result = client.query(view, query); for(ViewRow row : result) { .... } Friday, April 26, 13
  27. ≠ Hadoop  &  Couchbase • Deal  with  “Big  Data” •

    “More”  is  be)er  than  “Faster” • Batch  Oriented • Usually  used  to  “extract/transform”  data • Fully  distributed ­ Map,  Shuffle,  Reduce • Distributed   • Executed  where  the  document  is • Deal  with  “indexing”  data   • As  fast  as  possible • Use  to  query  the  data  in  the  Database Friday, April 26, 13
  28. Map  Reduce  in  Couchbase • Like  many  other  NoSQL  Database

     :  Used  for  queries  !   • Index  are  distributed  on  each  node  of  the  cluster • Index  are  updated  Incrementally • Write  you  Map  Reduce  in  Javascript Friday, April 26, 13