Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Why and How to integrate Hadoop and NoSQL?

Why and How to integrate Hadoop and NoSQL?

Learn why and how you can integrate Hadoop and NoSQL. This presentation shows some use cases, and concrete examples using Hadoop and Couchbase.

Tugdual Grall

June 10, 2013
Tweet

More Decks by Tugdual Grall

Other Decks in Technology

Transcript

  1. Goto  Night  CPH,  June  6th  2013 How  to  integrate  Hadoop

      with  your  NoSQL  database? Tugdual  “Tug”  Grall Technical  Evangelist Monday, June 10, 13
  2. Goto  Night  CPH,  June  6th  2013 About  Me   •

    Tugdual  “Tug”  Grall ­ Couchbase • Technical  Evangelist ­ eXo • CTO ­ Oracle • Developer/Product  Manager • Mainly  Java/SOA ­ Developer  in  consul@ng  firms • Web • @tgrall • hAp://blog.grallandco.com • tgrall • NantesJUG  co-­‐founder • Pet  Project  : • hAp://www.resultri.com Monday, June 10, 13
  3. Goto  Night  CPH,  June  6th  2013 4 0 0.50 1.00

    1.50 2.00 2000 2006 2011 Source:  IDC  2011  Digital  Universe  Study  (hKp://www.emc.com/collateral/demos/microsites/emc-­‐digital-­‐universe-­‐2011/index.htm) Trillions  of  Gigabytes  (ZeKabytes) Big  Data High  Data  Variety  and  Velocity Unstructured  and  Semi-­‐ Structured  Data Structured  Data Text,  Log  Files,  Click   Streams,  Blogs,   Tweets,  Audio,   Video,  etc. More  Flexible  Data  Model  Required Monday, June 10, 13
  4. Goto  Night  CPH,  June  6th  2013 <50%? 2027 95% RelaOonal

     Technology $30B  Database  Market  Being  Disrupted 2013 All  new  database  growth  will  be  NoSQL RelaOonal   Technology RelaOonal   Technology RelaOonal  Technology NoSQL Technology Other Monday, June 10, 13
  5. Goto  Night  CPH,  June  6th  2013 Cloudera Hortonworks Opera@onal  vs.

     Analy@c  Databases Couchbase Mongo AnalyOc Databases Get  insights  from   data Real-­‐Ome,   InteracOve  Databases Fast  access   to  data NoSQL Monday, June 10, 13
  6. Goto  Night  CPH,  June  6th  2013 Lack  of  flexibility/ rigid

     schemas Inability  to  scale   out  data Performance  challenges Cost All  of  these Other 49% 35% 29% 16% 12% 11% Source:  Couchbase  Survey,  December  2011,  n  =  1351. Monday, June 10, 13
  7. Goto  Night  CPH,  June  6th  2013 What  is  Hadoop? •

    Highly  scalable • Unstructured  data • Open  source • Big  Data  OperaOng  System • Changing  the  World  One  Petabyte  at  a  Time Monday, June 10, 13
  8. Goto  Night  CPH,  June  6th  2013 What  is  Hadoop? •

    Simplest  unit  of  compute  and  storage CPU Disks Application Data Monday, June 10, 13
  9. Goto  Night  CPH,  June  6th  2013 What  is  Hadoop? •

    And  when  it  grows? Application Data Monday, June 10, 13
  10. Goto  Night  CPH,  June  6th  2013 What  is  Hadoop? •

    And  when  it  grows  more? Monday, June 10, 13
  11. Goto  Night  CPH,  June  6th  2013 What  is  Hadoop? •

    NoSQL  to  the  rescue Application Data Monday, June 10, 13
  12. Goto  Night  CPH,  June  6th  2013 What  is  Hadoop? •

    Hadoop  is  a  different  paradigm Application Data Monday, June 10, 13
  13. Goto  Night  CPH,  June  6th  2013 events profiles,  campaigns profiles,

     real  @me  campaign   sta@s@cs 40  milliseconds  to  respond  with   the  decision. 2 3 1 Ad  and  offer  targeOng Monday, June 10, 13
  14. Goto  Night  CPH,  June  6th  2013 Logs Couchbase Server Cluster

    Hadoop Cluster sqoop import Logs Logs Logs Logs Ad Targeting Platform sqoop export flume flow Moving  Parts Monday, June 10, 13
  15. Goto  Night  CPH,  June  6th  2013 events& user&profiles& make&& recommenda2ons&

    2& 3& 1& Content Oriented Site Legacy Relational Database Content  &  RecommendaOon  TargeOng Monday, June 10, 13
  16. Goto  Night  CPH,  June  6th  2013 Logs Couchbase Server Cluster

    Hadoop Cluster sqoop import Logs Logs Logs Logs Content Driven Web Site sqoop export Original RDBMS In order to keep up with changing needs on richer, more targeted content that is delivered to larger and larger audiences very quickly, data behind content driven sites is shifting to Couchbase. Hadoop excels at complex analytics which may involve multiple steps of processing which incorporate a number of different data sources. sqoop import flume flow Moving  Parts Monday, June 10, 13
  17. Goto  Night  CPH,  June  6th  2013 Sqoop is a tool

    designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS. sqoop.apache.org What  is  Sqoop? Monday, June 10, 13
  18. Goto  Night  CPH,  June  6th  2013 • Traditional ETL Application

    Data Data T What  is  Sqoop? Monday, June 10, 13
  19. Goto  Night  CPH,  June  6th  2013 • A different paradigm

    Data Applicatio n Data What  is  Sqoop? Monday, June 10, 13
  20. Goto  Night  CPH,  June  6th  2013 • A very scalable

    different paradigm Data Application Data Application Data Application Data What  is  Sqoop? Monday, June 10, 13
  21. Goto  Night  CPH,  June  6th  2013 • Where did the

    Transform go? Application Data T T T T T T T T T T T T What  is  Sqoop? Monday, June 10, 13
  22. Goto  Night  CPH,  June  6th  2013 What  is  Sqoop? •

    Sqoop  “SQL-­‐Hadoop” ­ Default  connec@on  is  via  JDBC • Lots  of  custom  connectors ­ Couchbase,  VoltDB,  Ver@ca ­ Teradata,  Netezza ­ Oracle,  MySQL,  Postgres Monday, June 10, 13
  23. Goto  Night  CPH,  June  6th  2013 Sqoop  :  Import sqoop

    import --connect jdbc:mysql://rdbms1.demo.com/CRM --table customers Monday, June 10, 13
  24. Goto  Night  CPH,  June  6th  2013 Sqoop  :  Export sqoop

    export --connect jdbc:mysql://rdbms1.demo.com/ANALYTICS --table sales --export-dir /user/hive/warehouse/zip_profits --input-fields-terminated-by '\0001' Monday, June 10, 13
  25. Goto  Night  CPH,  June  6th  2013 Sqoop  :  Import sqoop

    import –-connect http://localhost:8091/pools --table DUMP Monday, June 10, 13
  26. MapReduceJob Goto  Night  CPH,  June  6th  2013 Sqoop  :  Import

    HDFS Map HDFS Map HDFS Map Sqoop   Client Metadata Launches Monday, June 10, 13
  27. Goto  Night  CPH,  June  6th  2013 Sqoop  :  Export sqoop

    export --connect http://localhost:8091/pools --table DUMP --export-dir /user/hive/profiles/recommendation --username social Monday, June 10, 13
  28. Goto  Night  CPH,  June  6th  2013 Sqoop  :  Export MapReduceJob

    HDFS Map HDFS Map HDFS Map Sqoop   Client Metadata Launches Monday, June 10, 13
  29. Goto  Night  CPH,  June  6th  2013 Easy   Scalability Consistent

     High   Performance Always  On   24x365 Grow  cluster  without  applica@on   changes,  without  down@me  with   a  single  click Consistent  sub-­‐millisecond   read  and  write  response  @mes   with  consistent  high  throughput No  down@me  for  so`ware   upgrades,  hardware  maintenance,   etc. Flexible  Data   Model JSON  document  model  with  no   fixed  schema. JSON JSON JSON JSON JSON PERFORMANCE Couchbase  Server  Core  Principles Monday, June 10, 13
  30. Goto  Night  CPH,  June  6th  2013 Couchbase  Server  2.0 Heartbeat

    Process  monitor Global  singleton  supervisor ConfiguraQon  manager on  each  node Rebalance  orchestrator Node  health  monitor one  per  cluster vBucket  state  and  replicaQon  manager hdp REST  management  API/Web  UI HTTP 8091 Erlang  port  mapper 4369 Distributed  Erlang 21100  -­‐  21199 Erlang/OTP storage  interface Couchbase  EP  Engine 11210 Memcapable    2.0 Moxi 11211 Memcapable    1.0 Memcached New  Persistence  Layer 8092 Query  API Query  Engine Data  Manager Cluster  Manager Monday, June 10, 13
  31. Goto  Night  CPH,  June  6th  2013 Couchbase  Server  2.0 Heartbeat

    Process  monitor Global  singleton  supervisor ConfiguraQon  manager on  each  node Rebalance  orchestrator Node  health  monitor one  per  cluster vBucket  state  and  replicaQon  manager hdp REST  management  API/Web  UI HTTP 8091 Erlang  port  mapper 4369 Distributed  Erlang 21100  -­‐  21199 Erlang/OTP storage  interface Couchbase  EP  Engine 11210 Memcapable    2.0 Moxi 11211 Memcapable    1.0 Memcached New  Persistence  Layer 8092 Query  API Query  Engine Monday, June 10, 13
  32. The  Classic  Order  Entry  Structure Goto  Night  CPH,  June  6th

     2013 39 hKp://[email protected]/bliki/AggregateOrientedDatabase.html Rela%onal  databases  were  not  designed  with  clusters  in  mind,  which  is  why  people   have  cast  around  for  an  alterna%ve.  Storing  aggregates  as  fundamental  units  makes   a  lot  of  sense  for  running  on  a  cluster.   Monday, June 10, 13
  33. Goto  Night  CPH,  June  6th  2013 40 o::1001 { uid:

     “ji22jd”, customer:  “Ann”, line_items:  [   {  sku:  0321293533,  quan:  3,    unit_price:  48.0  }, {  sku:  0321601912,  quan:  1,  unit_price:  39.0  }, {  sku:  0131495054,  quan:  1,  unit_price:  51.0  }   ], payment:  {                      type:  “Amex”,                    expiry:  “04/2001”,   last5:  12345 } • Easy  to  distribute  data • Makes  sense  to  applicaQon  programmers Aggregate  by  Comparison Monday, June 10, 13
  34. Goto  Night  CPH,  June  6th  2013 COUCHBASE  SERVER    CLUSTER

    • Docs  distributed  evenly  across   servers   • Each  server  stores  both  acOve  and   replica  docs Only  one  server  acQve  at  a  Qme • Client  library  provides  app  with   simple  interface  to  database • Cluster  map  provides  map   to  which  server  doc  is  on App  never  needs  to  know • App  reads,  writes,  updates  docs • MulOple  app  servers  can  access  same   document  at  same  Ome User  Configured  Replica  Count  =  1 READ/WRITE/UPDATE ACTIVE Doc  5 Doc  2 Doc Doc Doc SERVER  1 ACTIVE Doc  4 Doc  7 Doc Doc Doc SERVER  2 Doc  8 ACTIVE Doc  1 Doc  2 Doc Doc Doc REPLICA Doc  4 Doc  1 Doc  8 Doc Doc Doc REPLICA Doc  6 Doc  3 Doc  2 Doc Doc Doc REPLICA Doc  7 Doc  9 Doc  5 Doc Doc Doc SERVER  3 Doc  6 APP  SERVER  1 COUCHBASE  Client  Library CLUSTER  MAP COUCHBASE  Client  Library CLUSTER  MAP APP  SERVER  2 Doc  9 Basic  OperaOons Monday, June 10, 13
  35. Goto  Night  CPH,  June  6th  2013 COUCHBASE  SERVER    CLUSTER

    ACTIVE Doc  5 Doc  2 Doc Doc Doc SERVER  1 REPLICA Doc  4 Doc  1 Doc  8 Doc Doc Doc APP  SERVER  1 COUCHBASE  Client  Library CLUSTER  MAP COUCHBASE  Client  Library CLUSTER  MAP APP  SERVER  2 Doc  9 • Indexing  work  is  distributed  amongst   nodes • Large  data  set  possible • Parallelize  the  effort • Each  node  has  index  for  data  stored  on  it • Queries  combine  the  results  from  required   nodes ACTIVE Doc  5 Doc  2 Doc Doc Doc SERVER  2 REPLICA Doc  4 Doc  1 Doc  8 Doc Doc Doc Doc  9 ACTIVE Doc  5 Doc  2 Doc Doc Doc SERVER  3 REPLICA Doc  4 Doc  1 Doc  8 Doc Doc Doc Doc  9 Query Indexing Monday, June 10, 13
  36. ≠ Goto  Night  CPH,  June  6th  2013 Map  Reduce  ...

    • Deal  with  “Big  Data” • “More”  is  beder  than  “Faster” • Batch  Oriented • Usually  used  to  “extract/transform”  data • Fully  distributed ­ Map,  Shuffle,  Reduce • Distributed   • Executed  where  the  document  is • Deal  with  “indexing”  data   • As  fast  as  possible • Use  to  query  the  data  in  the  Database Monday, June 10, 13
  37. Goto  Night  CPH,  June  6th  2013 Conclusion • Big  Data

     and  Big  Users  working  together • Use  Hadoop  to  store  “everything” ­ Batch  oriented ­ Complex  data  processing • MapReduce • Expose  a  subset  of  the  dataset  to  your  applicaOon ­ Real  @me  analy@cs ­ Low  latency ­ Simple  data  interac@ons  and  queries Monday, June 10, 13