Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Sharing a Startup’s Big Data Lessons

Sharing a Startup’s Big Data Lessons

Presented in front of an internal audience at Sapient, Boston on 20/2/2013.

George P. Stathis

February 20, 2013
Tweet

More Decks by George P. Stathis

Other Decks in Technology

Transcript

  1. Sharing  a  Startup’s  Big  Data  
    Lessons  
    Experiences  with  non-­‐RDBMS  solu;ons  at  

    View full-size slide

  2. Who  we  are  
    •  A  search  
    engine  
    •  A  people  
    search  engine  
    •  An  influencer  
    search  engine  
    •  Subscrip;on-­‐
    based  

    View full-size slide

  3. George  Stathis  
    VP  Engineering  
    14+  years  of  experience  
    building  full-­‐stack  web  
    soHware  systems  with  a  past  
    focus  on  e-­‐commerce  and  
    publishing.  Currently  
    responsible  for  building  
    engineering  capability  to  
    enable  Traackr's  growth  goals.  

    View full-size slide

  4. What’s  this  talk  about?  
    •  Share  what  we  know  about  Big  Data/NoSQL:  
    what’s  behind  the  buzz  words?  
    •  Our  reasons  and  method  for  picking  a  NoSQL  
    database  
    •  Share  the  lessons  we  learned  going  through  
    the  process  

    View full-size slide

  5. Big  Data/NoSQL:  behind  the  buzz  words  

    View full-size slide

  6. What  is  Big  Data?  
    •  3  Vs:  
    – Volume  
    – Velocity  
    – Variety  

    View full-size slide

  7. What  is  Big  Data?  Volume  +  Velocity  
    •  Data  sets  too  large  or  coming  in  at  too  high  a  velocity  
    to  process  using  tradi;onal  databases  or  desktop  tools.  
    E.g.  
     
    big  science  
    web  logs  
    rfid  
    sensor  networks  
    social  networks  
    social  data  
    internet  text  and  documents  
    internet  search  indexing  
    call  detail  records  
    Astronomy  
    atmospheric  science  
    genomics  
    biogeochemical  
    military  surveillance  
    medical  records  
    photography  archives  
    video  archives  
    large-­‐scale  e-­‐commerce  

    View full-size slide

  8. Tradi;onal  sta;c  reports  
    What  is  Big  Data?  Variety  
    •  Big  Data  is  varied  and  unstructured  
    Analy;cs,  explora;on  &  
    experimenta;on  

    View full-size slide

  9. $$$$$$$$  
    What  is  Big  Data?  
    •  Scaling  data  processing  cost  effec;vely  
     
    $$$$$  
    $$$  

    View full-size slide

  10. What  is  NoSQL?  
    •  NoSQL  ≠  No  SQL  
    •  NoSQL  ≈  Not  Only  SQL  
    •  NoSQL  addresses  RDBMS  limita;ons,  it’s  not  
    about  the  SQL  language  
    •  RDBMS  =  sta;c  schema  
    •  NoSQL  =  schema  flexibility;  don’t  have  to  
    know  exact  structure  before  storing  

    View full-size slide

  11. What  is  Distributed  Compu;ng?  
    •  Sharing  the  workload:  divide  a  problem  into  
    many  tasks,  each  of  which  can  be  solved  by  one  
    or  more  computers  
    •  Allows  computa;ons  to  be  accomplished  in  
    acceptable  ;meframes  
    •  Distributed  computa;on  approaches  were  
    developed  to  leverage  mul;ple  machines:  
    MapReduce  
    •  With  MapReduce,  the  program  goes  to  the  data  
    since  the  data  is  too  big  to  move  

    View full-size slide

  12. What  is  MapReduce?  
    Source:  developer.yahoo.com  

    View full-size slide

  13. What  is  MapReduce?  
    •  MapReduce  =  batch  processing  =  analy;cal  
    •  MapReduce  ≠  interac;ve  
    •  Therefore  many  NoSQL  solu;ons  don’t  
    outright  replace  warehouse  solu;ons,  they  
    complement  them  
    •  RDBMS  is  s;ll  safe  J    

    View full-size slide

  14. What  is  Big  Data?  Velocity  
    •  In  some  instances,  being  able  to  process  large  
    amounts  of  data  in  real-­‐;me  can  yield  a  
    compe;;ve  advantage.  E.g.  
    –  Online  retailers  leveraging  buying  history  and  click-­‐
    though  data  for  real-­‐;me  recommenda;ons  
    •  No  ;me  to  wait  for  MapReduce  jobs  to  finish  
    •  Solu;ons:  streaming  processing  (e.g.  Twider  
    Storm),  pre-­‐compu;ng  (e.g.  aggregate  and  count  
    analy;cs  as  data  arrives),  quick  to  read  key/value  
    stores  (e.g.  distributed  hashes)  

    View full-size slide

  15. What  is  Big  Data?  Data  Science  
    •  Emergence  of  Data  Science    
    •  Data  Scien;st  ≈  Sta;s;cian  
    •  Possess  scien;fic  discipline  &  exper;se  
    •  Formulate  and  test  hypotheses  
    •  Understand  the  math  behind  the  algorithms  so  
    they  can  tweak  when  they  don’t  work  
    •  Can  dis;ll  the  results  into  an  easy  to  understand  
    story  
    •  Help  businesses  gain  ac;onable  insights  

    View full-size slide

  16. Big  Data  Landscape  
    Source:  capgemini.com  

    View full-size slide

  17. Big  Data  Landscape  
    Source:  capgemini.com  

    View full-size slide

  18. Big  Data  Landscape  
    Source:  capgemini.com  

    View full-size slide

  19. So  what’s  Traackr  and  why  did  we  
    need  a  NoSQL  DB?  

    View full-size slide

  20. Traackr:  context  
    •  A  cloud  compu;ng  company  as  about  to  
    launch  a  new  plakorm;  how  does  it  find  the  
    most  influen;al  IT  bloggers  on  the  web  that  
    can  help  bring  visibility  to  the  new  product?  
    How  does  it  find  the  opinion  leaders,  the  
    people  that  mader?  

    View full-size slide

  21. Traackr:  a  people  search  engine  
    Up  to  50  keywords  per  search!  

    View full-size slide

  22. Traackr:  a  people  search  engine  
    People  
    as  
    search  
    results  
    Content  
    aggregated  
    by  author  
    Proprietary    
    3-­‐scale  ranking  

    View full-size slide

  23. Traackr:  30,000  feet  
    Acquisi

    View full-size slide

  24. NoSQL  is  usually  associated  with  
    “Web  Scale”  (Volume  &  Velocity)  

    View full-size slide

  25. •  In  terms  of  users/traffic?  
    Do  we  fit  the  “Web  scale”  profile?  

    View full-size slide

  26. Source:  compete.com  

    View full-size slide

  27. Source:  compete.com  

    View full-size slide

  28. Source:  compete.com  

    View full-size slide

  29. Source:  compete.com  

    View full-size slide

  30. •  In  terms  of  users/traffic?  
    •  In  terms  of  the  amount  of  data?  
    Do  we  fit  the  “Web  scale”  profile?  

    View full-size slide

  31. PRIMARY>  use  traackr  
    switched  to  db  traackr  
    PRIMARY>  db.stats()  
    {  
     "db"  :  "traackr",  
     "collec;ons"  :  12,  
     "objects"  :  68226121,  
     "avgObjSize"  :  2972.0800625760330,  
     "dataSize"  :  202773493971,  
     "storageSize"  :  221491429671,  
     "numExtents"  :  199,  
     "indexes"  :  33,  
     "indexSize"  :  27472394891,  
     "fileSize"  :  266623699968,  
     "nsSizeMB"  :  16,  
     "ok"  :  1  
    }  
    That’s  a  quarter  of  a  
    terabyte  …  

    View full-size slide

  32. Wait!  What?  My  
    Synology  NAS  at  home  
    can  hold  2TB!  

    View full-size slide

  33. No  need  for  us  to  track  the  en;re  web  
    Web  Content  
    Influencer  
    Content  
    Not  at  scale  :-­‐)  

    View full-size slide

  34. •  In  terms  of  users/traffic?  
    •  In  terms  of  the  amount  of  data?  
    Do  we  fit  the  “Web  scale”  profile?  

    View full-size slide

  35. Variety  view  of  “Web  Scale”  
    Web  data  is:  
    Heterogeneous  
    Unstructured  (text)  

    View full-size slide

  36. Source:  hdp://www.opte.org/  
    Visualiza;on  of  the  Internet,  Nov.  23rd  2003  

    View full-size slide

  37. Data  sources  are  
    isolated  islands  of  rich  
    data  with  lose  links  to  
    one  another    

    View full-size slide

  38. How  do  we  build  a  database  that  
    models  all  possible  en;;es  found  on  
    the  web?  

    View full-size slide

  39. Modeling  the  web:  the  RDBMS  way  

    View full-size slide

  40. Source:  socialbuderflyclt.com  

    View full-size slide

  41. {  
         "realName":  "David  Chancogne",  
         ";tle":  "CTO",  
         "descrip;on":  "Web.  Geek.\r\nTraackr:  hdp://traackr.com\r\nPropz:  hdp://propz.me",  
         "primaryAffilia;on":  "Traackr",  
         "email":  "[email protected]",  
         "loca;on":  "Cambridge,  MA,  United  States",  
         "siteReferences":  [  
               {  
                     "siteUrl":  "hdp://twider.com/dchancogne",  
                     "metrics":  [  
                           {  
                                 "value":  216,  
                                 "name":  "twider_followers_count"  
                           },  
                           {  
                                 "value":  2107,  
                                 "name":  "twider_statuses_count"  
                           }  
                     ]  
               },  
               {  
                     "siteUrl":  "hdp://traackr.com/blog/author/david",  
                     "metrics":  [  
                           {  
                                 "value":  21,  
                                 "name":  "google_inbound_links"  
                           }  
                     ]  
               }  
         ]  
    }  
    Influencer  data  as  JSON  

    View full-size slide

  42. NoSQL  =  schema  flexibility  

    View full-size slide

  43. •  In  terms  of  users/traffic?  
    •  In  terms  of  the  amount  of  data?  
    Do  we  fit  the  “Web  scale”  profile?  

    View full-size slide

  44. •  In  terms  of  users/traffic?  
    •  In  terms  of  the  amount  of  data?  
    •  In  terms  of  the  variety  of  the  data  
    Do  we  fit  the  “Web  scale”  profile?  
    ✓  

    View full-size slide

  45. Traackr’s  Datastore  Requirements  
    •  Schema  flexibility  
    •  Good  at  storing  lots  of  variable  length  text  
    •  Batch  processing  op;ons  
    ✓  

    View full-size slide

  46. Requirement:  text  storage  
    Variable  text  length:  
    <  big  variance  <  
    140  
    character  
    tweets  
    mul;-­‐page  
    blog  posts  

    View full-size slide

  47. Requirement:  text  storage  
    RDBMS’  answer  to  variable  text  length:  
    Plan  ahead  for  largest  value  
    CLOB/BLOB  

    View full-size slide

  48. Requirement:  text  storage  
    Issues  with  CLOB/BLOG  for  us:  
    No  clue  what  largest  value  is  
    CLOB/BLOB  for  tweets  =  wasted  space  

    View full-size slide

  49. Requirement:  text  storage  
    NoSQL  solu;ons  are  great  for  text:  
    No  length  requirements  (automated  
    chunking)  
    Limited  space  overhead  

    View full-size slide

  50. Traackr’s  Datastore  Requirements  
    •  Schema  flexibility  
    •  Good  at  storing  lots  of  variable  length  text  
    •  Batch  processing  op;ons  
    ✓  
    ✓  

    View full-size slide

  51. Requirement:  batch  processing  
    Some  NoSQL  
    solu;ons  come  
    with  MapReduce  
    Source:  hdp://code.google.com/  

    View full-size slide

  52. Requirement:  batch  processing  
    MapReduce  +  RDBMS:  
    Possible  but  proprietary  solu;ons  
    Usually  involves  expor;ng  data  from  
    RDBMS  into  a  NoSQL  system  anyway.  
    Defeats  data  locality  benefit  of  MR  

    View full-size slide

  53. Traackr’s  Datastore  Requirements  
    •  Schema  flexibility  
    •  Good  at  storing  lots  of  variable  length  text  
    •  Batch  processing  op;ons  
    ✓  
    ✓  
    A  NoSQL  op;on  is  the  right  fit  
    ✓  

    View full-size slide

  54. How  did  we  pick  a  NoSQL  DB?  

    View full-size slide

  55. Bewildering  number  of  op;ons  (early  2010)  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    

    View full-size slide

  56. Bewildering  number  of  op;ons  (early  2010)  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    

    View full-size slide

  57. Trimming  op;ons  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    
    Graph  Databases:  while  we  can  model  
    our  domain  as  a  graph  we  don’t  want  to  
    pigeonhole  ourselves  into  this  structure.  
    We’d  rather  use  these  tools  for  
    specialized  data  analysis  but  not  as  the  
    main  data  store.  

    View full-size slide

  58. Trimming  op;ons  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    
    Memcache:  memory-­‐based,  
    we  need  true  persistence  

    View full-size slide

  59. Trimming  op;ons  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    
    Amazon  SimpleDB:  not  willing  to  
    store  our  data  in  a  proprietary  
    datastore.  

    View full-size slide

  60. Trimming  op;ons  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    
    Not  willing  to  store  our  data  in  a  
    proprietary  datastore.  
    Redis  and  LinkedIn’s  Project  
    Voldermort:  no  query  filters,  
    beder  used  as  queues  or  
    distributed  caches  

    View full-size slide

  61. Trimming  op;ons  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    
    CouchDB:  no  ad-­‐hoc  queries;  
    maturity  in  early  2010  made  us  
    shy  away  although  we  did  try  
    early  prototypes.  

    View full-size slide

  62. Trimming  op;ons  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    
    Cassandra:  in  early  2010,  
    maturity  ques;ons,  no  secondary  
    indexes  and  no  batch  processing  
    op;ons  (came  later  on).  

    View full-size slide

  63. Trimming  op;ons  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    
    MongoDB:  in  early  2010,  maturity  
    ques;ons,  adop;on  ques;ons  
    and  no  batch  processing  op;ons.  

    View full-size slide

  64. Trimming  op;ons  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    
    Riak:  very  close  but  in  early  2010,  
    we  had  adop;on  ques;ons.  

    View full-size slide

  65. Trimming  op;ons  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    
    HBase:  came  across  as  the  most  mature  
    at  the  ;me,  with  several  deployments,  a  
    healthy  community,  "out-­‐of-­‐the  box"  
    secondary  indexes  through  a  contrib  and  
    support  for  batch  processing  using  
    Hadoop/MR  .  

    View full-size slide

  66. Lessons  Learned  
    Challenges  
    -­‐  Complexity  
    -­‐  Missing  Features  
    -­‐  Problem  solu;on  fit  
    -­‐  Resources  
    Rewards  
    -­‐  Choices  
    -­‐  Empowering  
    -­‐  Community  
    -­‐  Cost  

    View full-size slide

  67. Rewards:  Choices  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    

    View full-size slide

  68. Rewards:  Choices  
    Source:  capgemini.com  

    View full-size slide

  69. Lessons  Learned  
    Challenges  
    -­‐  Complexity  
    -­‐  Missing  Features  
    -­‐  Problem  solu;on  fit  
    -­‐  Resources  
    Rewards  
    -­‐  Choices  
    -­‐  Empowering  
    -­‐  Community  
    -­‐  Cost  

    View full-size slide

  70. When  Big-­‐Data  =  Big  Architectures  
    Source:  hdp://www.larsgeorge.com/2009/10/hbase-­‐architecture-­‐101-­‐storage.html  
    Must  have  a  Hadoop  HDFS  
    cluster  of  at  least  2x  replica;on  
    factor  nodes  
    Must  have  an  odd  
    number  of    
    Zookeeper  quorum  
    nodes  
    Then  you  can  run  your  Hbase  
    nodes  but  it’s  recommended  to  
    co-­‐locate  regionservers  with  
    hadoop  datanodes  so  you  have  
    to  manage  resources.  
    Master/slave  architecture  
    means  a  single  point  of  failure,  
    so  you  need  to  protect  your  
    master.  
    And  then  we  also  have  to  
    manage  the  MapReduce  
    processes  and  resources  in  the  
    Hadoop  layer.  

    View full-size slide

  71. Source:  socialbuderflyclt.com  

    View full-size slide

  72. Jokes  aside,  no  one  said  open  source  
    was  easy  to  use  

    View full-size slide

  73. To  be  expected  
    •  Hadoop/Hbase  are  
    designed  to  move  
    mountains  
    •  If  you  want  to  move  big  
    stuff,  be  prepared  to  
    some;mes  use  big  
    equipment  

    View full-size slide

  74. What  it  means  to  a  startup  
    Development  capacity  before  
    Development  capacity  aHer  
    Congrats,  you  
    are  now  a  
    sysadmin…  

    View full-size slide

  75. Lessons  Learned  
    Challenges  
    -­‐  Complexity  
    -­‐  Missing  Features  
    -­‐  Problem  solu;on  fit  
    -­‐  Resources  
    Rewards  
    -­‐  Choices  
    -­‐  Empowering  
    -­‐  Community  
    -­‐  Cost  

    View full-size slide

  76. Mapping  an  saved  search  to  a  column  store  
    Name  
    Ranks   References  to  influencer  records  

    View full-size slide

  77. Unique  
    key    
    “adributes”  
    column  family  
    for  general  
    adributes  
    “influencerId”  column  family  
    for  influencer  ranks  and  foreign  keys  
    Mapping  an  saved  search  to  a  column  store  

    View full-size slide

  78. Mapping  an  saved  search  to  a  column  store  
    “name”  adribute  
    Influencer  ranks  
    can  be  adribute  
    names  as  well  

    View full-size slide

  79. Mapping  an  saved  search  to  a  column  store  
    Can  get  predy  long  so  needs  indexing  and  pagina;on  

    View full-size slide

  80. Problem:  no  out-­‐of-­‐the-­‐box  row-­‐based  
    indexing  and  pagina;on  

    View full-size slide

  81. Jumping  right  into  the  code  

    View full-size slide

  82. Lessons  Learned  
    Challenges  
    -­‐  Complexity  
    -­‐  Missing  Features  
    -­‐  Problem  solu;on  fit  
    -­‐  Resources  
    Rewards  
    -­‐  Choices  
    -­‐  Empowering  
    -­‐  Community  
    -­‐  Cost  

    View full-size slide

  83. a  few  months  later…  

    View full-size slide

  84. Need  to  upgrade  to  Hbase  0.90  
    •  Making  sure  to  remain  on  recent  code  base  
    •  Performance  improvements  
    •  Mostly  to  get  the  latest  bug  fixes  
    No  thanks!  

    View full-size slide

  85. Looks  like  something  is  missing  

    View full-size slide

  86. Our  DB  indexes  depend  on  this!  

    View full-size slide

  87. Let’s  get  this  straight  
    •  Hbase  no  longer  comes  with  secondary  
    indexing  out-­‐of-­‐the-­‐box  
    •  It’s  been  moved  out  of  the  trunk  to  GitHub  
    •  Where  only  one  other  company  besides  us  
    seems  to  care  about  it  

    View full-size slide

  88. Only  one  other  
    maintainer  
    besides  us  

    View full-size slide

  89. What  it  means  to  a  startup  
    Development  capacity  
    Congrats,  you  are  
    now  an  hbase  
    contrib  maintainer…  

    View full-size slide

  90. Source:  socialbuderflyclt.com  

    View full-size slide

  91. Lessons  Learned  
    Challenges  
    -­‐  Complexity  
    -­‐  Missing  Features  
    -­‐  Problem  solu;on  fit  
    -­‐  Resources  
    Rewards  
    -­‐  Choices  
    -­‐  Empowering  
    -­‐  Community  
    -­‐  Cost  

    View full-size slide

  92. Homegrown  Hbase  Indexes  
    Rows  have  id  prefixes  that  can  be  
    efficiently  scanned  using  STARTROW  
    and  STOPROW  filters  
    Row  ids  for  Posts  

    View full-size slide

  93. Homegrown  Hbase  Indexes  
    Find  posts  for  
    influencer_id_1234    
    Row  ids  for  Posts  

    View full-size slide

  94. Homegrown  Hbase  Indexes  
    Find  posts  for  
    influencer_id_5678  
    Row  ids  for  Posts  

    View full-size slide

  95. Homegrown  Hbase  Indexes  
    •  No  longer  depending  on  
    unmaintained  code  
    •  Work  with  out-­‐of-­‐the-­‐box  Hbase  
    installa;on  

    View full-size slide

  96. What  it  means  to  a  startup  
    Development  capacity  
    You  are  back  but  
    you  s;ll  need  to  
    maintain  indexing  
    logic  

    View full-size slide

  97. a  few  months  later…  

    View full-size slide

  98. Cracks  in  the  data  model  
    huffingtonpost.com  
    huffingtonpost.com  
    hdp://www.huffingtonpost.com/arianna-­‐huffington/post_1.html  
    hdp://www.huffingtonpost.com/arianna-­‐huffington/post_2.html  
    hdp://www.huffingtonpost.com/arianna-­‐huffington/post_3.html  
    hdp://www.huffingtonpost.com/shaun-­‐donovan/post1.html  
    hdp://www.huffingtonpost.com/shaun-­‐donovan/post2.html  
    hdp://www.huffingtonpost.com/shaun-­‐donovan/post3.html  
    writes  for  
    authored  by  
    published  under  
    writes  for  
    authored  by  
    published  under  

    View full-size slide

  99. Cracks  in  the  data  model  
    huffingtonpost.com  
    huffingtonpost.com  
    hdp://www.huffingtonpost.com/arianna-­‐huffington/post_1.html  
    hdp://www.huffingtonpost.com/arianna-­‐huffington/post_2.html  
    hdp://www.huffingtonpost.com/arianna-­‐huffington/post_3.html  
    hdp://www.huffingtonpost.com/shaun-­‐donovan/post1.html  
    hdp://www.huffingtonpost.com/shaun-­‐donovan/post2.html  
    hdp://www.huffingtonpost.com/shaun-­‐donovan/post3.html  
    writes  for  
    authored  by  
    published  under  
    writes  for  
    authored  by  
    published  under  
    Denormalized/duplicated  
    for  fast  run;me  access  
    and  storage  of  influencer-­‐
    to-­‐site  rela;onship  
    proper;es  

    View full-size slide

  100. Cracks  in  the  data  model  
    huffingtonpost.com  
    huffingtonpost.com  
    hdp://www.huffingtonpost.com/arianna-­‐huffington/post_1.html  
    hdp://www.huffingtonpost.com/arianna-­‐huffington/post_2.html  
    hdp://www.huffingtonpost.com/arianna-­‐huffington/post_3.html  
    hdp://www.huffingtonpost.com/shaun-­‐donovan/post1.html  
    hdp://www.huffingtonpost.com/shaun-­‐donovan/post2.html  
    hdp://www.huffingtonpost.com/shaun-­‐donovan/post3.html  
    writes  for  
    authored  by  
    published  under  
    writes  for  
    authored  by  
    published  under  
    Content  adribu;on  logic  could  some;mes  
    mis-­‐adribute  posts  because  of  the  
    duplicated  data.  

    View full-size slide

  101. Cracks  in  the  data  model  
    huffingtonpost.com  
    huffingtonpost.com  
    hdp://www.huffingtonpost.com/arianna-­‐huffington/post_1.html  
    hdp://www.huffingtonpost.com/arianna-­‐huffington/post_2.html  
    hdp://www.huffingtonpost.com/arianna-­‐huffington/post_3.html  
    hdp://www.huffingtonpost.com/shaun-­‐donovan/post1.html  
    hdp://www.huffingtonpost.com/shaun-­‐donovan/post2.html  
    hdp://www.huffingtonpost.com/shaun-­‐donovan/post3.html  
    writes  for  
    authored  by  
    published  under  
    writes  for  
    authored  by  
    published  under  
    Exacerbated  when  we  started  tracking  
    people’s  content  on  a  daily  basis  in  
    mid-­‐2011  

    View full-size slide

  102. Fixing  the  cracks  in  the  data  model  
    huffingtonpost.com  
    hdp://www.huffingtonpost.com/arianna-­‐huffington/post_1.html  
    hdp://www.huffingtonpost.com/arianna-­‐huffington/post_2.html  
    hdp://www.huffingtonpost.com/arianna-­‐huffington/post_3.html  
    hdp://www.huffingtonpost.com/shaun-­‐donovan/post1.html  
    hdp://www.huffingtonpost.com/shaun-­‐donovan/post2.html  
    hdp://www.huffingtonpost.com/shaun-­‐donovan/post3.html  
    writes  for  
    authored  by  
    published  under  
    writes  for  
    authored  by  
    published  under  
    Normalize  the  sites  

    View full-size slide

  103. Fixing  the  cracks  in  the  data  model  
    •  Normaliza;on  requires  stronger  
    secondary  indexing  
    •  Our  applica;on  layer  indexing  would  
    need  revisi;ng…again!  

    View full-size slide

  104. What  it  means  to  a  startup  
    Development  capacity  
    Psych!  You  are  back  
    to  wri;ng  indexing  
    code.  

    View full-size slide

  105. Source:  socialbuderflyclt.com  

    View full-size slide

  106. Lessons  Learned  
    Challenges  
    -­‐  Complexity  
    -­‐  Missing  Features  
    -­‐  Problem  solu;on  fit  
    -­‐  Resources  
    Rewards  
    -­‐  Choices  
    -­‐  Empowering  
    -­‐  Community  
    -­‐  Cost  

    View full-size slide

  107. Traackr’s  Datastore  Requirements  
    (Revisited)  
    •  Schema  flexibility  
    •  Good  at  storing  lots  of  variable  length  text  
    •  Out-­‐of-­‐the-­‐box  SECONDARY  INDEX  support!  
    •  Simple  to  use  and  administer  

    View full-size slide

  108. NoSQL  picking  –  Round  2  (mid  2011)  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    

    View full-size slide

  109. NoSQL  picking  –  Round  2  (mid  2011)  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    
    Nope!  

    View full-size slide

  110. NoSQL  picking  –  Round  2  (mid  2011)  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    
    Graph  Databases:  we  looked  at  
    Neo4J  a  bit  closer  but  passed  again  
    for  the  same  reasons  as  before.  

    View full-size slide

  111. NoSQL  picking  –  Round  2  (mid  2011)  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    
    Memcache:  s;ll  no  

    View full-size slide

  112. NoSQL  picking  –  Round  2  (mid  2011)  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    
    Amazon  SimpleDB:  s;ll  no.  

    View full-size slide

  113. NoSQL  picking  –  Round  2  (mid  2011)  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    
    Not  willing  to  store  our  data  in  a  
    proprietary  datastore.  
    Redis  and  LinkedIn’s  Project  
    Voldermort:  s;ll  no  

    View full-size slide

  114. NoSQL  picking  –  Round  2  (mid  2011)  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    
    CouchDB:  more  mature  but  s;ll  
    no  ad-­‐hoc  queries.  

    View full-size slide

  115. NoSQL  picking  –  Round  2  (mid  2011)  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    
    Cassandra:  matured  quite  a  bit,  added  
    secondary  indexes  and  batch  processing  
    op;ons  but  more  restric;ve  in  its’  use  than  
    other  solu;ons.  AHer  the  Hbase  lesson,  
    simplicity  of  use  was  now  more  important.  

    View full-size slide

  116. NoSQL  picking  –  Round  2  (mid  2011)  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    
    Riak:  strong  contender  s;ll  but  
    adop;on  ques;ons  remained.  

    View full-size slide

  117. NoSQL  picking  –  Round  2  (mid  2011)  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    
    MongoDB:  matured  by  leaps  and  bounds,  increased  
    adop;on,  support  from  10gen,  advanced  indexing  
    out-­‐of-­‐the-­‐box  as  well  as  some  batch  processing  
    op;ons,  breeze  to  use,  well  documented  and  fit  into  
    our  exis;ng  code  base  very  nicely.  

    View full-size slide

  118. Lessons  Learned  
    Challenges  
    -­‐  Complexity  
    -­‐  Missing  Features  
    -­‐  Problem  solu;on  fit  
    -­‐  Resources  
    Rewards  
    -­‐  Choices  
    -­‐  Empowering  
    -­‐  Community  
    -­‐  Cost  

    View full-size slide

  119. Immediate  Benefits  
    •  No  more  maintaining  custom  applica;on-­‐layer  
    secondary  indexing  code  

    View full-size slide

  120. What  it  means  to  a  startup  
    Development  capacity  
    Yay!  I’m  back!  

    View full-size slide

  121. Immediate  Benefits  
    •  No  more  maintaining  custom  applica;on-­‐layer  
    secondary  indexing  code  
    •  Single  binary  installa;on  greatly  simplifies  
    administra;on  

    View full-size slide

  122. What  it  means  to  a  startup  
    Development  capacity  
    Honestly,  I  thought  
    I’d  never  see  you  
    guys  again!  

    View full-size slide

  123. Immediate  Benefits  
    •  No  more  maintaining  custom  applica;on-­‐layer  
    secondary  indexing  code  
    •  Single  binary  installa;on  greatly  simplifies  
    administra;on  
    •  Our  NoSQL  could  now  support  our  domain  
    model  

    View full-size slide

  124. many-­‐to-­‐many  
    rela;onship  

    View full-size slide

  125. Modeling  an  influencer  
    Embedded  list  of  
    references  to  sites  
    augmented  with  
    influencer-­‐specific  
    site  adributes  (e.g.  
    percent  contribu;on  
    to  content)    
    {  
         ”_id":  "770cf5c54492344ad5e45ˆ791ae5d52”,  
         "realName":  "David  Chancogne",  
         ";tle":  "CTO",  
         "descrip;on":  "Web.  Geek.\r\nTraackr:  hdp://traackr.com\r\nPropz:  hdp://propz.me",  
         "primaryAffilia;on":  "Traackr",  
         "email":  "[email protected]",  
         "loca;on":  "Cambridge,  MA,  United  States",  
         "siteReferences":  [  
               {  
                     "siteId":  "b31236da306270dc2b5db34e943af88d",  
                     "contribu;on":  0.25    
               },  
               {  
                     "siteId":  "602dc370945d3b3480fff4f2a541227c",  
                     "contribu;on":  1.0    
               }  
         ]  
    }  

    View full-size slide

  126. Modeling  an  influencer  
    siteId  indexed  for  
    “find  influencers  
    connected  to  site  X”  
    > db.influencers.ensureIndex({siteReferences.siteId: 1});!
    > db.influencers.find({siteReferences.siteId: "602dc370945d3b3480fff4f2a541227c"});!
    {  
         ”_id":  "770cf5c54492344ad5e45ˆ791ae5d52”,  
         "realName":  "David  Chancogne",  
         ";tle":  "CTO",  
         "descrip;on":  "Web.  Geek.\r\nTraackr:  hdp://traackr.com\r\nPropz:  hdp://propz.me",  
         "primaryAffilia;on":  "Traackr",  
         "email":  "[email protected]",  
         "loca;on":  "Cambridge,  MA,  United  States",  
         "siteReferences":  [  
               {  
                     "siteId":  "b31236da306270dc2b5db34e943af88d",  
                     "contribu;on":  0.25    
               },  
               {  
                     "siteId":  "602dc370945d3b3480fff4f2a541227c",  
                     "contribu;on":  1.0    
               }  
         ]  
    }  

    View full-size slide

  127. Other  Benefits  
    •  Ad  hoc  queries  and  reports  became  easier  to  write  with  JavaScript:  
    no  need  for  a  Java  developer  to  write  map  reduce  code  to  extract  
    the  data  in  a  usable  form  like  it  was  needed  with  Hbase.  
    •  Simpler  backups:  Hbase  mostly  relied  on  HDFS  redundancy;  intra-­‐
    cluster  replica;on  is  available  but  experimental  and  a  lot  more  
    involved  to  setup.  
    •  Great  documenta;on  
    •  Great  adop;on  and  community  

    View full-size slide

  128. looks  like  we  found  the  right  fit!  

    View full-size slide

  129. We  have  more  of  this  
    Development  capacity  

    View full-size slide

  130. And  less  of  this  
    Source:  socialbuderflyclt.com  

    View full-size slide

  131. Recap  &  Final  Thoughts  
    •  3  Vs  of  Big  Data:  
    – Volume  
    – Velocity  
    – Variety  ß  Traackr  
    •  Big  Data  technologies  are  complementary  to  
    SQL  and  RDBMS  
    •  Un;l  machines  can  think  for  themselves  Data  
    Science  will  be  increasingly  important  

    View full-size slide

  132. Recap  &  Final  Thoughts  
    •  Be  prepared  to  deal  with  less  mature  tech  
    •  Be  as  flexible  as  the  data  =>  fearless  
    refactoring  
    •  Importance  of  ease  of  use  and  
    administra;on  cannot  be  overstated  for  a  
    small  startup  

    View full-size slide