Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Sharing a Startup’s Big Data Lessons

Sharing a Startup’s Big Data Lessons

Presented in front of an internal audience at Sapient, Boston on 20/2/2013.

George P. Stathis

February 20, 2013
Tweet

More Decks by George P. Stathis

Other Decks in Technology

Transcript

  1. Sharing  a  Startup’s  Big  Data  
    Lessons  
    Experiences  with  non-­‐RDBMS  solu;ons  at  

    View Slide

  2. Who  we  are  
    •  A  search  
    engine  
    •  A  people  
    search  engine  
    •  An  influencer  
    search  engine  
    •  Subscrip;on-­‐
    based  

    View Slide

  3. George  Stathis  
    VP  Engineering  
    14+  years  of  experience  
    building  full-­‐stack  web  
    soHware  systems  with  a  past  
    focus  on  e-­‐commerce  and  
    publishing.  Currently  
    responsible  for  building  
    engineering  capability  to  
    enable  Traackr's  growth  goals.  

    View Slide

  4. What’s  this  talk  about?  
    •  Share  what  we  know  about  Big  Data/NoSQL:  
    what’s  behind  the  buzz  words?  
    •  Our  reasons  and  method  for  picking  a  NoSQL  
    database  
    •  Share  the  lessons  we  learned  going  through  
    the  process  

    View Slide

  5. Big  Data/NoSQL:  behind  the  buzz  words  

    View Slide

  6. What  is  Big  Data?  
    •  3  Vs:  
    – Volume  
    – Velocity  
    – Variety  

    View Slide

  7. What  is  Big  Data?  Volume  +  Velocity  
    •  Data  sets  too  large  or  coming  in  at  too  high  a  velocity  
    to  process  using  tradi;onal  databases  or  desktop  tools.  
    E.g.  
     
    big  science  
    web  logs  
    rfid  
    sensor  networks  
    social  networks  
    social  data  
    internet  text  and  documents  
    internet  search  indexing  
    call  detail  records  
    Astronomy  
    atmospheric  science  
    genomics  
    biogeochemical  
    military  surveillance  
    medical  records  
    photography  archives  
    video  archives  
    large-­‐scale  e-­‐commerce  

    View Slide

  8. Tradi;onal  sta;c  reports  
    What  is  Big  Data?  Variety  
    •  Big  Data  is  varied  and  unstructured  
    Analy;cs,  explora;on  &  
    experimenta;on  

    View Slide

  9. $$$$$$$$  
    What  is  Big  Data?  
    •  Scaling  data  processing  cost  effec;vely  
     
    $$$$$  
    $$$  

    View Slide

  10. What  is  NoSQL?  
    •  NoSQL  ≠  No  SQL  
    •  NoSQL  ≈  Not  Only  SQL  
    •  NoSQL  addresses  RDBMS  limita;ons,  it’s  not  
    about  the  SQL  language  
    •  RDBMS  =  sta;c  schema  
    •  NoSQL  =  schema  flexibility;  don’t  have  to  
    know  exact  structure  before  storing  

    View Slide

  11. What  is  Distributed  Compu;ng?  
    •  Sharing  the  workload:  divide  a  problem  into  
    many  tasks,  each  of  which  can  be  solved  by  one  
    or  more  computers  
    •  Allows  computa;ons  to  be  accomplished  in  
    acceptable  ;meframes  
    •  Distributed  computa;on  approaches  were  
    developed  to  leverage  mul;ple  machines:  
    MapReduce  
    •  With  MapReduce,  the  program  goes  to  the  data  
    since  the  data  is  too  big  to  move  

    View Slide

  12. What  is  MapReduce?  
    Source:  developer.yahoo.com  

    View Slide

  13. What  is  MapReduce?  
    •  MapReduce  =  batch  processing  =  analy;cal  
    •  MapReduce  ≠  interac;ve  
    •  Therefore  many  NoSQL  solu;ons  don’t  
    outright  replace  warehouse  solu;ons,  they  
    complement  them  
    •  RDBMS  is  s;ll  safe  J    

    View Slide

  14. What  is  Big  Data?  Velocity  
    •  In  some  instances,  being  able  to  process  large  
    amounts  of  data  in  real-­‐;me  can  yield  a  
    compe;;ve  advantage.  E.g.  
    –  Online  retailers  leveraging  buying  history  and  click-­‐
    though  data  for  real-­‐;me  recommenda;ons  
    •  No  ;me  to  wait  for  MapReduce  jobs  to  finish  
    •  Solu;ons:  streaming  processing  (e.g.  Twider  
    Storm),  pre-­‐compu;ng  (e.g.  aggregate  and  count  
    analy;cs  as  data  arrives),  quick  to  read  key/value  
    stores  (e.g.  distributed  hashes)  

    View Slide

  15. What  is  Big  Data?  Data  Science  
    •  Emergence  of  Data  Science    
    •  Data  Scien;st  ≈  Sta;s;cian  
    •  Possess  scien;fic  discipline  &  exper;se  
    •  Formulate  and  test  hypotheses  
    •  Understand  the  math  behind  the  algorithms  so  
    they  can  tweak  when  they  don’t  work  
    •  Can  dis;ll  the  results  into  an  easy  to  understand  
    story  
    •  Help  businesses  gain  ac;onable  insights  

    View Slide

  16. Big  Data  Landscape  
    Source:  capgemini.com  

    View Slide

  17. Big  Data  Landscape  
    Source:  capgemini.com  

    View Slide

  18. Big  Data  Landscape  
    Source:  capgemini.com  

    View Slide

  19. So  what’s  Traackr  and  why  did  we  
    need  a  NoSQL  DB?  

    View Slide

  20. Traackr:  context  
    •  A  cloud  compu;ng  company  as  about  to  
    launch  a  new  plakorm;  how  does  it  find  the  
    most  influen;al  IT  bloggers  on  the  web  that  
    can  help  bring  visibility  to  the  new  product?  
    How  does  it  find  the  opinion  leaders,  the  
    people  that  mader?  

    View Slide

  21. Traackr:  a  people  search  engine  
    Up  to  50  keywords  per  search!  

    View Slide

  22. Traackr:  a  people  search  engine  
    People  
    as  
    search  
    results  
    Content  
    aggregated  
    by  author  
    Proprietary    
    3-­‐scale  ranking  

    View Slide

  23. Traackr:  30,000  feet  
    Acquisi

    View Slide

  24. NoSQL  is  usually  associated  with  
    “Web  Scale”  (Volume  &  Velocity)  

    View Slide

  25. •  In  terms  of  users/traffic?  
    Do  we  fit  the  “Web  scale”  profile?  

    View Slide

  26. Source:  compete.com  

    View Slide

  27. Source:  compete.com  

    View Slide

  28. Source:  compete.com  

    View Slide

  29. Source:  compete.com  

    View Slide

  30. View Slide

  31. •  In  terms  of  users/traffic?  
    •  In  terms  of  the  amount  of  data?  
    Do  we  fit  the  “Web  scale”  profile?  

    View Slide

  32. PRIMARY>  use  traackr  
    switched  to  db  traackr  
    PRIMARY>  db.stats()  
    {  
     "db"  :  "traackr",  
     "collec;ons"  :  12,  
     "objects"  :  68226121,  
     "avgObjSize"  :  2972.0800625760330,  
     "dataSize"  :  202773493971,  
     "storageSize"  :  221491429671,  
     "numExtents"  :  199,  
     "indexes"  :  33,  
     "indexSize"  :  27472394891,  
     "fileSize"  :  266623699968,  
     "nsSizeMB"  :  16,  
     "ok"  :  1  
    }  
    That’s  a  quarter  of  a  
    terabyte  …  

    View Slide

  33. Wait!  What?  My  
    Synology  NAS  at  home  
    can  hold  2TB!  

    View Slide

  34. No  need  for  us  to  track  the  en;re  web  
    Web  Content  
    Influencer  
    Content  
    Not  at  scale  :-­‐)  

    View Slide

  35. •  In  terms  of  users/traffic?  
    •  In  terms  of  the  amount  of  data?  
    Do  we  fit  the  “Web  scale”  profile?  

    View Slide

  36. Variety  view  of  “Web  Scale”  
    Web  data  is:  
    Heterogeneous  
    Unstructured  (text)  

    View Slide

  37. Source:  hdp://www.opte.org/  
    Visualiza;on  of  the  Internet,  Nov.  23rd  2003  

    View Slide

  38. Data  sources  are  
    isolated  islands  of  rich  
    data  with  lose  links  to  
    one  another    

    View Slide

  39. How  do  we  build  a  database  that  
    models  all  possible  en;;es  found  on  
    the  web?  

    View Slide

  40. Modeling  the  web:  the  RDBMS  way  

    View Slide

  41. Source:  socialbuderflyclt.com  

    View Slide

  42. or  

    View Slide

  43. View Slide

  44. {  
         "realName":  "David  Chancogne",  
         ";tle":  "CTO",  
         "descrip;on":  "Web.  Geek.\r\nTraackr:  hdp://traackr.com\r\nPropz:  hdp://propz.me",  
         "primaryAffilia;on":  "Traackr",  
         "email":  "[email protected]",  
         "loca;on":  "Cambridge,  MA,  United  States",  
         "siteReferences":  [  
               {  
                     "siteUrl":  "hdp://twider.com/dchancogne",  
                     "metrics":  [  
                           {  
                                 "value":  216,  
                                 "name":  "twider_followers_count"  
                           },  
                           {  
                                 "value":  2107,  
                                 "name":  "twider_statuses_count"  
                           }  
                     ]  
               },  
               {  
                     "siteUrl":  "hdp://traackr.com/blog/author/david",  
                     "metrics":  [  
                           {  
                                 "value":  21,  
                                 "name":  "google_inbound_links"  
                           }  
                     ]  
               }  
         ]  
    }  
    Influencer  data  as  JSON  

    View Slide

  45. NoSQL  =  schema  flexibility  

    View Slide

  46. •  In  terms  of  users/traffic?  
    •  In  terms  of  the  amount  of  data?  
    Do  we  fit  the  “Web  scale”  profile?  

    View Slide

  47. •  In  terms  of  users/traffic?  
    •  In  terms  of  the  amount  of  data?  
    •  In  terms  of  the  variety  of  the  data  
    Do  we  fit  the  “Web  scale”  profile?  
    ✓  

    View Slide

  48. Traackr’s  Datastore  Requirements  
    •  Schema  flexibility  
    •  Good  at  storing  lots  of  variable  length  text  
    •  Batch  processing  op;ons  
    ✓  

    View Slide

  49. Requirement:  text  storage  
    Variable  text  length:  
    <  big  variance  <  
    140  
    character  
    tweets  
    mul;-­‐page  
    blog  posts  

    View Slide

  50. Requirement:  text  storage  
    RDBMS’  answer  to  variable  text  length:  
    Plan  ahead  for  largest  value  
    CLOB/BLOB  

    View Slide

  51. Requirement:  text  storage  
    Issues  with  CLOB/BLOG  for  us:  
    No  clue  what  largest  value  is  
    CLOB/BLOB  for  tweets  =  wasted  space  

    View Slide

  52. Requirement:  text  storage  
    NoSQL  solu;ons  are  great  for  text:  
    No  length  requirements  (automated  
    chunking)  
    Limited  space  overhead  

    View Slide

  53. Traackr’s  Datastore  Requirements  
    •  Schema  flexibility  
    •  Good  at  storing  lots  of  variable  length  text  
    •  Batch  processing  op;ons  
    ✓  
    ✓  

    View Slide

  54. Requirement:  batch  processing  
    Some  NoSQL  
    solu;ons  come  
    with  MapReduce  
    Source:  hdp://code.google.com/  

    View Slide

  55. Requirement:  batch  processing  
    MapReduce  +  RDBMS:  
    Possible  but  proprietary  solu;ons  
    Usually  involves  expor;ng  data  from  
    RDBMS  into  a  NoSQL  system  anyway.  
    Defeats  data  locality  benefit  of  MR  

    View Slide

  56. Traackr’s  Datastore  Requirements  
    •  Schema  flexibility  
    •  Good  at  storing  lots  of  variable  length  text  
    •  Batch  processing  op;ons  
    ✓  
    ✓  
    A  NoSQL  op;on  is  the  right  fit  
    ✓  

    View Slide

  57. How  did  we  pick  a  NoSQL  DB?  

    View Slide

  58. Bewildering  number  of  op;ons  (early  2010)  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    

    View Slide

  59. Bewildering  number  of  op;ons  (early  2010)  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    

    View Slide

  60. Trimming  op;ons  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    
    Graph  Databases:  while  we  can  model  
    our  domain  as  a  graph  we  don’t  want  to  
    pigeonhole  ourselves  into  this  structure.  
    We’d  rather  use  these  tools  for  
    specialized  data  analysis  but  not  as  the  
    main  data  store.  

    View Slide

  61. Trimming  op;ons  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    
    Memcache:  memory-­‐based,  
    we  need  true  persistence  

    View Slide

  62. Trimming  op;ons  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    
    Amazon  SimpleDB:  not  willing  to  
    store  our  data  in  a  proprietary  
    datastore.  

    View Slide

  63. Trimming  op;ons  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    
    Not  willing  to  store  our  data  in  a  
    proprietary  datastore.  
    Redis  and  LinkedIn’s  Project  
    Voldermort:  no  query  filters,  
    beder  used  as  queues  or  
    distributed  caches  

    View Slide

  64. Trimming  op;ons  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    
    CouchDB:  no  ad-­‐hoc  queries;  
    maturity  in  early  2010  made  us  
    shy  away  although  we  did  try  
    early  prototypes.  

    View Slide

  65. Trimming  op;ons  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    
    Cassandra:  in  early  2010,  
    maturity  ques;ons,  no  secondary  
    indexes  and  no  batch  processing  
    op;ons  (came  later  on).  

    View Slide

  66. Trimming  op;ons  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    
    MongoDB:  in  early  2010,  maturity  
    ques;ons,  adop;on  ques;ons  
    and  no  batch  processing  op;ons.  

    View Slide

  67. Trimming  op;ons  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    
    Riak:  very  close  but  in  early  2010,  
    we  had  adop;on  ques;ons.  

    View Slide

  68. Trimming  op;ons  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    
    HBase:  came  across  as  the  most  mature  
    at  the  ;me,  with  several  deployments,  a  
    healthy  community,  "out-­‐of-­‐the  box"  
    secondary  indexes  through  a  contrib  and  
    support  for  batch  processing  using  
    Hadoop/MR  .  

    View Slide

  69. Lessons  Learned  
    Challenges  
    -­‐  Complexity  
    -­‐  Missing  Features  
    -­‐  Problem  solu;on  fit  
    -­‐  Resources  
    Rewards  
    -­‐  Choices  
    -­‐  Empowering  
    -­‐  Community  
    -­‐  Cost  

    View Slide

  70. Rewards:  Choices  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    

    View Slide

  71. Rewards:  Choices  
    Source:  capgemini.com  

    View Slide

  72. Lessons  Learned  
    Challenges  
    -­‐  Complexity  
    -­‐  Missing  Features  
    -­‐  Problem  solu;on  fit  
    -­‐  Resources  
    Rewards  
    -­‐  Choices  
    -­‐  Empowering  
    -­‐  Community  
    -­‐  Cost  

    View Slide

  73. When  Big-­‐Data  =  Big  Architectures  
    Source:  hdp://www.larsgeorge.com/2009/10/hbase-­‐architecture-­‐101-­‐storage.html  
    Must  have  a  Hadoop  HDFS  
    cluster  of  at  least  2x  replica;on  
    factor  nodes  
    Must  have  an  odd  
    number  of    
    Zookeeper  quorum  
    nodes  
    Then  you  can  run  your  Hbase  
    nodes  but  it’s  recommended  to  
    co-­‐locate  regionservers  with  
    hadoop  datanodes  so  you  have  
    to  manage  resources.  
    Master/slave  architecture  
    means  a  single  point  of  failure,  
    so  you  need  to  protect  your  
    master.  
    And  then  we  also  have  to  
    manage  the  MapReduce  
    processes  and  resources  in  the  
    Hadoop  layer.  

    View Slide

  74. Source:  socialbuderflyclt.com  

    View Slide

  75. Jokes  aside,  no  one  said  open  source  
    was  easy  to  use  

    View Slide

  76. To  be  expected  
    •  Hadoop/Hbase  are  
    designed  to  move  
    mountains  
    •  If  you  want  to  move  big  
    stuff,  be  prepared  to  
    some;mes  use  big  
    equipment  

    View Slide

  77. What  it  means  to  a  startup  
    Development  capacity  before  
    Development  capacity  aHer  
    Congrats,  you  
    are  now  a  
    sysadmin…  

    View Slide

  78. Lessons  Learned  
    Challenges  
    -­‐  Complexity  
    -­‐  Missing  Features  
    -­‐  Problem  solu;on  fit  
    -­‐  Resources  
    Rewards  
    -­‐  Choices  
    -­‐  Empowering  
    -­‐  Community  
    -­‐  Cost  

    View Slide

  79. Mapping  an  saved  search  to  a  column  store  
    Name  
    Ranks   References  to  influencer  records  

    View Slide

  80. Unique  
    key    
    “adributes”  
    column  family  
    for  general  
    adributes  
    “influencerId”  column  family  
    for  influencer  ranks  and  foreign  keys  
    Mapping  an  saved  search  to  a  column  store  

    View Slide

  81. Mapping  an  saved  search  to  a  column  store  
    “name”  adribute  
    Influencer  ranks  
    can  be  adribute  
    names  as  well  

    View Slide

  82. Mapping  an  saved  search  to  a  column  store  
    Can  get  predy  long  so  needs  indexing  and  pagina;on  

    View Slide

  83. Problem:  no  out-­‐of-­‐the-­‐box  row-­‐based  
    indexing  and  pagina;on  

    View Slide

  84. Jumping  right  into  the  code  

    View Slide

  85. Lessons  Learned  
    Challenges  
    -­‐  Complexity  
    -­‐  Missing  Features  
    -­‐  Problem  solu;on  fit  
    -­‐  Resources  
    Rewards  
    -­‐  Choices  
    -­‐  Empowering  
    -­‐  Community  
    -­‐  Cost  

    View Slide

  86. a  few  months  later…  

    View Slide

  87. Need  to  upgrade  to  Hbase  0.90  
    •  Making  sure  to  remain  on  recent  code  base  
    •  Performance  improvements  
    •  Mostly  to  get  the  latest  bug  fixes  
    No  thanks!  

    View Slide

  88. Looks  like  something  is  missing  

    View Slide

  89. View Slide

  90. Our  DB  indexes  depend  on  this!  

    View Slide

  91. Let’s  get  this  straight  
    •  Hbase  no  longer  comes  with  secondary  
    indexing  out-­‐of-­‐the-­‐box  
    •  It’s  been  moved  out  of  the  trunk  to  GitHub  
    •  Where  only  one  other  company  besides  us  
    seems  to  care  about  it  

    View Slide

  92. Only  one  other  
    maintainer  
    besides  us  

    View Slide

  93. What  it  means  to  a  startup  
    Development  capacity  
    Congrats,  you  are  
    now  an  hbase  
    contrib  maintainer…  

    View Slide

  94. Source:  socialbuderflyclt.com  

    View Slide

  95. Lessons  Learned  
    Challenges  
    -­‐  Complexity  
    -­‐  Missing  Features  
    -­‐  Problem  solu;on  fit  
    -­‐  Resources  
    Rewards  
    -­‐  Choices  
    -­‐  Empowering  
    -­‐  Community  
    -­‐  Cost  

    View Slide

  96. Homegrown  Hbase  Indexes  
    Rows  have  id  prefixes  that  can  be  
    efficiently  scanned  using  STARTROW  
    and  STOPROW  filters  
    Row  ids  for  Posts  

    View Slide

  97. Homegrown  Hbase  Indexes  
    Find  posts  for  
    influencer_id_1234    
    Row  ids  for  Posts  

    View Slide

  98. Homegrown  Hbase  Indexes  
    Find  posts  for  
    influencer_id_5678  
    Row  ids  for  Posts  

    View Slide

  99. Homegrown  Hbase  Indexes  
    •  No  longer  depending  on  
    unmaintained  code  
    •  Work  with  out-­‐of-­‐the-­‐box  Hbase  
    installa;on  

    View Slide

  100. What  it  means  to  a  startup  
    Development  capacity  
    You  are  back  but  
    you  s;ll  need  to  
    maintain  indexing  
    logic  

    View Slide

  101. a  few  months  later…  

    View Slide

  102. Cracks  in  the  data  model  
    huffingtonpost.com  
    huffingtonpost.com  
    hdp://www.huffingtonpost.com/arianna-­‐huffington/post_1.html  
    hdp://www.huffingtonpost.com/arianna-­‐huffington/post_2.html  
    hdp://www.huffingtonpost.com/arianna-­‐huffington/post_3.html  
    hdp://www.huffingtonpost.com/shaun-­‐donovan/post1.html  
    hdp://www.huffingtonpost.com/shaun-­‐donovan/post2.html  
    hdp://www.huffingtonpost.com/shaun-­‐donovan/post3.html  
    writes  for  
    authored  by  
    published  under  
    writes  for  
    authored  by  
    published  under  

    View Slide

  103. Cracks  in  the  data  model  
    huffingtonpost.com  
    huffingtonpost.com  
    hdp://www.huffingtonpost.com/arianna-­‐huffington/post_1.html  
    hdp://www.huffingtonpost.com/arianna-­‐huffington/post_2.html  
    hdp://www.huffingtonpost.com/arianna-­‐huffington/post_3.html  
    hdp://www.huffingtonpost.com/shaun-­‐donovan/post1.html  
    hdp://www.huffingtonpost.com/shaun-­‐donovan/post2.html  
    hdp://www.huffingtonpost.com/shaun-­‐donovan/post3.html  
    writes  for  
    authored  by  
    published  under  
    writes  for  
    authored  by  
    published  under  
    Denormalized/duplicated  
    for  fast  run;me  access  
    and  storage  of  influencer-­‐
    to-­‐site  rela;onship  
    proper;es  

    View Slide

  104. Cracks  in  the  data  model  
    huffingtonpost.com  
    huffingtonpost.com  
    hdp://www.huffingtonpost.com/arianna-­‐huffington/post_1.html  
    hdp://www.huffingtonpost.com/arianna-­‐huffington/post_2.html  
    hdp://www.huffingtonpost.com/arianna-­‐huffington/post_3.html  
    hdp://www.huffingtonpost.com/shaun-­‐donovan/post1.html  
    hdp://www.huffingtonpost.com/shaun-­‐donovan/post2.html  
    hdp://www.huffingtonpost.com/shaun-­‐donovan/post3.html  
    writes  for  
    authored  by  
    published  under  
    writes  for  
    authored  by  
    published  under  
    Content  adribu;on  logic  could  some;mes  
    mis-­‐adribute  posts  because  of  the  
    duplicated  data.  

    View Slide

  105. Cracks  in  the  data  model  
    huffingtonpost.com  
    huffingtonpost.com  
    hdp://www.huffingtonpost.com/arianna-­‐huffington/post_1.html  
    hdp://www.huffingtonpost.com/arianna-­‐huffington/post_2.html  
    hdp://www.huffingtonpost.com/arianna-­‐huffington/post_3.html  
    hdp://www.huffingtonpost.com/shaun-­‐donovan/post1.html  
    hdp://www.huffingtonpost.com/shaun-­‐donovan/post2.html  
    hdp://www.huffingtonpost.com/shaun-­‐donovan/post3.html  
    writes  for  
    authored  by  
    published  under  
    writes  for  
    authored  by  
    published  under  
    Exacerbated  when  we  started  tracking  
    people’s  content  on  a  daily  basis  in  
    mid-­‐2011  

    View Slide

  106. Fixing  the  cracks  in  the  data  model  
    huffingtonpost.com  
    hdp://www.huffingtonpost.com/arianna-­‐huffington/post_1.html  
    hdp://www.huffingtonpost.com/arianna-­‐huffington/post_2.html  
    hdp://www.huffingtonpost.com/arianna-­‐huffington/post_3.html  
    hdp://www.huffingtonpost.com/shaun-­‐donovan/post1.html  
    hdp://www.huffingtonpost.com/shaun-­‐donovan/post2.html  
    hdp://www.huffingtonpost.com/shaun-­‐donovan/post3.html  
    writes  for  
    authored  by  
    published  under  
    writes  for  
    authored  by  
    published  under  
    Normalize  the  sites  

    View Slide

  107. Fixing  the  cracks  in  the  data  model  
    •  Normaliza;on  requires  stronger  
    secondary  indexing  
    •  Our  applica;on  layer  indexing  would  
    need  revisi;ng…again!  

    View Slide

  108. What  it  means  to  a  startup  
    Development  capacity  
    Psych!  You  are  back  
    to  wri;ng  indexing  
    code.  

    View Slide

  109. Source:  socialbuderflyclt.com  

    View Slide

  110. Lessons  Learned  
    Challenges  
    -­‐  Complexity  
    -­‐  Missing  Features  
    -­‐  Problem  solu;on  fit  
    -­‐  Resources  
    Rewards  
    -­‐  Choices  
    -­‐  Empowering  
    -­‐  Community  
    -­‐  Cost  

    View Slide

  111. Traackr’s  Datastore  Requirements  
    (Revisited)  
    •  Schema  flexibility  
    •  Good  at  storing  lots  of  variable  length  text  
    •  Out-­‐of-­‐the-­‐box  SECONDARY  INDEX  support!  
    •  Simple  to  use  and  administer  

    View Slide

  112. NoSQL  picking  –  Round  2  (mid  2011)  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    

    View Slide

  113. NoSQL  picking  –  Round  2  (mid  2011)  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    
    Nope!  

    View Slide

  114. NoSQL  picking  –  Round  2  (mid  2011)  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    
    Graph  Databases:  we  looked  at  
    Neo4J  a  bit  closer  but  passed  again  
    for  the  same  reasons  as  before.  

    View Slide

  115. NoSQL  picking  –  Round  2  (mid  2011)  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    
    Memcache:  s;ll  no  

    View Slide

  116. NoSQL  picking  –  Round  2  (mid  2011)  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    
    Amazon  SimpleDB:  s;ll  no.  

    View Slide

  117. NoSQL  picking  –  Round  2  (mid  2011)  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    
    Not  willing  to  store  our  data  in  a  
    proprietary  datastore.  
    Redis  and  LinkedIn’s  Project  
    Voldermort:  s;ll  no  

    View Slide

  118. NoSQL  picking  –  Round  2  (mid  2011)  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    
    CouchDB:  more  mature  but  s;ll  
    no  ad-­‐hoc  queries.  

    View Slide

  119. NoSQL  picking  –  Round  2  (mid  2011)  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    
    Cassandra:  matured  quite  a  bit,  added  
    secondary  indexes  and  batch  processing  
    op;ons  but  more  restric;ve  in  its’  use  than  
    other  solu;ons.  AHer  the  Hbase  lesson,  
    simplicity  of  use  was  now  more  important.  

    View Slide

  120. NoSQL  picking  –  Round  2  (mid  2011)  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    
    Riak:  strong  contender  s;ll  but  
    adop;on  ques;ons  remained.  

    View Slide

  121. NoSQL  picking  –  Round  2  (mid  2011)  
    Key/Value  Databases  
    •  Distributed  hashtables  
    •  Designed  for  high  load  
    •  In-­‐memory  or  on-­‐disk  
    •  Eventually  consistent  
    Column  Databases  
    •  Spread  sheet  like  
    •  Key  is  a  row  id  
    •  Adributes  are  columns  
    •  Columns  can  be  grouped  
    into  families  
    Document  Databases  
    •  Like  Key/Value  
    •  Value  =  Document  
    •  Document  =  JSON/BSON  
    •  JSON  =  Flexible  Schema  
    Graph  Databases  
    •  Graph  Theory  G=(E,V)  
    •  Great  for  modeling  
    networks  
    •  Great  for  graph-­‐based  
    query  algorithms    
    MongoDB:  matured  by  leaps  and  bounds,  increased  
    adop;on,  support  from  10gen,  advanced  indexing  
    out-­‐of-­‐the-­‐box  as  well  as  some  batch  processing  
    op;ons,  breeze  to  use,  well  documented  and  fit  into  
    our  exis;ng  code  base  very  nicely.  

    View Slide

  122. Lessons  Learned  
    Challenges  
    -­‐  Complexity  
    -­‐  Missing  Features  
    -­‐  Problem  solu;on  fit  
    -­‐  Resources  
    Rewards  
    -­‐  Choices  
    -­‐  Empowering  
    -­‐  Community  
    -­‐  Cost  

    View Slide

  123. Immediate  Benefits  
    •  No  more  maintaining  custom  applica;on-­‐layer  
    secondary  indexing  code  

    View Slide

  124. What  it  means  to  a  startup  
    Development  capacity  
    Yay!  I’m  back!  

    View Slide

  125. Immediate  Benefits  
    •  No  more  maintaining  custom  applica;on-­‐layer  
    secondary  indexing  code  
    •  Single  binary  installa;on  greatly  simplifies  
    administra;on  

    View Slide

  126. What  it  means  to  a  startup  
    Development  capacity  
    Honestly,  I  thought  
    I’d  never  see  you  
    guys  again!  

    View Slide

  127. Immediate  Benefits  
    •  No  more  maintaining  custom  applica;on-­‐layer  
    secondary  indexing  code  
    •  Single  binary  installa;on  greatly  simplifies  
    administra;on  
    •  Our  NoSQL  could  now  support  our  domain  
    model  

    View Slide

  128. many-­‐to-­‐many  
    rela;onship  

    View Slide

  129. Modeling  an  influencer  
    Embedded  list  of  
    references  to  sites  
    augmented  with  
    influencer-­‐specific  
    site  adributes  (e.g.  
    percent  contribu;on  
    to  content)    
    {  
         ”_id":  "770cf5c54492344ad5e45ˆ791ae5d52”,  
         "realName":  "David  Chancogne",  
         ";tle":  "CTO",  
         "descrip;on":  "Web.  Geek.\r\nTraackr:  hdp://traackr.com\r\nPropz:  hdp://propz.me",  
         "primaryAffilia;on":  "Traackr",  
         "email":  "[email protected]",  
         "loca;on":  "Cambridge,  MA,  United  States",  
         "siteReferences":  [  
               {  
                     "siteId":  "b31236da306270dc2b5db34e943af88d",  
                     "contribu;on":  0.25    
               },  
               {  
                     "siteId":  "602dc370945d3b3480fff4f2a541227c",  
                     "contribu;on":  1.0    
               }  
         ]  
    }  

    View Slide

  130. Modeling  an  influencer  
    siteId  indexed  for  
    “find  influencers  
    connected  to  site  X”  
    > db.influencers.ensureIndex({siteReferences.siteId: 1});!
    > db.influencers.find({siteReferences.siteId: "602dc370945d3b3480fff4f2a541227c"});!
    {  
         ”_id":  "770cf5c54492344ad5e45ˆ791ae5d52”,  
         "realName":  "David  Chancogne",  
         ";tle":  "CTO",  
         "descrip;on":  "Web.  Geek.\r\nTraackr:  hdp://traackr.com\r\nPropz:  hdp://propz.me",  
         "primaryAffilia;on":  "Traackr",  
         "email":  "[email protected]",  
         "loca;on":  "Cambridge,  MA,  United  States",  
         "siteReferences":  [  
               {  
                     "siteId":  "b31236da306270dc2b5db34e943af88d",  
                     "contribu;on":  0.25    
               },  
               {  
                     "siteId":  "602dc370945d3b3480fff4f2a541227c",  
                     "contribu;on":  1.0    
               }  
         ]  
    }  

    View Slide

  131. Other  Benefits  
    •  Ad  hoc  queries  and  reports  became  easier  to  write  with  JavaScript:  
    no  need  for  a  Java  developer  to  write  map  reduce  code  to  extract  
    the  data  in  a  usable  form  like  it  was  needed  with  Hbase.  
    •  Simpler  backups:  Hbase  mostly  relied  on  HDFS  redundancy;  intra-­‐
    cluster  replica;on  is  available  but  experimental  and  a  lot  more  
    involved  to  setup.  
    •  Great  documenta;on  
    •  Great  adop;on  and  community  

    View Slide

  132. looks  like  we  found  the  right  fit!  

    View Slide

  133. We  have  more  of  this  
    Development  capacity  

    View Slide

  134. And  less  of  this  
    Source:  socialbuderflyclt.com  

    View Slide

  135. Recap  &  Final  Thoughts  
    •  3  Vs  of  Big  Data:  
    – Volume  
    – Velocity  
    – Variety  ß  Traackr  
    •  Big  Data  technologies  are  complementary  to  
    SQL  and  RDBMS  
    •  Un;l  machines  can  think  for  themselves  Data  
    Science  will  be  increasingly  important  

    View Slide

  136. Recap  &  Final  Thoughts  
    •  Be  prepared  to  deal  with  less  mature  tech  
    •  Be  as  flexible  as  the  data  =>  fearless  
    refactoring  
    •  Importance  of  ease  of  use  and  
    administra;on  cannot  be  overstated  for  a  
    small  startup  

    View Slide

  137. Q&A  

    View Slide