Sharing a Startup’s Big Data Lessons

Sharing a Startup’s Big Data Lessons

Presented in front of an internal audience at Sapient, Boston on 20/2/2013.

Ba71ad856b0e8d5f224b8c927c85a7c4?s=128

George P. Stathis

February 20, 2013
Tweet

Transcript

  1. Sharing  a  Startup’s  Big  Data   Lessons   Experiences  with

     non-­‐RDBMS  solu;ons  at  
  2. Who  we  are   •  A  search   engine  

    •  A  people   search  engine   •  An  influencer   search  engine   •  Subscrip;on-­‐ based  
  3. George  Stathis   VP  Engineering   14+  years  of  experience

      building  full-­‐stack  web   soHware  systems  with  a  past   focus  on  e-­‐commerce  and   publishing.  Currently   responsible  for  building   engineering  capability  to   enable  Traackr's  growth  goals.  
  4. What’s  this  talk  about?   •  Share  what  we  know

     about  Big  Data/NoSQL:   what’s  behind  the  buzz  words?   •  Our  reasons  and  method  for  picking  a  NoSQL   database   •  Share  the  lessons  we  learned  going  through   the  process  
  5. Big  Data/NoSQL:  behind  the  buzz  words  

  6. What  is  Big  Data?   •  3  Vs:   – Volume

      – Velocity   – Variety  
  7. What  is  Big  Data?  Volume  +  Velocity   •  Data

     sets  too  large  or  coming  in  at  too  high  a  velocity   to  process  using  tradi;onal  databases  or  desktop  tools.   E.g.     big  science   web  logs   rfid   sensor  networks   social  networks   social  data   internet  text  and  documents   internet  search  indexing   call  detail  records   Astronomy   atmospheric  science   genomics   biogeochemical   military  surveillance   medical  records   photography  archives   video  archives   large-­‐scale  e-­‐commerce  
  8. Tradi;onal  sta;c  reports   What  is  Big  Data?  Variety  

    •  Big  Data  is  varied  and  unstructured   Analy;cs,  explora;on  &   experimenta;on  
  9. $$$$$$$$   What  is  Big  Data?   •  Scaling  data

     processing  cost  effec;vely     $$$$$   $$$  
  10. What  is  NoSQL?   •  NoSQL  ≠  No  SQL  

    •  NoSQL  ≈  Not  Only  SQL   •  NoSQL  addresses  RDBMS  limita;ons,  it’s  not   about  the  SQL  language   •  RDBMS  =  sta;c  schema   •  NoSQL  =  schema  flexibility;  don’t  have  to   know  exact  structure  before  storing  
  11. What  is  Distributed  Compu;ng?   •  Sharing  the  workload:  divide

     a  problem  into   many  tasks,  each  of  which  can  be  solved  by  one   or  more  computers   •  Allows  computa;ons  to  be  accomplished  in   acceptable  ;meframes   •  Distributed  computa;on  approaches  were   developed  to  leverage  mul;ple  machines:   MapReduce   •  With  MapReduce,  the  program  goes  to  the  data   since  the  data  is  too  big  to  move  
  12. What  is  MapReduce?   Source:  developer.yahoo.com  

  13. What  is  MapReduce?   •  MapReduce  =  batch  processing  =

     analy;cal   •  MapReduce  ≠  interac;ve   •  Therefore  many  NoSQL  solu;ons  don’t   outright  replace  warehouse  solu;ons,  they   complement  them   •  RDBMS  is  s;ll  safe  J    
  14. What  is  Big  Data?  Velocity   •  In  some  instances,

     being  able  to  process  large   amounts  of  data  in  real-­‐;me  can  yield  a   compe;;ve  advantage.  E.g.   –  Online  retailers  leveraging  buying  history  and  click-­‐ though  data  for  real-­‐;me  recommenda;ons   •  No  ;me  to  wait  for  MapReduce  jobs  to  finish   •  Solu;ons:  streaming  processing  (e.g.  Twider   Storm),  pre-­‐compu;ng  (e.g.  aggregate  and  count   analy;cs  as  data  arrives),  quick  to  read  key/value   stores  (e.g.  distributed  hashes)  
  15. What  is  Big  Data?  Data  Science   •  Emergence  of

     Data  Science     •  Data  Scien;st  ≈  Sta;s;cian   •  Possess  scien;fic  discipline  &  exper;se   •  Formulate  and  test  hypotheses   •  Understand  the  math  behind  the  algorithms  so   they  can  tweak  when  they  don’t  work   •  Can  dis;ll  the  results  into  an  easy  to  understand   story   •  Help  businesses  gain  ac;onable  insights  
  16. Big  Data  Landscape   Source:  capgemini.com  

  17. Big  Data  Landscape   Source:  capgemini.com  

  18. Big  Data  Landscape   Source:  capgemini.com  

  19. So  what’s  Traackr  and  why  did  we   need  a

     NoSQL  DB?  
  20. Traackr:  context   •  A  cloud  compu;ng  company  as  about

     to   launch  a  new  plakorm;  how  does  it  find  the   most  influen;al  IT  bloggers  on  the  web  that   can  help  bring  visibility  to  the  new  product?   How  does  it  find  the  opinion  leaders,  the   people  that  mader?  
  21. Traackr:  a  people  search  engine   Up  to  50  keywords

     per  search!  
  22. Traackr:  a  people  search  engine   People   as  

    search   results   Content   aggregated   by  author   Proprietary     3-­‐scale  ranking  
  23. Traackr:  30,000  feet   Acquisi<on   Processing   Storage  &

     Indexing   Services   Applica<ons  
  24. NoSQL  is  usually  associated  with   “Web  Scale”  (Volume  &

     Velocity)  
  25. •  In  terms  of  users/traffic?   Do  we  fit  the

     “Web  scale”  profile?  
  26. Source:  compete.com  

  27. Source:  compete.com  

  28. Source:  compete.com  

  29. Source:  compete.com  

  30. None
  31. •  In  terms  of  users/traffic?   •  In  terms  of

     the  amount  of  data?   Do  we  fit  the  “Web  scale”  profile?  
  32. PRIMARY>  use  traackr   switched  to  db  traackr   PRIMARY>

     db.stats()   {    "db"  :  "traackr",    "collec;ons"  :  12,    "objects"  :  68226121,    "avgObjSize"  :  2972.0800625760330,    "dataSize"  :  202773493971,    "storageSize"  :  221491429671,    "numExtents"  :  199,    "indexes"  :  33,    "indexSize"  :  27472394891,    "fileSize"  :  266623699968,    "nsSizeMB"  :  16,    "ok"  :  1   }   That’s  a  quarter  of  a   terabyte  …  
  33. Wait!  What?  My   Synology  NAS  at  home   can

     hold  2TB!  
  34. No  need  for  us  to  track  the  en;re  web  

    Web  Content   Influencer   Content   Not  at  scale  :-­‐)  
  35. •  In  terms  of  users/traffic?   •  In  terms  of

     the  amount  of  data?   Do  we  fit  the  “Web  scale”  profile?  
  36. Variety  view  of  “Web  Scale”   Web  data  is:  

    Heterogeneous   Unstructured  (text)  
  37. Source:  hdp://www.opte.org/   Visualiza;on  of  the  Internet,  Nov.  23rd  2003

     
  38. Data  sources  are   isolated  islands  of  rich   data

     with  lose  links  to   one  another    
  39. How  do  we  build  a  database  that   models  all

     possible  en;;es  found  on   the  web?  
  40. Modeling  the  web:  the  RDBMS  way  

  41. Source:  socialbuderflyclt.com  

  42. or  

  43. None
  44. {        "realName":  "David  Chancogne",      

     ";tle":  "CTO",        "descrip;on":  "Web.  Geek.\r\nTraackr:  hdp://traackr.com\r\nPropz:  hdp://propz.me",        "primaryAffilia;on":  "Traackr",        "email":  "dchancogne@traackr.com",        "loca;on":  "Cambridge,  MA,  United  States",        "siteReferences":  [              {                    "siteUrl":  "hdp://twider.com/dchancogne",                    "metrics":  [                          {                                "value":  216,                                "name":  "twider_followers_count"                          },                          {                                "value":  2107,                                "name":  "twider_statuses_count"                          }                    ]              },              {                    "siteUrl":  "hdp://traackr.com/blog/author/david",                    "metrics":  [                          {                                "value":  21,                                "name":  "google_inbound_links"                          }                    ]              }        ]   }   Influencer  data  as  JSON  
  45. NoSQL  =  schema  flexibility  

  46. •  In  terms  of  users/traffic?   •  In  terms  of

     the  amount  of  data?   Do  we  fit  the  “Web  scale”  profile?  
  47. •  In  terms  of  users/traffic?   •  In  terms  of

     the  amount  of  data?   •  In  terms  of  the  variety  of  the  data   Do  we  fit  the  “Web  scale”  profile?   ✓  
  48. Traackr’s  Datastore  Requirements   •  Schema  flexibility   •  Good

     at  storing  lots  of  variable  length  text   •  Batch  processing  op;ons   ✓  
  49. Requirement:  text  storage   Variable  text  length:   <  big

     variance  <   140   character   tweets   mul;-­‐page   blog  posts  
  50. Requirement:  text  storage   RDBMS’  answer  to  variable  text  length:

      Plan  ahead  for  largest  value   CLOB/BLOB  
  51. Requirement:  text  storage   Issues  with  CLOB/BLOG  for  us:  

    No  clue  what  largest  value  is   CLOB/BLOB  for  tweets  =  wasted  space  
  52. Requirement:  text  storage   NoSQL  solu;ons  are  great  for  text:

      No  length  requirements  (automated   chunking)   Limited  space  overhead  
  53. Traackr’s  Datastore  Requirements   •  Schema  flexibility   •  Good

     at  storing  lots  of  variable  length  text   •  Batch  processing  op;ons   ✓   ✓  
  54. Requirement:  batch  processing   Some  NoSQL   solu;ons  come  

    with  MapReduce   Source:  hdp://code.google.com/  
  55. Requirement:  batch  processing   MapReduce  +  RDBMS:   Possible  but

     proprietary  solu;ons   Usually  involves  expor;ng  data  from   RDBMS  into  a  NoSQL  system  anyway.   Defeats  data  locality  benefit  of  MR  
  56. Traackr’s  Datastore  Requirements   •  Schema  flexibility   •  Good

     at  storing  lots  of  variable  length  text   •  Batch  processing  op;ons   ✓   ✓   A  NoSQL  op;on  is  the  right  fit   ✓  
  57. How  did  we  pick  a  NoSQL  DB?  

  58. Bewildering  number  of  op;ons  (early  2010)   Key/Value  Databases  

    •  Distributed  hashtables   •  Designed  for  high  load   •  In-­‐memory  or  on-­‐disk   •  Eventually  consistent   Column  Databases   •  Spread  sheet  like   •  Key  is  a  row  id   •  Adributes  are  columns   •  Columns  can  be  grouped   into  families   Document  Databases   •  Like  Key/Value   •  Value  =  Document   •  Document  =  JSON/BSON   •  JSON  =  Flexible  Schema   Graph  Databases   •  Graph  Theory  G=(E,V)   •  Great  for  modeling   networks   •  Great  for  graph-­‐based   query  algorithms    
  59. Bewildering  number  of  op;ons  (early  2010)   Key/Value  Databases  

    •  Distributed  hashtables   •  Designed  for  high  load   •  In-­‐memory  or  on-­‐disk   •  Eventually  consistent   Column  Databases   •  Spread  sheet  like   •  Key  is  a  row  id   •  Adributes  are  columns   •  Columns  can  be  grouped   into  families   Document  Databases   •  Like  Key/Value   •  Value  =  Document   •  Document  =  JSON/BSON   •  JSON  =  Flexible  Schema   Graph  Databases   •  Graph  Theory  G=(E,V)   •  Great  for  modeling   networks   •  Great  for  graph-­‐based   query  algorithms    
  60. Trimming  op;ons   Key/Value  Databases   •  Distributed  hashtables  

    •  Designed  for  high  load   •  In-­‐memory  or  on-­‐disk   •  Eventually  consistent   Column  Databases   •  Spread  sheet  like   •  Key  is  a  row  id   •  Adributes  are  columns   •  Columns  can  be  grouped   into  families   Document  Databases   •  Like  Key/Value   •  Value  =  Document   •  Document  =  JSON/BSON   •  JSON  =  Flexible  Schema   Graph  Databases   •  Graph  Theory  G=(E,V)   •  Great  for  modeling   networks   •  Great  for  graph-­‐based   query  algorithms     Graph  Databases:  while  we  can  model   our  domain  as  a  graph  we  don’t  want  to   pigeonhole  ourselves  into  this  structure.   We’d  rather  use  these  tools  for   specialized  data  analysis  but  not  as  the   main  data  store.  
  61. Trimming  op;ons   Key/Value  Databases   •  Distributed  hashtables  

    •  Designed  for  high  load   •  In-­‐memory  or  on-­‐disk   •  Eventually  consistent   Column  Databases   •  Spread  sheet  like   •  Key  is  a  row  id   •  Adributes  are  columns   •  Columns  can  be  grouped   into  families   Document  Databases   •  Like  Key/Value   •  Value  =  Document   •  Document  =  JSON/BSON   •  JSON  =  Flexible  Schema   Graph  Databases   •  Graph  Theory  G=(E,V)   •  Great  for  modeling   networks   •  Great  for  graph-­‐based   query  algorithms     Memcache:  memory-­‐based,   we  need  true  persistence  
  62. Trimming  op;ons   Key/Value  Databases   •  Distributed  hashtables  

    •  Designed  for  high  load   •  In-­‐memory  or  on-­‐disk   •  Eventually  consistent   Column  Databases   •  Spread  sheet  like   •  Key  is  a  row  id   •  Adributes  are  columns   •  Columns  can  be  grouped   into  families   Document  Databases   •  Like  Key/Value   •  Value  =  Document   •  Document  =  JSON/BSON   •  JSON  =  Flexible  Schema   Graph  Databases   •  Graph  Theory  G=(E,V)   •  Great  for  modeling   networks   •  Great  for  graph-­‐based   query  algorithms     Amazon  SimpleDB:  not  willing  to   store  our  data  in  a  proprietary   datastore.  
  63. Trimming  op;ons   Key/Value  Databases   •  Distributed  hashtables  

    •  Designed  for  high  load   •  In-­‐memory  or  on-­‐disk   •  Eventually  consistent   Column  Databases   •  Spread  sheet  like   •  Key  is  a  row  id   •  Adributes  are  columns   •  Columns  can  be  grouped   into  families   Document  Databases   •  Like  Key/Value   •  Value  =  Document   •  Document  =  JSON/BSON   •  JSON  =  Flexible  Schema   Graph  Databases   •  Graph  Theory  G=(E,V)   •  Great  for  modeling   networks   •  Great  for  graph-­‐based   query  algorithms     Not  willing  to  store  our  data  in  a   proprietary  datastore.   Redis  and  LinkedIn’s  Project   Voldermort:  no  query  filters,   beder  used  as  queues  or   distributed  caches  
  64. Trimming  op;ons   Key/Value  Databases   •  Distributed  hashtables  

    •  Designed  for  high  load   •  In-­‐memory  or  on-­‐disk   •  Eventually  consistent   Column  Databases   •  Spread  sheet  like   •  Key  is  a  row  id   •  Adributes  are  columns   •  Columns  can  be  grouped   into  families   Document  Databases   •  Like  Key/Value   •  Value  =  Document   •  Document  =  JSON/BSON   •  JSON  =  Flexible  Schema   Graph  Databases   •  Graph  Theory  G=(E,V)   •  Great  for  modeling   networks   •  Great  for  graph-­‐based   query  algorithms     CouchDB:  no  ad-­‐hoc  queries;   maturity  in  early  2010  made  us   shy  away  although  we  did  try   early  prototypes.  
  65. Trimming  op;ons   Key/Value  Databases   •  Distributed  hashtables  

    •  Designed  for  high  load   •  In-­‐memory  or  on-­‐disk   •  Eventually  consistent   Column  Databases   •  Spread  sheet  like   •  Key  is  a  row  id   •  Adributes  are  columns   •  Columns  can  be  grouped   into  families   Document  Databases   •  Like  Key/Value   •  Value  =  Document   •  Document  =  JSON/BSON   •  JSON  =  Flexible  Schema   Graph  Databases   •  Graph  Theory  G=(E,V)   •  Great  for  modeling   networks   •  Great  for  graph-­‐based   query  algorithms     Cassandra:  in  early  2010,   maturity  ques;ons,  no  secondary   indexes  and  no  batch  processing   op;ons  (came  later  on).  
  66. Trimming  op;ons   Key/Value  Databases   •  Distributed  hashtables  

    •  Designed  for  high  load   •  In-­‐memory  or  on-­‐disk   •  Eventually  consistent   Column  Databases   •  Spread  sheet  like   •  Key  is  a  row  id   •  Adributes  are  columns   •  Columns  can  be  grouped   into  families   Document  Databases   •  Like  Key/Value   •  Value  =  Document   •  Document  =  JSON/BSON   •  JSON  =  Flexible  Schema   Graph  Databases   •  Graph  Theory  G=(E,V)   •  Great  for  modeling   networks   •  Great  for  graph-­‐based   query  algorithms     MongoDB:  in  early  2010,  maturity   ques;ons,  adop;on  ques;ons   and  no  batch  processing  op;ons.  
  67. Trimming  op;ons   Key/Value  Databases   •  Distributed  hashtables  

    •  Designed  for  high  load   •  In-­‐memory  or  on-­‐disk   •  Eventually  consistent   Column  Databases   •  Spread  sheet  like   •  Key  is  a  row  id   •  Adributes  are  columns   •  Columns  can  be  grouped   into  families   Document  Databases   •  Like  Key/Value   •  Value  =  Document   •  Document  =  JSON/BSON   •  JSON  =  Flexible  Schema   Graph  Databases   •  Graph  Theory  G=(E,V)   •  Great  for  modeling   networks   •  Great  for  graph-­‐based   query  algorithms     Riak:  very  close  but  in  early  2010,   we  had  adop;on  ques;ons.  
  68. Trimming  op;ons   Key/Value  Databases   •  Distributed  hashtables  

    •  Designed  for  high  load   •  In-­‐memory  or  on-­‐disk   •  Eventually  consistent   Column  Databases   •  Spread  sheet  like   •  Key  is  a  row  id   •  Adributes  are  columns   •  Columns  can  be  grouped   into  families   Document  Databases   •  Like  Key/Value   •  Value  =  Document   •  Document  =  JSON/BSON   •  JSON  =  Flexible  Schema   Graph  Databases   •  Graph  Theory  G=(E,V)   •  Great  for  modeling   networks   •  Great  for  graph-­‐based   query  algorithms     HBase:  came  across  as  the  most  mature   at  the  ;me,  with  several  deployments,  a   healthy  community,  "out-­‐of-­‐the  box"   secondary  indexes  through  a  contrib  and   support  for  batch  processing  using   Hadoop/MR  .  
  69. Lessons  Learned   Challenges   -­‐  Complexity   -­‐  Missing

     Features   -­‐  Problem  solu;on  fit   -­‐  Resources   Rewards   -­‐  Choices   -­‐  Empowering   -­‐  Community   -­‐  Cost  
  70. Rewards:  Choices   Key/Value  Databases   •  Distributed  hashtables  

    •  Designed  for  high  load   •  In-­‐memory  or  on-­‐disk   •  Eventually  consistent   Column  Databases   •  Spread  sheet  like   •  Key  is  a  row  id   •  Adributes  are  columns   •  Columns  can  be  grouped   into  families   Document  Databases   •  Like  Key/Value   •  Value  =  Document   •  Document  =  JSON/BSON   •  JSON  =  Flexible  Schema   Graph  Databases   •  Graph  Theory  G=(E,V)   •  Great  for  modeling   networks   •  Great  for  graph-­‐based   query  algorithms    
  71. Rewards:  Choices   Source:  capgemini.com  

  72. Lessons  Learned   Challenges   -­‐  Complexity   -­‐  Missing

     Features   -­‐  Problem  solu;on  fit   -­‐  Resources   Rewards   -­‐  Choices   -­‐  Empowering   -­‐  Community   -­‐  Cost  
  73. When  Big-­‐Data  =  Big  Architectures   Source:  hdp://www.larsgeorge.com/2009/10/hbase-­‐architecture-­‐101-­‐storage.html   Must

     have  a  Hadoop  HDFS   cluster  of  at  least  2x  replica;on   factor  nodes   Must  have  an  odd   number  of     Zookeeper  quorum   nodes   Then  you  can  run  your  Hbase   nodes  but  it’s  recommended  to   co-­‐locate  regionservers  with   hadoop  datanodes  so  you  have   to  manage  resources.   Master/slave  architecture   means  a  single  point  of  failure,   so  you  need  to  protect  your   master.   And  then  we  also  have  to   manage  the  MapReduce   processes  and  resources  in  the   Hadoop  layer.  
  74. Source:  socialbuderflyclt.com  

  75. Jokes  aside,  no  one  said  open  source   was  easy

     to  use  
  76. To  be  expected   •  Hadoop/Hbase  are   designed  to

     move   mountains   •  If  you  want  to  move  big   stuff,  be  prepared  to   some;mes  use  big   equipment  
  77. What  it  means  to  a  startup   Development  capacity  before

      Development  capacity  aHer   Congrats,  you   are  now  a   sysadmin…  
  78. Lessons  Learned   Challenges   -­‐  Complexity   -­‐  Missing

     Features   -­‐  Problem  solu;on  fit   -­‐  Resources   Rewards   -­‐  Choices   -­‐  Empowering   -­‐  Community   -­‐  Cost  
  79. Mapping  an  saved  search  to  a  column  store   Name

      Ranks   References  to  influencer  records  
  80. Unique   key     “adributes”   column  family  

    for  general   adributes   “influencerId”  column  family   for  influencer  ranks  and  foreign  keys   Mapping  an  saved  search  to  a  column  store  
  81. Mapping  an  saved  search  to  a  column  store   “name”

     adribute   Influencer  ranks   can  be  adribute   names  as  well  
  82. Mapping  an  saved  search  to  a  column  store   Can

     get  predy  long  so  needs  indexing  and  pagina;on  
  83. Problem:  no  out-­‐of-­‐the-­‐box  row-­‐based   indexing  and  pagina;on  

  84. Jumping  right  into  the  code  

  85. Lessons  Learned   Challenges   -­‐  Complexity   -­‐  Missing

     Features   -­‐  Problem  solu;on  fit   -­‐  Resources   Rewards   -­‐  Choices   -­‐  Empowering   -­‐  Community   -­‐  Cost  
  86. a  few  months  later…  

  87. Need  to  upgrade  to  Hbase  0.90   •  Making  sure

     to  remain  on  recent  code  base   •  Performance  improvements   •  Mostly  to  get  the  latest  bug  fixes   No  thanks!  
  88. Looks  like  something  is  missing  

  89. None
  90. Our  DB  indexes  depend  on  this!  

  91. Let’s  get  this  straight   •  Hbase  no  longer  comes

     with  secondary   indexing  out-­‐of-­‐the-­‐box   •  It’s  been  moved  out  of  the  trunk  to  GitHub   •  Where  only  one  other  company  besides  us   seems  to  care  about  it  
  92. Only  one  other   maintainer   besides  us  

  93. What  it  means  to  a  startup   Development  capacity  

    Congrats,  you  are   now  an  hbase   contrib  maintainer…  
  94. Source:  socialbuderflyclt.com  

  95. Lessons  Learned   Challenges   -­‐  Complexity   -­‐  Missing

     Features   -­‐  Problem  solu;on  fit   -­‐  Resources   Rewards   -­‐  Choices   -­‐  Empowering   -­‐  Community   -­‐  Cost  
  96. Homegrown  Hbase  Indexes   Rows  have  id  prefixes  that  can

     be   efficiently  scanned  using  STARTROW   and  STOPROW  filters   Row  ids  for  Posts  
  97. Homegrown  Hbase  Indexes   Find  posts  for   influencer_id_1234  

      Row  ids  for  Posts  
  98. Homegrown  Hbase  Indexes   Find  posts  for   influencer_id_5678  

    Row  ids  for  Posts  
  99. Homegrown  Hbase  Indexes   •  No  longer  depending  on  

    unmaintained  code   •  Work  with  out-­‐of-­‐the-­‐box  Hbase   installa;on  
  100. What  it  means  to  a  startup   Development  capacity  

    You  are  back  but   you  s;ll  need  to   maintain  indexing   logic  
  101. a  few  months  later…  

  102. Cracks  in  the  data  model   huffingtonpost.com   huffingtonpost.com  

    hdp://www.huffingtonpost.com/arianna-­‐huffington/post_1.html   hdp://www.huffingtonpost.com/arianna-­‐huffington/post_2.html   hdp://www.huffingtonpost.com/arianna-­‐huffington/post_3.html   hdp://www.huffingtonpost.com/shaun-­‐donovan/post1.html   hdp://www.huffingtonpost.com/shaun-­‐donovan/post2.html   hdp://www.huffingtonpost.com/shaun-­‐donovan/post3.html   writes  for   authored  by   published  under   writes  for   authored  by   published  under  
  103. Cracks  in  the  data  model   huffingtonpost.com   huffingtonpost.com  

    hdp://www.huffingtonpost.com/arianna-­‐huffington/post_1.html   hdp://www.huffingtonpost.com/arianna-­‐huffington/post_2.html   hdp://www.huffingtonpost.com/arianna-­‐huffington/post_3.html   hdp://www.huffingtonpost.com/shaun-­‐donovan/post1.html   hdp://www.huffingtonpost.com/shaun-­‐donovan/post2.html   hdp://www.huffingtonpost.com/shaun-­‐donovan/post3.html   writes  for   authored  by   published  under   writes  for   authored  by   published  under   Denormalized/duplicated   for  fast  run;me  access   and  storage  of  influencer-­‐ to-­‐site  rela;onship   proper;es  
  104. Cracks  in  the  data  model   huffingtonpost.com   huffingtonpost.com  

    hdp://www.huffingtonpost.com/arianna-­‐huffington/post_1.html   hdp://www.huffingtonpost.com/arianna-­‐huffington/post_2.html   hdp://www.huffingtonpost.com/arianna-­‐huffington/post_3.html   hdp://www.huffingtonpost.com/shaun-­‐donovan/post1.html   hdp://www.huffingtonpost.com/shaun-­‐donovan/post2.html   hdp://www.huffingtonpost.com/shaun-­‐donovan/post3.html   writes  for   authored  by   published  under   writes  for   authored  by   published  under   Content  adribu;on  logic  could  some;mes   mis-­‐adribute  posts  because  of  the   duplicated  data.  
  105. Cracks  in  the  data  model   huffingtonpost.com   huffingtonpost.com  

    hdp://www.huffingtonpost.com/arianna-­‐huffington/post_1.html   hdp://www.huffingtonpost.com/arianna-­‐huffington/post_2.html   hdp://www.huffingtonpost.com/arianna-­‐huffington/post_3.html   hdp://www.huffingtonpost.com/shaun-­‐donovan/post1.html   hdp://www.huffingtonpost.com/shaun-­‐donovan/post2.html   hdp://www.huffingtonpost.com/shaun-­‐donovan/post3.html   writes  for   authored  by   published  under   writes  for   authored  by   published  under   Exacerbated  when  we  started  tracking   people’s  content  on  a  daily  basis  in   mid-­‐2011  
  106. Fixing  the  cracks  in  the  data  model   huffingtonpost.com  

    hdp://www.huffingtonpost.com/arianna-­‐huffington/post_1.html   hdp://www.huffingtonpost.com/arianna-­‐huffington/post_2.html   hdp://www.huffingtonpost.com/arianna-­‐huffington/post_3.html   hdp://www.huffingtonpost.com/shaun-­‐donovan/post1.html   hdp://www.huffingtonpost.com/shaun-­‐donovan/post2.html   hdp://www.huffingtonpost.com/shaun-­‐donovan/post3.html   writes  for   authored  by   published  under   writes  for   authored  by   published  under   Normalize  the  sites  
  107. Fixing  the  cracks  in  the  data  model   •  Normaliza;on

     requires  stronger   secondary  indexing   •  Our  applica;on  layer  indexing  would   need  revisi;ng…again!  
  108. What  it  means  to  a  startup   Development  capacity  

    Psych!  You  are  back   to  wri;ng  indexing   code.  
  109. Source:  socialbuderflyclt.com  

  110. Lessons  Learned   Challenges   -­‐  Complexity   -­‐  Missing

     Features   -­‐  Problem  solu;on  fit   -­‐  Resources   Rewards   -­‐  Choices   -­‐  Empowering   -­‐  Community   -­‐  Cost  
  111. Traackr’s  Datastore  Requirements   (Revisited)   •  Schema  flexibility  

    •  Good  at  storing  lots  of  variable  length  text   •  Out-­‐of-­‐the-­‐box  SECONDARY  INDEX  support!   •  Simple  to  use  and  administer  
  112. NoSQL  picking  –  Round  2  (mid  2011)   Key/Value  Databases

      •  Distributed  hashtables   •  Designed  for  high  load   •  In-­‐memory  or  on-­‐disk   •  Eventually  consistent   Column  Databases   •  Spread  sheet  like   •  Key  is  a  row  id   •  Adributes  are  columns   •  Columns  can  be  grouped   into  families   Document  Databases   •  Like  Key/Value   •  Value  =  Document   •  Document  =  JSON/BSON   •  JSON  =  Flexible  Schema   Graph  Databases   •  Graph  Theory  G=(E,V)   •  Great  for  modeling   networks   •  Great  for  graph-­‐based   query  algorithms    
  113. NoSQL  picking  –  Round  2  (mid  2011)   Key/Value  Databases

      •  Distributed  hashtables   •  Designed  for  high  load   •  In-­‐memory  or  on-­‐disk   •  Eventually  consistent   Column  Databases   •  Spread  sheet  like   •  Key  is  a  row  id   •  Adributes  are  columns   •  Columns  can  be  grouped   into  families   Document  Databases   •  Like  Key/Value   •  Value  =  Document   •  Document  =  JSON/BSON   •  JSON  =  Flexible  Schema   Graph  Databases   •  Graph  Theory  G=(E,V)   •  Great  for  modeling   networks   •  Great  for  graph-­‐based   query  algorithms     Nope!  
  114. NoSQL  picking  –  Round  2  (mid  2011)   Key/Value  Databases

      •  Distributed  hashtables   •  Designed  for  high  load   •  In-­‐memory  or  on-­‐disk   •  Eventually  consistent   Column  Databases   •  Spread  sheet  like   •  Key  is  a  row  id   •  Adributes  are  columns   •  Columns  can  be  grouped   into  families   Document  Databases   •  Like  Key/Value   •  Value  =  Document   •  Document  =  JSON/BSON   •  JSON  =  Flexible  Schema   Graph  Databases   •  Graph  Theory  G=(E,V)   •  Great  for  modeling   networks   •  Great  for  graph-­‐based   query  algorithms     Graph  Databases:  we  looked  at   Neo4J  a  bit  closer  but  passed  again   for  the  same  reasons  as  before.  
  115. NoSQL  picking  –  Round  2  (mid  2011)   Key/Value  Databases

      •  Distributed  hashtables   •  Designed  for  high  load   •  In-­‐memory  or  on-­‐disk   •  Eventually  consistent   Column  Databases   •  Spread  sheet  like   •  Key  is  a  row  id   •  Adributes  are  columns   •  Columns  can  be  grouped   into  families   Document  Databases   •  Like  Key/Value   •  Value  =  Document   •  Document  =  JSON/BSON   •  JSON  =  Flexible  Schema   Graph  Databases   •  Graph  Theory  G=(E,V)   •  Great  for  modeling   networks   •  Great  for  graph-­‐based   query  algorithms     Memcache:  s;ll  no  
  116. NoSQL  picking  –  Round  2  (mid  2011)   Key/Value  Databases

      •  Distributed  hashtables   •  Designed  for  high  load   •  In-­‐memory  or  on-­‐disk   •  Eventually  consistent   Column  Databases   •  Spread  sheet  like   •  Key  is  a  row  id   •  Adributes  are  columns   •  Columns  can  be  grouped   into  families   Document  Databases   •  Like  Key/Value   •  Value  =  Document   •  Document  =  JSON/BSON   •  JSON  =  Flexible  Schema   Graph  Databases   •  Graph  Theory  G=(E,V)   •  Great  for  modeling   networks   •  Great  for  graph-­‐based   query  algorithms     Amazon  SimpleDB:  s;ll  no.  
  117. NoSQL  picking  –  Round  2  (mid  2011)   Key/Value  Databases

      •  Distributed  hashtables   •  Designed  for  high  load   •  In-­‐memory  or  on-­‐disk   •  Eventually  consistent   Column  Databases   •  Spread  sheet  like   •  Key  is  a  row  id   •  Adributes  are  columns   •  Columns  can  be  grouped   into  families   Document  Databases   •  Like  Key/Value   •  Value  =  Document   •  Document  =  JSON/BSON   •  JSON  =  Flexible  Schema   Graph  Databases   •  Graph  Theory  G=(E,V)   •  Great  for  modeling   networks   •  Great  for  graph-­‐based   query  algorithms     Not  willing  to  store  our  data  in  a   proprietary  datastore.   Redis  and  LinkedIn’s  Project   Voldermort:  s;ll  no  
  118. NoSQL  picking  –  Round  2  (mid  2011)   Key/Value  Databases

      •  Distributed  hashtables   •  Designed  for  high  load   •  In-­‐memory  or  on-­‐disk   •  Eventually  consistent   Column  Databases   •  Spread  sheet  like   •  Key  is  a  row  id   •  Adributes  are  columns   •  Columns  can  be  grouped   into  families   Document  Databases   •  Like  Key/Value   •  Value  =  Document   •  Document  =  JSON/BSON   •  JSON  =  Flexible  Schema   Graph  Databases   •  Graph  Theory  G=(E,V)   •  Great  for  modeling   networks   •  Great  for  graph-­‐based   query  algorithms     CouchDB:  more  mature  but  s;ll   no  ad-­‐hoc  queries.  
  119. NoSQL  picking  –  Round  2  (mid  2011)   Key/Value  Databases

      •  Distributed  hashtables   •  Designed  for  high  load   •  In-­‐memory  or  on-­‐disk   •  Eventually  consistent   Column  Databases   •  Spread  sheet  like   •  Key  is  a  row  id   •  Adributes  are  columns   •  Columns  can  be  grouped   into  families   Document  Databases   •  Like  Key/Value   •  Value  =  Document   •  Document  =  JSON/BSON   •  JSON  =  Flexible  Schema   Graph  Databases   •  Graph  Theory  G=(E,V)   •  Great  for  modeling   networks   •  Great  for  graph-­‐based   query  algorithms     Cassandra:  matured  quite  a  bit,  added   secondary  indexes  and  batch  processing   op;ons  but  more  restric;ve  in  its’  use  than   other  solu;ons.  AHer  the  Hbase  lesson,   simplicity  of  use  was  now  more  important.  
  120. NoSQL  picking  –  Round  2  (mid  2011)   Key/Value  Databases

      •  Distributed  hashtables   •  Designed  for  high  load   •  In-­‐memory  or  on-­‐disk   •  Eventually  consistent   Column  Databases   •  Spread  sheet  like   •  Key  is  a  row  id   •  Adributes  are  columns   •  Columns  can  be  grouped   into  families   Document  Databases   •  Like  Key/Value   •  Value  =  Document   •  Document  =  JSON/BSON   •  JSON  =  Flexible  Schema   Graph  Databases   •  Graph  Theory  G=(E,V)   •  Great  for  modeling   networks   •  Great  for  graph-­‐based   query  algorithms     Riak:  strong  contender  s;ll  but   adop;on  ques;ons  remained.  
  121. NoSQL  picking  –  Round  2  (mid  2011)   Key/Value  Databases

      •  Distributed  hashtables   •  Designed  for  high  load   •  In-­‐memory  or  on-­‐disk   •  Eventually  consistent   Column  Databases   •  Spread  sheet  like   •  Key  is  a  row  id   •  Adributes  are  columns   •  Columns  can  be  grouped   into  families   Document  Databases   •  Like  Key/Value   •  Value  =  Document   •  Document  =  JSON/BSON   •  JSON  =  Flexible  Schema   Graph  Databases   •  Graph  Theory  G=(E,V)   •  Great  for  modeling   networks   •  Great  for  graph-­‐based   query  algorithms     MongoDB:  matured  by  leaps  and  bounds,  increased   adop;on,  support  from  10gen,  advanced  indexing   out-­‐of-­‐the-­‐box  as  well  as  some  batch  processing   op;ons,  breeze  to  use,  well  documented  and  fit  into   our  exis;ng  code  base  very  nicely.  
  122. Lessons  Learned   Challenges   -­‐  Complexity   -­‐  Missing

     Features   -­‐  Problem  solu;on  fit   -­‐  Resources   Rewards   -­‐  Choices   -­‐  Empowering   -­‐  Community   -­‐  Cost  
  123. Immediate  Benefits   •  No  more  maintaining  custom  applica;on-­‐layer  

    secondary  indexing  code  
  124. What  it  means  to  a  startup   Development  capacity  

    Yay!  I’m  back!  
  125. Immediate  Benefits   •  No  more  maintaining  custom  applica;on-­‐layer  

    secondary  indexing  code   •  Single  binary  installa;on  greatly  simplifies   administra;on  
  126. What  it  means  to  a  startup   Development  capacity  

    Honestly,  I  thought   I’d  never  see  you   guys  again!  
  127. Immediate  Benefits   •  No  more  maintaining  custom  applica;on-­‐layer  

    secondary  indexing  code   •  Single  binary  installa;on  greatly  simplifies   administra;on   •  Our  NoSQL  could  now  support  our  domain   model  
  128. many-­‐to-­‐many   rela;onship  

  129. Modeling  an  influencer   Embedded  list  of   references  to

     sites   augmented  with   influencer-­‐specific   site  adributes  (e.g.   percent  contribu;on   to  content)     {        ”_id":  "770cf5c54492344ad5e45ˆ791ae5d52”,        "realName":  "David  Chancogne",        ";tle":  "CTO",        "descrip;on":  "Web.  Geek.\r\nTraackr:  hdp://traackr.com\r\nPropz:  hdp://propz.me",        "primaryAffilia;on":  "Traackr",        "email":  "dchancogne@traackr.com",        "loca;on":  "Cambridge,  MA,  United  States",        "siteReferences":  [              {                    "siteId":  "b31236da306270dc2b5db34e943af88d",                    "contribu;on":  0.25                },              {                    "siteId":  "602dc370945d3b3480fff4f2a541227c",                    "contribu;on":  1.0                }        ]   }  
  130. Modeling  an  influencer   siteId  indexed  for   “find  influencers

      connected  to  site  X”   > db.influencers.ensureIndex({siteReferences.siteId: 1});! > db.influencers.find({siteReferences.siteId: "602dc370945d3b3480fff4f2a541227c"});! {        ”_id":  "770cf5c54492344ad5e45ˆ791ae5d52”,        "realName":  "David  Chancogne",        ";tle":  "CTO",        "descrip;on":  "Web.  Geek.\r\nTraackr:  hdp://traackr.com\r\nPropz:  hdp://propz.me",        "primaryAffilia;on":  "Traackr",        "email":  "dchancogne@traackr.com",        "loca;on":  "Cambridge,  MA,  United  States",        "siteReferences":  [              {                    "siteId":  "b31236da306270dc2b5db34e943af88d",                    "contribu;on":  0.25                },              {                    "siteId":  "602dc370945d3b3480fff4f2a541227c",                    "contribu;on":  1.0                }        ]   }  
  131. Other  Benefits   •  Ad  hoc  queries  and  reports  became

     easier  to  write  with  JavaScript:   no  need  for  a  Java  developer  to  write  map  reduce  code  to  extract   the  data  in  a  usable  form  like  it  was  needed  with  Hbase.   •  Simpler  backups:  Hbase  mostly  relied  on  HDFS  redundancy;  intra-­‐ cluster  replica;on  is  available  but  experimental  and  a  lot  more   involved  to  setup.   •  Great  documenta;on   •  Great  adop;on  and  community  
  132. looks  like  we  found  the  right  fit!  

  133. We  have  more  of  this   Development  capacity  

  134. And  less  of  this   Source:  socialbuderflyclt.com  

  135. Recap  &  Final  Thoughts   •  3  Vs  of  Big

     Data:   – Volume   – Velocity   – Variety  ß  Traackr   •  Big  Data  technologies  are  complementary  to   SQL  and  RDBMS   •  Un;l  machines  can  think  for  themselves  Data   Science  will  be  increasingly  important  
  136. Recap  &  Final  Thoughts   •  Be  prepared  to  deal

     with  less  mature  tech   •  Be  as  flexible  as  the  data  =>  fearless   refactoring   •  Importance  of  ease  of  use  and   administra;on  cannot  be  overstated  for  a   small  startup  
  137. Q&A