Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data Engineering - Top 10 Pragmatics

ksankar
April 27, 2012

Big Data Engineering - Top 10 Pragmatics

PhD Guest Talk at the Naval Postgraduate School

ksankar

April 27, 2012
Tweet

More Decks by ksankar

Other Decks in Technology

Transcript

  1. Krishna Sankar, http://doubleclix.wordpress.com EC4000–PhD Guest Seminar, Naval Post Graduate School

    April 27,2012 The road lies plain before me;--'tis a theme Single and of determined bounds; … - Wordsworth, The Prelude
  2. What is Big Data ? Big Data to smart data

    Big Data Pipeline Analytic Algorithms Storage - NOSQL Processing - Hadoop Cloud Architectures Analytics/ Modeling R Visualization o  Agenda o  To cover the broad picture o  Touch upon instances of the technologies employed o  Of the Big Data domain …
  3. Thanks to … The giants whose shoulders I am standing

    on Special  Thanks  to:        Peter  Ateshian,  NPS        Prof  Murali  Tummala,  NPS        Shirley  Bailes,O’Reilly        Ed  Dumbill,O’Reilly        Jeff  Barr,AWS        Jenny  Kohr  Chynoweth,AWS  
  4. Porcelain vs. Plumbing • The balance is always interesting … • This

    talk has both • Would be happy to dive deep into plumbing topics like Hadoop, R, MongoDB, Cassandra et al…
  5. ①  Volume o  Scale   ②  Velocity o  Data  change

     rate  vs.  decision  window   ③  Variety o  Different  sources  &  formats   o  Structured  vs.  Unstructured   ④  Variability o  Breadth  of  interpreta<on  &   o  Depth  of  analy<cs   EBC322   hKp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/   hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf   hKp://www.quora.com/Business-­‐Intelligence/What-­‐is-­‐the-­‐future-­‐of-­‐business-­‐intelligence  
  6. ①  Volume o  Scale   ②  Velocity o  Data  change

     rate  vs.  decision  window   ③  Variety o  Different  sources  &  formats   o  Structured  vs.  Unstructured   ④  Variability o  Breadth  of  interpreta<on  &   o  Depth  of  analy<cs   EBC322   hKp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/   hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf   hKp://www.quora.com/Business-­‐Intelligence/What-­‐is-­‐the-­‐future-­‐of-­‐business-­‐intelligence  
  7. ①  Volume o  Scale   ②  Velocity o  Data  change

     rate  vs.  decision  window   ③  Variety o  Different  sources  &  formats   o  Structured  vs.  Unstructured   ④  Variability o  Breadth  of  interpreta<on  &   o  Depth  of  analy<cs   EBC322   hKp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/   hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf   hKp://www.quora.com/Business-­‐Intelligence/What-­‐is-­‐the-­‐future-­‐of-­‐business-­‐intelligence  
  8. ①  Volume o  Scale   ②  Velocity o  Data  change

     rate  vs.  decision  window   ③  Variety o  Different  sources  &  formats   o  Structured  vs.  Unstructured   ④  Variability o  Breadth  of  interpreta<on  &   o  Depth  of  analy<cs   EBC322   hKp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/   hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf   hKp://www.quora.com/Business-­‐Intelligence/What-­‐is-­‐the-­‐future-­‐of-­‐business-­‐intelligence  
  9. ①  Volume o  Scale   ②  Velocity o  Data  change

     rate  vs.  decision  window   ③  Variety o  Different  sources  &  formats   o  Structured  vs.  Unstructured   ④  Variability o  Breadth  of  interpreta<on  &   o  Depth  of  analy<cs   ⑤  Contextual o  Dynamic  variability   o  RecommendaWon   ⑥  Connectedness EBC322   hKp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/   hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf  
  10. •  “…  they  didn’t  need  a  genius,  …  but  build

     the  world’s  most  impressive   dileKante  …  baKling  the  efficient  human  mind  with  spectacular   flamboyant  inefficiency”  –  Final  Jeopardy  by  Stephen  Baker   •  15  TB  memory,  across  90  IBM  760  servers,  in  10  racks   •  1  TB  of  dataset   •  200  Million  pages  processed  by  Hadoop   •  This  is  a  good  example  of  Connected  data   –  Contextual  w/  variability   –  Breath  of  interpretaWon   –  AnalyWcs  depth   hKp://doubleclix.wordpress.com/2011/03/01/the-­‐educaWon-­‐of-­‐a-­‐machine-­‐%E2%80%93-­‐review-­‐of-­‐book-­‐%E2%80%9Cfinal-­‐jeopardy %E2%80%9D-­‐by-­‐stephen-­‐baker/   hKp://doubleclix.wordpress.com/2011/02/17/watson-­‐at-­‐jeopardy-­‐a-­‐race-­‐of-­‐machines/  
  11. Volume Velocity Variety Variability Connectedness Context Model Infer-ability Decomplexify! Contextualize!

    Network! Reason! Infer! Logs,   Scribe,   Flume,   Storm,   Hadoop …   SQL   NOSQL,   HDFS,   XML,   =iles,  …     SQL,     BI  Tools,   Hadoop,   Pig,  Hive,     .NET   Dryad,   Various   other  tools   Internal   dashboards,   Tableau     Ref:h&p:goo.gl/Mm83k Hand   coded   Programs,   R,  Mahout,   …    
  12. Twitter §  200 million tweets/day §  Peak 10,000/second §  How

    would you handle the fire hose for social network analytics ? hKp://goo.gl/dcBsQ   Storage §  4 U box = 40 TB, §  1 PB = 25 boxes ! Zynga §  “Analytics company, not a gaming company!” §  Harvests data : 15 TB/day §  Test new features §  Target advertising §  230 million players/month AWS – 900 Billion objects!
  13. •  6  Billion  Messages  per   day   •  2

     PB  (w/compression)   online   •  6  PB  w/  replicaWon   •  250  TB/Month  growth   •  HBase  Infrastructure  
  14. Ref:  hKp://www.hpts.ws/sessions/2011HPTS-­‐TomFastner.pdf   Path  Analysis   A/B  TesWng   50

     TB/Day   240  nodes,  84  PB   Teradata  InstallaWon   Very  systemaWc   Diagram  speaks  volumes!   eBay  Extreme   AnalyWcs   Architecture  
  15. Splunk   Scribe   Flume   Storm   Collect NOSQL

      Cassandra   MongoDB   Hbase   Neo4j   Store Hadoop   Pig/Hive   R   Transform & Analyze R   Mahout   BI  Tools   Model & Reason D3.js   Tableau   Dashboard   Predict, Recommend & Visualize When I think of my own native land, ! In a moment I seem to be there; ! But, alas! recollection at hand Soon hurries me back to despair.! - Cowper, The Solitude Of Alexander SelKirk!
  16. Key  Value   Column   Document   Graph   NOSQL

      Neo4j   FlockDB   InfiniteGraph   CouchDB   MongoDB   Lotus  Domino   Riak   Google   BigTable   HBase   Cassandra   HyperTable   In-­‐memory   Disk  Based   SimpleDB   Memcached   Redis   Tokyo  Cabinet   Dynamo   Voldemort   Azure  TS  
  17. 19   Infrastructure  As  A  Service   Plasorm  As  A

     Service   Sotware  As  A  Service  
  18. Amazon – Canonical Cloud •  S3  –  Blob  storage  

    •  Dynamo  DB  –  NOSQL   •  EMR  –  ElasWc  Map  Reduce   •  EC2  –  Compute   •  1%  of  Internet  traffic   hKp://blog.deepfield.net/2012/04/18/how-­‐big-­‐is-­‐amazons-­‐cloud/   “Scalability is about building wider roads, not about building faster cars” – Steve Swartz
  19. •  Social  Network  Analysis   •  SenWment  Analysis   • 

    Brand  Strength   •  CitaWon/co-­‐citaWon  ≅  Followed  by/Also  Follows   •  Metrics   –  Network  diameter,     –  Weak-­‐Wes,     –  Erdös-­‐Renyi  model  &     –  Kronecker  Graphs   Tweets   Followers   Follow/Unfollow   hKp://www.oscon.com/oscon2012/public/schedule/detail/23130  
  20. Was it a vision, or a waking dream?! Fled is

    that music:—do I wake or sleep?! -Keats, Ode to a Nightingale!