Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data Engineering - Top 10 Pragmatics

Avatar for ksankar ksankar
April 27, 2012

Big Data Engineering - Top 10 Pragmatics

PhD Guest Talk at the Naval Postgraduate School

Avatar for ksankar

ksankar

April 27, 2012
Tweet

More Decks by ksankar

Other Decks in Technology

Transcript

  1. Krishna Sankar, http://doubleclix.wordpress.com EC4000–PhD Guest Seminar, Naval Post Graduate School

    April 27,2012 The road lies plain before me;--'tis a theme Single and of determined bounds; … - Wordsworth, The Prelude
  2. What is Big Data ? Big Data to smart data

    Big Data Pipeline Analytic Algorithms Storage - NOSQL Processing - Hadoop Cloud Architectures Analytics/ Modeling R Visualization o  Agenda o  To cover the broad picture o  Touch upon instances of the technologies employed o  Of the Big Data domain …
  3. Thanks to … The giants whose shoulders I am standing

    on Special  Thanks  to:        Peter  Ateshian,  NPS        Prof  Murali  Tummala,  NPS        Shirley  Bailes,O’Reilly        Ed  Dumbill,O’Reilly        Jeff  Barr,AWS        Jenny  Kohr  Chynoweth,AWS  
  4. Porcelain vs. Plumbing • The balance is always interesting … • This

    talk has both • Would be happy to dive deep into plumbing topics like Hadoop, R, MongoDB, Cassandra et al…
  5. ①  Volume o  Scale   ②  Velocity o  Data  change

     rate  vs.  decision  window   ③  Variety o  Different  sources  &  formats   o  Structured  vs.  Unstructured   ④  Variability o  Breadth  of  interpreta<on  &   o  Depth  of  analy<cs   EBC322   hKp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/   hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf   hKp://www.quora.com/Business-­‐Intelligence/What-­‐is-­‐the-­‐future-­‐of-­‐business-­‐intelligence  
  6. ①  Volume o  Scale   ②  Velocity o  Data  change

     rate  vs.  decision  window   ③  Variety o  Different  sources  &  formats   o  Structured  vs.  Unstructured   ④  Variability o  Breadth  of  interpreta<on  &   o  Depth  of  analy<cs   EBC322   hKp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/   hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf   hKp://www.quora.com/Business-­‐Intelligence/What-­‐is-­‐the-­‐future-­‐of-­‐business-­‐intelligence  
  7. ①  Volume o  Scale   ②  Velocity o  Data  change

     rate  vs.  decision  window   ③  Variety o  Different  sources  &  formats   o  Structured  vs.  Unstructured   ④  Variability o  Breadth  of  interpreta<on  &   o  Depth  of  analy<cs   EBC322   hKp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/   hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf   hKp://www.quora.com/Business-­‐Intelligence/What-­‐is-­‐the-­‐future-­‐of-­‐business-­‐intelligence  
  8. ①  Volume o  Scale   ②  Velocity o  Data  change

     rate  vs.  decision  window   ③  Variety o  Different  sources  &  formats   o  Structured  vs.  Unstructured   ④  Variability o  Breadth  of  interpreta<on  &   o  Depth  of  analy<cs   EBC322   hKp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/   hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf   hKp://www.quora.com/Business-­‐Intelligence/What-­‐is-­‐the-­‐future-­‐of-­‐business-­‐intelligence  
  9. ①  Volume o  Scale   ②  Velocity o  Data  change

     rate  vs.  decision  window   ③  Variety o  Different  sources  &  formats   o  Structured  vs.  Unstructured   ④  Variability o  Breadth  of  interpreta<on  &   o  Depth  of  analy<cs   ⑤  Contextual o  Dynamic  variability   o  RecommendaWon   ⑥  Connectedness EBC322   hKp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/   hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf  
  10. •  “…  they  didn’t  need  a  genius,  …  but  build

     the  world’s  most  impressive   dileKante  …  baKling  the  efficient  human  mind  with  spectacular   flamboyant  inefficiency”  –  Final  Jeopardy  by  Stephen  Baker   •  15  TB  memory,  across  90  IBM  760  servers,  in  10  racks   •  1  TB  of  dataset   •  200  Million  pages  processed  by  Hadoop   •  This  is  a  good  example  of  Connected  data   –  Contextual  w/  variability   –  Breath  of  interpretaWon   –  AnalyWcs  depth   hKp://doubleclix.wordpress.com/2011/03/01/the-­‐educaWon-­‐of-­‐a-­‐machine-­‐%E2%80%93-­‐review-­‐of-­‐book-­‐%E2%80%9Cfinal-­‐jeopardy %E2%80%9D-­‐by-­‐stephen-­‐baker/   hKp://doubleclix.wordpress.com/2011/02/17/watson-­‐at-­‐jeopardy-­‐a-­‐race-­‐of-­‐machines/  
  11. Volume Velocity Variety Variability Connectedness Context Model Infer-ability Decomplexify! Contextualize!

    Network! Reason! Infer! Logs,   Scribe,   Flume,   Storm,   Hadoop …   SQL   NOSQL,   HDFS,   XML,   =iles,  …     SQL,     BI  Tools,   Hadoop,   Pig,  Hive,     .NET   Dryad,   Various   other  tools   Internal   dashboards,   Tableau     Ref:h&p:goo.gl/Mm83k Hand   coded   Programs,   R,  Mahout,   …    
  12. Twitter §  200 million tweets/day §  Peak 10,000/second §  How

    would you handle the fire hose for social network analytics ? hKp://goo.gl/dcBsQ   Storage §  4 U box = 40 TB, §  1 PB = 25 boxes ! Zynga §  “Analytics company, not a gaming company!” §  Harvests data : 15 TB/day §  Test new features §  Target advertising §  230 million players/month AWS – 900 Billion objects!
  13. •  6  Billion  Messages  per   day   •  2

     PB  (w/compression)   online   •  6  PB  w/  replicaWon   •  250  TB/Month  growth   •  HBase  Infrastructure  
  14. Ref:  hKp://www.hpts.ws/sessions/2011HPTS-­‐TomFastner.pdf   Path  Analysis   A/B  TesWng   50

     TB/Day   240  nodes,  84  PB   Teradata  InstallaWon   Very  systemaWc   Diagram  speaks  volumes!   eBay  Extreme   AnalyWcs   Architecture  
  15. Splunk   Scribe   Flume   Storm   Collect NOSQL

      Cassandra   MongoDB   Hbase   Neo4j   Store Hadoop   Pig/Hive   R   Transform & Analyze R   Mahout   BI  Tools   Model & Reason D3.js   Tableau   Dashboard   Predict, Recommend & Visualize When I think of my own native land, ! In a moment I seem to be there; ! But, alas! recollection at hand Soon hurries me back to despair.! - Cowper, The Solitude Of Alexander SelKirk!
  16. Key  Value   Column   Document   Graph   NOSQL

      Neo4j   FlockDB   InfiniteGraph   CouchDB   MongoDB   Lotus  Domino   Riak   Google   BigTable   HBase   Cassandra   HyperTable   In-­‐memory   Disk  Based   SimpleDB   Memcached   Redis   Tokyo  Cabinet   Dynamo   Voldemort   Azure  TS  
  17. 19   Infrastructure  As  A  Service   Plasorm  As  A

     Service   Sotware  As  A  Service  
  18. Amazon – Canonical Cloud •  S3  –  Blob  storage  

    •  Dynamo  DB  –  NOSQL   •  EMR  –  ElasWc  Map  Reduce   •  EC2  –  Compute   •  1%  of  Internet  traffic   hKp://blog.deepfield.net/2012/04/18/how-­‐big-­‐is-­‐amazons-­‐cloud/   “Scalability is about building wider roads, not about building faster cars” – Steve Swartz
  19. •  Social  Network  Analysis   •  SenWment  Analysis   • 

    Brand  Strength   •  CitaWon/co-­‐citaWon  ≅  Followed  by/Also  Follows   •  Metrics   –  Network  diameter,     –  Weak-­‐Wes,     –  Erdös-­‐Renyi  model  &     –  Kronecker  Graphs   Tweets   Followers   Follow/Unfollow   hKp://www.oscon.com/oscon2012/public/schedule/detail/23130  
  20. Was it a vision, or a waking dream?! Fled is

    that music:—do I wake or sleep?! -Keats, Ode to a Nightingale!