Big Data Pipeline Analytic Algorithms Storage - NOSQL Processing - Hadoop Cloud Architectures Analytics/ Modeling R Visualization o Agenda o To cover the broad picture o Touch upon instances of the technologies employed o Of the Big Data domain …
rate vs. decision window ③ Variety o Different sources & formats o Structured vs. Unstructured ④ Variability o Breadth of interpreta<on & o Depth of analy<cs EBC322 hKp://doubleclix.wordpress.com/2011/09/13/when-‐is-‐big-‐data-‐really-‐big-‐data/ hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf hKp://www.quora.com/Business-‐Intelligence/What-‐is-‐the-‐future-‐of-‐business-‐intelligence
rate vs. decision window ③ Variety o Different sources & formats o Structured vs. Unstructured ④ Variability o Breadth of interpreta<on & o Depth of analy<cs EBC322 hKp://doubleclix.wordpress.com/2011/09/13/when-‐is-‐big-‐data-‐really-‐big-‐data/ hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf hKp://www.quora.com/Business-‐Intelligence/What-‐is-‐the-‐future-‐of-‐business-‐intelligence
rate vs. decision window ③ Variety o Different sources & formats o Structured vs. Unstructured ④ Variability o Breadth of interpreta<on & o Depth of analy<cs EBC322 hKp://doubleclix.wordpress.com/2011/09/13/when-‐is-‐big-‐data-‐really-‐big-‐data/ hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf hKp://www.quora.com/Business-‐Intelligence/What-‐is-‐the-‐future-‐of-‐business-‐intelligence
rate vs. decision window ③ Variety o Different sources & formats o Structured vs. Unstructured ④ Variability o Breadth of interpreta<on & o Depth of analy<cs EBC322 hKp://doubleclix.wordpress.com/2011/09/13/when-‐is-‐big-‐data-‐really-‐big-‐data/ hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf hKp://www.quora.com/Business-‐Intelligence/What-‐is-‐the-‐future-‐of-‐business-‐intelligence
rate vs. decision window ③ Variety o Different sources & formats o Structured vs. Unstructured ④ Variability o Breadth of interpreta<on & o Depth of analy<cs ⑤ Contextual o Dynamic variability o RecommendaWon ⑥ Connectedness EBC322 hKp://doubleclix.wordpress.com/2011/09/13/when-‐is-‐big-‐data-‐really-‐big-‐data/ hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf
the world’s most impressive dileKante … baKling the efficient human mind with spectacular flamboyant inefficiency” – Final Jeopardy by Stephen Baker • 15 TB memory, across 90 IBM 760 servers, in 10 racks • 1 TB of dataset • 200 Million pages processed by Hadoop • This is a good example of Connected data – Contextual w/ variability – Breath of interpretaWon – AnalyWcs depth hKp://doubleclix.wordpress.com/2011/03/01/the-‐educaWon-‐of-‐a-‐machine-‐%E2%80%93-‐review-‐of-‐book-‐%E2%80%9Cfinal-‐jeopardy %E2%80%9D-‐by-‐stephen-‐baker/ hKp://doubleclix.wordpress.com/2011/02/17/watson-‐at-‐jeopardy-‐a-‐race-‐of-‐machines/
would you handle the fire hose for social network analytics ? hKp://goo.gl/dcBsQ Storage § 4 U box = 40 TB, § 1 PB = 25 boxes ! Zynga § “Analytics company, not a gaming company!” § Harvests data : 15 TB/day § Test new features § Target advertising § 230 million players/month AWS – 900 Billion objects!
Cassandra MongoDB Hbase Neo4j Store Hadoop Pig/Hive R Transform & Analyze R Mahout BI Tools Model & Reason D3.js Tableau Dashboard Predict, Recommend & Visualize When I think of my own native land, ! In a moment I seem to be there; ! But, alas! recollection at hand Soon hurries me back to despair.! - Cowper, The Solitude Of Alexander SelKirk!
Neo4j FlockDB InfiniteGraph CouchDB MongoDB Lotus Domino Riak Google BigTable HBase Cassandra HyperTable In-‐memory Disk Based SimpleDB Memcached Redis Tokyo Cabinet Dynamo Voldemort Azure TS
• Dynamo DB – NOSQL • EMR – ElasWc Map Reduce • EC2 – Compute • 1% of Internet traffic hKp://blog.deepfield.net/2012/04/18/how-‐big-‐is-‐amazons-‐cloud/ “Scalability is about building wider roads, not about building faster cars” – Steve Swartz