Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data - Should You Care?

Big Data - Should You Care?

Big Data is a buzz-phrase that's been becoming more and more popular these days. Are you wondering whether you should care about all the hype? I did, not so long ago, and the answer for me is a Big Yes. In this talk, I'll show you what I found so appealing about this topic that it pushed me to make quite a radical shift in my career as a developer. At the beginning of my journey, my understanding of Big Data was, as I quickly realized, very limited. That's why I'd like to give you a birds-eye view of different sub-areas and specializations that Big Data entails. We'll touch on topics like infrastructure, engineering, data science and visualization. We'll see some real-world use-cases. We'll learn a bit about algorithmic aspects of dealing with large amounts of data. We'll take a look at recent proliferation of Big Data-related technologies, products and vendors, both open-source and commercial. And finally we'll get our hands dirty by doing some live coding demo.

1fe4904857798affa9b7e130ee165042?s=128

Marek Stój

March 25, 2015
Tweet

Transcript

  1. @MarekStoj marek.stoj@gmail.com

  2. • Passionate about software engineering and computer science • Worked

    as a .NET software developer @ KRD • Working as a Big Data Analytics Engineer @ Credit Suisse
  3. • The Hype • Fundamentals • Bigger Picture

  4. None
  5. None
  6. Everyone talks about it, nobody really knows how to do

    it, everyone thinks everyone else is doing it, so everyone claims they are doing it. ~ Dan Ariely
  7. None
  8. None
  9. 2.5 exabytes (2.5 million terabytes) of data created every day

  10. 1 terabyte of trade information during each trading session on

    New York Stock Exchange
  11. 13 million hours of YouTube videos 12 billion Tweets 30

    billion Facebook Posts every month
  12. Source: http://www.datasciencecentral.com/forum/topics/the-3vs-that-define-big-data

  13. Game Dev Data Apps

  14. Game Dev Big Data Analytics

  15. None
  16. • 2000 - Google Search Engine running on commodity hardware

    • 2003, Google paper: „Google File System” • 2004, Google paper: „MapReduce – Simplified Data Processing on Large Clusters”
  17. • Development started in 2005 at Yahoo • Google File

    System -> Hadoop Distributed File System • MapReduce –> Hadoop MapReduce
  18. None
  19. • Designed for very large files (GB, TB, PB) •

    Distributed • Scalable • Reliable • Portable
  20. Source: https://yoyoclouds.wordpress.com/2011/12/15/hdfsarchitecture/

  21. Programming model & implementation for processing large data sets. An

    abstraction layer for programmers.
  22. Map(key1, value1)  list(key2, value2) Reduce(key2, list(value2))  list of

    value3
  23. function map(string documentName, string document): for each word in document:

    emit(word, 1) function reduce(string word, iterator<int> partialCounts): sum = 0 for each partialCount in partialCounts: sum += partialCount emit(word, sum)
  24. Source: http://nagyadat.blog.hu/2014/03/12/mapreduce-rol_bovebben

  25. Source: https://www.youtube.com/watch?v=ht3dNvdNDzI

  26. None
  27. None
  28. Source: http://www.ongridventures.com/2012/10/23/the-big-data-landscape/

  29. Source: http://thebigdatablog.weebly.com/blog/the-hadoop-ecosystem-overview

  30. A = LOAD 'file1' AS (x, y, z); B =

    LOAD 'file2' AS (t, u, v); C = FILTER A BY y > 0; D = JOIN C BY x, B BY u; E = GROUP D BY z; F = FOREACH E GENERATE group, COUNT(D); STORE F INTO 'output';
  31. A = LOAD 'file1' AS (x, y, z); B =

    LOAD 'file2' AS (t, u, v); C = FILTER A BY y > 0; D = JOIN C BY x, B BY u; E = GROUP D BY z; F = FOREACH E GENERATE group, COUNT(D); STORE F INTO 'output';
  32. None
  33. None
  34. None
  35. None
  36. Statistics + Mathematics + Computer Science = Predictive Analytics

  37. F(X) = Y

  38. Numerical Regression • expected lifespan • real-estate prices • players

    performance
  39. Classification • digits recognition • credit fraud detection • e-mail

    classification
  40. Clustering • market segmentation • news articles categorization • movie

    recommendations
  41. None
  42. “Global organizational spending on Big Data exceeded $31 billion in

    2013, and is predicted to reach $118 billion in 2018.” Source: http://ebooks.capgemini-consulting.com/cracking-the-data-conundrum/#/2/zoomed
  43. Source: https://www.linkedin.com/pulse/20140618231442-1287-too-much-stem-not-enough-computer-science

  44. Source: http://natemat.pl/105597,it-arystokracja

  45. Source: http://venublog.com/2012/12/10/data-science-vs-data-analytics/comment-page-1/

  46. None
  47. Slide deck available at http://goo.gl/C99kTO @MarekStoj marek.stoj@gmail.com Homework ;) http://goo.gl/XrMiaG

  48. • Demo Source Code https://github.com/marek-stoj/talk-demos/tree/master/mapreduce- wordcount • Big Data University

    http://bigdatauniversity.com/ • Hortonworks Sandbox http://hortonworks.com/products/hortonworks-sandbox/
  49. • edX – Analytics Edge https://www.edx.org/course/analytics-edge-mitx-15-071x-0 • Coursera - Machine

    Learning https://www.coursera.org/course/ml • Coursera – Mining Massive Datasets https://www.coursera.org/course/mmds
  50. • edX – Health in Numbers https://www.edx.org/course/health-numbers-quantitative-methods- harvardx-ph207x#.VOCWFfnF9yU • MIT

    Professional X - Tackling the Challenges of Big Data https://mitprofessionalx.edx.org/courses/MITProfessionalX/6.BDX/ 2T2014/about