Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data - Should You Care?

Big Data - Should You Care?

Big Data is a buzz-phrase that's been becoming more and more popular these days. Are you wondering whether you should care about all the hype? I did, not so long ago, and the answer for me is a Big Yes. In this talk, I'll show you what I found so appealing about this topic that it pushed me to make quite a radical shift in my career as a developer. At the beginning of my journey, my understanding of Big Data was, as I quickly realized, very limited. That's why I'd like to give you a birds-eye view of different sub-areas and specializations that Big Data entails. We'll touch on topics like infrastructure, engineering, data science and visualization. We'll see some real-world use-cases. We'll learn a bit about algorithmic aspects of dealing with large amounts of data. We'll take a look at recent proliferation of Big Data-related technologies, products and vendors, both open-source and commercial. And finally we'll get our hands dirty by doing some live coding demo.

Marek Stój

March 25, 2015
Tweet

Other Decks in Technology

Transcript

  1. • Passionate about software engineering and computer science • Worked

    as a .NET software developer @ KRD • Working as a Big Data Analytics Engineer @ Credit Suisse
  2. Everyone talks about it, nobody really knows how to do

    it, everyone thinks everyone else is doing it, so everyone claims they are doing it. ~ Dan Ariely
  3. 13 million hours of YouTube videos 12 billion Tweets 30

    billion Facebook Posts every month
  4. • 2000 - Google Search Engine running on commodity hardware

    • 2003, Google paper: „Google File System” • 2004, Google paper: „MapReduce – Simplified Data Processing on Large Clusters”
  5. • Development started in 2005 at Yahoo • Google File

    System -> Hadoop Distributed File System • MapReduce –> Hadoop MapReduce
  6. • Designed for very large files (GB, TB, PB) •

    Distributed • Scalable • Reliable • Portable
  7. function map(string documentName, string document): for each word in document:

    emit(word, 1) function reduce(string word, iterator<int> partialCounts): sum = 0 for each partialCount in partialCounts: sum += partialCount emit(word, sum)
  8. A = LOAD 'file1' AS (x, y, z); B =

    LOAD 'file2' AS (t, u, v); C = FILTER A BY y > 0; D = JOIN C BY x, B BY u; E = GROUP D BY z; F = FOREACH E GENERATE group, COUNT(D); STORE F INTO 'output';
  9. A = LOAD 'file1' AS (x, y, z); B =

    LOAD 'file2' AS (t, u, v); C = FILTER A BY y > 0; D = JOIN C BY x, B BY u; E = GROUP D BY z; F = FOREACH E GENERATE group, COUNT(D); STORE F INTO 'output';
  10. “Global organizational spending on Big Data exceeded $31 billion in

    2013, and is predicted to reach $118 billion in 2018.” Source: http://ebooks.capgemini-consulting.com/cracking-the-data-conundrum/#/2/zoomed
  11. • Demo Source Code https://github.com/marek-stoj/talk-demos/tree/master/mapreduce- wordcount • Big Data University

    http://bigdatauniversity.com/ • Hortonworks Sandbox http://hortonworks.com/products/hortonworks-sandbox/
  12. • edX – Analytics Edge https://www.edx.org/course/analytics-edge-mitx-15-071x-0 • Coursera - Machine

    Learning https://www.coursera.org/course/ml • Coursera – Mining Massive Datasets https://www.coursera.org/course/mmds
  13. • edX – Health in Numbers https://www.edx.org/course/health-numbers-quantitative-methods- harvardx-ph207x#.VOCWFfnF9yU • MIT

    Professional X - Tackling the Challenges of Big Data https://mitprofessionalx.edx.org/courses/MITProfessionalX/6.BDX/ 2T2014/about