Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data - Should You Care?

Big Data - Should You Care?

Big Data is a buzz-phrase that's been becoming more and more popular these days. Are you wondering whether you should care about all the hype? I did, not so long ago, and the answer for me is a Big Yes. In this talk, I'll show you what I found so appealing about this topic that it pushed me to make quite a radical shift in my career as a developer. At the beginning of my journey, my understanding of Big Data was, as I quickly realized, very limited. That's why I'd like to give you a birds-eye view of different sub-areas and specializations that Big Data entails. We'll touch on topics like infrastructure, engineering, data science and visualization. We'll see some real-world use-cases. We'll learn a bit about algorithmic aspects of dealing with large amounts of data. We'll take a look at recent proliferation of Big Data-related technologies, products and vendors, both open-source and commercial. And finally we'll get our hands dirty by doing some live coding demo.

Avatar for Marek Stój

Marek Stój

March 25, 2015
Tweet

Other Decks in Technology

Transcript

  1. • Passionate about software engineering and computer science • Worked

    as a .NET software developer @ KRD • Working as a Big Data Analytics Engineer @ Credit Suisse
  2. Everyone talks about it, nobody really knows how to do

    it, everyone thinks everyone else is doing it, so everyone claims they are doing it. ~ Dan Ariely
  3. 13 million hours of YouTube videos 12 billion Tweets 30

    billion Facebook Posts every month
  4. • 2000 - Google Search Engine running on commodity hardware

    • 2003, Google paper: „Google File System” • 2004, Google paper: „MapReduce – Simplified Data Processing on Large Clusters”
  5. • Development started in 2005 at Yahoo • Google File

    System -> Hadoop Distributed File System • MapReduce –> Hadoop MapReduce
  6. • Designed for very large files (GB, TB, PB) •

    Distributed • Scalable • Reliable • Portable
  7. function map(string documentName, string document): for each word in document:

    emit(word, 1) function reduce(string word, iterator<int> partialCounts): sum = 0 for each partialCount in partialCounts: sum += partialCount emit(word, sum)
  8. A = LOAD 'file1' AS (x, y, z); B =

    LOAD 'file2' AS (t, u, v); C = FILTER A BY y > 0; D = JOIN C BY x, B BY u; E = GROUP D BY z; F = FOREACH E GENERATE group, COUNT(D); STORE F INTO 'output';
  9. A = LOAD 'file1' AS (x, y, z); B =

    LOAD 'file2' AS (t, u, v); C = FILTER A BY y > 0; D = JOIN C BY x, B BY u; E = GROUP D BY z; F = FOREACH E GENERATE group, COUNT(D); STORE F INTO 'output';
  10. “Global organizational spending on Big Data exceeded $31 billion in

    2013, and is predicted to reach $118 billion in 2018.” Source: http://ebooks.capgemini-consulting.com/cracking-the-data-conundrum/#/2/zoomed
  11. • Demo Source Code https://github.com/marek-stoj/talk-demos/tree/master/mapreduce- wordcount • Big Data University

    http://bigdatauniversity.com/ • Hortonworks Sandbox http://hortonworks.com/products/hortonworks-sandbox/
  12. • edX – Analytics Edge https://www.edx.org/course/analytics-edge-mitx-15-071x-0 • Coursera - Machine

    Learning https://www.coursera.org/course/ml • Coursera – Mining Massive Datasets https://www.coursera.org/course/mmds
  13. • edX – Health in Numbers https://www.edx.org/course/health-numbers-quantitative-methods- harvardx-ph207x#.VOCWFfnF9yU • MIT

    Professional X - Tackling the Challenges of Big Data https://mitprofessionalx.edx.org/courses/MITProfessionalX/6.BDX/ 2T2014/about