Big Data - Should You Care?

@MarekStoj [email protected]

• Passionate about software engineering and computer science • Worked
as a .NET software developer @ KRD • Working as a Big Data Analytics Engineer @ Credit Suisse

• The Hype • Fundamentals • Bigger Picture

Everyone talks about it, nobody really knows how to do
it, everyone thinks everyone else is doing it, so everyone claims they are doing it. ~ Dan Ariely

2.5 exabytes (2.5 million terabytes) of data created every day

1 terabyte of trade information during each trading session on
New York Stock Exchange

13 million hours of YouTube videos 12 billion Tweets 30
billion Facebook Posts every month

Source: http://www.datasciencecentral.com/forum/topics/the-3vs-that-define-big-data

Game Dev Data Apps

Game Dev Big Data Analytics

• 2000 - Google Search Engine running on commodity hardware
• 2003, Google paper: „Google File System” • 2004, Google paper: „MapReduce – Simplified Data Processing on Large Clusters”

• Development started in 2005 at Yahoo • Google File
System -> Hadoop Distributed File System • MapReduce –> Hadoop MapReduce

• Designed for very large files (GB, TB, PB) •
Distributed • Scalable • Reliable • Portable

Source: https://yoyoclouds.wordpress.com/2011/12/15/hdfsarchitecture/

Programming model & implementation for processing large data sets. An
abstraction layer for programmers.

Map(key1, value1)  list(key2, value2) Reduce(key2, list(value2))  list of
value3

function map(string documentName, string document): for each word in document:
emit(word, 1) function reduce(string word, iterator<int> partialCounts): sum = 0 for each partialCount in partialCounts: sum += partialCount emit(word, sum)

Source: http://nagyadat.blog.hu/2014/03/12/mapreduce-rol_bovebben

Source: https://www.youtube.com/watch?v=ht3dNvdNDzI

Source: http://www.ongridventures.com/2012/10/23/the-big-data-landscape/

Source: http://thebigdatablog.weebly.com/blog/the-hadoop-ecosystem-overview

A = LOAD 'file1' AS (x, y, z); B =
LOAD 'file2' AS (t, u, v); C = FILTER A BY y > 0; D = JOIN C BY x, B BY u; E = GROUP D BY z; F = FOREACH E GENERATE group, COUNT(D); STORE F INTO 'output';

Statistics + Mathematics + Computer Science = Predictive Analytics

F(X) = Y

Numerical Regression • expected lifespan • real-estate prices • players
performance

Classification • digits recognition • credit fraud detection • e-mail
classification

Clustering • market segmentation • news articles categorization • movie
recommendations

“Global organizational spending on Big Data exceeded $31 billion in
2013, and is predicted to reach $118 billion in 2018.” Source: http://ebooks.capgemini-consulting.com/cracking-the-data-conundrum/#/2/zoomed

Source: https://www.linkedin.com/pulse/20140618231442-1287-too-much-stem-not-enough-computer-science

Source: http://natemat.pl/105597,it-arystokracja

Source: http://venublog.com/2012/12/10/data-science-vs-data-analytics/comment-page-1/

Slide deck available at http://goo.gl/C99kTO @MarekStoj [email protected] Homework ;) http://goo.gl/XrMiaG

• Demo Source Code https://github.com/marek-stoj/talk-demos/tree/master/mapreduce- wordcount • Big Data University
http://bigdatauniversity.com/ • Hortonworks Sandbox http://hortonworks.com/products/hortonworks-sandbox/

• edX – Analytics Edge https://www.edx.org/course/analytics-edge-mitx-15-071x-0 • Coursera - Machine
Learning https://www.coursera.org/course/ml • Coursera – Mining Massive Datasets https://www.coursera.org/course/mmds

• edX – Health in Numbers https://www.edx.org/course/health-numbers-quantitative-methods- harvardx-ph207x#.VOCWFfnF9yU • MIT
Professional X - Tackling the Challenges of Big Data https://mitprofessionalx.edx.org/courses/MITProfessionalX/6.BDX/ 2T2014/about

Big Data - Should You Care?

Big Data - Should You Care?

Other Decks in Technology

Featured

Transcript