Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Workshop - Hadoop + R by CARLOS GIL BELLOSTA at...

Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

The workshop will illustrate a number of techniques for data modelling that help us extend our small data capabilities to the world of big data: sampling, resampling, parallelization where possible, etc. We will leverage the functional architecture of R and its statistical analysis prowess in small data environments using the mapreduce technique embedded in Hadoop to tackle large data analysis problems. Particular attention will be paid to the ubiquitous --but non-scalable-- logistic regression technique and its big data alternatives.

Big Data Spain

December 23, 2013
Tweet

More Decks by Big Data Spain

Other Decks in Technology

Transcript

  1. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop

    & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks Big Data Analytics R & Hadoop Carlos J. Gil Bellosta [email protected] November 2013
  2. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop

    & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks Table of Contents 1 Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R 2 Counting (& Graphics) 3 Details of mapreduce 4 Scoring, sampling & simulating 5 Data modelling 6 Final remarks
  3. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop

    & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks File system: manages all about files • Examples: diskettes, hard disks, RAIDs,... magnetic tapes! • Combination of hardware and software to hide boring activities from users: • Find space to write the files • Read/write files • Manage fragmentation • Etc. • How many devices per FS? • 1-to-1: diskettes, CD-ROMs, HDDs,... • n-to-1: partitioned HDDs,... • 1-to-n: RAIDs, Hadoop
  4. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop

    & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks Hadoop goodies (as a FS) • Chuncks (large) files among machines • Replicates chunks (default, 3) • Balances data • Robust to hardware failures • It is rack aware Obviously, it requires some system to keep track of: • Which servers/racks are up/down • Where each chunk is located • ...
  5. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop

    & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks How to work with data in Hadoop? • Provides a shell (ls, cp, etc.) • You can put/get data from your local FS to Hadoop FS • This is: • You can dump your data to your local machine • You can run your programs in your local machine • You can put results back into Hadoop • But what if the file is too large? Solution Rather than bringing the data to the code, why not moving the code to the data? One of the ways to move code to data is known as mapreduce.
  6. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop

    & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks Mapreduce • Two step process: • Map: run your code on chunks all over • Reduce: reshape the output into the desired format • Hadoop manages issues: • System failures • Threads that do not return • And all (?) that made life of OpenMP, MPI, etc. users miserable • Slotted approach: mapreduce provides slots where you put the mappers/reducers code • The code is for you to provide!
  7. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop

    & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks What is R? • R is a • software package? • programming language? • environment? for data analysis and graphics. • R users are (should be?) used to the mapreduce approach: ddply(dfx, .(group, sex), summarize, mean = mean(age), sd = sd(age))
  8. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop

    & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks Table of Contents 1 Intro to Hadoop & R 2 Counting (& Graphics) Graphics & big data Let’s count... hexagons 3 Details of mapreduce 4 Scoring, sampling & simulating 5 Data modelling 6 Final remarks
  9. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop

    & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks Visualizing a million
  10. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop

    & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks Fluctuation plot
  11. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop

    & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks Table plot
  12. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop

    & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks • Non-trivial counting exercise (no, we are not counting words today!) • Good visualization features for big datasets • Fits in mapreduce framework: • Map: Assigns points to hexagons • Reduce: aggregates counts on hexagons • The output is small and can be plotted locally
  13. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop

    & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks Table of Contents 1 Intro to Hadoop & R 2 Counting (& Graphics) 3 Details of mapreduce 4 Scoring, sampling & simulating 5 Data modelling 6 Final remarks
  14. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop

    & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks What you see: input/output, map, reduce • input: • Type: text, csv, R object,... • Options: separator,... • output: similar to input • map & reduce: • Functions with (k,v) argument (k, key; v, value) • They return a k,v list • Thus, mapreduces can be chained together (the output of the first one is the input for the second)
  15. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop

    & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks What you don’t see $HADOOP jar $HADOOP_STREAMING -D stream.map.input=typedbytes -D stream.map.output=typedbytes -D stream.reduce.input=typedbytes -D stream.reduce.output=typedbytes -D mapred.reduce.tasks=0 -input /tmp/RtmpUUrNMj/file68c0185e60c -output /tmp/RtmpUUrNMj/file68c04c25d5f0 -mapper \"Rscript rmr-streaming-map68c018acf680 \" -file /tmp/RtmpUUrNMj/rmr-local-env68c0101c8e8a -file /tmp/RtmpUUrNMj/rmr-global-env68c03abb4080 -file /tmp/RtmpUUrNMj/rmr-streaming-map68c018acf680 -inputformat org.apache.hadoop.streaming.AutoInputFormat -outputformat org.apache.hadoop.mapred.SequenceFileOutputForm
  16. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop

    & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks Table of Contents 1 Intro to Hadoop & R 2 Counting (& Graphics) 3 Details of mapreduce 4 Scoring, sampling & simulating 5 Data modelling 6 Final remarks
  17. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop

    & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks Scoring • Externals consultants build a model (using R and small data) • Models in R should have a predict method • You can then score your huge database (in batch) • No need to rewrite the model into your systems!
  18. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop

    & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks The case for sampling • Sampling works! • Sampled datasets can be used to build small data models • You can use R (& mapreduce) to sample data, but you better not
  19. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop

    & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks Running simulations on Hadoop • Some (many?) people say it is not the right tool • You need input data, but simulations often not • You want to control the number of mappers (which run your simulations) • Still mapreduce is nice for simulations... • ... so let and old dog try its dirty trick!
  20. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop

    & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks Table of Contents 1 Intro to Hadoop & R 2 Counting (& Graphics) 3 Details of mapreduce 4 Scoring, sampling & simulating 5 Data modelling Linear Regression Logistic Regression Trees & Random Forests 6 Final remarks
  21. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop

    & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks Linear regression can be parallelized Simple linear regression: y ∼ α + βx ˆ β = n i=1 (xi − ¯ x)(yi − ¯ y) n i=1 (xi − ¯ x)2 = = n i=1 xi yi − 1 n n i=1 xi n j=1 yj n i=1 (x2 i ) − 1 n ( n i=1 xi )2 Operations are case by case!
  22. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop

    & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks Multiple linear regression • Based on X X and X y: ˆ β = (X X)−1X y • If X = [X1|...|Xn] (by blocks), then X X = i Xi Xi .
  23. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop

    & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks Can logistic regression be parallelized? Yes and no. • Fitting logistic regression models is iterative and iterations are not parallelizable. • However, each iteration can be parallelized (these are not unlike fitting linear models as before) • We will explore two big data alternatives: • Parallelize iterations using mapreduce (see http://goo.gl/ftx36r) • Split your data meaningfully and do standard logistic regression in the nodes
  24. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop

    & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks How many bytes make knowledge? (aka the fractal nature of big data)
  25. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop

    & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks Splitted logistic regression
  26. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop

    & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks Viable alternatives to logistic models • Trees • High interpretability • But unstable and tend to miss out details • Random forests • Black boxes • Superb performance • These are collections of trees that can be built in parallel • Both can be parallelized indifferent ways: • Similar to partitioned logistic models above • Within training
  27. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop

    & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks Table of Contents 1 Intro to Hadoop & R 2 Counting (& Graphics) 3 Details of mapreduce 4 Scoring, sampling & simulating 5 Data modelling 6 Final remarks
  28. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop

    & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks Forget most of what you learned today, seriously • People strive to extend small data models to big data (as we did today)... • ... but is it the way to go? • Achtung microlocal structure • Small data people knows microlocal structure as outliers • Global models (linear, logistic,...) cannot (easily?) exploit microlocal structure • But the promises of big data lie precisely there • (Otherwise, just sample and you will be fine) • Areas to watch for insights on big data modelling: • SNA (networks analysis) • Text analysis
  29. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop

    & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks Thank you very much and... ... questions?