Slide 1

Slide 1 text

HadoopStack:  Big  Data   Processing  on  Cloud   Dharmesh  Kakadia   Shashank  Sahni   Vasudeva  Varma   Understanding  Big  Data  Analy?cs          IKDD,  Feb’13  

Slide 2

Slide 2 text

Twins  of  the  tech  world   •  Big  Data   – Economy  to  store  and  process   – Data  Scien?sts,  Algorithms  and  Frameworks   •  Cloud  -­‐  Infrastructure  for  everyone   – Elas?city   – Pay-­‐per-­‐use   – On  demand    

Slide 3

Slide 3 text

How  I  used  to  do  it     •  Write  the  code   •  Search  for  machines     •  Deploy  processing  framework   •  Configure   •  Troubleshoot  (a  lot  !!!)   •  Finally..  Run  the  code   •  Is  it  running  ?   10%  ...  23%  …  33%...  60%...  and  I  missed  the  paper  deadline.   •  And  this  is  NORMAL  scenario  !!     –  And  if  it  was  a  workflow  ?  Go  back  to  first  step  and  repeat   –  what  about  failures  ?   –  you  ran  out  of  space  ?   …  

Slide 4

Slide 4 text

What  if  ….   $  hadoopstack     -­‐-­‐input  seedURLs     -­‐-­‐jar  “cutomCrawl.jar”  crawlerMainClass   -­‐-­‐output  index  

Slide 5

Slide 5 text

What  if  ….   $  hadoopstack     -­‐-­‐input  seedURLs     -­‐-­‐workflow  “crawl”   -­‐-­‐output  index  

Slide 6

Slide 6 text

What  if  ….   $  hadoopstack     -­‐-­‐input  seedURLs     -­‐-­‐workflow  “crawl  >  parse  >  index  ”   -­‐-­‐output  index  

Slide 7

Slide 7 text

What  if  ….   $  hadoopstack     -­‐-­‐input  seedURLs     -­‐-­‐workflow  “crawl  >  parse  >  index  ”   -­‐-­‐output  index   -­‐-­‐cloud  aws  

Slide 8

Slide 8 text

What  if  ….   $  hadoopstack     -­‐-­‐input  seedURLs     -­‐-­‐workflow  “crawl  >  parse  >  index  ”   -­‐-­‐output  index   -­‐-­‐cloud  aws   -­‐-­‐deadline  “2  days”  

Slide 9

Slide 9 text

HadoopStack  

Slide 10

Slide 10 text

HadoopStack   •  Mul?ple  clouds  

Slide 11

Slide 11 text

HadoopStack   •  Mul?ple  clouds   •  Auto  scaling    

Slide 12

Slide 12 text

HadoopStack   •  Mul?ple  clouds   •  Auto  scaling   •  Quota    

Slide 13

Slide 13 text

HadoopStack   •  Mul?ple  clouds   •  Auto  scaling   •  Quota   •  Minimal  cost    

Slide 14

Slide 14 text

HadoopStack   •  Mul?ple  clouds   •  Auto  scaling   •  Quota   •  Minimal  cost   •  Deadline-­‐aware    

Slide 15

Slide 15 text

HadoopStack   •  Mul?ple  clouds   •  Auto  scaling   •  Quota   •  Minimal  cost   •  Deadline-­‐aware   •  Data  processing  ecosystem    

Slide 16

Slide 16 text

HadoopStack   •  Mul?ple  clouds   •  Auto  scaling   •  Quota   •  Minimal  cost   •  Deadline-­‐aware   •  Data  processing  ecosystem   •  Open  Source    

Slide 17

Slide 17 text

Thoughts?   siel-­‐iiith/hadoopstack   [email protected]