Upgrade to Pro — share decks privately, control downloads, hide ads and more …

HadoopStack: Big Data Processing on Cloud

HadoopStack: Big Data Processing on Cloud

Presented at Workshop on Understanding Big Data Analytics, ACM India Special Interest Group on Knowledge Discovery and Data Mining (iKDD)

2013, Mysore, India.

dharmeshkakadia

February 15, 2013
Tweet

More Decks by dharmeshkakadia

Other Decks in Technology

Transcript

  1. HadoopStack:  Big  Data   Processing  on  Cloud   Dharmesh  Kakadia

      Shashank  Sahni   Vasudeva  Varma   Understanding  Big  Data  Analy?cs          IKDD,  Feb’13  
  2. Twins  of  the  tech  world   •  Big  Data  

    – Economy  to  store  and  process   – Data  Scien?sts,  Algorithms  and  Frameworks   •  Cloud  -­‐  Infrastructure  for  everyone   – Elas?city   – Pay-­‐per-­‐use   – On  demand    
  3. How  I  used  to  do  it     •  Write

     the  code   •  Search  for  machines     •  Deploy  processing  framework   •  Configure   •  Troubleshoot  (a  lot  !!!)   •  Finally..  Run  the  code   •  Is  it  running  ?   10%  ...  23%  …  33%...  60%...  and  I  missed  the  paper  deadline.   •  And  this  is  NORMAL  scenario  !!     –  And  if  it  was  a  workflow  ?  Go  back  to  first  step  and  repeat   –  what  about  failures  ?   –  you  ran  out  of  space  ?   …  
  4. What  if  ….   $  hadoopstack     -­‐-­‐input  seedURLs

        -­‐-­‐jar  “cutomCrawl.jar”  crawlerMainClass   -­‐-­‐output  index  
  5. What  if  ….   $  hadoopstack     -­‐-­‐input  seedURLs

        -­‐-­‐workflow  “crawl”   -­‐-­‐output  index  
  6. What  if  ….   $  hadoopstack     -­‐-­‐input  seedURLs

        -­‐-­‐workflow  “crawl  >  parse  >  index  ”   -­‐-­‐output  index  
  7. What  if  ….   $  hadoopstack     -­‐-­‐input  seedURLs

        -­‐-­‐workflow  “crawl  >  parse  >  index  ”   -­‐-­‐output  index   -­‐-­‐cloud  aws  
  8. What  if  ….   $  hadoopstack     -­‐-­‐input  seedURLs

        -­‐-­‐workflow  “crawl  >  parse  >  index  ”   -­‐-­‐output  index   -­‐-­‐cloud  aws   -­‐-­‐deadline  “2  days”  
  9. HadoopStack   •  Mul?ple  clouds   •  Auto  scaling  

    •  Quota   •  Minimal  cost    
  10. HadoopStack   •  Mul?ple  clouds   •  Auto  scaling  

    •  Quota   •  Minimal  cost   •  Deadline-­‐aware    
  11. HadoopStack   •  Mul?ple  clouds   •  Auto  scaling  

    •  Quota   •  Minimal  cost   •  Deadline-­‐aware   •  Data  processing  ecosystem    
  12. HadoopStack   •  Mul?ple  clouds   •  Auto  scaling  

    •  Quota   •  Minimal  cost   •  Deadline-­‐aware   •  Data  processing  ecosystem   •  Open  Source