HadoopStack: Big Data Processing on Cloud

HadoopStack: Big Data Processing on Cloud

Presented at Workshop on Understanding Big Data Analytics, ACM India Special Interest Group on Knowledge Discovery and Data Mining (iKDD)

2013, Mysore, India.

0aa2ebd008cdd198af5e9765062bb265?s=128

dharmeshkakadia

February 15, 2013
Tweet

Transcript

  1. HadoopStack:  Big  Data   Processing  on  Cloud   Dharmesh  Kakadia

      Shashank  Sahni   Vasudeva  Varma   Understanding  Big  Data  Analy?cs          IKDD,  Feb’13  
  2. Twins  of  the  tech  world   •  Big  Data  

    – Economy  to  store  and  process   – Data  Scien?sts,  Algorithms  and  Frameworks   •  Cloud  -­‐  Infrastructure  for  everyone   – Elas?city   – Pay-­‐per-­‐use   – On  demand    
  3. How  I  used  to  do  it     •  Write

     the  code   •  Search  for  machines     •  Deploy  processing  framework   •  Configure   •  Troubleshoot  (a  lot  !!!)   •  Finally..  Run  the  code   •  Is  it  running  ?   10%  ...  23%  …  33%...  60%...  and  I  missed  the  paper  deadline.   •  And  this  is  NORMAL  scenario  !!     –  And  if  it  was  a  workflow  ?  Go  back  to  first  step  and  repeat   –  what  about  failures  ?   –  you  ran  out  of  space  ?   …  
  4. What  if  ….   $  hadoopstack     -­‐-­‐input  seedURLs

        -­‐-­‐jar  “cutomCrawl.jar”  crawlerMainClass   -­‐-­‐output  index  
  5. What  if  ….   $  hadoopstack     -­‐-­‐input  seedURLs

        -­‐-­‐workflow  “crawl”   -­‐-­‐output  index  
  6. What  if  ….   $  hadoopstack     -­‐-­‐input  seedURLs

        -­‐-­‐workflow  “crawl  >  parse  >  index  ”   -­‐-­‐output  index  
  7. What  if  ….   $  hadoopstack     -­‐-­‐input  seedURLs

        -­‐-­‐workflow  “crawl  >  parse  >  index  ”   -­‐-­‐output  index   -­‐-­‐cloud  aws  
  8. What  if  ….   $  hadoopstack     -­‐-­‐input  seedURLs

        -­‐-­‐workflow  “crawl  >  parse  >  index  ”   -­‐-­‐output  index   -­‐-­‐cloud  aws   -­‐-­‐deadline  “2  days”  
  9. HadoopStack  

  10. HadoopStack   •  Mul?ple  clouds  

  11. HadoopStack   •  Mul?ple  clouds   •  Auto  scaling  

     
  12. HadoopStack   •  Mul?ple  clouds   •  Auto  scaling  

    •  Quota    
  13. HadoopStack   •  Mul?ple  clouds   •  Auto  scaling  

    •  Quota   •  Minimal  cost    
  14. HadoopStack   •  Mul?ple  clouds   •  Auto  scaling  

    •  Quota   •  Minimal  cost   •  Deadline-­‐aware    
  15. HadoopStack   •  Mul?ple  clouds   •  Auto  scaling  

    •  Quota   •  Minimal  cost   •  Deadline-­‐aware   •  Data  processing  ecosystem    
  16. HadoopStack   •  Mul?ple  clouds   •  Auto  scaling  

    •  Quota   •  Minimal  cost   •  Deadline-­‐aware   •  Data  processing  ecosystem   •  Open  Source    
  17. Thoughts?   siel-­‐iiith/hadoopstack   dharmesh.kakadia@research.iiit.ac.in