HadoopStack: Big Data Processing on Cloud

HadoopStack: Big Data Processing on Cloud Dharmesh Kakadia
Shashank Sahni Vasudeva Varma Understanding Big Data Analy?cs IKDD, Feb’13

Twins of the tech world •  Big Data
– Economy to store and process – Data Scien?sts, Algorithms and Frameworks •  Cloud -‐ Infrastructure for everyone – Elas?city – Pay-‐per-‐use – On demand

How I used to do it •  Write
the code •  Search for machines •  Deploy processing framework •  Configure •  Troubleshoot (a lot !!!) •  Finally.. Run the code •  Is it running ? 10% ... 23% … 33%... 60%... and I missed the paper deadline. •  And this is NORMAL scenario !! –  And if it was a workflow ? Go back to first step and repeat –  what about failures ? –  you ran out of space ? …

What if …. $ hadoopstack -‐-‐input seedURLs
-‐-‐jar “cutomCrawl.jar” crawlerMainClass -‐-‐output index

-‐-‐workﬂow “crawl” -‐-‐output index

-‐-‐workﬂow “crawl > parse > index ” -‐-‐output index

-‐-‐workﬂow “crawl > parse > index ” -‐-‐output index -‐-‐cloud aws

-‐-‐workﬂow “crawl > parse > index ” -‐-‐output index -‐-‐cloud aws -‐-‐deadline “2 days”

HadoopStack

HadoopStack •  Mul?ple clouds

HadoopStack •  Mul?ple clouds •  Auto scaling

•  Quota

•  Quota •  Minimal cost

•  Quota •  Minimal cost •  Deadline-‐aware

•  Quota •  Minimal cost •  Deadline-‐aware •  Data processing ecosystem

•  Quota •  Minimal cost •  Deadline-‐aware •  Data processing ecosystem •  Open Source

Thoughts? siel-‐iiith/hadoopstack [email protected]

HadoopStack: Big Data Processing on Cloud

HadoopStack: Big Data Processing on Cloud

dharmeshkakadia

More Decks by dharmeshkakadia

Other Decks in Technology

Featured

Transcript