Surviving Hadoop on AWS in Production

SURVIVING HADOOP ON AWS IN PRODUCTION

DISCLAIMER: I AM A BAD PERSON.

ABOUT ME Chief Data Scientist at Yieldbot, Co-Founder at StockTwits.
@sorenmacbeth

YIELDBOT “Yieldbot's technology creates a marketplace where search advertisers buy
real-time consumer intent on premium publishers.”

WHERE WE ARE TODAY MapR M3 on EMR All data
read from and written to S3

CLOJURE FOR DATA PROCESSING All of our MapReduce jobs are
written in . This gives us speed, flexability and testability. More importantly, Clojure and Cascalog are fun to write. Cascalog

CASCALOG EXAMPLE (ns lucene-cascalog.core (:gen-class) (:use cascalog.api) (:import org.apache.lucene.analysis.standard.StandardAnalyzer org.apache.lucene.analysis.TokenStream
org.apache.lucene.util.Version org.apache.lucene.analysis.tokenattributes.TermAttribute)) (defn tokenizer-seq "Build a lazy-seq out of a tokenizer with TermAttribute" [^TokenStream tokenizer ^TermAttribute term-att] (lazy-seq (when (.incrementToken tokenizer) (cons (.term term-att) (tokenizer-seq tokenizer term-att))) ))

HADOOP IS COMPLEX

“Fact: There are more Hadoop configuration options than there are
stars our galaxy.”

EVEN IN THE BEST CASE SCENARIO, IT TAKES A LOT
OF TUNING TO GET A HADOOP CLUSTER RUNNING WELL. There are large companies that make money soley by configuring and supporting hadoop clusters for enterprise customers.

RUNNING HADOOP ON AWS

SO WHY RUN ON AWS? $$$

HADOOP ON AWS: AN PERSONAL HISTORY

PIG AND ELASTICMAPREDUCE Slow development cycle; writing Java sucks.

CASCALOG AND ELASTICMAPREDUCE Learning Emacs, Clojure, and Cascalog was hard,
but was worth it. The way our jobs were designed sucked and didn't work well with ElasticMapReduce

CASCALOG AND SELF-MANAGED HADOOP CLUSTER We used a hacked up
version of a cloudera python script to launch and bootstrap a cluster. We ran on spot instances Cluster boot up time SUCKED and often failed. We paid for instances during bootstrap and configuration Our jobs weren't designed to tolerate things like spot instances going away in the middle of a job. Drinking heavily dulled the pain a little.

CASCALOG AND ELASTICMAPREDUCE AGAIN Rebuilt data processing pipeline from scratch
(only took nine months!) Data pipelines were broken out into a handful of fault- tolerant jobflow steps; each steps writes output to S3. EMR supported spot instances at this point.

WEIRD BUGS THAT WE'VE HIT Bootstrap script errors Random cluster
fuckedupedness AMI version changes Vendor issues My personal favourite: Invisible S3 write failures.

IF YOU MUST RUN ON AWS Break your processing pipelines
into stages; write out to S3 after each stage. Bake in (a lot) of variability into your expected jobflow run times. Compress the data your are reading and writing from S3 as much as possible. Drinking helps.

QUESTIONS?

YIELDBOT IS HIRING! http://yieldbot.com/jobs

Surviving Hadoop on AWS in Production

Surviving Hadoop on AWS in Production

Soren Macbeth

More Decks by Soren Macbeth

Other Decks in Technology

Featured

Transcript