Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Surviving Hadoop on AWS in Production

Surviving Hadoop on AWS in Production

Soren Macbeth

January 23, 2013
Tweet

More Decks by Soren Macbeth

Other Decks in Technology

Transcript

  1. WHERE WE ARE TODAY MapR M3 on EMR All data

    read from and written to S3
  2. CLOJURE FOR DATA PROCESSING All of our MapReduce jobs are

    written in . This gives us speed, flexability and testability. More importantly, Clojure and Cascalog are fun to write. Cascalog
  3. CASCALOG EXAMPLE (ns lucene-cascalog.core (:gen-class) (:use cascalog.api) (:import org.apache.lucene.analysis.standard.StandardAnalyzer org.apache.lucene.analysis.TokenStream

    org.apache.lucene.util.Version org.apache.lucene.analysis.tokenattributes.TermAttribute)) (defn tokenizer-seq "Build a lazy-seq out of a tokenizer with TermAttribute" [^TokenStream tokenizer ^TermAttribute term-att] (lazy-seq (when (.incrementToken tokenizer) (cons (.term term-att) (tokenizer-seq tokenizer term-att))) ))
  4. EVEN IN THE BEST CASE SCENARIO, IT TAKES A LOT

    OF TUNING TO GET A HADOOP CLUSTER RUNNING WELL. There are large companies that make money soley by configuring and supporting hadoop clusters for enterprise customers.
  5. CASCALOG AND ELASTICMAPREDUCE Learning Emacs, Clojure, and Cascalog was hard,

    but was worth it. The way our jobs were designed sucked and didn't work well with ElasticMapReduce
  6. CASCALOG AND SELF-MANAGED HADOOP CLUSTER We used a hacked up

    version of a cloudera python script to launch and bootstrap a cluster. We ran on spot instances Cluster boot up time SUCKED and often failed. We paid for instances during bootstrap and configuration Our jobs weren't designed to tolerate things like spot instances going away in the middle of a job. Drinking heavily dulled the pain a little.
  7. CASCALOG AND ELASTICMAPREDUCE AGAIN Rebuilt data processing pipeline from scratch

    (only took nine months!) Data pipelines were broken out into a handful of fault- tolerant jobflow steps; each steps writes output to S3. EMR supported spot instances at this point.
  8. WEIRD BUGS THAT WE'VE HIT Bootstrap script errors Random cluster

    fuckedupedness AMI version changes Vendor issues My personal favourite: Invisible S3 write failures.
  9. IF YOU MUST RUN ON AWS Break your processing pipelines

    into stages; write out to S3 after each stage. Bake in (a lot) of variability into your expected jobflow run times. Compress the data your are reading and writing from S3 as much as possible. Drinking helps.