Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Spark On Mesos -- "The Road Less Travelled"

December 15, 2015

Spark On Mesos -- "The Road Less Travelled"

Big Things Meetup -- Tel Aviv


As a startup, we've had the luxury of developing our batch processing infrastructure from scratch allowing us to incorporate some unconventional combinations of technologies. I will outline our current infrastructure, Spark running over mesos with data stored exclusively on S3 as a mixture of raw data in Hadoop sequence files and Parquet files, and explain the advantages it offers us over a more typical setup with Spark running on top of YARN backed by HDFS. However, running spark in this way has not been without challenges and a few set backs. I will highlight a few of the larger problems we encountered and what we did to solve them. Despite the challenges, choosing Spark has opened up many possibilities for us. To highlight the performance and flexibility we gained by using Spark, I will dive into one process, Retention, that we originally implemented using Cascalog (Datalog translated into Cascading/Hadoop) and then rewrote as a Spark job.

Morri's bio -
Morri studied epi-genetics as a post-doc at the Weizmann institute and has
PhD in Biophysics from University of California San Francisco. He left the world of academia to crack Big Data problems. That's why he joined the AppsFlyer Dev team.


December 15, 2015

More Decks by AppsFlyer

Other Decks in Technology


  1. -OR- User Device Store Redirected Enables • Cost Per Install

    (CPI) • Cost Per In-app Action (CPA) • Revenue Share • Network Optimization • Retargeting Media sources The Flow AppsFlyer Servers
  2. Spark successfully used for Cohort report (similar to retention) Cohort

    written with Spark low level RDD API wrapped by sparkling (defn select-angry-birds-clojure [file-name] (let [file-lines (split-lines (slurp file-name)) parsed (map json/parse-string file-lines) desired (filter (fn [m] (= "com.rovio.angrybirds" (:app_id m))) parsed)] desired)) (defn select-angry-birds-sparkling [file-name] (let [raw-rdd (spark/text-file file-name) parsed-rdd (spark/map json/parse-string file-lines) desired-rdd (spark/filter (fn [m] (= "com.rovio.angrybirds" (:app_id m))) parsed)] (spark/collect desired))) https://github.com/gorillalabs/sparkling Retention V2-
  3. S3 Data v1 – Hadoop Sequence files: Key, Value <Kafka

    Offset, Json Message> Gzip Compressed ~ 1.8 TB / Day S3 Data v2 – Parquet Files (Schema on Write) Retain fields required for retention, apply some business logic while converting. Generates “tables” for installs and sessions. Retention v2 – “SELECT … JOIN ON ...” 18 Dimensions vs 2 in original report Retention – Spark SQL / Parquet
  4. Retention Calculation Phases 1. Daily aggregation Cohort_day, Activity_day, <Dimensions>, Retained

    Count 2. Pivot Cohort_day, <Dimensions>, Day0, Day1, Day2 … After Aggregation and Pivot ~ 1 billion rows
  5. Data Warehouse v3 Parquet Files – Schema on Read Retain

    almost all fields from original json Do not apply any business logic Business logic applied when reading through use of a shared library
  6. Why? All Data on S3 – No need for HDFS

    Spark & Mesos have a long history Some interest in moving our attribution services to Mesos Began using spark with EC2 “standalone” cluster scripts (No VPC) Easy to setup Culture of trying out promising technologies
  7. Mesos Creature Comforts Nice UI – Job outputs / sandbox

    easy to find Driver and Slave logs are accessible
  8. Mesos Creature Comforts Fault tolerant – Masters store data in

    zookeeper and canfail over smoothly Nodes join and leave the cluster automatically at bootup / shutdown
  9. Spark Execution Concepts RDD (Resilient Distributed Dataset) = Partitioned Data

    Job divided into Stages Stages divided into Tasks Task = Partition Stages separated by Shuffle
  10. Spark Execution Concepts Driver runs the show Executors run Tasks

  11. Specific Lessons / Challenges using Spark, Mesos & S3 -or-

    What Went Wrong with Spark / Mesos & S3 and How We Fixed It. Spark / Mesos in production for nearly 1 year
  12. S3 is not HDFS S3n gives tons of timeouts and

    DNS Errors @ 5pm Daily Can compensate for timeouts with spark.task.maxFailures set to 20 Use S3a from Hadoop 2.7 (S3a in 2.6 generates millions of partitions – HADOOP-11584) https://www.appsflyer.com/blog/the-bleeding-edge-spark-parquet-and-s3/
  13. S3 is not HDFS part 2 Use a Direct Output

    Commiter https://www.appsflyer.com/blog/the-bleeding-edge-spark-parquet-and-s3/ Spark writes files to staging area and renames them at end of job Rename on S3 is an expensive operation (~10s of minutes for thousands of files) Direct Output Commiters write to final output location (Safe because S3 is atomic, so writes always succeed) Disadvantages –Incompatible with speculative execution Poor recovery from failures during write operations
  14. Avoid .0 releases if possible https://www.appsflyer.com/blog/the-bleeding-edge-spark-parquet-and-s3/ Worst example Spark 1.4.0

    randomly looses data especially on jobs with many output partitions Fixed by SPARK-8406
  15. Coarse-Grained with Dynamic Allocation Brings some advantages of Fine-Grained to

    Coarse-Grained Executors are shut down once they are inactive Relies on External Shuffle Service to serve Shuffle files since an executor might or might not be present after shuffle Advantages – Similar to fine-grained Disadvantages- Not as fair / dynamic as fine-grained. Jobs only share resources when executors become idle at the end of stages. Not production ready. Some jobs consistently fail. Reason is not yet clear. External shuffle service is sometimes unavailable.
  16. Tuning Jobs in Coarse-Grained Set executor memory to ~ entire

    memory of a machine (200GB for r3.8xlarge) spark.task.cpus is then actually spark memory per task OOM!! 200 GB 32 cpus
  17. Spark on Mesos Future Improvements Increased stability – Dynamic allocation

    Tungsten Mesos Maintenance Primitives, experimental in 0.25.0 Gracefully reduce size of cluster by marking nodes that will soon be killed Inverse Offers – preemption, more dynamic scheduling