Spark On Mesos -- "The Road Less Travelled"

Morri Feldman SPARK ON MESOS “THE ROAD LESS TRAVELLED” morri@appsﬂyer.com

The Plan Attribution & Overall Architecture Retention Data Infrastructure -
Spark on Mesos 1 2 3

-OR- User Device Store Redirected Enables • Cost Per Install
(CPI) • Cost Per In-app Action (CPA) • Revenue Share • Network Optimization • Retargeting Media sources The Flow AppsFlyer Servers

Retention

~25 Million Installs / Day ~4.75 Billion Sessions / Day
Retention

Two Dimensions (App-Id and Media-Source) Cascalog DataLog / Logic programming
over Cascading / Hadoop Retention V1 (MVP)

Spark successfully used for Cohort report (similar to retention) Cohort
written with Spark low level RDD API wrapped by sparkling (defn select-angry-birds-clojure [file-name] (let [file-lines (split-lines (slurp file-name)) parsed (map json/parse-string file-lines) desired (filter (fn [m] (= "com.rovio.angrybirds" (:app_id m))) parsed)] desired)) (defn select-angry-birds-sparkling [file-name] (let [raw-rdd (spark/text-file file-name) parsed-rdd (spark/map json/parse-string file-lines) desired-rdd (spark/filter (fn [m] (= "com.rovio.angrybirds" (:app_id m))) parsed)] (spark/collect desired))) https://github.com/gorillalabs/sparkling Retention V2-

S3 Data v1 – Hadoop Sequence files: Key, Value <Kafka
Offset, Json Message> Gzip Compressed ~ 1.8 TB / Day S3 Data v2 – Parquet Files (Schema on Write) Retain fields required for retention, apply some business logic while converting. Generates “tables” for installs and sessions. Retention v2 – “SELECT … JOIN ON ...” 18 Dimensions vs 2 in original report Retention – Spark SQL / Parquet

Retention Calculation Phases 1. Daily aggregation Cohort_day, Activity_day, <Dimensions>, Retained
Count 2. Pivot Cohort_day, <Dimensions>, Day0, Day1, Day2 … After Aggregation and Pivot ~ 1 billion rows

Data Warehouse v3 Parquet Files – Schema on Read Retain
almost all ﬁelds from original json Do not apply any business logic Business logic applied when reading through use of a shared library

Why? All Data on S3 – No need for HDFS
Spark & Mesos have a long history Some interest in moving our attribution services to Mesos Began using spark with EC2 “standalone” cluster scripts (No VPC) Easy to setup Culture of trying out promising technologies

Mesos Creature Comforts Nice UI – Job outputs / sandbox
easy to ﬁnd Driver and Slave logs are accessible

Mesos Creature Comforts Fault tolerant – Masters store data in
zookeeper and canfail over smoothly Nodes join and leave the cluster automatically at bootup / shutdown

Job Scheduling – Chronos ? https://aphyr.com/posts/326-jepsen-chronos

Spark Execution Concepts RDD (Resilient Distributed Dataset) = Partitioned Data
Job divided into Stages Stages divided into Tasks Task = Partition Stages separated by Shuﬄe

Spark Execution Concepts http://www.trongkhoanguyen.com/2015/04/understand-shuffle-component-in-spark.html RDD (Resilient Distributed Dataset) = Partitioned
Data Job divided into Stages Stages divided into Tasks Task = Partition Stages separated by Shuﬄe

Spark Execution Concepts Driver runs the show Executors run Tasks
http://spark.apache.org/docs/latest/cluster-overview.html

Speciﬁc Lessons / Challenges using Spark, Mesos & S3 -or-
What Went Wrong with Spark / Mesos & S3 and How We Fixed It. Spark / Mesos in production for nearly 1 year

S3 is not HDFS S3n gives tons of timeouts and
DNS Errors @ 5pm Daily Can compensate for timeouts with spark.task.maxFailures set to 20 Use S3a from Hadoop 2.7 (S3a in 2.6 generates millions of partitions – HADOOP-11584) https://www.appsflyer.com/blog/the-bleeding-edge-spark-parquet-and-s3/

S3 is not HDFS part 2 Use a Direct Output
Commiter https://www.appsflyer.com/blog/the-bleeding-edge-spark-parquet-and-s3/ Spark writes files to staging area and renames them at end of job Rename on S3 is an expensive operation (~10s of minutes for thousands of files) Direct Output Commiters write to final output location (Safe because S3 is atomic, so writes always succeed) Disadvantages –Incompatible with speculative execution Poor recovery from failures during write operations

Avoid .0 releases if possible https://www.appsflyer.com/blog/the-bleeding-edge-spark-parquet-and-s3/ Worst example Spark 1.4.0
randomly looses data especially on jobs with many output partitions Fixed by SPARK-8406

Coarse-Grained or Fine- Grained? TL; DR – Use coarse-grained Not
Perfect, but Stable

Coarse-Grained – Disadvantages spark.cores.max (not dynamic)

http://www.slideshare.net/SparkSummit/wampler- chen

Coarse-Grained with Dynamic Allocation

Coarse-Grained with Dynamic Allocation Brings some advantages of Fine-Grained to
Coarse-Grained Executors are shut down once they are inactive Relies on External Shuffle Service to serve Shuffle files since an executor might or might not be present after shuffle Advantages – Similar to fine-grained Disadvantages- Not as fair / dynamic as fine-grained. Jobs only share resources when executors become idle at the end of stages. Not production ready. Some jobs consistently fail. Reason is not yet clear. External shuffle service is sometimes unavailable.

Tuning Jobs in Coarse-Grained

Tuning Jobs in Coarse-Grained Set executor memory to ~ entire
memory of a machine (200GB for r3.8xlarge) spark.task.cpus is then actually spark memory per task OOM!! 200 GB 32 cpus

Tuning Jobs in Coarse-Grained More Shuffle Partitions OOM!!

Spark on Mesos Future Improvements Increased stability – Dynamic allocation
Tungsten Mesos Maintenance Primitives, experimental in 0.25.0 Gracefully reduce size of cluster by marking nodes that will soon be killed Inverse Oﬀers – preemption, more dynamic scheduling

We are Hiring! https://www.appsﬂyer.com/jobs/

Spark On Mesos -- "The Road Less Travelled"

Spark On Mesos -- "The Road Less Travelled"

AppsFlyer

More Decks by AppsFlyer

Other Decks in Technology

Featured

Transcript

Morri Feldman SPARK ON MESOS “THE ROAD LESS TRAVELLED” morri@appsﬂyer.com

The Plan Attribution & Overall Architecture Retention Data Infrastructure -

-OR- User Device Store Redirected Enables • Cost Per Install

Retention

~25 Million Installs / Day ~4.75 Billion Sessions / Day

Two Dimensions (App-Id and Media-Source) Cascalog DataLog / Logic programming

Two Dimensions (App-Id and Media-Source) Cascalog DataLog / Logic programming

Spark successfully used for Cohort report (similar to retention) Cohort

S3 Data v1 – Hadoop Sequence ﬁles: Key, Value <Kafka

Retention Calculation Phases 1. Daily aggregation Cohort_day, Activity_day, <Dimensions>, Retained

Data Warehouse v3 Parquet Files – Schema on Read Retain

Why? All Data on S3 – No need for HDFS

Mesos Creature Comforts Nice UI – Job outputs / sandbox

Mesos Creature Comforts Fault tolerant – Masters store data in

Job Scheduling – Chronos ? https://aphyr.com/posts/326-jepsen-chronos

Spark Execution Concepts RDD (Resilient Distributed Dataset) = Partitioned Data

Spark Execution Concepts http://www.trongkhoanguyen.com/2015/04/understand-shuffle-component-in-spark.html RDD (Resilient Distributed Dataset) = Partitioned

Spark Execution Concepts Driver runs the show Executors run Tasks

Speciﬁc Lessons / Challenges using Spark, Mesos & S3 -or-

S3 is not HDFS S3n gives tons of timeouts and

S3 is not HDFS part 2 Use a Direct Output

Avoid .0 releases if possible https://www.appsflyer.com/blog/the-bleeding-edge-spark-parquet-and-s3/ Worst example Spark 1.4.0

Coarse-Grained or Fine- Grained? TL; DR – Use coarse-grained Not

Coarse-Grained – Disadvantages spark.cores.max (not dynamic)

http://www.slideshare.net/SparkSummit/wampler- chen

Coarse-Grained with Dynamic Allocation

Coarse-Grained with Dynamic Allocation Brings some advantages of Fine-Grained to

Tuning Jobs in Coarse-Grained

Tuning Jobs in Coarse-Grained Set executor memory to ~ entire

Tuning Jobs in Coarse-Grained More Shuffle Partitions OOM!!

Spark on Mesos Future Improvements Increased stability – Dynamic allocation

We are Hiring! https://www.appsﬂyer.com/jobs/