Slide 1

Slide 1 text

Adapting from Spark to Dask: what to expect Irina Truong @irinatruong

Slide 2

Slide 2 text

Irina Truong We process lots of data with Apache Spark and Apache Storm I wrote wharfee, and maintain pgcli https://github.com/j-bennet/ @irinatruong I work for Parse.ly as a Backend Engineer

Slide 3

Slide 3 text

To Dask or not to Dask? That is the question. @irinatruong

Slide 4

Slide 4 text

You may be considering a journey... Is it possible? What will go wrong? What about performance? Is it worth it? @irinatruong

Slide 5

Slide 5 text

I did too. Because... @irinatruong

Slide 6

Slide 6 text

How does this make you feel? ExecutorLostFailure (executor 16 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. ✋ @irinatruong

Slide 7

Slide 7 text

What about this? 17/09/05 16:40:33 ERROR Utils: Aborting task java.lang.NullPointerException at org.apache.parquet.it.unimi.dsi.fastutil.objects.Object2IntLinkedOpenHashMap.getInt(Object2IntLinkedOpenHashMap.java:590) at org.apache.parquet.column.values.dictionary.DictionaryValuesWriter$PlainFixedLenArrayDictionaryValuesWriter.writeBytes(DictionaryValuesWriter.java:307) at org.apache.parquet.column.values.fallback.FallbackValuesWriter.writeBytes(FallbackValuesWriter.java:162) at org.apache.parquet.column.impl.ColumnWriterV1.write(ColumnWriterV1.java:201) at org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.addBinary(MessageColumnIO.java:467) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$org$apache$spark$sql$execution$datasources$parquet$ParquetWriteSupport$$makeWriter$10.apply(ParquetWriteSupport.scala:1 84) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$org$apache$spark$sql$execution$datasources$parquet$ParquetWriteSupport$$makeWriter$10.apply(ParquetWriteSupport.scala:1 72) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$org$apache$spark$sql$execution$datasources$parquet$ParquetWriteSupport$$writeFields$1.apply$mcV$sp(ParquetWriteSupport. scala:124) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.org$apache$spark$sql$execution$datasources$parquet$ParquetWriteSupport$$consumeField(ParquetWriteSupport.scala:437) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.org$apache$spark$sql$execution$datasources$parquet$ParquetWriteSupport$$writeFields(ParquetWriteSupport.scala:123) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$write$1.apply$mcV$sp(ParquetWriteSupport.scala:114) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeMessage(ParquetWriteSupport.scala:425) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:113) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:51) at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:123) at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:180) at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:46) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.write(ParquetOutputWriter.scala:40) at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask$$anonfun$execute$2.apply(FileFormatWriter.scala:465) at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask$$anonfun$execute$2.apply(FileFormatWriter.scala:440) at scala.collection.Iterator$class.foreach(Iterator.scala:893) ...78 more lines of Java stack trace

Slide 8

Slide 8 text

Or, I don’t know, this? Let’s not even provide a reason, shall we? @irinatruong

Slide 9

Slide 9 text

I know how it made me feel... @irinatruong

Slide 10

Slide 10 text

So when Dask promised: No more Java heap space Or “Container killed by YARN” Real-time visual feedback Meaningful stack traces @irinatruong

Slide 11

Slide 11 text

I jumped. @irinatruong

Slide 12

Slide 12 text

I am here to help you set realistic expectations. If something sounds too good to be true, it usually is.

Slide 13

Slide 13 text

Let’s back up. @irinatruong

Slide 14

Slide 14 text

What is Spark? General data processing engine Batch processing Stream processing SQL Engine Java, Scala, Python or R Hadoop YARN EC2 Mesos Standalone cluster Developed by Apache Software Foundation https://spark.apache.org/ ✋

Slide 15

Slide 15 text

What is Dask? Parallel computing library for analytic computing Developed by Anaconda Inc Built on top of Numeric Python ecosystem Written in Python, C and Fortran Python only http://dask.pydata.org/en/latest/ Hadoop YARN EC2 Mesos Local cluster Kubernetes And more... ✋

Slide 16

Slide 16 text

So, I had this Spark application... @irinatruong

Slide 17

Slide 17 text

Parquet in S3 Aggregate Transform records JSON in S3 Read W rite

Slide 18

Slide 18 text

Input Parquet record customer url referrer session_id ts a.com http://a.com/articles/1 http://google.com xxx 2017-09-15 05:04:21.123 @irinatruong

Slide 19

Slide 19 text

Aggregated JSON record { "_id": "http://a.com/articles/1|a.com|2017-09-15T01:00", "_index": "events", "customer": "a.com", "url": "http://a.com/articles/1", "freq": "1hour", "ts": "2017-09-15T01:01:00", "metrics": { "page_views": 3, "visitors": 3 }, "referrers": { "http://google.com/": 1, "http://bing.com/": 1, "http://facebook.com/": 1 } } @irinatruong

Slide 20

Slide 20 text

Let’s read our input data! @irinatruong

Slide 21

Slide 21 text

Input data structure ./events/year=2017/month=09/day=17/hour=00/customer=a.com part-0.gz.parquet part-1.gz.parquet ... partitioning columns @irinatruong

Slide 22

Slide 22 text

Read Parquet with Spark df = sqlContext.read.parquet("./events/") That was easy. @irinatruong

Slide 23

Slide 23 text

Read Parquet with Dask import dask.dataframe as dd df = dd.read_parquet('./events/*/*/*/*/*/*.parquet').drop('hour', axis=1) file mask drop partitioning columns @irinatruong

Slide 24

Slide 24 text

Now, aggregate and transform. @irinatruong

Slide 25

Slide 25 text

Aggregation Round timestamps down to an hour Group by customer, url and rounded time period COUNT(*) (page views) COUNT(DISTINCT(session_id)) (visitors) COUNT_VALUES(referrer) (referrers) @irinatruong

Slide 26

Slide 26 text

Aggregate with Spark from pyspark.sql import functions as F agg = (df.groupby('customer', 'url', F.window('ts', '1 hour').start.alias("ts")) .agg(F.count("*").alias("page_views"), F.countDistinct('session_id').alias('visitors'), count_values('referrer').alias('referrers'))) @irinatruong

Slide 27

Slide 27 text

Aggregate with Spark (SQL) agg = sqlContext.sql(""" select customer, url, window(ts, '1 hour').start as ts, count(*) as page_views, count(distinct(session_id)) as visitors, count_values(referrer) as referrers from df group by customer, url, window(ts, '1 hour').start """) @irinatruong

Slide 28

Slide 28 text

Spark needs a custom aggregation agg_counter = sc._jvm.com.jbennet.daskvsspark.udafs.AggregateCounter() sqlContext.sparkSession._jsparkSession.udf().register( 'count_values', agg_counter) def count_values(col): """Register UDAF for use in aggregations outside of Spark SQL.""" counter = sc._jvm.com.jbennet.daskvsspark.udafs.AggregateCounter().apply return Column(counter(_to_seq(sc, [col], _to_java_column))) @irinatruong

Slide 29

Slide 29 text

At this point... ipdb> agg.show(5) +--------+--------------------+-------------------+----------+--------+--------------------+ |customer| url| ts|page_views|visitors| referrers| +--------+--------------------+-------------------+----------+--------+--------------------+ | a.com|http://a.com/arti...|2017-09-17 12:00:00| 1| 1|[http://bing.com/...| | a.com|http://a.com/arti...|2017-09-17 05:00:00| 1| 1|[http://bing.com/...| | a.com|http://a.com/arti...|2017-09-17 00:00:00| 1| 1|[http://facebook....| | a.com|http://a.com/arti...|2017-09-17 19:00:00| 1| 1|[http://google.co...| | a.com|http://a.com/arti...|2017-09-17 22:00:00| 1| 1|[http://google.co...| +--------+--------------------+-------------------+----------+--------+--------------------+ @irinatruong

Slide 30

Slide 30 text

Aggregate with Dask # round timestamps down to an hour df['ts'] = df['ts'].dt.floor('1H') # group on customer, timestamp (rounded) and url gb = df.groupby(['customer', 'url', 'ts']) ag = gb.agg({ 'session_id': [count_unique, 'count'], 'referrer': counter} ) ag = ag.reset_index() ✋ @irinatruong

Slide 31

Slide 31 text

Dask needs two custom aggregations counter = dd.Aggregation( 'counter', lambda s: s.apply(counter_chunk), lambda s: s.apply(counter_agg), ) count_unique = dd.Aggregation( 'count_unique', lambda s: s.apply(nunique_chunk), lambda s: s.apply(nunique_agg) ) @irinatruong

Slide 32

Slide 32 text

Chunk function: called on each partition def counter_chunk(ser): """Return counter of values in series.""" return list(Counter(ser.values).items()) @irinatruong

Slide 33

Slide 33 text

Agg function: puts chunks together def counter_agg(chunks): """Add all counters together and return dict items.""" total = Counter() for chunk in chunks: current = Counter(dict(chunk)) total += current return list(total.items()) @irinatruong

Slide 34

Slide 34 text

At this point... ipdb> ag.head(5) customer url ts session_id referrer count_unique count counter 0 a.com http://a.com/art... 2017-09-17 01:00:00 1 2 [(http://bing.co... 1 a.com http://a.com/art... 2017-09-17 03:00:00 1 1 [(http://faceboo... 2 a.com http://a.com/art... 2017-09-17 19:00:00 1 1 [(http://bing.co... 3 a.com http://a.com/art... 2017-09-17 23:00:00 1 1 [(http://google.... 4 a.com http://a.com/art... 2017-09-17 01:00:00 1 1 [(http://google.... @irinatruong

Slide 35

Slide 35 text

We’re not done: get rid of multilevel columns ag.columns = ['customer', 'url', 'ts', 'visitors', 'page_views', 'referrers'] Multilevel columns would add an undesired level of hierarchy into JSON. @irinatruong

Slide 36

Slide 36 text

Transformation Aggregated record Create an _id out of customer, url and ts Put page_views and visitors into a dict called metrics Make referrers a dict with referred counts @irinatruong

Slide 37

Slide 37 text

Transform with Spark agg = sqlContext.sql(""" select format_id(customer, url, ts) as _id, customer, url, ts, format_metrics(page_views, visitors) as metrics, referrers from df """) @irinatruong

Slide 38

Slide 38 text

At this point... ipdb> agg.first().asDict(True) {'_id': 'http://a.com/articles/14|a.com|2017-09-17T12:00:00', 'customer': 'a.com', 'url': 'http://a.com/articles/14', 'ts': datetime.datetime(2017, 9, 17, 12, 0), 'metrics': {'page_views': 1, 'visitors': 1}, 'referrers': {'http://bing.com/': 1}} @irinatruong

Slide 39

Slide 39 text

Transform with Dask tr = ag.apply(transform_one, axis=1, meta={'data': str}) @irinatruong

Slide 40

Slide 40 text

Transform one record with Dask data = ser.to_dict() page_views = data.pop('page_views') visitors = data.pop('visitors') data.update({ '_id': format_id(data['customer'], data['url'], data['ts']), 'ts': data['ts'].strftime('%Y-%m-%dT%H:%M:%S'), 'metrics': format_metrics(page_views, visitors), 'referrers': dict(data['referrers']) }) return pd.Series([data], name='data') @irinatruong

Slide 41

Slide 41 text

At this point... ipdb> tr.head(1) data 0 {'customer': 'a.com', 'url': 'http://a.com/articles/1', 'ts': '2017-09-17T01:00:00', 'referrers': {'http://bing.com/': 1, 'http://facebook.com/': 1}, '_id': 'http://a.com/articles/1|a.com|2017-09-17T01:00:00', 'metrics': {'page_views': 2, 'visitors': 1}} @irinatruong

Slide 42

Slide 42 text

Finally, write it out! @irinatruong

Slide 43

Slide 43 text

Write JSON with Spark df.write.json("./aggs_spark/") @irinatruong

Slide 44

Slide 44 text

Write JSON with Dask (tr.to_bag() .map(lambda t: t[0]) .map(json.dumps) .to_textfiles('./aggs_dask/*.json')) no df.to_json @irinatruong

Slide 45

Slide 45 text

What about performance? @irinatruong

Slide 46

Slide 46 text

Performance Total records Records per partition Files per hour Spark timings (mm:ss) Dask timings (mm:ss) Ratio (Spark / Dask) 100,000,000 2,000,000 2 2:50 0:37 4.6 1,000,000,000 2,000,000 20 45:08 16:11 2.8 MacBook Pro, 2.8 GHz Intel Core i7 (4 cores), 16 GB 1600 MHz DDR3 @irinatruong

Slide 47

Slide 47 text

Benchmarking setup MacBook Pro, 2.8 GHz Intel Core i7 (4 cores), 16 GB 1600 MHz DDR3 Spark Dask Local mode. Driver memory: 6g Executor memory: 2g Memory overhead: 2g Number of executors: 4 Local cluster. Worker cores: 1 Worker memory: 4g Number of workers: 4 @irinatruong

Slide 48

Slide 48 text

YARN cluster (Amazon EMR) c4.2xlarge 8 vCore 15 GiB memory EBS Storage:64 GiB c4.2xlarge 8 vCore 15 GiB memory EBS Storage:64 GiB m4.xlarge 8 vCore 16 GiB memory EBS Storage:32 GiB @irinatruong

Slide 49

Slide 49 text

Performance (YARN) Spark driver memory: 8g 4 executors x 4 cores + 3g memory + 2g overhead Dask YARN scheduler (knit) 4 workers x 3 cores + 5g memory Dask distributed scheduler 4 workers x 4 cores + 5g memory Dask distributed scheduler 5 workers x 4 cores + 5g memory 11:59 11:12 11:29 9:40 12:14 11:15 11:37 9:44 12:15 11:20 11:43 9:45 1,000,000,000 records in s3, 2,000,000 records per partition (2 files per hour), 3 attempts mm:ss Winner! @irinatruong

Slide 50

Slide 50 text

Cluster deployment with Spark (AWS) ● Create a bootstrap shell script and specify it in cluster settings ● Bootstrap script would create a venv and install requirements ● Use bash scripts to deploy code, jars and eggs when updated to cluster master (next step: fabric scripts) ● SSH to master node to start a job (next step: use RQ: https://github.com/rq/rq and a UI to schedule jobs remotely) https://github.com/j-bennet/talks/tree/master/2018/daskvsspark @irinatruong

Slide 51

Slide 51 text

Cluster deployment with Dask (Kubernetes) ● Cluster with Kubernetes in Google Cloud, AWS or MS Azure: https://zero-to-jupyterhub.readthedocs.io/en/v0.4-doc/create-k8s-cluster.html ● Docker image that contains requirements and code ● Helm chart to create and manage the deployment: specify resources, number of instances, etc. ● SSH to scheduler pod to start a job ● Kubernetes-in-Docker or Minikube for local development ● Amazon Elastic Container Service for Kubernetes (EKS) is in beta preview https://github.com/j-bennet/talks/tree/master/2018/daskvsspark @irinatruong

Slide 52

Slide 52 text

How does Dask compare? @irinatruong

Slide 53

Slide 53 text

Speaking of promises Promised Yes? No more Java heap space ✔ Or Container killed by YARN ✔ Meaningful stack traces ✔ Real-time visual feedback ✔ Profiling tools Diagnostic tools

Slide 54

Slide 54 text

Oh, the visual feedback! @irinatruong

Slide 55

Slide 55 text

The good All Python Familiar if coming from Pandas Easy to install with pip. Included into Anaconda. Up and running immediately Large subset of Pandas API (filtering, grouping, joining) Useful DatetimeIndex operations (floor, ceiling, round) Community is growing, and developers are responsive Performance @irinatruong

Slide 56

Slide 56 text

The bad No complex Parquet objects SQL Only simple aggregations (no collect_list or nunique) Can’t write json More bugs Less docs and examples Cluster deployments are WIP (dask-ec2, knit, skein, kubernetes) @irinatruong

Slide 57

Slide 57 text

The full code https://github.com/j-bennet/talks/tree/master/2018/daskvsspark @irinatruong

Slide 58

Slide 58 text

Questions? @irinatruong Slides: https://is.gd/vAqzCS Huge thanks to Dask developers for their help: ● Martin Durant ● Matthew Rocklin