Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Adapting from Spark to Dask: What to expect

Adapting from Spark to Dask: What to expect

Until very recently, Apache Spark has been a de facto standard choice of a framework for batch data processing. For Python developers, diving into Spark is challenging, because it requires learning the Java infrastructure, memory management, configuration management. The multiple layers of indirection also make it harder to debug things, especially when throwing the Pyspark wrapper into the equation.

With Dask emerging as a pure Python framework for parallel computing, Python developers might be looking at it with new hope, wondering if it might work for them in place of Spark. In this talk, I’m using a data aggregation example to highlight the important differences between the two frameworks, and make it clear how involved the switch may be.

Avatar for Irina Truong

Irina Truong

May 11, 2018
Tweet

Other Decks in Programming

Transcript

  1. Irina Truong We process lots of data with Apache Spark

    and Apache Storm I wrote wharfee, and maintain pgcli https://github.com/j-bennet/ @irinatruong I work for Parse.ly as a Backend Engineer
  2. You may be considering a journey... Is it possible? What

    will go wrong? What about performance? Is it worth it? @irinatruong
  3. How does this make you feel? ExecutorLostFailure (executor 16 exited

    caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. ✋ @irinatruong
  4. What about this? 17/09/05 16:40:33 ERROR Utils: Aborting task java.lang.NullPointerException

    at org.apache.parquet.it.unimi.dsi.fastutil.objects.Object2IntLinkedOpenHashMap.getInt(Object2IntLinkedOpenHashMap.java:590) at org.apache.parquet.column.values.dictionary.DictionaryValuesWriter$PlainFixedLenArrayDictionaryValuesWriter.writeBytes(DictionaryValuesWriter.java:307) at org.apache.parquet.column.values.fallback.FallbackValuesWriter.writeBytes(FallbackValuesWriter.java:162) at org.apache.parquet.column.impl.ColumnWriterV1.write(ColumnWriterV1.java:201) at org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.addBinary(MessageColumnIO.java:467) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$org$apache$spark$sql$execution$datasources$parquet$ParquetWriteSupport$$makeWriter$10.apply(ParquetWriteSupport.scala:1 84) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$org$apache$spark$sql$execution$datasources$parquet$ParquetWriteSupport$$makeWriter$10.apply(ParquetWriteSupport.scala:1 72) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$org$apache$spark$sql$execution$datasources$parquet$ParquetWriteSupport$$writeFields$1.apply$mcV$sp(ParquetWriteSupport. scala:124) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.org$apache$spark$sql$execution$datasources$parquet$ParquetWriteSupport$$consumeField(ParquetWriteSupport.scala:437) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.org$apache$spark$sql$execution$datasources$parquet$ParquetWriteSupport$$writeFields(ParquetWriteSupport.scala:123) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$write$1.apply$mcV$sp(ParquetWriteSupport.scala:114) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeMessage(ParquetWriteSupport.scala:425) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:113) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:51) at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:123) at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:180) at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:46) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.write(ParquetOutputWriter.scala:40) at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask$$anonfun$execute$2.apply(FileFormatWriter.scala:465) at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask$$anonfun$execute$2.apply(FileFormatWriter.scala:440) at scala.collection.Iterator$class.foreach(Iterator.scala:893) ...78 more lines of Java stack trace
  5. So when Dask promised: No more Java heap space Or

    “Container killed by YARN” Real-time visual feedback Meaningful stack traces @irinatruong
  6. I am here to help you set realistic expectations. If

    something sounds too good to be true, it usually is.
  7. What is Spark? General data processing engine Batch processing Stream

    processing SQL Engine Java, Scala, Python or R Hadoop YARN EC2 Mesos Standalone cluster Developed by Apache Software Foundation https://spark.apache.org/ ✋
  8. What is Dask? Parallel computing library for analytic computing Developed

    by Anaconda Inc Built on top of Numeric Python ecosystem Written in Python, C and Fortran Python only http://dask.pydata.org/en/latest/ Hadoop YARN EC2 Mesos Local cluster Kubernetes And more... ✋
  9. Input Parquet record customer url referrer session_id ts a.com http://a.com/articles/1

    http://google.com xxx 2017-09-15 05:04:21.123 @irinatruong
  10. Aggregated JSON record { "_id": "http://a.com/articles/1|a.com|2017-09-15T01:00", "_index": "events", "customer": "a.com",

    "url": "http://a.com/articles/1", "freq": "1hour", "ts": "2017-09-15T01:01:00", "metrics": { "page_views": 3, "visitors": 3 }, "referrers": { "http://google.com/": 1, "http://bing.com/": 1, "http://facebook.com/": 1 } } @irinatruong
  11. Read Parquet with Dask import dask.dataframe as dd df =

    dd.read_parquet('./events/*/*/*/*/*/*.parquet').drop('hour', axis=1) file mask drop partitioning columns @irinatruong
  12. Aggregation Round timestamps down to an hour Group by customer,

    url and rounded time period COUNT(*) (page views) COUNT(DISTINCT(session_id)) (visitors) COUNT_VALUES(referrer) (referrers) @irinatruong
  13. Aggregate with Spark from pyspark.sql import functions as F agg

    = (df.groupby('customer', 'url', F.window('ts', '1 hour').start.alias("ts")) .agg(F.count("*").alias("page_views"), F.countDistinct('session_id').alias('visitors'), count_values('referrer').alias('referrers'))) @irinatruong
  14. Aggregate with Spark (SQL) agg = sqlContext.sql(""" select customer, url,

    window(ts, '1 hour').start as ts, count(*) as page_views, count(distinct(session_id)) as visitors, count_values(referrer) as referrers from df group by customer, url, window(ts, '1 hour').start """) @irinatruong
  15. Spark needs a custom aggregation agg_counter = sc._jvm.com.jbennet.daskvsspark.udafs.AggregateCounter() sqlContext.sparkSession._jsparkSession.udf().register( 'count_values',

    agg_counter) def count_values(col): """Register UDAF for use in aggregations outside of Spark SQL.""" counter = sc._jvm.com.jbennet.daskvsspark.udafs.AggregateCounter().apply return Column(counter(_to_seq(sc, [col], _to_java_column))) @irinatruong
  16. At this point... ipdb> agg.show(5) +--------+--------------------+-------------------+----------+--------+--------------------+ |customer| url| ts|page_views|visitors| referrers|

    +--------+--------------------+-------------------+----------+--------+--------------------+ | a.com|http://a.com/arti...|2017-09-17 12:00:00| 1| 1|[http://bing.com/...| | a.com|http://a.com/arti...|2017-09-17 05:00:00| 1| 1|[http://bing.com/...| | a.com|http://a.com/arti...|2017-09-17 00:00:00| 1| 1|[http://facebook....| | a.com|http://a.com/arti...|2017-09-17 19:00:00| 1| 1|[http://google.co...| | a.com|http://a.com/arti...|2017-09-17 22:00:00| 1| 1|[http://google.co...| +--------+--------------------+-------------------+----------+--------+--------------------+ @irinatruong
  17. Aggregate with Dask # round timestamps down to an hour

    df['ts'] = df['ts'].dt.floor('1H') # group on customer, timestamp (rounded) and url gb = df.groupby(['customer', 'url', 'ts']) ag = gb.agg({ 'session_id': [count_unique, 'count'], 'referrer': counter} ) ag = ag.reset_index() ✋ @irinatruong
  18. Dask needs two custom aggregations counter = dd.Aggregation( 'counter', lambda

    s: s.apply(counter_chunk), lambda s: s.apply(counter_agg), ) count_unique = dd.Aggregation( 'count_unique', lambda s: s.apply(nunique_chunk), lambda s: s.apply(nunique_agg) ) @irinatruong
  19. Chunk function: called on each partition def counter_chunk(ser): """Return counter

    of values in series.""" return list(Counter(ser.values).items()) @irinatruong
  20. Agg function: puts chunks together def counter_agg(chunks): """Add all counters

    together and return dict items.""" total = Counter() for chunk in chunks: current = Counter(dict(chunk)) total += current return list(total.items()) @irinatruong
  21. At this point... ipdb> ag.head(5) customer url ts session_id referrer

    count_unique count counter 0 a.com http://a.com/art... 2017-09-17 01:00:00 1 2 [(http://bing.co... 1 a.com http://a.com/art... 2017-09-17 03:00:00 1 1 [(http://faceboo... 2 a.com http://a.com/art... 2017-09-17 19:00:00 1 1 [(http://bing.co... 3 a.com http://a.com/art... 2017-09-17 23:00:00 1 1 [(http://google.... 4 a.com http://a.com/art... 2017-09-17 01:00:00 1 1 [(http://google.... @irinatruong
  22. We’re not done: get rid of multilevel columns ag.columns =

    ['customer', 'url', 'ts', 'visitors', 'page_views', 'referrers'] Multilevel columns would add an undesired level of hierarchy into JSON. @irinatruong
  23. Transformation Aggregated record Create an _id out of customer, url

    and ts Put page_views and visitors into a dict called metrics Make referrers a dict with referred counts @irinatruong
  24. Transform with Spark agg = sqlContext.sql(""" select format_id(customer, url, ts)

    as _id, customer, url, ts, format_metrics(page_views, visitors) as metrics, referrers from df """) @irinatruong
  25. At this point... ipdb> agg.first().asDict(True) {'_id': 'http://a.com/articles/14|a.com|2017-09-17T12:00:00', 'customer': 'a.com', 'url':

    'http://a.com/articles/14', 'ts': datetime.datetime(2017, 9, 17, 12, 0), 'metrics': {'page_views': 1, 'visitors': 1}, 'referrers': {'http://bing.com/': 1}} @irinatruong
  26. Transform one record with Dask data = ser.to_dict() page_views =

    data.pop('page_views') visitors = data.pop('visitors') data.update({ '_id': format_id(data['customer'], data['url'], data['ts']), 'ts': data['ts'].strftime('%Y-%m-%dT%H:%M:%S'), 'metrics': format_metrics(page_views, visitors), 'referrers': dict(data['referrers']) }) return pd.Series([data], name='data') @irinatruong
  27. At this point... ipdb> tr.head(1) data 0 {'customer': 'a.com', 'url':

    'http://a.com/articles/1', 'ts': '2017-09-17T01:00:00', 'referrers': {'http://bing.com/': 1, 'http://facebook.com/': 1}, '_id': 'http://a.com/articles/1|a.com|2017-09-17T01:00:00', 'metrics': {'page_views': 2, 'visitors': 1}} @irinatruong
  28. Performance Total records Records per partition Files per hour Spark

    timings (mm:ss) Dask timings (mm:ss) Ratio (Spark / Dask) 100,000,000 2,000,000 2 2:50 0:37 4.6 1,000,000,000 2,000,000 20 45:08 16:11 2.8 MacBook Pro, 2.8 GHz Intel Core i7 (4 cores), 16 GB 1600 MHz DDR3 @irinatruong
  29. Benchmarking setup MacBook Pro, 2.8 GHz Intel Core i7 (4

    cores), 16 GB 1600 MHz DDR3 Spark Dask Local mode. Driver memory: 6g Executor memory: 2g Memory overhead: 2g Number of executors: 4 Local cluster. Worker cores: 1 Worker memory: 4g Number of workers: 4 @irinatruong
  30. YARN cluster (Amazon EMR) c4.2xlarge 8 vCore 15 GiB memory

    EBS Storage:64 GiB c4.2xlarge 8 vCore 15 GiB memory EBS Storage:64 GiB m4.xlarge 8 vCore 16 GiB memory EBS Storage:32 GiB @irinatruong
  31. Performance (YARN) Spark driver memory: 8g 4 executors x 4

    cores + 3g memory + 2g overhead Dask YARN scheduler (knit) 4 workers x 3 cores + 5g memory Dask distributed scheduler 4 workers x 4 cores + 5g memory Dask distributed scheduler 5 workers x 4 cores + 5g memory 11:59 11:12 11:29 9:40 12:14 11:15 11:37 9:44 12:15 11:20 11:43 9:45 1,000,000,000 records in s3, 2,000,000 records per partition (2 files per hour), 3 attempts mm:ss Winner! @irinatruong
  32. Cluster deployment with Spark (AWS) • Create a bootstrap shell

    script and specify it in cluster settings • Bootstrap script would create a venv and install requirements • Use bash scripts to deploy code, jars and eggs when updated to cluster master (next step: fabric scripts) • SSH to master node to start a job (next step: use RQ: https://github.com/rq/rq and a UI to schedule jobs remotely) https://github.com/j-bennet/talks/tree/master/2018/daskvsspark @irinatruong
  33. Cluster deployment with Dask (Kubernetes) • Cluster with Kubernetes in

    Google Cloud, AWS or MS Azure: https://zero-to-jupyterhub.readthedocs.io/en/v0.4-doc/create-k8s-cluster.html • Docker image that contains requirements and code • Helm chart to create and manage the deployment: specify resources, number of instances, etc. • SSH to scheduler pod to start a job • Kubernetes-in-Docker or Minikube for local development • Amazon Elastic Container Service for Kubernetes (EKS) is in beta preview https://github.com/j-bennet/talks/tree/master/2018/daskvsspark @irinatruong
  34. Speaking of promises Promised Yes? No more Java heap space

    ✔ Or Container killed by YARN ✔ Meaningful stack traces ✔ Real-time visual feedback ✔ Profiling tools Diagnostic tools
  35. The good All Python Familiar if coming from Pandas Easy

    to install with pip. Included into Anaconda. Up and running immediately Large subset of Pandas API (filtering, grouping, joining) Useful DatetimeIndex operations (floor, ceiling, round) Community is growing, and developers are responsive Performance @irinatruong
  36. The bad No complex Parquet objects SQL Only simple aggregations

    (no collect_list or nunique) Can’t write json More bugs Less docs and examples Cluster deployments are WIP (dask-ec2, knit, skein, kubernetes) @irinatruong