Leveraging Databricks for Spark Pipelines

Leveraging Databricks for Spark Pipelines How Coatue Management saved time
and money Rose Toomey AWS Data Analytics Week 27 February 2020 1

About me Finance. Technology. Code. And data. So much data.
A software engineer making up for a lifetime of trying to ﬁx data problems at the wrong end of the pipeline. • Software Engineer at Coatue Management • Lead API Developer at Gemini Trust • Director at Novus Partners 2

Apache Spark At Scale In The Cloud Spark+AI Summit Europe
2019, Amsterdam video / slides Moonshot Spark: Serverless Spark with GraalVM Scale By The Bay 2019, Oakland video / slides Upcoming! ✨ A Field Mechanic's Guide To Integration Testing Your Apache Spark App ScalaDays 2020, Seattle Recent Talks 3

About Coatue Management What we do: data engineering at Coatue
• Terabyte scale, billions of rows • Lambda architecture • Functional programming Our stack • Scala (cats, shapeless, fs2, http4s) • Spark / Hadoop • Data warehouses • Python / R / Tableau 4

Our Challenges Coatue Management deal with a massive volume of
data. 1. Many Spark pipelines, many diﬀerent sizes of Spark pipelines. Small, medium, large: each pipeline size has its own tuning and processing issues. 2. All running in the cloud: each cloud provider presents separate operational challenges. Improving the cost eﬃciency and reliability of our Spark pipelines is a huge win for our data engineering team. We want to focus on the input and output of our pipelines, not the operational details surrounding them. 5

• Standardize how pipelines are deployed • Run faster •
Spend less money • Don't get paged over nonsense • That goes double for being rate limited when using S3 • Become BFF with our Databricks account team OK, it wasn't a goal but it happened anyway Our goals 6

• Deployment and scaling toolbox: autoscale the cluster, autoscale the
local storage, define jobs and clusters or just go ad hoc. Multiple cloud providers through a single API. Bigger master, smaller workers. • We wanted to see if Databricks Runtime join and filter optimizations could make our jobs faster relative to what's offered in Apache Spark • Superior, easy to use tools.Spark history server (only recently available elsewhere), historical Ganglia screenshots , easy access to logs from a browser. • Optimized cloud storage access • Run our Spark jobs in the same environment where we run our notebooks Why Databricks? 7

What happened next We began migrating over our pipelines in
order of size and production impact. This was not a direct transposition but brought about a useful re-evaluation of how we were running our Spark jobs. • Submitting a single large job using the Databricks Jobs API instead of multiple smaller jobs using spark-submit • Making the most of the Databricks Runtime • Improve the performance and reliability of reading from and writing to S3 by switching to instance types that support Delta Cache • Optimizing join and ﬁlter operations 8

The Doge Pipeline Case study 1

The "Doge" pipeline A sample large data pipeline that drags.
• ~20 billion rows • File skew. Highly variable row length. • CSV that needs multiline support. Occasional bad rows. • Cluster is 200 nodes with 1600 vCPU, 13 TB memory • Each run takes four hours ...and growing 10

• Instead of multiple jobs (ingestion, multiple processing segments, writing
ﬁnal output to S3) pipeline now runs as a single job • Due to complex query lineages, we ran into an issue where the whole cluster bogged down • And thanks to using an instance type that supports Databricks Delta Caching, found a novel workaround The changes 11

1. Dataset foo (+ lineage ) is in memory 2.
Write dataset foo to S3 (delta cache not populated on write) 3. Read dataset foo (- lineage ) from S3 (delta cache lazily populated by read) 4. Optional… some time later, read dataset foo from S3 again - this time we would expect to hit delta cache instead Counterintuitively, this is faster than checkpointing. Even if your local storage is NVMe SSD. Truncating lineage faster 12

The outcome Runs almost twice as fast on about half
the cluster. • Pipeline now completes in around 2.5 hours • Uses a cluster of 100 nodes • Larger master, 16 vCPU and 120GB memory • Worker nodes total 792 vCPUs and 5.94TB memory • Big pickup provided by NVMe SSD local storage 13

The Hamster Pipeline Case Study 2

A medium sized older pipeline that turns out to have
been running on its hamster wheel most of the time. • 5 billion rows. • CSVs. Wide rows increase size of dataset to medium, relative to small number of rows. • Processing stages like Jurassic Park, but for regexes • Cluster is 100 nodes with 800 vCPU, 6.5 TB memory • Each run takes three hours, but recently started failing due to resource issues even though "nothing changed" The "Hamster" Pipeline 15

We were hoping for a straightforward win by dumping the
Hamster job on Databricks. That didn't happen. We made the same superﬁcial changes as for Doge: single job, NVMe SSD local storage. The job didn't die but it was really slow. Examining the query planner results, we saw cost-based optimizer results all over the place. Some of the queries made choices that were logical at that time but counterproductive for Spark 2.4 using Databricks Runtime. Time to rewrite those older queries! Goodbye Hamster Wheel 16

The Hamster pipeline uses roughly the same amount of resources
as before but now completes with over 5x speedup from the previous runtime. • Pipeline now completes in 35 minutes • Uses a cluster of 200 nodes • Larger master, 16 vCPU and 120GB memory • Worker nodes total 792 vCPUs and 5.94TB memory Hello Hamster Rocket 17

Results • Migrating Coatue Management’s Spark pipelines to Databricks has
reduced operational overhead while saving time and money. • Our cloud storage reads and writes are now more reliable • Jobs and clusters are now managed through a single simple REST API instead of a Tower of Babel toolchain for diﬀerent cloud providers Interested in ﬁnding out more about data engineering at Coatue Management? Come talk to us! 18

Leveraging Databricks for Spark Pipelines

Leveraging Databricks for Spark Pipelines

Rose Toomey

More Decks by Rose Toomey

Other Decks in Technology

Featured

Transcript

Leveraging Databricks for Spark Pipelines How Coatue Management saved time

About me Finance. Technology. Code. And data. So much data.

Apache Spark At Scale In The Cloud Spark+AI Summit Europe

About Coatue Management What we do: data engineering at Coatue

Our Challenges Coatue Management deal with a massive volume of

• Standardize how pipelines are deployed • Run faster •

• Deployment and scaling toolbox: autoscale the cluster, autoscale the

What happened next We began migrating over our pipelines in

The Doge Pipeline Case study 1

The "Doge" pipeline A sample large data pipeline that drags.

• Instead of multiple jobs (ingestion, multiple processing segments, writing

1. Dataset foo (+ lineage ) is in memory 2.

The outcome Runs almost twice as fast on about half

The Hamster Pipeline Case Study 2

A medium sized older pipeline that turns out to have

We were hoping for a straightforward win by dumping the

The Hamster pipeline uses roughly the same amount of resources

Results • Migrating Coatue Management’s Spark pipelines to Databricks has