Spark in the ☁️without going bankrupt

www.platform-lunar.com

in the ☁ without going bankrupt

@imsaulius saulius

Disclaimer • We run on Amazon Web Services. Should work
on other cloud providers • We just started product development • We currently only run batch workloads

+ ☁ = ?

• On-premise, self- managed infrastructure • Fast dedicated servers •
Fast network • Self-managed Cloudera’s CDH distribution

• On-premise, self- managed infrastructure • Fast dedicated servers •
Fast network • Self-managed Cloudera’s CDH distribution • Runs on AWS • Shared servers • Shared network • Uses AWS Elastic MapReduce service with on-demand EC2 instances

Use cases for esome project • Daily workﬂows • Hourly
workﬂows • Data integrity checks (via Apache Hive) • Ad-hoc job runs • Notebook (Jupyter) usage with Spark

Challenges with Spark on EMR • Costs • Performance •
Data engineer productivity • Cost allocation visibility

TODO for the next project • Minimise costs • As
little infrastructure management as possible • Infrastructure scaling as needed • Enable data engineers • Data-related feature cost visibility • Apache Spark all the things

Ephemeral clusters on EC2 spot instances

ℹ Ephemeral - lasting a very short time; short- lived;
transitory.

On-demand instances Reserved instances Spot instances • Default instance type
• Pay-as-you-use per- second billing • Runs uninterrupted • Up to 75% cheaper than on-demand • Fixed price for contract term • Runs uninterrupted • Up to 90% cheaper than on-demand • Pay-as-you-use per- second billing • Marketplace-based with bidding on compute capacity • Can be interrupted at any time with a 2 minute warning beforehand

Lets build an ephemeral cluster…

S3 for data storage • Shared across instances and clusters
• Zero maintenance • Cheap • Independent from compute infrastructure • Up to 25 Gbits of bandwidth from EC2

Hive metastore on a dedicated machine • Dedicated, on-demand machine
• Central source of truth about dataset structure and location • Uses dedicated MySQL instance running on AWS Relational Database Service (RDS)

Custom Amazon Machine Image (AMI) for fast boot • Needed
for fast booting • Spark pre-installed • Spark in standalone mode • Conﬁg ﬁles kept in S3, sync’ed on boot

Clusters API + + • API Gateway + Lambda functions
+ DynamoDB • POST /clusters - create new cluster • GET /clusters/:workflow - get cluster info • DELETE /clusters/:workflow - destroy a cluster • PATCH /clusters/:workflow/keepalive - keep a cluster running for a while

• Use EC2 instance tagging for cost breakdown • Cost
allocation tags Instance tagging

• ✅ Minimise costs - only spot instances • ✅
As little infrastructure management as possible - S3, standalone Hive Metastore, spot instances • ✅ Infrastructure scaling as needed - S3, spot instances • ✅ Enable data engineers - cluster API, spot instances • ✅ Data-related feature cost visibility - cost allocation tags, node tagging • Apache Spark all the things

Apache Spark all the things Then Now Custom Python apps
for ETL Spark Hive for integrity checks Spark Hive for JDBC client access Spark Thriftserver Hadoop and YARN Spark Standalone master HDFS S3

I want to ask some questions

• ip-10-55-22-104.eu-west-1.compute.internal • Dynamic DNS via AWS Route53 and Lambda
functions • workﬂow-name.spark.environment.domain.com • sg-run-job.spark.prod.platform-lunar.com Where is my cluster ⁉ +

How do you run a Spark job? • Using Apache
Livy (incubating) to run Spark Jobs • Livy - REST API to Apache Spark cluster • POST /batches - JAR ﬁle, class name, SparkConf params, executor and driver params • GET /batches/:batch_id - job state

How do you turn off a cluster? • ./rmcluster.sh •
DELETE /clusters/:workﬂow - destroy a cluster • Pinging • PATCH /clusters/:workﬂow/keepalive • Time to live

What happens if cluster node is reclaimed by Amazon? •
Bid high enough - happens extremely rarely • Block duration for master node (30-45% cheaper than on-demand instances) • Worker spot ﬂeet allocation strategy: cheapest vs diversiﬁed

How do you persist Spark job logs? • Experiments with
S3 log appender, AWS CloudWatch log appender • AWS Elastic File System (EFS) an Network File System (NFS) based solution • Files from EFS served via tiny Nginx instance, deleted periodically

Notebook usage • Apache Zeppelin as notebook server • S3
for storing notebook source ﬁles

Built with

Thanks! www.platform-lunar.com @imsaulius saulius

Spark in the ☁️without going bankrupt

Spark in the ☁️without going bankrupt

Saulius Grigaliunas

More Decks by Saulius Grigaliunas

Other Decks in Programming

Featured

Transcript

www.platform-lunar.com

in the ☁ without going bankrupt

@imsaulius saulius

Disclaimer • We run on Amazon Web Services. Should work

+ ☁ = ?

• On-premise, self- managed infrastructure • Fast dedicated servers •

• On-premise, self- managed infrastructure • Fast dedicated servers •

Use cases for esome project • Daily workﬂows • Hourly

Challenges with Spark on EMR • Costs • Performance •

TODO for the next project • Minimise costs • As

???

Ephemeral clusters on EC2 spot instances

ℹ Ephemeral - lasting a very short time; short- lived;

On-demand instances Reserved instances Spot instances • Default instance type

Lets build an ephemeral cluster…

S3 for data storage • Shared across instances and clusters

Hive metastore on a dedicated machine • Dedicated, on-demand machine

Custom Amazon Machine Image (AMI) for fast boot • Needed

Clusters API + + • API Gateway + Lambda functions

• Use EC2 instance tagging for cost breakdown • Cost

• ✅ Minimise costs - only spot instances • ✅

Apache Spark all the things Then Now Custom Python apps

I want to ask some questions

• ip-10-55-22-104.eu-west-1.compute.internal • Dynamic DNS via AWS Route53 and Lambda

How do you run a Spark job? • Using Apache

How do you turn off a cluster? • ./rmcluster.sh •

What happens if cluster node is reclaimed by Amazon? •

How do you persist Spark job logs? • Experiments with

Notebook usage • Apache Zeppelin as notebook server • S3

Built with

Thanks! www.platform-lunar.com @imsaulius saulius