Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Spark in the ☁️without going bankrupt

Spark in the ☁️without going bankrupt

Spark in the cloud without going bankrupt. Platform Lunar's journey from AWS Elastic MapReduce to Ephemeral Spark clusters on AWS EC2 spot instances in order to save costs and boost data engineer productivity.

Saulius Grigaliunas

February 22, 2018
Tweet

More Decks by Saulius Grigaliunas

Other Decks in Programming

Transcript

  1. Disclaimer • We run on Amazon Web Services. Should work

    on other cloud providers • We just started product development • We currently only run batch workloads
  2. • On-premise, self- managed infrastructure • Fast dedicated servers •

    Fast network • Self-managed Cloudera’s CDH distribution
  3. • On-premise, self- managed infrastructure • Fast dedicated servers •

    Fast network • Self-managed Cloudera’s CDH distribution • Runs on AWS • Shared servers • Shared network • Uses AWS Elastic MapReduce service with on-demand EC2 instances
  4. Use cases for esome project • Daily workflows • Hourly

    workflows • Data integrity checks (via Apache Hive) • Ad-hoc job runs • Notebook (Jupyter) usage with Spark
  5. Challenges with Spark on EMR • Costs • Performance •

    Data engineer productivity • Cost allocation visibility
  6. TODO for the next project • Minimise costs • As

    little infrastructure management as possible • Infrastructure scaling as needed • Enable data engineers • Data-related feature cost visibility • Apache Spark all the things
  7. ???

  8. On-demand instances Reserved instances Spot instances • Default instance type

    • Pay-as-you-use per- second billing • Runs uninterrupted • Up to 75% cheaper than on-demand • Fixed price for contract term • Runs uninterrupted • Up to 90% cheaper than on-demand • Pay-as-you-use per- second billing • Marketplace-based with bidding on compute capacity • Can be interrupted at any time with a 2 minute warning beforehand
  9. S3 for data storage • Shared across instances and clusters

    • Zero maintenance • Cheap • Independent from compute infrastructure • Up to 25 Gbits of bandwidth from EC2
  10. Hive metastore on a dedicated machine • Dedicated, on-demand machine

    • Central source of truth about dataset structure and location • Uses dedicated MySQL instance running on AWS Relational Database Service (RDS)
  11. Custom Amazon Machine Image (AMI) for fast boot • Needed

    for fast booting • Spark pre-installed • Spark in standalone mode • Config files kept in S3, sync’ed on boot
  12. Clusters API + + • API Gateway + Lambda functions

    + DynamoDB • POST /clusters - create new cluster • GET /clusters/:workflow - get cluster info • DELETE /clusters/:workflow - destroy a cluster • PATCH /clusters/:workflow/keepalive - keep a cluster running for a while
  13. • ✅ Minimise costs - only spot instances • ✅

    As little infrastructure management as possible - S3, standalone Hive Metastore, spot instances • ✅ Infrastructure scaling as needed - S3, spot instances • ✅ Enable data engineers - cluster API, spot instances • ✅ Data-related feature cost visibility - cost allocation tags, node tagging • Apache Spark all the things
  14. Apache Spark all the things Then Now Custom Python apps

    for ETL Spark Hive for integrity checks Spark Hive for JDBC client access Spark Thriftserver Hadoop and YARN Spark Standalone master HDFS S3
  15. • ip-10-55-22-104.eu-west-1.compute.internal • Dynamic DNS via AWS Route53 and Lambda

    functions • workflow-name.spark.environment.domain.com • sg-run-job.spark.prod.platform-lunar.com Where is my cluster ⁉ +
  16. How do you run a Spark job? • Using Apache

    Livy (incubating) to run Spark Jobs • Livy - REST API to Apache Spark cluster • POST /batches - JAR file, class name, SparkConf params, executor and driver params • GET /batches/:batch_id - job state
  17. How do you turn off a cluster? • ./rmcluster.sh •

    DELETE /clusters/:workflow - destroy a cluster • Pinging • PATCH /clusters/:workflow/keepalive • Time to live
  18. What happens if cluster node is reclaimed by Amazon? •

    Bid high enough - happens extremely rarely • Block duration for master node (30-45% cheaper than on-demand instances) • Worker spot fleet allocation strategy: cheapest vs diversified
  19. How do you persist Spark job logs? • Experiments with

    S3 log appender, AWS CloudWatch log appender • AWS Elastic File System (EFS) an Network File System (NFS) based solution • Files from EFS served via tiny Nginx instance, deleted periodically