Spark: From Interactivity To Production (And Back)

WHAT IS THIS ABOUT story on how I tried to
save money, time and organized work flow with Spark spark environments work flow pain & needs

NOT ABOUT how it works code libraries solutions

AUDITORY engineers 70% analysts, who work with data, etc

PLAN where it all started, different and common needs 1.
repeatable environment 2. deploying (to prod) 3. debugging & testing 4. business, ad-hoc querying 5. wrangling & exploration 6. etl & streaming applications

0. WHERE IT ALL STARTED dynamic env, startup ad-tech, RTB,
demand/supply matching 100k/s real-time decision making large scale analytics covid☠ => trinityaudio.ai - Text-to-speech audio player

STACK emr hbase redis spark storm memsql mysql redshift presto
hive kafka aerospike elasticsearch gcp

LOTS OF FUN cron jobs on emr nodes, random jobs
on rundeck data processes are not centralized etls in python js php jenkins & rundeck & go-cd scala,java,akka,node.js,php,bash

DATA 100k/s - real-time decisions timeseries events behavioral data

1. REPEATABLE (DEV?) ENVIRONMENT

SINGLE JOB DEPENDENCIES kafka spark hive hbase 2xmysql s3

MOTIVATION scripts, no tests, zoo sandbox - need to run
and play IMG_0016.jpg

CONSIDERATIONS local vs cloud? flexibility vs versatility code vs data
parameterize code reuse/mock data input vs output immutability vs mutability

OPTIONS simply local docker, vm bad as a dev env
good for tests (testcontainers.org) emr stage parameterize a lot complexity

EMR - BIGDATA CLUSTER

TIP: EMR MASTER AS PRIMARY DEVELOPMENT ENTRY-POINT

2. DEPLOY

WHAT MATTERS MOST automation speed clarity (reliability)

MOTIVATION scala spark streaming and etl applications dev, manual testing

OPTIONS take1(default): push to branch ⇨ jenkins ⇨ jar ⇨
spark-submit (10min) take2: sbt build ⇨ scp ⇨ spark-submit (3-4min) take3: rsync source code to emr master option(hardcore): emacs/vim develop directly on emr master continuous rsync/lsyncd .. ok, this is good enough for me

3. DEBUG & TEST manual automated: unit, integration

BIG-DATA & TESTING bad input may break the whole pipeline
bad input will happen much faster effect of bug may take weeks, months until noticed distributed system effects huge data what you can automate and what you cannot divide & conquer know your library: how to test structured streaming - SparkTest

MANUAL separate cluster? complexity: parameterize - cf, tools time to
start money, do not forget to shut it down same prod cluster? interfering with existing jobs isolation: yarn queues (complex) data input: kafka offload data into some topic data output - parameterize/mock

TDD DOESN'T WORK GOOD HERE what works better: experiment ⇨
prototype ⇨ test ⇨ beta prod ⇨ .. spark, scala, sql structured streaming for the win! zeppelin

SPARK-SQL & STRUCTURED STREAMING FTW! class Stream { def readStream:
DataStreamReader = kafkaStreamReader(inputTopic, failOnDataLoss, maxOffsetsPerTr def processStream(df: DataFrame): DataFrame = { df.withWatermark("ts", watermark) .createOrReplaceTempView("table1") spark.sql(s""" | select date_trunc('hour', ts) as datehour, | coalesce(seller_id, -1) as publisher_id, | coalesce(campaign_id, -1) as campaign_id, | coalesce(ab_test, -1) as ab_test, | -- metrics | sum(if(kind='imp', 1, 0)) as imps, | sum(if(kind='playClicked', 1, 0)) as clicks,

ZEPPELIN

BLESSING: I NEED AN INTERACTIVE ENVIRONMENT!

4. DATA ACCESS & OPERATIONS

NEEDS one time maintenance operations one time data processing ad
hoc querying analytics vs ops: searching vs operating all of these need interactive interface

TOOLS shell sql client spark-shell spark-sql zeppelin jupyter + sparkmagic

PROBLEM: HOW TO SHARE CODEBASE?

5. AD-HOC QUERYING AND BUSINESS

SOME USECASES business doing sql me wrangling and searching me
building prototype & testing on the scale business beta testing me building streaming app / etl me performing one-time operations

PRESTO - "DATABASE OF DATABASES"

CONSIDERATIONS presto vs spark boring to rewrite sql presto: lack
of custom code presto is much faster presto: much easier to glue different storages thrift-server & sql clients beauty of spark sql structured streaming spark: sql vs scala api?

6. WRANGLING I want to discover!

JUPYTERHUB

EMR NOTEBOOKS

DATA BREW

EMR STUDIO

SOME OTHER databricks notebooks

IN MY CASE I'M BOUND TO SPARK prototype in scala
+ spark easily move from experiment/prototype ⇒ productionized streaming/etl application reuse production code for further experiments/prototypes

interactive wrangling ⇕ production application code

MOTIVATION business needs quick prototype requirements may change quickly keep
work flow optimal

BOTH THESE WORKFLOWS NEED: versioned code shared dependencies shared code
(classpath) unified work flow 1. repl, experiment 2. test at scale 3. acceptance 4. productionize

OPTIONS? zeppelin + copy-paste zeppelin + shared lib jupyter +
sparkmagic + livy + copy-paste started my own project

SOLUTION use-case1: prototype use-case2: debug/fix/update use-case3: wrangling, one-time operations

SOLUTION

SOMETHING WORTH CHECKING dbt

WRAP UP.. haven't built my ideal world yet grateful for
any feedback would like to hear your experience! thank you!

Spark: From Interactivity To Production (And Back)

Spark: From Interactivity To Production (And Back)

More Decks by Yuri Ostapchuk

Other Decks in Programming

Featured

Transcript