Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

Spark: From Interactivity To Production (And Back)

Spark: From Interactivity To Production (And Back)

A talk given on fwdays AI '21 - annual Ukrainian conference

Avatar for Yuri Ostapchuk

Yuri Ostapchuk

September 13, 2021
Tweet

More Decks by Yuri Ostapchuk

Other Decks in Programming

Transcript

  1. WHAT IS THIS ABOUT story on how I tried to

    save money, time and organized work flow with Spark spark environments work flow pain & needs
  2. PLAN where it all started, different and common needs 1.

    repeatable environment 2. deploying (to prod) 3. debugging & testing 4. business, ad-hoc querying 5. wrangling & exploration 6. etl & streaming applications
  3. 0. WHERE IT ALL STARTED dynamic env, startup ad-tech, RTB,

    demand/supply matching 100k/s real-time decision making large scale analytics covid☠ => trinityaudio.ai - Text-to-speech audio player
  4. LOTS OF FUN cron jobs on emr nodes, random jobs

    on rundeck data processes are not centralized etls in python js php jenkins & rundeck & go-cd scala,java,akka,node.js,php,bash
  5. CONSIDERATIONS local vs cloud? flexibility vs versatility code vs data

    parameterize code reuse/mock data input vs output immutability vs mutability
  6. OPTIONS simply local docker, vm bad as a dev env

    good for tests (testcontainers.org) emr stage parameterize a lot complexity
  7. OPTIONS take1(default): push to branch ⇨ jenkins ⇨ jar ⇨

    spark-submit (10min) take2: sbt build ⇨ scp ⇨ spark-submit (3-4min) take3: rsync source code to emr master option(hardcore): emacs/vim develop directly on emr master continuous rsync/lsyncd .. ok, this is good enough for me
  8. BIG-DATA & TESTING bad input may break the whole pipeline

    bad input will happen much faster effect of bug may take weeks, months until noticed distributed system effects huge data what you can automate and what you cannot divide & conquer know your library: how to test structured streaming - SparkTest
  9. MANUAL separate cluster? complexity: parameterize - cf, tools time to

    start money, do not forget to shut it down same prod cluster? interfering with existing jobs isolation: yarn queues (complex) data input: kafka offload data into some topic data output - parameterize/mock
  10. TDD DOESN'T WORK GOOD HERE what works better: experiment ⇨

    prototype ⇨ test ⇨ beta prod ⇨ .. spark, scala, sql structured streaming for the win! zeppelin
  11. SPARK-SQL & STRUCTURED STREAMING FTW! class Stream { def readStream:

    DataStreamReader = kafkaStreamReader(inputTopic, failOnDataLoss, maxOffsetsPerTr def processStream(df: DataFrame): DataFrame = { df.withWatermark("ts", watermark) .createOrReplaceTempView("table1") spark.sql(s""" | select date_trunc('hour', ts) as datehour, | coalesce(seller_id, -1) as publisher_id, | coalesce(campaign_id, -1) as campaign_id, | coalesce(ab_test, -1) as ab_test, | -- metrics | sum(if(kind='imp', 1, 0)) as imps, | sum(if(kind='playClicked', 1, 0)) as clicks,
  12. NEEDS one time maintenance operations one time data processing ad

    hoc querying analytics vs ops: searching vs operating all of these need interactive interface
  13. SOME USECASES business doing sql me wrangling and searching me

    building prototype & testing on the scale business beta testing me building streaming app / etl me performing one-time operations
  14. CONSIDERATIONS presto vs spark boring to rewrite sql presto: lack

    of custom code presto is much faster presto: much easier to glue different storages thrift-server & sql clients beauty of spark sql structured streaming spark: sql vs scala api?
  15. IN MY CASE I'M BOUND TO SPARK prototype in scala

    + spark easily move from experiment/prototype ⇒ productionized streaming/etl application reuse production code for further experiments/prototypes
  16. BOTH THESE WORKFLOWS NEED: versioned code shared dependencies shared code

    (classpath) unified work flow 1. repl, experiment 2. test at scale 3. acceptance 4. productionize
  17. OPTIONS? zeppelin + copy-paste zeppelin + shared lib jupyter +

    sparkmagic + livy + copy-paste started my own project
  18. WRAP UP.. haven't built my ideal world yet grateful for

    any feedback would like to hear your experience! thank you!