Slide 1

Slide 1 text

music is just wiggly air Lynn Root | Staff Engineer | @roguelynn building infrastructure to support audio research

Slide 2

Slide 2 text

intro @roguelynn

Slide 3

Slide 3 text

— Audio intelligence research at Spotify advances the state of the art in understanding music at scale to enhance how it is created, identified and consumed. research.spotify.com

Slide 4

Slide 4 text

— Audio intelligence research at Spotify advances the state of the art in understanding music at scale to enhance how it is created, identified and consumed. research.spotify.com

Slide 5

Slide 5 text

research workflow @roguelynn

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

productionization requirements @roguelynn

Slide 10

Slide 10 text

— graph execution modes

Slide 11

Slide 11 text

— top-down execution

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

— bottom-up execution

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

— research workload

Slide 26

Slide 26 text

— python

Slide 27

Slide 27 text

— custom environment

Slide 28

Slide 28 text

— streaming

Slide 29

Slide 29 text

— avoid duplicate work

Slide 30

Slide 30 text

— scalability

Slide 31

Slide 31 text

— summary Top-down & bottom-up Python Custom environment Avoid duplicate work Streaming Scalability

Slide 32

Slide 32 text

early approaches @roguelynn

Slide 33

Slide 33 text

— music intelligence pipeline Google Cloud PubSub + Microservices

Slide 34

Slide 34 text

— Apache Beam

Slide 35

Slide 35 text

— Apache Beam Google Cloud PubSub Apache Beam on Google Cloud Dataflow +

Slide 36

Slide 36 text

— what’s left?

Slide 37

Slide 37 text

our solution: klio @roguelynn

Slide 38

Slide 38 text

Kleio

Slide 39

Slide 39 text

Kleio

Slide 40

Slide 40 text

— goals of klio

Slide 41

Slide 41 text

— ecosystem

Slide 42

Slide 42 text

— ecosystem: user PoV

Slide 43

Slide 43 text

Develop

Slide 44

Slide 44 text

$ klio job create Develop

Slide 45

Slide 45 text

$ klio job create Develop $ klio job verify

Slide 46

Slide 46 text

$ klio job create Develop Test $ klio job verify

Slide 47

Slide 47 text

$ klio job create $ klio job test Develop Test $ klio job verify

Slide 48

Slide 48 text

$ klio job create $ klio job test Develop Test $ klio job verify $ klio job audit

Slide 49

Slide 49 text

$ klio job create $ klio job test Develop Test $ klio job verify $ klio job audit $ klio job profile

Slide 50

Slide 50 text

$ klio job create $ klio job test Develop Test Deploy $ klio job verify $ klio job audit $ klio job profile

Slide 51

Slide 51 text

$ klio job create $ klio job run $ klio job test Develop Test Deploy $ klio job verify $ klio job audit $ klio job profile

Slide 52

Slide 52 text

$ klio job create $ klio job run $ klio message publish $ klio job test Develop Test Deploy $ klio job verify $ klio job audit $ klio job profile

Slide 53

Slide 53 text

$ klio job create $ klio job run $ klio message publish $ klio job test Develop Test Deploy $ klio job verify $ klio job audit $ klio job profile $ klio job logs

Slide 54

Slide 54 text

— ecosystem: behind the scenes

Slide 55

Slide 55 text

Local / CI Machine

Slide 56

Slide 56 text

$ klio job run Local / CI Machine

Slide 57

Slide 57 text

$ klio job run $ klio image build Local / CI Machine

Slide 58

Slide 58 text

$ klio job run $ klio image build $ klioexec run Local / CI Machine worker container

Slide 59

Slide 59 text

$ klio job run $ klio image build $ klioexec run Local / CI Machine Google Cloud worker container

Slide 60

Slide 60 text

$ klio job run $ klio image build $ klioexec run Local / CI Machine Google Cloud worker container

Slide 61

Slide 61 text

$ klio job run $ klio image build $ klioexec run Local / CI Machine Google Cloud worker container

Slide 62

Slide 62 text

$ klio job run $ klio message publish $ klio image build $ klioexec run Local / CI Machine Google Cloud worker container

Slide 63

Slide 63 text

$ klio job run $ klio message publish $ klio image build $ klioexec run Local / CI Machine Google Cloud worker container

Slide 64

Slide 64 text

$ klio job run $ klio message publish $ klio image build $ klioexec run Local / CI Machine Google Cloud worker container $ klio job logs

Slide 65

Slide 65 text

— architecture

Slide 66

Slide 66 text

— architecture: klio job

Slide 67

Slide 67 text

docker container data ow worker klio job

Slide 68

Slide 68 text

docker container data ow worker s0ng2 klio job

Slide 69

Slide 69 text

docker container data ow worker klio preprocessing s0ng2

Slide 70

Slide 70 text

docker container data ow worker klio preprocessing user-implemented transform s0ng2

Slide 71

Slide 71 text

docker container data ow worker klio preprocessing user-implemented transform s0ng2 s0ng2.wav

Slide 72

Slide 72 text

docker container data ow worker klio preprocessing user-implemented transform s0ng2 s0ng2.wav s0ng2.json

Slide 73

Slide 73 text

docker container data ow worker klio preprocessing user-implemented transform klio postprocessing s0ng2 s0ng2.wav s0ng2.json

Slide 74

Slide 74 text

docker container data ow worker klio preprocessing user-implemented transform klio postprocessing s0ng2 s0ng2.wav s0ng2.json s0ng2

Slide 75

Slide 75 text

docker container data ow worker klio preprocessing user-implemented transform klio postprocessing UUID-like klio job

Slide 76

Slide 76 text

— architecture: klio message

Slide 77

Slide 77 text

Downstream? Yes No

Slide 78

Slide 78 text

Downstream? Drop Yes No

Slide 79

Slide 79 text

Downstream? Drop Ping Mode? Yes No

Slide 80

Slide 80 text

Downstream? Drop Ping Mode? Yes No No Yes

Slide 81

Slide 81 text

Downstream? Drop Ping Mode? Pass Thru Yes No No Yes

Slide 82

Slide 82 text

Downstream? Drop Ping Mode? Output data exists? Pass Thru Yes No No Yes

Slide 83

Slide 83 text

Downstream? Drop Ping Mode? Output data exists? Pass Thru Yes No No No Yes Yes

Slide 84

Slide 84 text

Downstream? Drop Ping Mode? Output data exists? Force mode? Pass Thru Yes No No No Yes Yes

Slide 85

Slide 85 text

Downstream? Drop Ping Mode? Output data exists? Force mode? Pass Thru Yes No No Yes No Yes Yes No

Slide 86

Slide 86 text

Downstream? Drop Ping Mode? Output data exists? Force mode? Pass Thru Pass Thru Yes No No Yes No Yes Yes No

Slide 87

Slide 87 text

Downstream? Drop Ping Mode? Output data exists? Input data exists? Force mode? Pass Thru Pass Thru Yes No No Yes No Yes Yes No

Slide 88

Slide 88 text

Downstream? Drop Ping Mode? Output data exists? Input data exists? Force mode? Pass Thru Pass Thru Yes No No Yes Yes No Yes Yes No No

Slide 89

Slide 89 text

Downstream? Drop Ping Mode? Output data exists? Input data exists? Force mode? Pass Thru Pass Thru Trigger Parent & Drop Yes No No Yes Yes No Yes Yes No No

Slide 90

Slide 90 text

Downstream? Drop Ping Mode? Output data exists? Input data exists? Force mode? Pass Thru Process Pass Thru Trigger Parent & Drop Yes No No Yes Yes No Yes Yes No No

Slide 91

Slide 91 text

Downstream? Drop Ping Mode? Output data exists? Input data exists? Force mode? Pass Thru Process Pass Thru Trigger Parent & Drop Yes No No Yes Yes No Yes Yes No No

Slide 92

Slide 92 text

show us code! @roguelynn

Slide 93

Slide 93 text

— automixer

Slide 94

Slide 94 text

— automixer: vanilla Beam

Slide 95

Slide 95 text

$ tree . ├── Dockerfile ├── job-requirements.txt ├── mixer.py ├── run.py └── track_storage.py

Slide 96

Slide 96 text

$ tree . ├── Dockerfile ├── job-requirements.txt ├── mixer.py ├── run.py └── track_storage.py

Slide 97

Slide 97 text

import logging import re import threading import apache_beam as beam from apache_beam.io.gcp import gcsio from apache_beam.options import pipeline_options import mixer import track_storage class MixerDoFn(beam.DoFn): PROJECT = "sigint" GCS_BUCKET = "sigint-output" GCS_OBJECT_PATH = "automixer-beam" OUTPUT_NAME_TPL = "{track1_id}-{track2_id}-mix.ogg" GCS_OUTPUT_TPL = "gs://{bucket}/{object_path}/{filename}" _thread_local = threading.local() @property def gcs_client(self): client = getattr(self._thread_local, "gcs_client", None) if not client: self._thread_local.gcs_client = gcsio.GcsIO() return self._thread_local.gcs_client def process(self, entity_ids): track1_id, track2_id = entity_ids.decode("utf-8").split(",") output_filename = MixerDoFn.OUTPUT_NAME_TPL.format( track1_id=track1_id, track2_id=track2_id ) gcs_output_path = MixerDoFn.GCS_OUTPUT_TPL.format( bucket=MixerDoFn.GCS_BUCKET, object_path=MixerDoFn.GCS_OBJECT_PATH, filename=output_filename, ) # Check if output already exists: if self.gcs_client.exists(gcs_output_path): # Don't do unnecessary work logging.info( "Mix for {} & {} already exists: {}".format( track1_id, track2_id, gcs_output_path ) ) return # Check if input data is available err_msg = "Input for {track} is not available: {e}" try: track1_input_path = track_storage.download_track(track1_id) except Exception as e: logging.error(err_msg.format(track=track1_id, e=e)) return try: track2_input_path = track_storage.download_track(track2_id) except Exception as e: logging.error(err_msg.format(track=track2_id, e=e)) return # Get input track ids track1 = mixer.Track(track1_id, track1_input_path) track2 = mixer.Track(track2_id, track2_input_path) # Mix tracks & save to output file mixer.mix(track1, track2, output_filename) # Upload mix logging.info("Uploading mix to {}".format(gcs_output_path)) with self.gcs_client.open(gcs_output_path, "wb", mime_type="application/octet-stream") as dest: with open(output_filename, "rb") as source: dest.write(source.read()) yield entity_ids 
 def run(): input_subscription = "projects/sigint/subscriptions/automixer-klio-input-automixer-klio" output_topic = "projects/sigint/topics/automixer-klio-output" options = pipeline_options.PipelineOptions() gcp_opts = options.view_as(pipeline_options.GoogleCloudOptions) gcp_opts.job_name = "automixer-beam" gcp_opts.project = "sigint" gcp_opts.region = "europe-west1" gcp_opts.temp_location = "gs://sigint-dataflow-tmp/automixer-beam/temp" gcp_opts.staging_location = "gs://sigint-dataflow-tmp/automixer-beam/staging" worker_opts = options.view_as(pipeline_options.WorkerOptions) worker_opts.subnetwork = “https://www.googleapis.com/compute/v1/projects/some-network/regions/europe-west1/subnetworks/foo1" worker_opts.machine_type = "n1-standard-2" worker_opts.disk_size_gb = 32 worker_opts.num_workers = 2 worker_opts.max_num_workers = 2 worker_opts.worker_harness_container_image = "gcr.io/sigint/automixer-worker-beam:1" standard_opts = options.view_as(pipeline_options.StandardOptions) standard_opts.streaming = True standard_opts.runner = "dataflow" debug_opts = options.view_as(pipeline_options.DebugOptions) debug_opts.experiments = ["beam_fn_api"] options.view_as(pipeline_options.SetupOptions).save_main_session = True logging.info("Launching pipeline...") pipeline = beam.Pipeline(options=options) (pipeline | beam.io.ReadFromPubSub(subscription=input_subscription) | beam.ParDo(MixerDoFn()) | beam.io.WriteToPubSub(output_topic)) result = pipeline.run() result.wait_until_finish() if __name__ == "__main__": fmt = '%(asctime)s %(message)s' logging.basicConfig(format=fmt, level=logging.INFO) run()

Slide 98

Slide 98 text

import logging import re import threading import apache_beam as beam from apache_beam.io.gcp import gcsio from apache_beam.options import pipeline_options import mixer import track_storage class MixerDoFn(beam.DoFn): PROJECT = "sigint" GCS_BUCKET = "sigint-output" GCS_OBJECT_PATH = "automixer-beam" OUTPUT_NAME_TPL = "{track1_id}-{track2_id}-mix.ogg" GCS_OUTPUT_TPL = "gs://{bucket}/{object_path}/{filename}" _thread_local = threading.local() @property def gcs_client(self): client = getattr(self._thread_local, "gcs_client", None) if not client: self._thread_local.gcs_client = gcsio.GcsIO() return self._thread_local.gcs_client def process(self, entity_ids): track1_id, track2_id = entity_ids.decode("utf-8").split(",") output_filename = MixerDoFn.OUTPUT_NAME_TPL.format( track1_id=track1_id, track2_id=track2_id ) gcs_output_path = MixerDoFn.GCS_OUTPUT_TPL.format( bucket=MixerDoFn.GCS_BUCKET, object_path=MixerDoFn.GCS_OBJECT_PATH, filename=output_filename, ) # Check if output already exists: if self.gcs_client.exists(gcs_output_path): # Don't do unnecessary work logging.info( "Mix for {} & {} already exists: {}".format( track1_id, track2_id, gcs_output_path ) ) return # Check if input data is available err_msg = "Input for {track} is not available: {e}" try: track1_input_path = track_storage.download_track(track1_id) except Exception as e: logging.error(err_msg.format(track=track1_id, e=e)) return try: track2_input_path = track_storage.download_track(track2_id) except Exception as e: logging.error(err_msg.format(track=track2_id, e=e)) return # Get input track ids track1 = mixer.Track(track1_id, track1_input_path) track2 = mixer.Track(track2_id, track2_input_path) # Mix tracks & save to output file mixer.mix(track1, track2, output_filename) # Upload mix logging.info("Uploading mix to {}".format(gcs_output_path)) with self.gcs_client.open(gcs_output_path, "wb", mime_type="application/octet-stream") as dest: with open(output_filename, "rb") as source: dest.write(source.read()) yield entity_ids 
 def run(): input_subscription = "projects/sigint/subscriptions/automixer-klio-input-automixer-klio" output_topic = "projects/sigint/topics/automixer-klio-output" options = pipeline_options.PipelineOptions() gcp_opts = options.view_as(pipeline_options.GoogleCloudOptions) gcp_opts.job_name = "automixer-beam" gcp_opts.project = "sigint" gcp_opts.region = "europe-west1" gcp_opts.temp_location = "gs://sigint-dataflow-tmp/automixer-beam/temp" gcp_opts.staging_location = "gs://sigint-dataflow-tmp/automixer-beam/staging" worker_opts = options.view_as(pipeline_options.WorkerOptions) worker_opts.subnetwork = “https://www.googleapis.com/compute/v1/projects/some-network/regions/europe-west1/subnetworks/foo1” worker_opts.machine_type = "n1-standard-2" worker_opts.disk_size_gb = 32 worker_opts.num_workers = 2 worker_opts.max_num_workers = 2 worker_opts.worker_harness_container_image = "gcr.io/sigint/automixer-worker-beam:1" standard_opts = options.view_as(pipeline_options.StandardOptions) standard_opts.streaming = True standard_opts.runner = "dataflow" debug_opts = options.view_as(pipeline_options.DebugOptions) debug_opts.experiments = ["beam_fn_api"] options.view_as(pipeline_options.SetupOptions).save_main_session = True logging.info("Launching pipeline...") pipeline = beam.Pipeline(options=options) (pipeline | beam.io.ReadFromPubSub(subscription=input_subscription) | beam.ParDo(MixerDoFn()) | beam.io.WriteToPubSub(output_topic)) result = pipeline.run() result.wait_until_finish() if __name__ == "__main__": fmt = '%(asctime)s %(message)s' logging.basicConfig(format=fmt, level=logging.INFO) run() 125 LoC

Slide 99

Slide 99 text

# start job from local dev machine (env) $ docker build . -t my-worker-image:v1 (env) $ docker push my-worker-image:v1 (env) $ python run.py

Slide 100

Slide 100 text

# start job from worker container $ docker build . -t my-worker-image:v1 $ docker push my-worker-image:v1 $ docker run --rm -it \
 —entrypoint /bin/bash \ -v ~/.config/gcloud/:/usr/gcloud/ \ -v $(pwd)/:/usr/src/app/ \ -e GOOGLE_APPLICATION_CREDENTIALS=/path/to/creds.json \ -e GOOGLE_CLOUD_PROJECT=my-gcp-project \ my-worker-image:v1 \
 python run.py

Slide 101

Slide 101 text

— automixer: klio

Slide 102

Slide 102 text

$ tree . ├── Dockerfile ├── job-requirements.txt ├── klio-job.yaml ├── mixer.py ├── run.py └── track_storage.py

Slide 103

Slide 103 text

$ tree . ├── Dockerfile ├── job-requirements.txt ├── klio-job.yaml ├── mixer.py ├── run.py └── track_storage.py

Slide 104

Slide 104 text

$ tree . ├── Dockerfile ├── job-requirements.txt ├── klio-job.yaml ├── mixer.py ├── run.py └── track_storage.py

Slide 105

Slide 105 text

import os import apache_beam as beam from klio.transforms import decorators import mixer import track_storage class AutomixerJob(beam.DoFn): @decorators.handle_klio def process(self, data): # Get input track ids track1_id, track2_id = data.element.split(",") track1 = mixer.Track(track1_id) track2 = mixer.Track(track2_id) # Cross fade tracks local_output_path = mixer.mix(track1, track2) # Upload crossfaded track gcs_output_path = os.path.join( self._klio.config.job_config.outputs[0].data_location, local_output_path ) self._klio.logger.info("Uploading mix to {}".format(gcs_output_path)) track_storage.upload_track(gcs_output_path, local_output_path) yield data

Slide 106

Slide 106 text

import os import apache_beam as beam from klio.transforms import decorators import mixer import track_storage class AutomixerJob(beam.DoFn): @decorators.handle_klio def process(self, data): # Get input track ids track1_id, track2_id = data.element.split(",") track1 = mixer.Track(track1_id) track2 = mixer.Track(track2_id) # Cross fade tracks local_output_path = mixer.mix(track1, track2) # Upload crossfaded track gcs_output_path = os.path.join( self._klio.config.job_config.outputs[0].data_location, local_output_path ) self._klio.logger.info("Uploading mix to {}".format(gcs_output_path)) track_storage.upload_track(gcs_output_path, local_output_path) yield data 30 LoC

Slide 107

Slide 107 text

import os import apache_beam as beam from klio.transforms import decorators import mixer import track_storage class AutomixerJob(beam.DoFn): @decorators.handle_klio def process(self, data): # Get input track ids track1_id, track2_id = data.element.split(",") track1 = mixer.Track(track1_id) track2 = mixer.Track(track2_id) # Cross fade tracks local_output_path = mixer.mix(track1, track2) # Upload crossfaded track gcs_output_path = os.path.join( self._klio.config.job_config.outputs[0].data_location, local_output_path ) self._klio.logger.info("Uploading mix to {}".format(gcs_output_path)) track_storage.upload_track(gcs_output_path, local_output_path) yield data 30 LoC 75% off! over

Slide 108

Slide 108 text

$ klio job run

Slide 109

Slide 109 text

from klio.transforms import decorators class AutomixerJob(beam.DoFn): @decorators.handle_klio def process(self, data): ...

Slide 110

Slide 110 text

from klio.transforms import decorators class AutomixerJob(beam.DoFn): @decorators.handle_klio def process(self, data): ...

Slide 111

Slide 111 text

from klio.transforms import decorators class AutomixerJob(beam.DoFn): @decorators.handle_klio def process(self, data): ...

Slide 112

Slide 112 text

job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config: events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav

Slide 113

Slide 113 text

job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config: events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav

Slide 114

Slide 114 text

job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config: events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav

Slide 115

Slide 115 text

job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config: events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav

Slide 116

Slide 116 text

job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config: events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav

Slide 117

Slide 117 text

job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config: events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav

Slide 118

Slide 118 text

job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config: events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav

Slide 119

Slide 119 text

job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config: events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav

Slide 120

Slide 120 text

job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config: events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav

Slide 121

Slide 121 text

job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config: events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav

Slide 122

Slide 122 text

job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config: events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav

Slide 123

Slide 123 text

job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config: events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav

Slide 124

Slide 124 text

job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config: events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav

Slide 125

Slide 125 text

job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config: events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav

Slide 126

Slide 126 text

job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config: events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav

Slide 127

Slide 127 text

job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config: events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav gs://my-job/output-bucket/s0m3-aud10-1d.wav

Slide 128

Slide 128 text

job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config: events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav gs://my-job/output-bucket/s0m3-aud10-1d.wav

Slide 129

Slide 129 text

job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config: events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav gs://my-job/output-bucket/s0m3-aud10-1d.wav

Slide 130

Slide 130 text

job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config: events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav

Slide 131

Slide 131 text

job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config: events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav gs://my-parent-job/output-bucket/s0m3-aud10-1d.ogg

Slide 132

Slide 132 text

job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config: events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav

Slide 133

Slide 133 text

job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config: events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav

Slide 134

Slide 134 text

import logging import re import threading import apache_beam as beam from apache_beam.io.gcp import gcsio from apache_beam.options import pipeline_options import mixer import track_storage class MixerDoFn(beam.DoFn): PROJECT = "sigint" GCS_BUCKET = "sigint-output" GCS_OBJECT_PATH = "automixer-beam" OUTPUT_NAME_TPL = "{track1_id}-{track2_id}-mix.ogg" GCS_OUTPUT_TPL = "gs://{bucket}/{object_path}/{filename}" _thread_local = threading.local() @property def gcs_client(self): client = getattr(self._thread_local, "gcs_client", None) if not client: self._thread_local.gcs_client = gcsio.GcsIO() return self._thread_local.gcs_client def process(self, entity_ids): track1_id, track2_id = entity_ids.decode("utf-8").split(",") output_filename = MixerDoFn.OUTPUT_NAME_TPL.format( track1_id=track1_id, track2_id=track2_id ) gcs_output_path = MixerDoFn.GCS_OUTPUT_TPL.format( bucket=MixerDoFn.GCS_BUCKET, object_path=MixerDoFn.GCS_OBJECT_PATH, filename=output_filename, ) # Check if output already exists: if self.gcs_client.exists(gcs_output_path): # Don't do unnecessary work logging.info( "Mix for {} & {} already exists: {}".format( track1_id, track2_id, gcs_output_path ) ) return # Check if input data is available err_msg = "Input for {track} is not available: {e}" try: track1_input_path = track_storage.download_track(track1_id) except Exception as e: logging.error(err_msg.format(track=track1_id, e=e)) return try: track2_input_path = track_storage.download_track(track2_id) except Exception as e: logging.error(err_msg.format(track=track2_id, e=e)) return # Get input track ids track1 = mixer.Track(track1_id, track1_input_path) track2 = mixer.Track(track2_id, track2_input_path) # Mix tracks & save to output file mixer.mix(track1, track2, output_filename) # Upload mix logging.info("Uploading mix to {}".format(gcs_output_path)) with self.gcs_client.open(gcs_output_path, "wb", mime_type="application/octet-stream") as dest: with open(output_filename, "rb") as source: dest.write(source.read()) yield entity_ids 
 def run(): input_subscription = "projects/sigint/subscriptions/automixer-klio-input-automixer-klio" output_topic = "projects/sigint/topics/automixer-klio-output" options = pipeline_options.PipelineOptions() gcp_opts = options.view_as(pipeline_options.GoogleCloudOptions) gcp_opts.job_name = "automixer-beam" gcp_opts.project = "sigint" gcp_opts.region = "europe-west1" gcp_opts.temp_location = "gs://sigint-dataflow-tmp/automixer-beam/temp" gcp_opts.staging_location = "gs://sigint-dataflow-tmp/automixer-beam/staging" worker_opts = options.view_as(pipeline_options.WorkerOptions) worker_opts.subnetwork = “https://www.googleapis.com/compute/v1/projects/some-network/regions/europe-west1/subnetworks/foo-1“ worker_opts.machine_type = "n1-standard-2" worker_opts.disk_size_gb = 32 worker_opts.num_workers = 2 worker_opts.max_num_workers = 2 worker_opts.worker_harness_container_image = "gcr.io/sigint/automixer-worker-beam:1" standard_opts = options.view_as(pipeline_options.StandardOptions) standard_opts.streaming = True standard_opts.runner = "dataflow" debug_opts = options.view_as(pipeline_options.DebugOptions) debug_opts.experiments = ["beam_fn_api"] options.view_as(pipeline_options.SetupOptions).save_main_session = True logging.info("Launching pipeline...") pipeline = beam.Pipeline(options=options) (pipeline | beam.io.ReadFromPubSub(subscription=input_subscription) | beam.ParDo(MixerDoFn()) | beam.io.WriteToPubSub(output_topic)) result = pipeline.run() result.wait_until_finish() if __name__ == "__main__": fmt = '%(asctime)s %(message)s' logging.basicConfig(format=fmt, level=logging.INFO) run()

Slide 135

Slide 135 text

import os import apache_beam as beam from klio.transforms import decorators import mixer import track_storage class AutomixerJob(beam.DoFn): @decorators.handle_klio def process(self, data): # Get input track ids track1_id, track2_id = data.element.split(",") track1 = mixer.Track(track1_id) track2 = mixer.Track(track2_id) # Cross fade tracks local_output_path = mixer.mix(track1, track2) # Upload crossfaded track gcs_output_path = os.path.join( self._klio.config.job_config.outputs[0].data_location, local_output_path ) self._klio.logger.info("Uploading mix to {}".format(gcs_output_path)) track_storage.upload_track(gcs_output_path, local_output_path) yield data

Slide 136

Slide 136 text

— klio vs vanilla Beam

Slide 137

Slide 137 text

take aways @roguelynn

Slide 138

Slide 138 text

— what worked?

Slide 139

Slide 139 text

— what was hard?

Slide 140

Slide 140 text

— what’s next?

Slide 141

Slide 141 text

thanks! Lynn Root | @roguelynn We’re hiring: spotifyjobs.com Find more information on klio at docs.klio.io and github.com/spotify/klio