Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Music Is Just Wiggly Air

8c5e76dca74a59822dbf7f0286177ddd?s=47 Lynn Root
November 10, 2021

Music Is Just Wiggly Air

Apache Beam Summit 2020

Digital signal processing (DSP) has been made easy with the help of many Python libraries, allowing engineers and researchers to quickly and effortlessly analyze audio, images, and video. However, scaling these algorithms and models to process millions of files has not been equally as seamless. At Spotify, we’re trying to address scaling DSP over our catalog of over 50 million songs. This talk will discuss the challenges we’ve encountered while building the infrastructure needed to support audio processing at scale. I’ll discuss the how we’ve leveraged Apache Beam for streaming data pipelines and the tooling we’ve built on top of Beam to support our heavy resource requirements.

8c5e76dca74a59822dbf7f0286177ddd?s=128

Lynn Root

November 10, 2021
Tweet

More Decks by Lynn Root

Other Decks in Programming

Transcript

  1. music is just wiggly air Lynn Root | Staff Engineer

    | @roguelynn building infrastructure to support audio research
  2. intro @roguelynn

  3. — Audio intelligence research at Spotify advances the state of

    the art in understanding music at scale to enhance how it is created, identified and consumed. research.spotify.com
  4. — Audio intelligence research at Spotify advances the state of

    the art in understanding music at scale to enhance how it is created, identified and consumed. research.spotify.com
  5. research workflow @roguelynn

  6. None
  7. None
  8. None
  9. productionization requirements @roguelynn

  10. — graph execution modes

  11. — top-down execution

  12. None
  13. None
  14. None
  15. None
  16. None
  17. — bottom-up execution

  18. None
  19. None
  20. None
  21. None
  22. None
  23. None
  24. None
  25. — research workload

  26. — python

  27. — custom environment

  28. — streaming

  29. — avoid duplicate work

  30. — scalability

  31. — summary Top-down & bottom-up Python Custom environment Avoid duplicate

    work Streaming Scalability
  32. early approaches @roguelynn

  33. — music intelligence pipeline Google Cloud PubSub + Microservices

  34. — Apache Beam

  35. — Apache Beam Google Cloud PubSub Apache Beam on Google

    Cloud Dataflow +
  36. — what’s left?

  37. our solution: klio @roguelynn

  38. Kleio

  39. Kleio

  40. — goals of klio

  41. — ecosystem

  42. — ecosystem: user PoV

  43. Develop

  44. $ klio job create Develop

  45. $ klio job create Develop $ klio job verify

  46. $ klio job create Develop Test $ klio job verify

  47. $ klio job create $ klio job test Develop Test

    $ klio job verify
  48. $ klio job create $ klio job test Develop Test

    $ klio job verify $ klio job audit
  49. $ klio job create $ klio job test Develop Test

    $ klio job verify $ klio job audit $ klio job profile
  50. $ klio job create $ klio job test Develop Test

    Deploy $ klio job verify $ klio job audit $ klio job profile
  51. $ klio job create $ klio job run $ klio

    job test Develop Test Deploy $ klio job verify $ klio job audit $ klio job profile
  52. $ klio job create $ klio job run $ klio

    message publish $ klio job test Develop Test Deploy $ klio job verify $ klio job audit $ klio job profile
  53. $ klio job create $ klio job run $ klio

    message publish $ klio job test Develop Test Deploy $ klio job verify $ klio job audit $ klio job profile $ klio job logs
  54. — ecosystem: behind the scenes

  55. Local / CI Machine

  56. $ klio job run Local / CI Machine

  57. $ klio job run $ klio image build Local /

    CI Machine
  58. $ klio job run $ klio image build $ klioexec

    run Local / CI Machine worker container
  59. $ klio job run $ klio image build $ klioexec

    run Local / CI Machine Google Cloud worker container
  60. $ klio job run $ klio image build $ klioexec

    run Local / CI Machine Google Cloud worker container
  61. $ klio job run $ klio image build $ klioexec

    run Local / CI Machine Google Cloud worker container
  62. $ klio job run $ klio message publish $ klio

    image build $ klioexec run Local / CI Machine Google Cloud worker container
  63. $ klio job run $ klio message publish $ klio

    image build $ klioexec run Local / CI Machine Google Cloud worker container
  64. $ klio job run $ klio message publish $ klio

    image build $ klioexec run Local / CI Machine Google Cloud worker container $ klio job logs
  65. — architecture

  66. — architecture: klio job

  67. docker container data ow worker klio job

  68. docker container data ow worker s0ng2 klio job

  69. docker container data ow worker klio preprocessing s0ng2

  70. docker container data ow worker klio preprocessing user-implemented transform s0ng2

  71. docker container data ow worker klio preprocessing user-implemented transform s0ng2

    s0ng2.wav
  72. docker container data ow worker klio preprocessing user-implemented transform s0ng2

    s0ng2.wav s0ng2.json
  73. docker container data ow worker klio preprocessing user-implemented transform klio

    postprocessing s0ng2 s0ng2.wav s0ng2.json
  74. docker container data ow worker klio preprocessing user-implemented transform klio

    postprocessing s0ng2 s0ng2.wav s0ng2.json s0ng2
  75. docker container data ow worker klio preprocessing user-implemented transform klio

    postprocessing UUID-like klio job
  76. — architecture: klio message

  77. Downstream? Yes No

  78. Downstream? Drop Yes No

  79. Downstream? Drop Ping Mode? Yes No

  80. Downstream? Drop Ping Mode? Yes No No Yes

  81. Downstream? Drop Ping Mode? Pass Thru Yes No No Yes

  82. Downstream? Drop Ping Mode? Output data exists? Pass Thru Yes

    No No Yes
  83. Downstream? Drop Ping Mode? Output data exists? Pass Thru Yes

    No No No Yes Yes
  84. Downstream? Drop Ping Mode? Output data exists? Force mode? Pass

    Thru Yes No No No Yes Yes
  85. Downstream? Drop Ping Mode? Output data exists? Force mode? Pass

    Thru Yes No No Yes No Yes Yes No
  86. Downstream? Drop Ping Mode? Output data exists? Force mode? Pass

    Thru Pass Thru Yes No No Yes No Yes Yes No
  87. Downstream? Drop Ping Mode? Output data exists? Input data exists?

    Force mode? Pass Thru Pass Thru Yes No No Yes No Yes Yes No
  88. Downstream? Drop Ping Mode? Output data exists? Input data exists?

    Force mode? Pass Thru Pass Thru Yes No No Yes Yes No Yes Yes No No
  89. Downstream? Drop Ping Mode? Output data exists? Input data exists?

    Force mode? Pass Thru Pass Thru Trigger Parent & Drop Yes No No Yes Yes No Yes Yes No No
  90. Downstream? Drop Ping Mode? Output data exists? Input data exists?

    Force mode? Pass Thru Process Pass Thru Trigger Parent & Drop Yes No No Yes Yes No Yes Yes No No
  91. Downstream? Drop Ping Mode? Output data exists? Input data exists?

    Force mode? Pass Thru Process Pass Thru Trigger Parent & Drop Yes No No Yes Yes No Yes Yes No No
  92. show us code! @roguelynn

  93. — automixer

  94. — automixer: vanilla Beam

  95. $ tree . ├── Dockerfile ├── job-requirements.txt ├── mixer.py ├──

    run.py └── track_storage.py
  96. $ tree . ├── Dockerfile ├── job-requirements.txt ├── mixer.py ├──

    run.py └── track_storage.py
  97. import logging import re import threading import apache_beam as beam

    from apache_beam.io.gcp import gcsio from apache_beam.options import pipeline_options import mixer import track_storage class MixerDoFn(beam.DoFn): PROJECT = "sigint" GCS_BUCKET = "sigint-output" GCS_OBJECT_PATH = "automixer-beam" OUTPUT_NAME_TPL = "{track1_id}-{track2_id}-mix.ogg" GCS_OUTPUT_TPL = "gs://{bucket}/{object_path}/{filename}" _thread_local = threading.local() @property def gcs_client(self): client = getattr(self._thread_local, "gcs_client", None) if not client: self._thread_local.gcs_client = gcsio.GcsIO() return self._thread_local.gcs_client def process(self, entity_ids): track1_id, track2_id = entity_ids.decode("utf-8").split(",") output_filename = MixerDoFn.OUTPUT_NAME_TPL.format( track1_id=track1_id, track2_id=track2_id ) gcs_output_path = MixerDoFn.GCS_OUTPUT_TPL.format( bucket=MixerDoFn.GCS_BUCKET, object_path=MixerDoFn.GCS_OBJECT_PATH, filename=output_filename, ) # Check if output already exists: if self.gcs_client.exists(gcs_output_path): # Don't do unnecessary work logging.info( "Mix for {} & {} already exists: {}".format( track1_id, track2_id, gcs_output_path ) ) return # Check if input data is available err_msg = "Input for {track} is not available: {e}" try: track1_input_path = track_storage.download_track(track1_id) except Exception as e: logging.error(err_msg.format(track=track1_id, e=e)) return try: track2_input_path = track_storage.download_track(track2_id) except Exception as e: logging.error(err_msg.format(track=track2_id, e=e)) return # Get input track ids track1 = mixer.Track(track1_id, track1_input_path) track2 = mixer.Track(track2_id, track2_input_path) # Mix tracks & save to output file mixer.mix(track1, track2, output_filename) # Upload mix logging.info("Uploading mix to {}".format(gcs_output_path)) with self.gcs_client.open(gcs_output_path, "wb", mime_type="application/octet-stream") as dest: with open(output_filename, "rb") as source: dest.write(source.read()) yield entity_ids 
 def run(): input_subscription = "projects/sigint/subscriptions/automixer-klio-input-automixer-klio" output_topic = "projects/sigint/topics/automixer-klio-output" options = pipeline_options.PipelineOptions() gcp_opts = options.view_as(pipeline_options.GoogleCloudOptions) gcp_opts.job_name = "automixer-beam" gcp_opts.project = "sigint" gcp_opts.region = "europe-west1" gcp_opts.temp_location = "gs://sigint-dataflow-tmp/automixer-beam/temp" gcp_opts.staging_location = "gs://sigint-dataflow-tmp/automixer-beam/staging" worker_opts = options.view_as(pipeline_options.WorkerOptions) worker_opts.subnetwork = “https://www.googleapis.com/compute/v1/projects/some-network/regions/europe-west1/subnetworks/foo1" worker_opts.machine_type = "n1-standard-2" worker_opts.disk_size_gb = 32 worker_opts.num_workers = 2 worker_opts.max_num_workers = 2 worker_opts.worker_harness_container_image = "gcr.io/sigint/automixer-worker-beam:1" standard_opts = options.view_as(pipeline_options.StandardOptions) standard_opts.streaming = True standard_opts.runner = "dataflow" debug_opts = options.view_as(pipeline_options.DebugOptions) debug_opts.experiments = ["beam_fn_api"] options.view_as(pipeline_options.SetupOptions).save_main_session = True logging.info("Launching pipeline...") pipeline = beam.Pipeline(options=options) (pipeline | beam.io.ReadFromPubSub(subscription=input_subscription) | beam.ParDo(MixerDoFn()) | beam.io.WriteToPubSub(output_topic)) result = pipeline.run() result.wait_until_finish() if __name__ == "__main__": fmt = '%(asctime)s %(message)s' logging.basicConfig(format=fmt, level=logging.INFO) run()
  98. import logging import re import threading import apache_beam as beam

    from apache_beam.io.gcp import gcsio from apache_beam.options import pipeline_options import mixer import track_storage class MixerDoFn(beam.DoFn): PROJECT = "sigint" GCS_BUCKET = "sigint-output" GCS_OBJECT_PATH = "automixer-beam" OUTPUT_NAME_TPL = "{track1_id}-{track2_id}-mix.ogg" GCS_OUTPUT_TPL = "gs://{bucket}/{object_path}/{filename}" _thread_local = threading.local() @property def gcs_client(self): client = getattr(self._thread_local, "gcs_client", None) if not client: self._thread_local.gcs_client = gcsio.GcsIO() return self._thread_local.gcs_client def process(self, entity_ids): track1_id, track2_id = entity_ids.decode("utf-8").split(",") output_filename = MixerDoFn.OUTPUT_NAME_TPL.format( track1_id=track1_id, track2_id=track2_id ) gcs_output_path = MixerDoFn.GCS_OUTPUT_TPL.format( bucket=MixerDoFn.GCS_BUCKET, object_path=MixerDoFn.GCS_OBJECT_PATH, filename=output_filename, ) # Check if output already exists: if self.gcs_client.exists(gcs_output_path): # Don't do unnecessary work logging.info( "Mix for {} & {} already exists: {}".format( track1_id, track2_id, gcs_output_path ) ) return # Check if input data is available err_msg = "Input for {track} is not available: {e}" try: track1_input_path = track_storage.download_track(track1_id) except Exception as e: logging.error(err_msg.format(track=track1_id, e=e)) return try: track2_input_path = track_storage.download_track(track2_id) except Exception as e: logging.error(err_msg.format(track=track2_id, e=e)) return # Get input track ids track1 = mixer.Track(track1_id, track1_input_path) track2 = mixer.Track(track2_id, track2_input_path) # Mix tracks & save to output file mixer.mix(track1, track2, output_filename) # Upload mix logging.info("Uploading mix to {}".format(gcs_output_path)) with self.gcs_client.open(gcs_output_path, "wb", mime_type="application/octet-stream") as dest: with open(output_filename, "rb") as source: dest.write(source.read()) yield entity_ids 
 def run(): input_subscription = "projects/sigint/subscriptions/automixer-klio-input-automixer-klio" output_topic = "projects/sigint/topics/automixer-klio-output" options = pipeline_options.PipelineOptions() gcp_opts = options.view_as(pipeline_options.GoogleCloudOptions) gcp_opts.job_name = "automixer-beam" gcp_opts.project = "sigint" gcp_opts.region = "europe-west1" gcp_opts.temp_location = "gs://sigint-dataflow-tmp/automixer-beam/temp" gcp_opts.staging_location = "gs://sigint-dataflow-tmp/automixer-beam/staging" worker_opts = options.view_as(pipeline_options.WorkerOptions) worker_opts.subnetwork = “https://www.googleapis.com/compute/v1/projects/some-network/regions/europe-west1/subnetworks/foo1” worker_opts.machine_type = "n1-standard-2" worker_opts.disk_size_gb = 32 worker_opts.num_workers = 2 worker_opts.max_num_workers = 2 worker_opts.worker_harness_container_image = "gcr.io/sigint/automixer-worker-beam:1" standard_opts = options.view_as(pipeline_options.StandardOptions) standard_opts.streaming = True standard_opts.runner = "dataflow" debug_opts = options.view_as(pipeline_options.DebugOptions) debug_opts.experiments = ["beam_fn_api"] options.view_as(pipeline_options.SetupOptions).save_main_session = True logging.info("Launching pipeline...") pipeline = beam.Pipeline(options=options) (pipeline | beam.io.ReadFromPubSub(subscription=input_subscription) | beam.ParDo(MixerDoFn()) | beam.io.WriteToPubSub(output_topic)) result = pipeline.run() result.wait_until_finish() if __name__ == "__main__": fmt = '%(asctime)s %(message)s' logging.basicConfig(format=fmt, level=logging.INFO) run() 125 LoC
  99. # start job from local dev machine (env) $ docker

    build . -t my-worker-image:v1 (env) $ docker push my-worker-image:v1 (env) $ python run.py
  100. # start job from worker container $ docker build .

    -t my-worker-image:v1 $ docker push my-worker-image:v1 $ docker run --rm -it \
 —entrypoint /bin/bash \ -v ~/.config/gcloud/:/usr/gcloud/ \ -v $(pwd)/:/usr/src/app/ \ -e GOOGLE_APPLICATION_CREDENTIALS=/path/to/creds.json \ -e GOOGLE_CLOUD_PROJECT=my-gcp-project \ my-worker-image:v1 \
 python run.py
  101. — automixer: klio

  102. $ tree . ├── Dockerfile ├── job-requirements.txt ├── klio-job.yaml ├──

    mixer.py ├── run.py └── track_storage.py
  103. $ tree . ├── Dockerfile ├── job-requirements.txt ├── klio-job.yaml ├──

    mixer.py ├── run.py └── track_storage.py
  104. $ tree . ├── Dockerfile ├── job-requirements.txt ├── klio-job.yaml ├──

    mixer.py ├── run.py └── track_storage.py
  105. import os import apache_beam as beam from klio.transforms import decorators

    import mixer import track_storage class AutomixerJob(beam.DoFn): @decorators.handle_klio def process(self, data): # Get input track ids track1_id, track2_id = data.element.split(",") track1 = mixer.Track(track1_id) track2 = mixer.Track(track2_id) # Cross fade tracks local_output_path = mixer.mix(track1, track2) # Upload crossfaded track gcs_output_path = os.path.join( self._klio.config.job_config.outputs[0].data_location, local_output_path ) self._klio.logger.info("Uploading mix to {}".format(gcs_output_path)) track_storage.upload_track(gcs_output_path, local_output_path) yield data
  106. import os import apache_beam as beam from klio.transforms import decorators

    import mixer import track_storage class AutomixerJob(beam.DoFn): @decorators.handle_klio def process(self, data): # Get input track ids track1_id, track2_id = data.element.split(",") track1 = mixer.Track(track1_id) track2 = mixer.Track(track2_id) # Cross fade tracks local_output_path = mixer.mix(track1, track2) # Upload crossfaded track gcs_output_path = os.path.join( self._klio.config.job_config.outputs[0].data_location, local_output_path ) self._klio.logger.info("Uploading mix to {}".format(gcs_output_path)) track_storage.upload_track(gcs_output_path, local_output_path) yield data 30 LoC
  107. import os import apache_beam as beam from klio.transforms import decorators

    import mixer import track_storage class AutomixerJob(beam.DoFn): @decorators.handle_klio def process(self, data): # Get input track ids track1_id, track2_id = data.element.split(",") track1 = mixer.Track(track1_id) track2 = mixer.Track(track2_id) # Cross fade tracks local_output_path = mixer.mix(track1, track2) # Upload crossfaded track gcs_output_path = os.path.join( self._klio.config.job_config.outputs[0].data_location, local_output_path ) self._klio.logger.info("Uploading mix to {}".format(gcs_output_path)) track_storage.upload_track(gcs_output_path, local_output_path) yield data 30 LoC 75% off! over
  108. $ klio job run

  109. from klio.transforms import decorators class AutomixerJob(beam.DoFn): @decorators.handle_klio def process(self, data):

    ...
  110. from klio.transforms import decorators class AutomixerJob(beam.DoFn): @decorators.handle_klio def process(self, data):

    ...
  111. from klio.transforms import decorators class AutomixerJob(beam.DoFn): @decorators.handle_klio def process(self, data):

    ...
  112. job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config:

    events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav
  113. job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config:

    events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav
  114. job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config:

    events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav
  115. job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config:

    events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav
  116. job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config:

    events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav
  117. job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config:

    events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav
  118. job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config:

    events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav
  119. job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config:

    events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav
  120. job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config:

    events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav
  121. job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config:

    events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav
  122. job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config:

    events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav
  123. job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config:

    events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav
  124. job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config:

    events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav
  125. job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config:

    events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav
  126. job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config:

    events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav
  127. job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config:

    events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav gs://my-job/output-bucket/s0m3-aud10-1d.wav
  128. job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config:

    events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav gs://my-job/output-bucket/s0m3-aud10-1d.wav
  129. job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config:

    events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav gs://my-job/output-bucket/s0m3-aud10-1d.wav
  130. job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config:

    events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav
  131. job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config:

    events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav gs://my-parent-job/output-bucket/s0m3-aud10-1d.ogg
  132. job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config:

    events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav
  133. job_name: my-job pipeline_options: streaming: True # <-- snip --> job_config:

    events: inputs: - type: pubsub topic: my-parent-job-output-topic subscription: my-job-input-subscription outputs: - type: pubsub topic: my-job-output-topic data: inputs: - type: gcs location: gs://my-parent-job/output-bucket file_suffix: ogg outputs: - type: gcs location: gs://my-job/output-bucket file_suffix: wav
  134. import logging import re import threading import apache_beam as beam

    from apache_beam.io.gcp import gcsio from apache_beam.options import pipeline_options import mixer import track_storage class MixerDoFn(beam.DoFn): PROJECT = "sigint" GCS_BUCKET = "sigint-output" GCS_OBJECT_PATH = "automixer-beam" OUTPUT_NAME_TPL = "{track1_id}-{track2_id}-mix.ogg" GCS_OUTPUT_TPL = "gs://{bucket}/{object_path}/{filename}" _thread_local = threading.local() @property def gcs_client(self): client = getattr(self._thread_local, "gcs_client", None) if not client: self._thread_local.gcs_client = gcsio.GcsIO() return self._thread_local.gcs_client def process(self, entity_ids): track1_id, track2_id = entity_ids.decode("utf-8").split(",") output_filename = MixerDoFn.OUTPUT_NAME_TPL.format( track1_id=track1_id, track2_id=track2_id ) gcs_output_path = MixerDoFn.GCS_OUTPUT_TPL.format( bucket=MixerDoFn.GCS_BUCKET, object_path=MixerDoFn.GCS_OBJECT_PATH, filename=output_filename, ) # Check if output already exists: if self.gcs_client.exists(gcs_output_path): # Don't do unnecessary work logging.info( "Mix for {} & {} already exists: {}".format( track1_id, track2_id, gcs_output_path ) ) return # Check if input data is available err_msg = "Input for {track} is not available: {e}" try: track1_input_path = track_storage.download_track(track1_id) except Exception as e: logging.error(err_msg.format(track=track1_id, e=e)) return try: track2_input_path = track_storage.download_track(track2_id) except Exception as e: logging.error(err_msg.format(track=track2_id, e=e)) return # Get input track ids track1 = mixer.Track(track1_id, track1_input_path) track2 = mixer.Track(track2_id, track2_input_path) # Mix tracks & save to output file mixer.mix(track1, track2, output_filename) # Upload mix logging.info("Uploading mix to {}".format(gcs_output_path)) with self.gcs_client.open(gcs_output_path, "wb", mime_type="application/octet-stream") as dest: with open(output_filename, "rb") as source: dest.write(source.read()) yield entity_ids 
 def run(): input_subscription = "projects/sigint/subscriptions/automixer-klio-input-automixer-klio" output_topic = "projects/sigint/topics/automixer-klio-output" options = pipeline_options.PipelineOptions() gcp_opts = options.view_as(pipeline_options.GoogleCloudOptions) gcp_opts.job_name = "automixer-beam" gcp_opts.project = "sigint" gcp_opts.region = "europe-west1" gcp_opts.temp_location = "gs://sigint-dataflow-tmp/automixer-beam/temp" gcp_opts.staging_location = "gs://sigint-dataflow-tmp/automixer-beam/staging" worker_opts = options.view_as(pipeline_options.WorkerOptions) worker_opts.subnetwork = “https://www.googleapis.com/compute/v1/projects/some-network/regions/europe-west1/subnetworks/foo-1“ worker_opts.machine_type = "n1-standard-2" worker_opts.disk_size_gb = 32 worker_opts.num_workers = 2 worker_opts.max_num_workers = 2 worker_opts.worker_harness_container_image = "gcr.io/sigint/automixer-worker-beam:1" standard_opts = options.view_as(pipeline_options.StandardOptions) standard_opts.streaming = True standard_opts.runner = "dataflow" debug_opts = options.view_as(pipeline_options.DebugOptions) debug_opts.experiments = ["beam_fn_api"] options.view_as(pipeline_options.SetupOptions).save_main_session = True logging.info("Launching pipeline...") pipeline = beam.Pipeline(options=options) (pipeline | beam.io.ReadFromPubSub(subscription=input_subscription) | beam.ParDo(MixerDoFn()) | beam.io.WriteToPubSub(output_topic)) result = pipeline.run() result.wait_until_finish() if __name__ == "__main__": fmt = '%(asctime)s %(message)s' logging.basicConfig(format=fmt, level=logging.INFO) run()
  135. import os import apache_beam as beam from klio.transforms import decorators

    import mixer import track_storage class AutomixerJob(beam.DoFn): @decorators.handle_klio def process(self, data): # Get input track ids track1_id, track2_id = data.element.split(",") track1 = mixer.Track(track1_id) track2 = mixer.Track(track2_id) # Cross fade tracks local_output_path = mixer.mix(track1, track2) # Upload crossfaded track gcs_output_path = os.path.join( self._klio.config.job_config.outputs[0].data_location, local_output_path ) self._klio.logger.info("Uploading mix to {}".format(gcs_output_path)) track_storage.upload_track(gcs_output_path, local_output_path) yield data
  136. — klio vs vanilla Beam

  137. take aways @roguelynn

  138. — what worked?

  139. — what was hard?

  140. — what’s next?

  141. thanks! Lynn Root | @roguelynn We’re hiring: spotifyjobs.com Find more

    information on klio at docs.klio.io and github.com/spotify/klio