Beam in Production - Speaker Deck

Slide 1

Slide 1 text

Beam in Production Mike Clarke

Slide 2

Slide 2 text

What we’ll cover ● Problems we encountered going to production ● Solutions we found useful to address these problems

Slide 3

Slide 3 text

Problem: Unexpected data in production

Slide 4

Slide 4 text

Problem - Unexpected data in production ● Schema changes happen ○ Schema of input data from a Kafka stream ○ Schema of destination BigQuery table ● Bad data happens ○ Assumptions we made about source data did not hold up in production Mistakes will be made - but how quickly can we fix them?

Slide 5

Slide 5 text

Error messages buried in logs INFO:apache_beam.io.gcp.bigquery:There were errors inserting to BigQuery: [] index: 0>, ] index: 1>, ...

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

What if your pipeline told you when it broke?

Slide 8

Slide 8 text

Python SDK import sentry_sdk from sentry_sdk.integrations.beam import BeamIntegration # Sentry init integrations = [BeamIntegration()] sentry_sdk.init(dsn="YOUR DSN", integrations=integrations) # Pipeline code goes here

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

Java SDK public static void main(String[] args) { MyOptions options = PipelineOptionsFactory .fromArgs(args) .withValidation() .as(MyOptions.class); Sentry.init(options.getSentryDsn()); try { Pipeline pipeline = Pipeline.create(options); pipeline.apply(...) // apply your PTransforms pipeline.run().waitUntilFinish() } catch (Exception e) { Sentry.capture(e); throw e; } }

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

More demos (python, java, scio) at github.com/sentry-demos/apache-beam

Slide 13

Slide 13 text

Problem: Dropping Data

Slide 14

Slide 14 text

Problem - Dropping Data ● Data missing but stream watermark advancing anyways ○ Very surprising to discover this ○ Uncovered that transient errors are skipped over ○ No way for us to replay dropped data ● Beam & BigQuery retry options ○ All errors are considered transient except if BigQuery says that the error reason contains one of ImmutableSet.of("invalid", "invalidQuery", "notImplemented")

Slide 15

Slide 15 text

Option Behavior alwaysRetry Always retry all failures. neverRetry Never retry any failures. retryTransientErrors Retry all failures except for known persistent errors. shouldRetry Calls a function your write, you return true if this failure should be retried.

Slide 16

Slide 16 text

Solution - Dropping Data ● Verify the retry policy matches the needs of a pipeline ● Earlier versions had an unsafe default: retryTransientErrors ● > 0.2.17 (Dec. 2019) has https://issues.apache.org/jira/browse/BEAM-8803 ● Monitor the logs for BigQuery errors (transient and permanent) ● Or, consider sending your errors to a third-party service

Slide 17

Slide 17 text

Problem: Poor Performance

Slide 18

Slide 18 text

Problem - Poor Performance ● Python dataflow pipeline requires lots of machines for our throughput (order 1000s of rows per second) ● Python process was crashing every few days (in some cases, every few hours) ○ We weren’t able to narrow down what caused this

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

Solution ● Move to Java SDK (line for line rewrite) ○ With Python SDK: 100+ VCPUs required to process throughput (~1000 events / second at peak) ○ With Java SDK: Down to 2 n1-standard-4 machines (8 VCPUs total) (!) ○ 90% savings month-over-month ● (near term) better tooling to understand performance ● Question: why was python so much less efficient? ○ Might be related to python SDK’s usage of ìnspect`module which ○ Might also be related to concurrency natives and the JVM’s ability to execute threads more efficiently ○ Might be that the Python SDK is still catching up to the Java SDK

Slide 23

Slide 23 text

Problem: Observability

Slide 24

Slide 24 text

Problem In control theory, observability is a measure of how well internal states of a system can be inferred by knowledge of its external outputs. The observability and controllability of a system are mathematical duals. ● How can we be confident the pipeline is healthy? ● Metrics within StackDriver are a decent starting point for point alerts ● Datadog oﬀers nice advantages over built-in StackDriver ○ Longer retention ○ More threshold alerting options ○ Also, more expensive

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

Future: APM for Beam

Slide 27

Slide 27 text

Source: https://cloud.google.com/blog/products/data-analytics/better-data-pipeline-observability-for-batch-and-stream-processing

Slide 28

Slide 28 text

Sentry’s APM

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

We’re hiring. sentry.io/careers

Slide 32

Slide 32 text

Try out Sentry. $50 credit Promo Code: beam2020