Beam in Production

Beam in Production Mike Clarke

What we’ll cover • Problems we encountered going to production
• Solutions we found useful to address these problems

Problem: Unexpected data in production

Problem - Unexpected data in production • Schema changes happen
◦ Schema of input data from a Kafka stream ◦ Schema of destination BigQuery table • Bad data happens ◦ Assumptions we made about source data did not hold up in production Mistakes will be made - but how quickly can we fix them?

Error messages buried in logs INFO:apache_beam.io.gcp.bigquery:There were errors inserting to
BigQuery: [<InsertErrorsValueListEntry errors: [<ErrorProto debugInfo: '' location: 'phonenumber.areacode' message: 'Cannot convert value to integer (bad value):asdf' reason: 'invalid'>] index: 0>, <InsertErrorsValueListEntry errors: [<ErrorProto debugInfo: '' location: 'phonenumber.areacode' message: 'Cannot convert value to integer (bad value):asdf' reason: 'invalid'>] index: 1>, ...

What if your pipeline told you when it broke?

Python SDK import sentry_sdk from sentry_sdk.integrations.beam import BeamIntegration # Sentry
init integrations = [BeamIntegration()] sentry_sdk.init(dsn="YOUR DSN", integrations=integrations) # Pipeline code goes here

Java SDK public static void main(String[] args) { MyOptions options
= PipelineOptionsFactory .fromArgs(args) .withValidation() .as(MyOptions.class); Sentry.init(options.getSentryDsn()); try { Pipeline pipeline = Pipeline.create(options); pipeline.apply(...) // apply your PTransforms pipeline.run().waitUntilFinish() } catch (Exception e) { Sentry.capture(e); throw e; } }

More demos (python, java, scio) at github.com/sentry-demos/apache-beam

Problem: Dropping Data

Problem - Dropping Data • Data missing but stream watermark
advancing anyways ◦ Very surprising to discover this ◦ Uncovered that transient errors are skipped over ◦ No way for us to replay dropped data • Beam & BigQuery retry options ◦ All errors are considered transient except if BigQuery says that the error reason contains one of ImmutableSet.of("invalid", "invalidQuery", "notImplemented")

Option Behavior alwaysRetry Always retry all failures. neverRetry Never retry
any failures. retryTransientErrors Retry all failures except for known persistent errors. shouldRetry Calls a function your write, you return true if this failure should be retried.

Solution - Dropping Data • Verify the retry policy matches
the needs of a pipeline • Earlier versions had an unsafe default: retryTransientErrors • > 0.2.17 (Dec. 2019) has https://issues.apache.org/jira/browse/BEAM-8803 • Monitor the logs for BigQuery errors (transient and permanent) • Or, consider sending your errors to a third-party service

Problem: Poor Performance

Problem - Poor Performance • Python dataflow pipeline requires lots
of machines for our throughput (order 1000s of rows per second) • Python process was crashing every few days (in some cases, every few hours) ◦ We weren’t able to narrow down what caused this

Solution • Move to Java SDK (line for line rewrite)
◦ With Python SDK: 100+ VCPUs required to process throughput (~1000 events / second at peak) ◦ With Java SDK: Down to 2 n1-standard-4 machines (8 VCPUs total) (!) ◦ 90% savings month-over-month • (near term) better tooling to understand performance • Question: why was python so much less efficient? ◦ Might be related to python SDK’s usage of ìnspect`module which ◦ Might also be related to concurrency natives and the JVM’s ability to execute threads more efficiently ◦ Might be that the Python SDK is still catching up to the Java SDK

Problem: Observability

Problem In control theory, observability is a measure of how
well internal states of a system can be inferred by knowledge of its external outputs. The observability and controllability of a system are mathematical duals. • How can we be confident the pipeline is healthy? • Metrics within StackDriver are a decent starting point for point alerts • Datadog oﬀers nice advantages over built-in StackDriver ◦ Longer retention ◦ More threshold alerting options ◦ Also, more expensive

Future: APM for Beam

Source: https://cloud.google.com/blog/products/data-analytics/better-data-pipeline-observability-for-batch-and-stream-processing

Sentry’s APM

We’re hiring. sentry.io/careers

Try out Sentry. $50 credit Promo Code: beam2020

Beam in Production

Beam in Production

Mike Clarke

More Decks by Mike Clarke

Other Decks in Programming

Featured

Transcript