Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Beam in Production

Beam in Production

Mike will describe gotchas and early struggles Sentry hit moving streaming data pipelines off our laptops and into production. He'll cover some unexpected Beam defaults, detecting schema errors, compare performance between the python & java SDK, and proactively identifying when production pipelines break due to unexpected data.

Mike Clarke is an engineering manager at Sentry, an open-source error monitoring tool that helps developers ship better software, faster. Mike's passion is bringing Sentry's monitoring solutions to data engineers & data scientists. Connect with Mike and leverage Sentry on your next project.

Mike Clarke

February 19, 2020
Tweet

More Decks by Mike Clarke

Other Decks in Programming

Transcript

  1. What we’ll cover • Problems we encountered going to production

    • Solutions we found useful to address these problems
  2. Problem - Unexpected data in production • Schema changes happen

    ◦ Schema of input data from a Kafka stream ◦ Schema of destination BigQuery table • Bad data happens ◦ Assumptions we made about source data did not hold up in production Mistakes will be made - but how quickly can we fix them?
  3. Error messages buried in logs INFO:apache_beam.io.gcp.bigquery:There were errors inserting to

    BigQuery: [<InsertErrorsValueListEntry errors: [<ErrorProto debugInfo: '' location: 'phonenumber.areacode' message: 'Cannot convert value to integer (bad value):asdf' reason: 'invalid'>] index: 0>, <InsertErrorsValueListEntry errors: [<ErrorProto debugInfo: '' location: 'phonenumber.areacode' message: 'Cannot convert value to integer (bad value):asdf' reason: 'invalid'>] index: 1>, ...
  4. Python SDK import sentry_sdk from sentry_sdk.integrations.beam import BeamIntegration # Sentry

    init integrations = [BeamIntegration()] sentry_sdk.init(dsn="YOUR DSN", integrations=integrations) # Pipeline code goes here
  5. Java SDK public static void main(String[] args) { MyOptions options

    = PipelineOptionsFactory .fromArgs(args) .withValidation() .as(MyOptions.class); Sentry.init(options.getSentryDsn()); try { Pipeline pipeline = Pipeline.create(options); pipeline.apply(...) // apply your PTransforms pipeline.run().waitUntilFinish() } catch (Exception e) { Sentry.capture(e); throw e; } }
  6. Problem - Dropping Data • Data missing but stream watermark

    advancing anyways ◦ Very surprising to discover this ◦ Uncovered that transient errors are skipped over ◦ No way for us to replay dropped data • Beam & BigQuery retry options ◦ All errors are considered transient except if BigQuery says that the error reason contains one of ImmutableSet.of("invalid", "invalidQuery", "notImplemented")
  7. Option Behavior alwaysRetry Always retry all failures. neverRetry Never retry

    any failures. retryTransientErrors Retry all failures except for known persistent errors. shouldRetry Calls a function your write, you return true if this failure should be retried.
  8. Solution - Dropping Data • Verify the retry policy matches

    the needs of a pipeline • Earlier versions had an unsafe default: retryTransientErrors • > 0.2.17 (Dec. 2019) has https://issues.apache.org/jira/browse/BEAM-8803 • Monitor the logs for BigQuery errors (transient and permanent) • Or, consider sending your errors to a third-party service
  9. Problem - Poor Performance • Python dataflow pipeline requires lots

    of machines for our throughput (order 1000s of rows per second) • Python process was crashing every few days (in some cases, every few hours) ◦ We weren’t able to narrow down what caused this
  10. Solution • Move to Java SDK (line for line rewrite)

    ◦ With Python SDK: 100+ VCPUs required to process throughput (~1000 events / second at peak) ◦ With Java SDK: Down to 2 n1-standard-4 machines (8 VCPUs total) (!) ◦ 90% savings month-over-month • (near term) better tooling to understand performance • Question: why was python so much less efficient? ◦ Might be related to python SDK’s usage of `inspect`module which ◦ Might also be related to concurrency natives and the JVM’s ability to execute threads more efficiently ◦ Might be that the Python SDK is still catching up to the Java SDK
  11. Problem In control theory, observability is a measure of how

    well internal states of a system can be inferred by knowledge of its external outputs. The observability and controllability of a system are mathematical duals. • How can we be confident the pipeline is healthy? • Metrics within StackDriver are a decent starting point for point alerts • Datadog offers nice advantages over built-in StackDriver ◦ Longer retention ◦ More threshold alerting options ◦ Also, more expensive