Upgrade to Pro — share decks privately, control downloads, hide ads and more …

End-to-End “Exactly Once” with Heron & Pulsar b...

End-to-End “Exactly Once” with Heron & Pulsar by Ivan Kelly at Big Data Spain 2017

Heron is an open-source streaming engine, employed by Twitter, Microsoft and Google, to process billions of events every day.

https://www.bigdataspain.org/2017/talk/end-to-end-exactly-once-with-heron-pulsar

Big Data Spain 2017
November 16th - 17th Kinépolis Madrid

Big Data Spain

December 01, 2017
Tweet

More Decks by Big Data Spain

Other Decks in Technology

Transcript

  1. 3 common semantics in stream processing • At most once

    • At least once • Effectively once
  2. ☒ A processing node will only ever process a tuple

    once ☑ Result of a processed tuple only perceived once Effectively once?
  3. Effectively Once Support Optimistic Pessimistic Twitter Heron Apache Flink Google

    MillWheel Apache Kafka Apache Apex Twitter Heron (In-development)
  4. On failure, the messaging system must not: • Lose messages

    • Duplicate messages • Reorder messages*
  5. Subscribers must be able to reattach where they left off

    Catching up should not adversely affect other operations
  6. Once acked by messaging, data is persisted there But before

    this point, publishers must not: • Lose data • Duplicate data
  7. Both PubSub and Queue semantics Multi-tenant & Cross-DC Apache Incubating

    project 2 years production usage at scale in Yahoo Apache Pulsar
  8. Horizontally scalable replicated logging Pulsar topic replicated log is sequence

    of bookkeeper ledgers (stored in ZK) Uses ZooKeeper to close ledgers Many different streams per storage node (bookies) Isolated read and write paths Apache BookKeeper
  9. Idempotent publish Publisher communicates offset in source data to broker

    Deduplicate published messages will lower or equal offset Broker stores publisher offsets in a cursor
  10. Pulling it all together Optimistic Effectively Once Pessimistic Effectively Once

    (Coming soon) Total Order Atomic Broadcast Scalable Storage for many streams Cursors Idempotent Publish