Upgrade to Pro — share decks privately, control downloads, hide ads and more …

End-to-End “Exactly Once” with Heron & Pulsar by Ivan Kelly at Big Data Spain 2017

End-to-End “Exactly Once” with Heron & Pulsar by Ivan Kelly at Big Data Spain 2017

Heron is an open-source streaming engine, employed by Twitter, Microsoft and Google, to process billions of events every day.


Big Data Spain 2017
November 16th - 17th Kinépolis Madrid


Big Data Spain

December 01, 2017


  1. None
  2. End-to-end Effectively Once with Heron and Pulsar Ivan Kelly @ivankelly

  3. What is effectively once?

  4. 3 common semantics in stream processing • At most once

    • At least once • Effectively once
  5. Only interesting in the case of failures!

  6. Example application State in each count node

  7. At most once Drops tuples

  8. At least once Retries tuples

  9. ☒ A processing node will only ever process a tuple

    once ☑ Result of a processed tuple only perceived once Effectively once?
  10. Optimistic & Pessimistic I/O requirements versus recovery latency Effectively once?

  11. Optimistic Effectively Once

  12. Pessimistic Effectively Once

  13. Effectively Once Support Optimistic Pessimistic Twitter Heron Apache Flink Google

    MillWheel Apache Kafka Apache Apex Twitter Heron (In-development)
  14. There are more ways to fail

  15. None
  16. On failure, the messaging system must not: • Lose messages

    • Duplicate messages • Reorder messages*
  17. Subscribers must be able to reattach where they left off

    Catching up should not adversely affect other operations
  18. Once acked by messaging, data is persisted there But before

    this point, publishers must not: • Lose data • Duplicate data
  19. Both PubSub and Queue semantics Multi-tenant & Cross-DC Apache Incubating

    project 2 years production usage at scale in Yahoo Apache Pulsar
  20. None
  21. No loss, No dupes, No reorder* = Total Order Atomic

    Broadcast (TOAB) = Consensus
  22. KV/Filesystem API • No access to log Single replicated log

    • Can’t scale out
  23. Horizontally scalable replicated logging Pulsar topic replicated log is sequence

    of bookkeeper ledgers (stored in ZK) Uses ZooKeeper to close ledgers Many different streams per storage node (bookies) Isolated read and write paths Apache BookKeeper
  24. Reattach with Cursors Enhanced offsets Managed by Pulsar Stored in

    BookKeeper ledgers (no scale issue)
  25. None
  26. Idempotent publish Publisher communicates offset in source data to broker

    Deduplicate published messages will lower or equal offset Broker stores publisher offsets in a cursor
  27. Pulling it all together Optimistic Effectively Once Pessimistic Effectively Once

    (Coming soon) Total Order Atomic Broadcast Scalable Storage for many streams Cursors Idempotent Publish
  28. Blog: https://streaml.io/blog YouTube: https://goo.gl/qnWXBT @streamlio @ivankelly