Gwen Shapira on Realtime Data Processing at Facebook

1 Papers We Love: Realtime Data Processing at Facebook Gwen
Shapira Confluent Inc.

2 Papers We Love: Realtime Data Processing at Facebook

3 Published in 2016 (!)

4 What kind of paper is this?

5 This is NOT The one true architecture . Please
don’t cargo-cult this paper

6 Few real-time systems at Facebook • Chorus – aggregate
trends • Realtime feedback for mobile app developers • Page analytics – likes, engagement… • Offload CPU-intensive dashboard queries

10 Looking for trending topics in 5 minute windows

11 The Tofu & Potatoes of the paper: Design Decisions

12 / KafkaStreams + exactly once

13 Decision #1 – Language Paradigm • Declarative (SQL) –
easy & limited • Functional • Procedural (C++, Java, Python) – most flexibility, control, performance. Longer dev cycle.

14 Decision #1 – Language Paradigm • Declarative (SQL) –
easy & limited • Functional • Procedural (C++, Java, Python) – most flexibility, control, performance. Longer dev cycle.

15 Decision #2: Data Transfer • RPC (Millwheel, Flink, SparkStreaming)
• All about speed • Message-forwarding broker (Heron) • Applies back-pressure, multiplex • Persistent stream storage (Samza, Kafka’s Stream API) • Most reliable • Decouples processors

16 Decision #2: Data Transfer

17 Love Song to Scribe Independent stream processing nodes And
storing inputs / outputs Made everything great

18 Decision #3 – Processing Semantics

19 Decision #3 – Processing Semantics Facebook Verdict: It depends
on requirements • Ranker writes to idempotent system – at least once • Scuba can lose data, but not handle duplicates – at most once • …. Exactly once is REALLY HARD and requires transactions

20 Don’t miss the side-note on side-effects • Exactly once
means writing output + offsets to a transactional system • This takes time • Why just wait when you can deserialize? And maybe do other stateless stuff?

21 Decision #4 – State Saving • In-memory state with
replication (Old VoltDB) • Requires lots of hardware and network • Local database (Samza, Kafka Streams API) • Remote database (Millwheel) • Upstream (i.e. replay everything on failure) • Global consistent snapshot (Flink)

22 Decision #4 – State Saving Facebook Verdict: It depends
Rhode Island Alaska

23 Best Part of the Paper – by far How
to efficiently work with state in remote DB?

24 Decision #5 - Reprocessing • Stream only – requires
long retention in the stream store • Maintain both batch and stream systems • Develop systems that can run in streams and batch (Flink, Spark)

25 Decision #5 - Reprocessing • Stream only – requires
long retention in the stream store • Maintain both batch and stream systems • Develop systems that can run in streams and batch (Flink, Spark) Facebook Verdict: SQL runs everywhere And binary generation FTW

26 Applications – Or a whirlwind tour of good patterns
One example:

27 Lessons Learned! The biggest win is pipelines composed of
independent processors • Mixing multiple systems let us move fast • High level abstractions let us improve implementation • Ease of debugging – Independent nodes and ability to replay • Ease of deployment – Puma as-a-service • Ease of monitoring – Lag is the most important metric. Everything is instrumented out of the box. • In the future – auto-scale based on lag

28 Thank You!

Gwen Shapira on Realtime Data Processing at Fac...

Gwen Shapira on Realtime Data Processing at Facebook

Papers_We_Love

More Decks by Papers_We_Love

Other Decks in Technology

Featured

Transcript

1 Papers We Love: Realtime Data Processing at Facebook Gwen

2 Papers We Love: Realtime Data Processing at Facebook

3 Published in 2016 (!)

4 What kind of paper is this?

5 This is NOT The one true architecture . Please

6 Few real-time systems at Facebook • Chorus – aggregate

7

8

9

10 Looking for trending topics in 5 minute windows

11 The Tofu & Potatoes of the paper: Design Decisions

12 / KafkaStreams + exactly once

13 Decision #1 – Language Paradigm • Declarative (SQL) –

14 Decision #1 – Language Paradigm • Declarative (SQL) –

15 Decision #2: Data Transfer • RPC (Millwheel, Flink, SparkStreaming)

16 Decision #2: Data Transfer

17 Love Song to Scribe Independent stream processing nodes And

18 Decision #3 – Processing Semantics

19 Decision #3 – Processing Semantics Facebook Verdict: It depends

20 Don’t miss the side-note on side-effects • Exactly once

21 Decision #4 – State Saving • In-memory state with

22 Decision #4 – State Saving Facebook Verdict: It depends

23 Best Part of the Paper – by far How

24 Decision #5 - Reprocessing • Stream only – requires

25 Decision #5 - Reprocessing • Stream only – requires

26 Applications – Or a whirlwind tour of good patterns

27 Lessons Learned! The biggest win is pipelines composed of

28 Thank You!