Gwen Shapira on Realtime Data Processing at Facebook

by Papers_We_Love

Slide 1

Slide 1 text

1 Papers We Love: Realtime Data Processing at Facebook Gwen Shapira Confluent Inc.

Slide 2

Slide 2 text

2 Papers We Love: Realtime Data Processing at Facebook

Slide 3

Slide 3 text

3 Published in 2016 (!)

Slide 4

Slide 4 text

4 What kind of paper is this?

Slide 5

Slide 5 text

5 This is NOT The one true architecture . Please don’t cargo-cult this paper

Slide 6

Slide 6 text

6 Few real-time systems at Facebook • Chorus – aggregate trends • Realtime feedback for mobile app developers • Page analytics – likes, engagement… • Offload CPU-intensive dashboard queries

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

10 Looking for trending topics in 5 minute windows

Slide 11

Slide 11 text

11 The Tofu & Potatoes of the paper: Design Decisions

Slide 12

Slide 12 text

12 / KafkaStreams + exactly once

Slide 13

Slide 13 text

13 Decision #1 – Language Paradigm • Declarative (SQL) – easy & limited • Functional • Procedural (C++, Java, Python) – most flexibility, control, performance. Longer dev cycle.

Slide 14

Slide 14 text

14 Decision #1 – Language Paradigm • Declarative (SQL) – easy & limited • Functional • Procedural (C++, Java, Python) – most flexibility, control, performance. Longer dev cycle.

Slide 15

Slide 15 text

15 Decision #2: Data Transfer • RPC (Millwheel, Flink, SparkStreaming) • All about speed • Message-forwarding broker (Heron) • Applies back-pressure, multiplex • Persistent stream storage (Samza, Kafka’s Stream API) • Most reliable • Decouples processors

Slide 16

Slide 16 text

16 Decision #2: Data Transfer

Slide 17

Slide 17 text

17 Love Song to Scribe Independent stream processing nodes And storing inputs / outputs Made everything great

Slide 18

Slide 18 text

18 Decision #3 – Processing Semantics

Slide 19

Slide 19 text

19 Decision #3 – Processing Semantics Facebook Verdict: It depends on requirements • Ranker writes to idempotent system – at least once • Scuba can lose data, but not handle duplicates – at most once • …. Exactly once is REALLY HARD and requires transactions

Slide 20

Slide 20 text

20 Don’t miss the side-note on side-effects • Exactly once means writing output + offsets to a transactional system • This takes time • Why just wait when you can deserialize? And maybe do other stateless stuff?

Slide 21

Slide 21 text

21 Decision #4 – State Saving • In-memory state with replication (Old VoltDB) • Requires lots of hardware and network • Local database (Samza, Kafka Streams API) • Remote database (Millwheel) • Upstream (i.e. replay everything on failure) • Global consistent snapshot (Flink)

Slide 22

Slide 22 text

22 Decision #4 – State Saving Facebook Verdict: It depends Rhode Island Alaska

Slide 23

Slide 23 text

23 Best Part of the Paper – by far How to efficiently work with state in remote DB?

Slide 24

Slide 24 text

24 Decision #5 - Reprocessing • Stream only – requires long retention in the stream store • Maintain both batch and stream systems • Develop systems that can run in streams and batch (Flink, Spark)

Slide 25

Slide 25 text

25 Decision #5 - Reprocessing • Stream only – requires long retention in the stream store • Maintain both batch and stream systems • Develop systems that can run in streams and batch (Flink, Spark) Facebook Verdict: SQL runs everywhere And binary generation FTW

Slide 26

Slide 26 text

26 Applications – Or a whirlwind tour of good patterns One example:

Slide 27

Slide 27 text

27 Lessons Learned! The biggest win is pipelines composed of independent processors • Mixing multiple systems let us move fast • High level abstractions let us improve implementation • Ease of debugging – Independent nodes and ability to replay • Ease of deployment – Puma as-a-service • Ease of monitoring – Lag is the most important metric. Everything is instrumented out of the box. • In the future – auto-scale based on lag

Slide 28

Slide 28 text

28 Thank You!