Slide 1

Slide 1 text

1 Billing the cloud Real world stream processing

Slide 2

Slide 2 text

2 . 1 @pyr Co-Founder, CTO at Exoscale Open source developer

Slide 3

Slide 3 text

3 . 1 Tonight Problem domain Scaling methodologies Our approach

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

4 . 1

Slide 6

Slide 6 text

5 . 1

Slide 7

Slide 7 text

6 . 1 7 . 1 Infrastructure isn't free!

Slide 8

Slide 8 text

8 . 1 Business Model Provide cloud infrastructure ??? Pro t!

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

9 . 1

Slide 11

Slide 11 text

10 . 1 11 . 1 10000 mile high view

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

12 . 1 Quantities Resources

Slide 14

Slide 14 text

13 . 1 14 . 1 Quantities 10 megabytes have been sent from 159.100.251.251 over the last minute

Slide 15

Slide 15 text

15 . 1 Resources Account geneva-jug started instance foo with pro le large today at 12:00 Account geneva-jug stopped instance foo today at 12:15

Slide 16

Slide 16 text

16 . 1 A bit closer to reality {:type :usage :entity :vm :action :create :time #inst "2016-12-12T15:48:32.000-00:00" :template "ubuntu-16.04" :source :cloudstack :account "geneva-jug" :uuid "7a070a3d-66ff-4658-ab08-fe3cecd7c70f" :version 1 :offering "medium"}

Slide 17

Slide 17 text

17 . 1 A bit closer to reality message IPMeasure { /* Versioning */ required uint32 header = 1; required uint32 saddr = 2; required uint64 bytes = 3; /* Validity */ required uint64 start = 4; required uint64 end = 5; }

Slide 18

Slide 18 text

18 . 1 Theory

Slide 19

Slide 19 text

19 . 1 Quantities are simple

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

20 . 1 21 . 1 Resources are harder

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

22 . 1 23 . 1 This is per-account

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

24 . 1 25 . 1 Solving for all events resources = {} metering = [] def usage_metering(): for event in fetch_all_events(): uuid = event.uuid() time = event.time() if event.action() == 'start': resources[uuid] = time else: timespan = duration(resources[uuid], time) usage = Usage(uuid, timespan) metering.append(usage) return metering

Slide 26

Slide 26 text

26 . 1 Practical matters This is a never-ending process Minute precision billing Only apply once an hour Avoid over billing at all cost Avoid under billing (we need to eat!)

Slide 27

Slide 27 text

27 . 1 Practical matters Keep a small operational footprint

Slide 28

Slide 28 text

28 . 1 A naive approach

Slide 29

Slide 29 text

32 * * * * usage-metering >/dev/null 2>&1

Slide 30

Slide 30 text

29 . 1

Slide 31

Slide 31 text

30 . 1

Slide 32

Slide 32 text

31 . 1 32 . 1 Advantages

Slide 33

Slide 33 text

Low operational overhead Simple functional boundaries Easy to test

Slide 34

Slide 34 text

33 . 1 34 . 1 Drawbacks High pressure on SQL server Hard to avoid overlapping jobs Overlaps result in longer metering intervals

Slide 35

Slide 35 text

You are in a room full of overlapping cron jobs. You can hear the screams of a dying MySQL server. An Oracle vendor is here. To the West, a door is marked "Map/Reduce" To the East, a door is marked "Streaming"

Slide 36

Slide 36 text

35 . 1 36 . 1 > Talk to Oracle

Slide 37

Slide 37 text

You have been eaten by a grue.

Slide 38

Slide 38 text

37 . 1 38 . 1 > Go West

Slide 39

Slide 39 text

No content

Slide 40

Slide 40 text

39 . 1 Conceptually simple Spreads easily Data-locality aware processing

Slide 41

Slide 41 text

40 . 1 ETL High latency High operational overhead

Slide 42

Slide 42 text

41 . 1

Slide 43

Slide 43 text

42 . 1 43 . 1 > Go East

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

44 . 1 Continuous computation on an unbounded stream

Slide 46

Slide 46 text

45 . 1 Each event processed as it comes in Very low latency A never ending reduce

Slide 47

Slide 47 text

46 . 1 (reductions + [1 2 3 4]) ;; => (1 3 6 10)

Slide 48

Slide 48 text

47 . 1 Conceptually harder Where do we store intermediate results? How does data ow between computation steps?

Slide 49

Slide 49 text

48 . 1

Slide 50

Slide 50 text

49 . 1 50 . 1 Deciding factors

Slide 51

Slide 51 text

51 . 1 Our shopping list

Slide 52

Slide 52 text

Operational simplicity Integration through our whole stack Going beyond billing Room to grow

Slide 53

Slide 53 text

52 . 1 53 . 1 Operational simplicity Experience matters Spark and Storm are intimidating Hbase & Hive discarded

Slide 54

Slide 54 text

54 . 1 Integration HDFS would require simple integration Spark usually goes hand in hand with Cassandra Storm tends to prefer Kafka

Slide 55

Slide 55 text

55 . 1 Room to grow A ton of logs A ton of metrics

Slide 56

Slide 56 text

56 . 1 Thursday confessions Previously knew Kafka

Slide 57

Slide 57 text

No content

Slide 58

Slide 58 text

57 . 1

Slide 59

Slide 59 text

58 . 1 Publish & Subscribe Processing Store

Slide 60

Slide 60 text

59 . 1 60 . 1 Publish & Subscribe Messages are produced to topics Topics have a prede ned number of partitions Messages have a key which determines its partition

Slide 61

Slide 61 text

Consumers get assigned a set of partitions Consumers store their last consumed offset Brokers own partitions, handle replication

Slide 62

Slide 62 text

61 . 1

Slide 63

Slide 63 text

62 . 1 Stable consumer topology Memory desaggregation Can rely on in-memory storage

Slide 64

Slide 64 text

63 . 1 64 . 1 Stream expiry

Slide 65

Slide 65 text

No content

Slide 66

Slide 66 text

65 . 1

Slide 67

Slide 67 text

66 . 1

Slide 68

Slide 68 text

67 . 1

Slide 69

Slide 69 text

68 . 1 69 . 1 Problem solved?

Slide 70

Slide 70 text

Process crashes Undelivered message? Avoiding double billing

Slide 71

Slide 71 text

70 . 1 71 . 1 Process crashes Triggers a rebalance Loss of in-memory cache No initial state!

Slide 72

Slide 72 text

72 . 1 Reconciliation Snapshot of full inventory Converges stored resource state if necessary Handles failed deliveries as well

Slide 73

Slide 73 text

73 . 1 Avoiding double billing Reconciler acts as logical clock When supplying usage, attach a unique transaction ID Reject multiple transaction attempts on a single ID

Slide 74

Slide 74 text

74 . 1 Looking back Things stay simple (roughly 600 LoC) Room to grow Stable and resilient DNS, Logs, Metrics, Event Sourcing

Slide 75

Slide 75 text

75 . 1 What about batch Streaming doesn't work for everything Sometimes throughput matters more than latency Building models in batch, applying with stream processing

Slide 76

Slide 76 text

76 . 1 Questions? Thanks!