Slide 1

Slide 1 text

@pyr Billing the cloud Real world stream processing

Slide 2

Slide 2 text

@pyr Three-line bio ● CTO & co-founder at Exoscale ● Open Source Developer ● Monitoring & Distributed Systems Enthusiast

Slide 3

Slide 3 text

@pyr Billing the cloud Real world stream processing

Slide 4

Slide 4 text

@pyr ● Billing resources ● Scaling methodologies ● Our approach

Slide 5

Slide 5 text

@pyr

Slide 6

Slide 6 text

@pyr provider "exoscale" { api_key = "${var.exoscale_api_key}" secret_key = "${var.exoscale_secret_key}" } resource "exoscale_instance" "web" { template = "ubuntu 17.04" disk_size = "50g" template = "ubuntu 17.04" profile = "medium" ssh_key = "production" }

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

@pyr Infrastructure isn’t free! (sorry)

Slide 10

Slide 10 text

@pyr Business Model ● Provide cloud infrastructure ● (???) ● Profit!

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

@pyr 10000 mile high view

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

Quantities

Slide 16

Slide 16 text

Quantities ● 10 megabytes have been set from 159.100.251.251 over the last minute

Slide 17

Slide 17 text

Resources

Slide 18

Slide 18 text

Resources ● Account WAD started instance foo with profile large today at 12:00 ● Account WAD stopped instance foo today at 12:15

Slide 19

Slide 19 text

A bit closer to reality {:type :usage :entity :vm :action :create :time #inst "2016-12-12T15:48:32.000-00:00" :template "ubuntu-16.04" :source :cloudstack :account "geneva-jug" :uuid "7a070a3d-66ff-4658-ab08-fe3cecd7c70f" :version 1 :offering "medium"}

Slide 20

Slide 20 text

A bit closer to reality message IPMeasure { /* Versioning */ required uint32 header = 1; required uint32 saddr = 2; required uint64 bytes = 3; /* Validity */ required uint64 start = 4; required uint64 end = 5; }

Slide 21

Slide 21 text

@pyr Theory

Slide 22

Slide 22 text

@pyr Quantities are simple

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

@pyr Resources are harder

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

@pyr This is per account

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

@pyr Solving for all events

Slide 29

Slide 29 text

resources = {} metering = [] def usage_metering(): for event in fetch_all_events(): uuid = event.uuid() time = event.time() if event.action() == 'start': resources[uuid] = time else: timespan = duration(resources[uuid], time) usage = Usage(uuid, timespan) metering.append(usage) return metering

Slide 30

Slide 30 text

@pyr In Practice

Slide 31

Slide 31 text

@pyr ● This is a never-ending process ● Minute-precision billing ● Applied every hour

Slide 32

Slide 32 text

@pyr ● Avoid overbilling at all cost ● Avoid underbilling (we need to eat!)

Slide 33

Slide 33 text

@pyr ● Keep a small operational footprint

Slide 34

Slide 34 text

@pyr A naive approach

Slide 35

Slide 35 text

30 * * * * usage-metering >/dev/null 2>&1

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

@pyr Advantages

Slide 38

Slide 38 text

@pyr ● Low operational overhead ● Simple functional boundaries ● Easy to test

Slide 39

Slide 39 text

@pyr Drawbacks

Slide 40

Slide 40 text

@pyr ● High pressure on SQL server ● Hard to avoid overlapping jobs ● Overlaps result in longer metering intervals

Slide 41

Slide 41 text

You are in a room full of overlapping cron jobs. You can hear the screams of a dying MySQL server. An Oracle vendor is here. To the West, a door is marked “Map/Reduce” To the East, a door is marked “Stream Processing”

Slide 42

Slide 42 text

> Talk to Oracle

Slide 43

Slide 43 text

You’ve been eaten by a grue.

Slide 44

Slide 44 text

> Go West

Slide 45

Slide 45 text

@pyr

Slide 46

Slide 46 text

@pyr ● Conceptually simple ● Spreads easily ● Data locality aware processing

Slide 47

Slide 47 text

@pyr ● ETL ● High latency ● High operational overhead

Slide 48

Slide 48 text

> Go East

Slide 49

Slide 49 text

@pyr

Slide 50

Slide 50 text

@pyr ● Continuous computation on an unbounded stream ● Each record processed as it arrives ● Very low latency

Slide 51

Slide 51 text

@pyr ● Conceptually harder ● Where do we store intermediate results? ● How does data flow between computation steps?

Slide 52

Slide 52 text

@pyr Deciding factors

Slide 53

Slide 53 text

@pyr Our shopping list ● Operational simplicity ● Integration through our whole stack ● Room to grow

Slide 54

Slide 54 text

@pyr Operational simplicity ● Experience matters ● Spark and Storm are intimidating ● Hbase & Hive discarded

Slide 55

Slide 55 text

@pyr Integration ● HDFS & Kafka require simple integration ● Spark goes hand in hand with Cassandra

Slide 56

Slide 56 text

@pyr Room to grow ● A ton of logs ● A ton of metrics

Slide 57

Slide 57 text

@pyr Small confession ● Previously knew Kafka

Slide 58

Slide 58 text

@pyr

Slide 59

Slide 59 text

No content

Slide 60

Slide 60 text

@pyr ● Publish & Subscribe ● Processing ● Store

Slide 61

Slide 61 text

@pyr Publish & Subscribe ● Records are produced on topics ● Topics have a predefined number of partitions ● Records have a key which determines their partition

Slide 62

Slide 62 text

@pyr ● Consumers get assigned a set of partitions ● Consumers store their last consumed offset ● Brokers own partitions, handle replication

Slide 63

Slide 63 text

No content

Slide 64

Slide 64 text

@pyr ● Stable consumer topology ● Memory disaggregation ● Can rely on in-memory storage ● Age expiry and log compaction

Slide 65

Slide 65 text

@pyr

Slide 66

Slide 66 text

@pyr Billing at Exoscale

Slide 67

Slide 67 text

No content

Slide 68

Slide 68 text

No content

Slide 69

Slide 69 text

No content

Slide 70

Slide 70 text

@pyr Problem solved?

Slide 71

Slide 71 text

@pyr ● Process crashes ● Undelivered message? ● Avoiding overbilling

Slide 72

Slide 72 text

@pyr Reconciliation ● Snapshot of full inventory ● Converges stored resource state if necessary ● Handles failed deliveries as well

Slide 73

Slide 73 text

@pyr Avoiding overbilling ● Reconciler acts as logical clock ● When supplying usage, attach a unique transaction ID ● Reject multiple transaction attempts on a single ID

Slide 74

Slide 74 text

@pyr Avoiding overbilling ● Reconciler acts as logical clock ● When supplying usage, attach a unique transaction ID ● Reject multiple transaction attempts on a single ID

Slide 75

Slide 75 text

@pyr Parting words

Slide 76

Slide 76 text

@pyr Looking back ● Things stay simple (roughly 600 LoC) ● Room to grow ● Stable and resilient ● DNS, Logs, Metrics, Event Sourcing

Slide 77

Slide 77 text

@pyr What about batch? ● Streaming doesn’t work for everything ● Sometimes throughput matters more than latency ● Building models in batch, applying with stream processing

Slide 78

Slide 78 text

@pyr Thanks! Questions?