Billing the Cloud

@pyr Billing the cloud Real world stream processing

@pyr Three-line bio • CTO & co-founder at Exoscale •
Open Source Developer • Monitoring & Distributed Systems Enthusiast

@pyr Billing the cloud Real world stream processing

@pyr • Billing resources • Scaling methodologies • Our approach

@pyr provider "exoscale" { api_key = "${var.exoscale_api_key}" secret_key = "${var.exoscale_secret_key}"
} resource "exoscale_instance" "web" { template = "ubuntu 17.04" disk_size = "50g" template = "ubuntu 17.04" profile = "medium" ssh_key = "production" }

@pyr Infrastructure isn’t free! (sorry)

@pyr Business Model • Provide cloud infrastructure • (???) •
Profit!

@pyr 10000 mile high view

Quantities

Quantities • 10 megabytes have been set from 159.100.251.251 over
the last minute

Resources

Resources • Account WAD started instance foo with profile large
today at 12:00 • Account WAD stopped instance foo today at 12:15

A bit closer to reality {:type :usage :entity :vm :action
:create :time #inst "2016-12-12T15:48:32.000-00:00" :template "ubuntu-16.04" :source :cloudstack :account "geneva-jug" :uuid "7a070a3d-66ff-4658-ab08-fe3cecd7c70f" :version 1 :offering "medium"}

A bit closer to reality message IPMeasure { /* Versioning
*/ required uint32 header = 1; required uint32 saddr = 2; required uint64 bytes = 3; /* Validity */ required uint64 start = 4; required uint64 end = 5; }

@pyr Theory

@pyr Quantities are simple

@pyr Resources are harder

@pyr This is per account

@pyr Solving for all events

resources = {} metering = [] def usage_metering(): for event
in fetch_all_events(): uuid = event.uuid() time = event.time() if event.action() == 'start': resources[uuid] = time else: timespan = duration(resources[uuid], time) usage = Usage(uuid, timespan) metering.append(usage) return metering

@pyr In Practice

@pyr • This is a never-ending process • Minute-precision billing
• Applied every hour

@pyr • Avoid overbilling at all cost • Avoid underbilling
(we need to eat!)

@pyr • Keep a small operational footprint

@pyr A naive approach

30 * * * * usage-metering >/dev/null 2>&1

@pyr Advantages

@pyr • Low operational overhead • Simple functional boundaries •
Easy to test

@pyr Drawbacks

@pyr • High pressure on SQL server • Hard to
avoid overlapping jobs • Overlaps result in longer metering intervals

You are in a room full of overlapping cron jobs.
You can hear the screams of a dying MySQL server. An Oracle vendor is here. To the West, a door is marked “Map/Reduce” To the East, a door is marked “Stream Processing”

> Talk to Oracle

You’ve been eaten by a grue.

> Go West

@pyr • Conceptually simple • Spreads easily • Data locality
aware processing

@pyr • ETL • High latency • High operational overhead

> Go East

@pyr • Continuous computation on an unbounded stream • Each
record processed as it arrives • Very low latency

@pyr • Conceptually harder • Where do we store intermediate
results? • How does data flow between computation steps?

@pyr Deciding factors

@pyr Our shopping list • Operational simplicity • Integration through
our whole stack • Room to grow

@pyr Operational simplicity • Experience matters • Spark and Storm
are intimidating • Hbase & Hive discarded

@pyr Integration • HDFS & Kafka require simple integration •
Spark goes hand in hand with Cassandra

@pyr Room to grow • A ton of logs •
A ton of metrics

@pyr Small confession • Previously knew Kafka

@pyr • Publish & Subscribe • Processing • Store

@pyr Publish & Subscribe • Records are produced on topics
• Topics have a predefined number of partitions • Records have a key which determines their partition

@pyr • Consumers get assigned a set of partitions •
Consumers store their last consumed offset • Brokers own partitions, handle replication

@pyr • Stable consumer topology • Memory disaggregation • Can
rely on in-memory storage • Age expiry and log compaction

@pyr Billing at Exoscale

@pyr Problem solved?

@pyr • Process crashes • Undelivered message? • Avoiding overbilling

@pyr Reconciliation • Snapshot of full inventory • Converges stored
resource state if necessary • Handles failed deliveries as well

@pyr Avoiding overbilling • Reconciler acts as logical clock •
When supplying usage, attach a unique transaction ID • Reject multiple transaction attempts on a single ID

@pyr Parting words

@pyr Looking back • Things stay simple (roughly 600 LoC)
• Room to grow • Stable and resilient • DNS, Logs, Metrics, Event Sourcing

@pyr What about batch? • Streaming doesn’t work for everything
• Sometimes throughput matters more than latency • Building models in batch, applying with stream processing

@pyr Thanks! Questions?

Billing the Cloud

Billing the Cloud

More Decks by Pierre-Yves Ritschard

Other Decks in Programming

Featured

Transcript