Billing the Cloud

Billing the Cloud

This talk describes how Exoscale approaches usage metering and billing with Apache Kafka

2fcc875f98607b3007909fe4be99160d?s=128

Pierre-Yves Ritschard

December 15, 2016
Tweet

Transcript

  1. 1 Billing the cloud Real world stream processing

  2. 2 . 1 @pyr Co-Founder, CTO at Exoscale Open source

    developer
  3. 3 . 1 Tonight Problem domain Scaling methodologies Our approach

  4. None
  5. 4 . 1

  6. 5 . 1

  7. 6 . 1 7 . 1 Infrastructure isn't free!

  8. 8 . 1 Business Model Provide cloud infrastructure ??? Pro

    t!
  9. None
  10. 9 . 1

  11. 10 . 1 11 . 1 10000 mile high view

  12. None
  13. 12 . 1 Quantities Resources

  14. 13 . 1 14 . 1 Quantities 10 megabytes have

    been sent from 159.100.251.251 over the last minute
  15. 15 . 1 Resources Account geneva-jug started instance foo with

    pro le large today at 12:00 Account geneva-jug stopped instance foo today at 12:15
  16. 16 . 1 A bit closer to reality {:type :usage

    :entity :vm :action :create :time #inst "2016-12-12T15:48:32.000-00:00" :template "ubuntu-16.04" :source :cloudstack :account "geneva-jug" :uuid "7a070a3d-66ff-4658-ab08-fe3cecd7c70f" :version 1 :offering "medium"}
  17. 17 . 1 A bit closer to reality message IPMeasure

    { /* Versioning */ required uint32 header = 1; required uint32 saddr = 2; required uint64 bytes = 3; /* Validity */ required uint64 start = 4; required uint64 end = 5; }
  18. 18 . 1 Theory

  19. 19 . 1 Quantities are simple

  20. None
  21. 20 . 1 21 . 1 Resources are harder

  22. None
  23. 22 . 1 23 . 1 This is per-account

  24. None
  25. 24 . 1 25 . 1 Solving for all events

    resources = {} metering = [] def usage_metering(): for event in fetch_all_events(): uuid = event.uuid() time = event.time() if event.action() == 'start': resources[uuid] = time else: timespan = duration(resources[uuid], time) usage = Usage(uuid, timespan) metering.append(usage) return metering
  26. 26 . 1 Practical matters This is a never-ending process

    Minute precision billing Only apply once an hour Avoid over billing at all cost Avoid under billing (we need to eat!)
  27. 27 . 1 Practical matters Keep a small operational footprint

  28. 28 . 1 A naive approach

  29. 32 * * * * usage-metering >/dev/null 2>&1

  30. 29 . 1

  31. 30 . 1

  32. 31 . 1 32 . 1 Advantages

  33. Low operational overhead Simple functional boundaries Easy to test

  34. 33 . 1 34 . 1 Drawbacks High pressure on

    SQL server Hard to avoid overlapping jobs Overlaps result in longer metering intervals
  35. You are in a room full of overlapping cron jobs.

    You can hear the screams of a dying MySQL server. An Oracle vendor is here. To the West, a door is marked "Map/Reduce" To the East, a door is marked "Streaming"
  36. 35 . 1 36 . 1 > Talk to Oracle

  37. You have been eaten by a grue.

  38. 37 . 1 38 . 1 > Go West

  39. None
  40. 39 . 1 Conceptually simple Spreads easily Data-locality aware processing

  41. 40 . 1 ETL High latency High operational overhead

  42. 41 . 1

  43. 42 . 1 43 . 1 > Go East

  44. None
  45. 44 . 1 Continuous computation on an unbounded stream

  46. 45 . 1 Each event processed as it comes in

    Very low latency A never ending reduce
  47. 46 . 1 (reductions + [1 2 3 4]) ;;

    => (1 3 6 10)
  48. 47 . 1 Conceptually harder Where do we store intermediate

    results? How does data ow between computation steps?
  49. 48 . 1

  50. 49 . 1 50 . 1 Deciding factors

  51. 51 . 1 Our shopping list

  52. Operational simplicity Integration through our whole stack Going beyond billing

    Room to grow
  53. 52 . 1 53 . 1 Operational simplicity Experience matters

    Spark and Storm are intimidating Hbase & Hive discarded
  54. 54 . 1 Integration HDFS would require simple integration Spark

    usually goes hand in hand with Cassandra Storm tends to prefer Kafka
  55. 55 . 1 Room to grow A ton of logs

    A ton of metrics
  56. 56 . 1 Thursday confessions Previously knew Kafka

  57. None
  58. 57 . 1

  59. 58 . 1 Publish & Subscribe Processing Store

  60. 59 . 1 60 . 1 Publish & Subscribe Messages

    are produced to topics Topics have a prede ned number of partitions Messages have a key which determines its partition
  61. Consumers get assigned a set of partitions Consumers store their

    last consumed offset Brokers own partitions, handle replication
  62. 61 . 1

  63. 62 . 1 Stable consumer topology Memory desaggregation Can rely

    on in-memory storage
  64. 63 . 1 64 . 1 Stream expiry

  65. None
  66. 65 . 1

  67. 66 . 1

  68. 67 . 1

  69. 68 . 1 69 . 1 Problem solved?

  70. Process crashes Undelivered message? Avoiding double billing

  71. 70 . 1 71 . 1 Process crashes Triggers a

    rebalance Loss of in-memory cache No initial state!
  72. 72 . 1 Reconciliation Snapshot of full inventory Converges stored

    resource state if necessary Handles failed deliveries as well
  73. 73 . 1 Avoiding double billing Reconciler acts as logical

    clock When supplying usage, attach a unique transaction ID Reject multiple transaction attempts on a single ID
  74. 74 . 1 Looking back Things stay simple (roughly 600

    LoC) Room to grow Stable and resilient DNS, Logs, Metrics, Event Sourcing
  75. 75 . 1 What about batch Streaming doesn't work for

    everything Sometimes throughput matters more than latency Building models in batch, applying with stream processing
  76. 76 . 1 Questions? Thanks!