Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Billing the Cloud

Billing the Cloud

This talk describes how Exoscale approaches usage metering and billing with Apache Kafka

Pierre-Yves Ritschard

December 15, 2016
Tweet

More Decks by Pierre-Yves Ritschard

Other Decks in Technology

Transcript

  1. 1
    Billing the cloud
    Real world stream processing

    View Slide

  2. 2 . 1
    @pyr
    Co-Founder, CTO at Exoscale
    Open source developer

    View Slide

  3. 3 . 1
    Tonight
    Problem domain
    Scaling methodologies
    Our approach

    View Slide

  4. View Slide

  5. 4 . 1

    View Slide

  6. 5 . 1

    View Slide

  7. 6 . 1
    7 . 1
    Infrastructure isn't free!

    View Slide

  8. 8 . 1
    Business Model
    Provide cloud infrastructure
    ???
    Pro t!

    View Slide

  9. View Slide

  10. 9 . 1

    View Slide

  11. 10 . 1
    11 . 1
    10000 mile high view

    View Slide

  12. View Slide

  13. 12 . 1
    Quantities
    Resources

    View Slide

  14. 13 . 1
    14 . 1
    Quantities
    10 megabytes have been sent from
    159.100.251.251 over the last minute

    View Slide

  15. 15 . 1
    Resources
    Account geneva-jug started instance foo
    with pro le large today at 12:00
    Account geneva-jug stopped instance foo
    today at 12:15

    View Slide

  16. 16 . 1
    A bit closer to reality
    {:type :usage
    :entity :vm
    :action :create
    :time #inst "2016-12-12T15:48:32.000-00:00"
    :template "ubuntu-16.04"
    :source :cloudstack
    :account "geneva-jug"
    :uuid "7a070a3d-66ff-4658-ab08-fe3cecd7c70f"
    :version 1
    :offering "medium"}

    View Slide

  17. 17 . 1
    A bit closer to reality
    message IPMeasure {
    /* Versioning */
    required uint32 header = 1;
    required uint32 saddr = 2;
    required uint64 bytes = 3;
    /* Validity */
    required uint64 start = 4;
    required uint64 end = 5;
    }

    View Slide

  18. 18 . 1
    Theory

    View Slide

  19. 19 . 1
    Quantities are simple

    View Slide

  20. View Slide

  21. 20 . 1
    21 . 1
    Resources are harder

    View Slide

  22. View Slide

  23. 22 . 1
    23 . 1
    This is per-account

    View Slide

  24. View Slide

  25. 24 . 1
    25 . 1
    Solving for all events
    resources = {}
    metering = []
    def usage_metering():
    for event in fetch_all_events():
    uuid = event.uuid()
    time = event.time()
    if event.action() == 'start':
    resources[uuid] = time
    else:
    timespan = duration(resources[uuid], time)
    usage = Usage(uuid, timespan)
    metering.append(usage)
    return metering

    View Slide

  26. 26 . 1
    Practical matters
    This is a never-ending process
    Minute precision billing
    Only apply once an hour
    Avoid over billing at all cost
    Avoid under billing (we need to eat!)

    View Slide

  27. 27 . 1
    Practical matters
    Keep a small operational footprint

    View Slide

  28. 28 . 1
    A naive approach

    View Slide

  29. 32 * * * * usage-metering >/dev/null 2>&1

    View Slide

  30. 29 . 1

    View Slide

  31. 30 . 1

    View Slide

  32. 31 . 1
    32 . 1
    Advantages

    View Slide

  33. Low operational overhead
    Simple functional boundaries
    Easy to test

    View Slide

  34. 33 . 1
    34 . 1
    Drawbacks
    High pressure on SQL server
    Hard to avoid overlapping jobs
    Overlaps result in longer metering intervals

    View Slide

  35. You are in a room full of overlapping cron jobs.
    You can hear the screams of a dying MySQL server.
    An Oracle vendor is here.
    To the West, a door is marked "Map/Reduce"
    To the East, a door is marked "Streaming"

    View Slide

  36. 35 . 1
    36 . 1
    > Talk to Oracle

    View Slide

  37. You have been eaten by a grue.

    View Slide

  38. 37 . 1
    38 . 1
    > Go West

    View Slide

  39. View Slide

  40. 39 . 1
    Conceptually simple
    Spreads easily
    Data-locality aware processing

    View Slide

  41. 40 . 1
    ETL
    High latency
    High operational overhead

    View Slide

  42. 41 . 1

    View Slide

  43. 42 . 1
    43 . 1
    > Go East

    View Slide

  44. View Slide

  45. 44 . 1
    Continuous computation on an unbounded stream

    View Slide

  46. 45 . 1
    Each event processed as it comes in
    Very low latency
    A never ending reduce

    View Slide

  47. 46 . 1
    (reductions + [1 2 3 4]) ;; => (1 3 6 10)

    View Slide

  48. 47 . 1
    Conceptually harder
    Where do we store intermediate results?
    How does data ow between computation steps?

    View Slide

  49. 48 . 1

    View Slide

  50. 49 . 1
    50 . 1
    Deciding factors

    View Slide

  51. 51 . 1
    Our shopping list

    View Slide

  52. Operational simplicity
    Integration through our whole stack
    Going beyond billing
    Room to grow

    View Slide

  53. 52 . 1
    53 . 1
    Operational simplicity
    Experience matters
    Spark and Storm are intimidating
    Hbase & Hive discarded

    View Slide

  54. 54 . 1
    Integration
    HDFS would require simple integration
    Spark usually goes hand in hand with Cassandra
    Storm tends to prefer Kafka

    View Slide

  55. 55 . 1
    Room to grow
    A ton of logs
    A ton of metrics

    View Slide

  56. 56 . 1
    Thursday confessions
    Previously knew Kafka

    View Slide

  57. View Slide

  58. 57 . 1

    View Slide

  59. 58 . 1
    Publish & Subscribe
    Processing
    Store

    View Slide

  60. 59 . 1
    60 . 1
    Publish & Subscribe
    Messages are produced to topics
    Topics have a prede ned number of partitions
    Messages have a key which determines its partition

    View Slide

  61. Consumers get assigned a set of partitions
    Consumers store their last consumed offset
    Brokers own partitions, handle replication

    View Slide

  62. 61 . 1

    View Slide

  63. 62 . 1
    Stable consumer topology
    Memory desaggregation
    Can rely on in-memory storage

    View Slide

  64. 63 . 1
    64 . 1
    Stream expiry

    View Slide

  65. View Slide

  66. 65 . 1

    View Slide

  67. 66 . 1

    View Slide

  68. 67 . 1

    View Slide

  69. 68 . 1
    69 . 1
    Problem solved?

    View Slide

  70. Process crashes
    Undelivered message?
    Avoiding double billing

    View Slide

  71. 70 . 1
    71 . 1
    Process crashes
    Triggers a rebalance
    Loss of in-memory cache
    No initial state!

    View Slide

  72. 72 . 1
    Reconciliation
    Snapshot of full inventory
    Converges stored resource state if necessary
    Handles failed deliveries as well

    View Slide

  73. 73 . 1
    Avoiding double billing
    Reconciler acts as logical clock
    When supplying usage, attach a unique transaction ID
    Reject multiple transaction attempts on a single ID

    View Slide

  74. 74 . 1
    Looking back
    Things stay simple (roughly 600 LoC)
    Room to grow
    Stable and resilient
    DNS, Logs, Metrics, Event Sourcing

    View Slide

  75. 75 . 1
    What about batch
    Streaming doesn't work for everything
    Sometimes throughput matters more than latency
    Building models in batch, applying with stream processing

    View Slide

  76. 76 . 1
    Questions?
    Thanks!

    View Slide