Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How we process billions of metrics to make your...

Fabian Lange
September 28, 2017

How we process billions of metrics to make your microservices run smoothly

Fabian Lange

September 28, 2017
Tweet

More Decks by Fabian Lange

Other Decks in Technology

Transcript

  1. About Me • Co-Founder of Instana • Core Architect and

    Developer • Performance Consultant who has seen the worst 2
  2. About This Talk • At Instana we deal with billions

    of metrics every second • This talk covers • Collection • Processing • Storage • Presentation • Mostly using Java and Amazon Web Services • Tips and caveats for all of the above • It also contains a few things about monitoring and tracing in general 3
  3. A Short Glimpse Of Instana In Action • We automated

    our demo, to avoid human error :) • www.youtube.com/watch?v=z14wXHzw5lU 4
  4. A Beautiful And Meaningful User Interface • Written using React

    and Three.js • Provides access to various types of data with dedicated preconfigured views • Realtime • Fast • Rarely requests data • Live streaming of information from the backend via Sock.js 6
  5. Displaying Tons of Data Points Quickly • React DOM diffing

    • Flux stores • RxJS based subscriptions with tweaks • Custom charts • HW accelerated canvases over HTML 7
  6. Agents Send Tons Of Data • Customers start an Instana

    Agent process on every host that shall be monitored • Collection of metrics and configuration • Standard APIs like statsd or JMX • Proprietary “tricks” • Push or Poll • Trace data describing the code executions • Automatic no impact instrumentation • Manual OpenTracing API • Everything is running on timers • Reduces and compresses data representing the complete state of everything on the host • Single digit CPU usage 8
  7. API First Design • The edges of the Instana processing

    pipeline are designed API first • No tricks or shortcuts for Instana code • Growing documentation and examples • 3rd party integrations 10
  8. High Level Architecture 11 Agent Agent Agent Agent Custom Event

    Source API API Browser UI Browser UI Email PagerDuty Processing Pipeline Magic happens here
  9. Building a Complete Picture • We could just route the

    data straight through • When an agent sends a metric value, we could update the UI. • However we want to do more than that • Analyze correlation between any metric or event in the whole system. • E.g. When one server is overloaded, another one times out on network connections at the same time. • Build a canonical representation of EVERYTHING 13
  10. Managing Large Amount of Data Streams • Lightweight proxy component

    at the edge • Authenticates traffic • Decompresses traffic • Handles backchannel • Jetty 9.4 using HTTP/2 • Agents are not time synced • Traffic spreads out over time • Internal routing via Kafka • LZ4 compression cut bandwidth by an order of magnitude 14
  11. 15

  12. 16

  13. 17

  14. Scaling for Many Customers • All customer specific processing is

    performed in a “tenant unit”. • Customers have access to all of their units via SSO. • Units have multiple purposes • Separating environments like production and test • Allowing organisational separation despite centralised billing • Scaling via sharding 18
  15. Central Components Of Our Stream Processing • Heart: unified view

    and relation • builds Dynamic Graph from incoming data • sanitizes and time-normalizes data and fills missing data • Optimizes for long time storage • Brain: analysis and understanding • runs on the Dynamic Graph and continuously evaluates health • performs root cause analysis and alerting 19
  16. The Heart Of our Stream Processing • Holds the complete

    state of a customer environment in memory • Incorporates stream of incoming changes / new data • Metrics • Events • Traces • Configuration changes • Performs the canonical time ticking 20
  17. Reactive Streams • Implemented on Project Reactor and Java 8

    • Every processing step is a subscriber on a stream • Rollup calculation and persistence • Deriving events from changes • Short term storage for live mode • Indexing for search 21
  18. Instanas Reactive Streams • Subscribers do not unsubscribe on error

    • No supervision of subscribers • No backpressure • Small ring buffer size • Every subscriber is metered • Throughput • Processed • Errors • Drops 22
  19. 23

  20. 24

  21. 25

  22. 26

  23. 27

  24. The Brain Of Our Stream Processing • Receives consistent sets

    of data for each monitored component every second • Health analysis subscribes dynamically to the relevant incoming data • Performs • Configuration analysis • Static thresholds • Trend prediction • Outlier and anomaly detection • Neural network based forecasting • Found issues are stored in Elasticsearch and forwarded via Kafka • Reoccurring issues are detected • Causality is analysed based on temporal order and relationships in the Dynamic Graph 28
  25. Algorithms • Algorithms heavily use Median Absolute Deviation (MAD) •

    en.wikipedia.org/wiki/Median_absolute_deviation • is more robust “average” • Neural Network • TensorFlow • Long Short-Term Memory • Daily training • Hourly prediction updates 29
  26. Sudden Drop / Increase 30 “Twitter Paper”: Leveraging Cloud Data

    to Mitigate User Experience from Breaking Bad
 E-DIVISIVE WITH MEDIANS arxiv.org/pdf/1411.7955.pdf
  27. No So Micro Services • We run several components communicating

    via Kafka • Majority of processing is happening within two central components • Most processing operates on the same data set:
 Current state of the monitored system • Copying data over network is not free • Plugin architecture speeds up development 35
  28. Optimization: Data Conversion and Passing • Reading bytes from Kafka

    and converting into our data model 36 Option A • Thread 1 • Connects Kafka • Reads and decompresses bytes • passes bytes on • Thread 2 • reads bytes • builds domain model • passes domain model on • Thread 3 • processes domain model
 Option B • Thread 1 • Connects Kafka • Reads and decompresses bytes • passes bytes on • builds domain model • passes domain model on • Thread 2 • processes domain model 

  29. Optimization: Data Conversion and Passing • Option B • Domain

    model objects live longer • Keep garbage thread local (TLAB) • Use G1GC • -XX:+UseG1GC • When dealing with lots of string data use StringDeduplication • -XX:+UseStringDeduplication • When dealing with lots of primitive types use primitive collections • Trove • GS / Eclipse collections • various others with different focus 37
  30. 38

  31. How We Deliver • Agent updates are released daily •

    Backend is released bi-weekly • SaaS first, on-prem a week later • SaaS releases beginning of the week • 1-2 hotfixes the same week • Real load never matches tests • Ability to roll back or forward quickly 39
  32. SaaS Deployment • Hashicorp Terraform used to provision infrastructure •

    Shared Infrastructure is installed, updated and scaled “manually” • Instana components are delivered as Docker containers • VPCs and AZs • Hashicorp Nomad manages deployments • Not limited to Docker • Hashicorp Consul facilitates service discovery and config mgmt • Customers get a domain name via Route53 • Tip: Avoid complex dependency chains or even cycles 40
  33. Scaling on AWS (Costs Money) • Not really flexible but

    straightforward • Inter-AZ traffic costs money https://aws.amazon.com/ec2/pricing/on-demand/ • IOPS via disc space http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html 41 • Not really flexible but straightforward • Inter-AZ traffic costs money https://aws.amazon.com/ec2/pricing/on-demand/ • Bandwidth via machine size http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-ec2-config.html Instance Type Max Mbps Expected MB/s r4.large 437 54 r4.xlarge 875 109 r4.2xlarge 1750 218 r4.4xlarge 3500 437 r4.8xlarge 7000 875 r4.16xlarge 14000 1750