How we process billions of metrics to make your microservices run smoothly

How we process billions of metrics to make your microservices
run smoothly Fabian Lange  Instana

About Me • Co-Founder of Instana • Core Architect and
Developer • Performance Consultant who has seen the worst 2

About This Talk • At Instana we deal with billions
of metrics every second • This talk covers • Collection • Processing • Storage • Presentation • Mostly using Java and Amazon Web Services • Tips and caveats for all of the above • It also contains a few things about monitoring and tracing in general 3

A Short Glimpse Of Instana In Action • We automated
our demo, to avoid human error :) • www.youtube.com/watch?v=z14wXHzw5lU 4

The Edges Of Our Processing Pipeline

A Beautiful And Meaningful User Interface • Written using React
and Three.js • Provides access to various types of data with dedicated preconﬁgured views • Realtime • Fast • Rarely requests data • Live streaming of information from the backend via Sock.js 6

Displaying Tons of Data Points Quickly • React DOM diﬃng
• Flux stores • RxJS based subscriptions with tweaks • Custom charts • HW accelerated canvases over HTML 7

Agents Send Tons Of Data • Customers start an Instana
Agent process on every host that shall be monitored • Collection of metrics and conﬁguration • Standard APIs like statsd or JMX • Proprietary “tricks” • Push or Poll • Trace data describing the code executions • Automatic no impact instrumentation • Manual OpenTracing API • Everything is running on timers • Reduces and compresses data representing the complete state of everything on the host • Single digit CPU usage 8

Everything Out of the Box * Actually much more but
couldn’t ﬁt the slide… 9

API First Design • The edges of the Instana processing
pipeline are designed API ﬁrst • No tricks or shortcuts for Instana code • Growing documentation and examples • 3rd party integrations 10

High Level Architecture 11 Agent Agent Agent Agent Custom Event
Source API API Browser UI Browser UI Email PagerDuty Processing Pipeline Magic happens here

The Core Of Our Processing Pipeline

Building a Complete Picture • We could just route the
data straight through • When an agent sends a metric value, we could update the UI. • However we want to do more than that • Analyze correlation between any metric or event in the whole system. • E.g. When one server is overloaded, another one times out on network connections at the same time. • Build a canonical representation of EVERYTHING 13

Managing Large Amount of Data Streams • Lightweight proxy component
at the edge • Authenticates traffic • Decompresses traffic • Handles backchannel • Jetty 9.4 using HTTP/2 • Agents are not time synced • Traffic spreads out over time • Internal routing via Kafka • LZ4 compression cut bandwidth by an order of magnitude 14

Scaling for Many Customers • All customer speciﬁc processing is
performed in a “tenant unit”. • Customers have access to all of their units via SSO. • Units have multiple purposes • Separating environments like production and test • Allowing organisational separation despite centralised billing • Scaling via sharding 18

Central Components Of Our Stream Processing • Heart: uniﬁed view
and relation • builds Dynamic Graph from incoming data • sanitizes and time-normalizes data and ﬁlls missing data • Optimizes for long time storage • Brain: analysis and understanding • runs on the Dynamic Graph and continuously evaluates health • performs root cause analysis and alerting 19

The Heart Of our Stream Processing • Holds the complete
state of a customer environment in memory • Incorporates stream of incoming changes / new data • Metrics • Events • Traces • Conﬁguration changes • Performs the canonical time ticking 20

Reactive Streams • Implemented on Project Reactor and Java 8
• Every processing step is a subscriber on a stream • Rollup calculation and persistence • Deriving events from changes • Short term storage for live mode • Indexing for search 21

Instanas Reactive Streams • Subscribers do not unsubscribe on error
• No supervision of subscribers • No backpressure • Small ring buﬀer size • Every subscriber is metered • Throughput • Processed • Errors • Drops 22

The Brain Of Our Stream Processing • Receives consistent sets
of data for each monitored component every second • Health analysis subscribes dynamically to the relevant incoming data • Performs • Conﬁguration analysis • Static thresholds • Trend prediction • Outlier and anomaly detection • Neural network based forecasting • Found issues are stored in Elasticsearch and forwarded via Kafka • Reoccurring issues are detected • Causality is analysed based on temporal order and relationships in the Dynamic Graph 28

Algorithms • Algorithms heavily use Median Absolute Deviation (MAD) •
en.wikipedia.org/wiki/Median_absolute_deviation • is more robust “average” • Neural Network • TensorFlow • Long Short-Term Memory • Daily training • Hourly prediction updates 29

Sudden Drop / Increase 30 “Twitter Paper”: Leveraging Cloud Data
to Mitigate User Experience from Breaking Bad  E-DIVISIVE WITH MEDIANS arxiv.org/pdf/1411.7955.pdf

Trend Detection 31

Outliers 32

Prediction 33

Outlier Prediction 34 *Prototype - not yet in Instana

No So Micro Services • We run several components communicating
via Kafka • Majority of processing is happening within two central components • Most processing operates on the same data set:  Current state of the monitored system • Copying data over network is not free • Plugin architecture speeds up development 35

Optimization: Data Conversion and Passing • Reading bytes from Kafka
and converting into our data model 36 Option A • Thread 1 • Connects Kafka • Reads and decompresses bytes • passes bytes on • Thread 2 • reads bytes • builds domain model • passes domain model on • Thread 3 • processes domain model  Option B • Thread 1 • Connects Kafka • Reads and decompresses bytes • passes bytes on • builds domain model • passes domain model on • Thread 2 • processes domain model  

Optimization: Data Conversion and Passing • Option B • Domain
model objects live longer • Keep garbage thread local (TLAB) • Use G1GC • -XX:+UseG1GC • When dealing with lots of string data use StringDeduplication • -XX:+UseStringDeduplication • When dealing with lots of primitive types use primitive collections • Trove • GS / Eclipse collections • various others with diﬀerent focus 37

How We Deliver • Agent updates are released daily •
Backend is released bi-weekly • SaaS ﬁrst, on-prem a week later • SaaS releases beginning of the week • 1-2 hotﬁxes the same week • Real load never matches tests • Ability to roll back or forward quickly 39

SaaS Deployment • Hashicorp Terraform used to provision infrastructure •
Shared Infrastructure is installed, updated and scaled “manually” • Instana components are delivered as Docker containers • VPCs and AZs • Hashicorp Nomad manages deployments • Not limited to Docker • Hashicorp Consul facilitates service discovery and conﬁg mgmt • Customers get a domain name via Route53 • Tip: Avoid complex dependency chains or even cycles 40

Scaling on AWS (Costs Money) • Not really flexible but
straightforward • Inter-AZ traffic costs money https://aws.amazon.com/ec2/pricing/on-demand/ • IOPS via disc space http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html 41 • Not really flexible but straightforward • Inter-AZ traffic costs money https://aws.amazon.com/ec2/pricing/on-demand/ • Bandwidth via machine size http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-ec2-config.html Instance Type Max Mbps Expected MB/s r4.large 437 54 r4.xlarge 875 109 r4.2xlarge 1750 218 r4.4xlarge 3500 437 r4.8xlarge 7000 875 r4.16xlarge 14000 1750

We are hiring! Devs and Ops  (and DevOps and OpsDevs…)

How we process billions of metrics to make your...

How we process billions of metrics to make your microservices run smoothly

More Decks by Fabian Lange

Other Decks in Technology

Featured

Transcript