Slide 1

Slide 1 text

Ewan Fairweather

Slide 2

Slide 2 text

What this drill down is about • Apps in the cloud are composed of stuff which sometimes breaks • How do I find and fix problems in the cloud service before my customers are affected? …all the time? …with potentially millions of customers? …without spending infinite money on telemetry? • I want to spend precious development time building the core logic … not telemetry solutions • Most customers are where we were a few years ago with our own service visibility • I’ll talk about the lessons we’ve learned managing the platform and apps running on it

Slide 3

Slide 3 text

Buy vs. Build • This space is evolving rapidly • The choices today will change further 12 months from now – so assume you will revisit the choices you have now • Invest in re-usable pieces, not monoliths • Scale of service usually determines the option • Not all options scale to the largest sizes

Slide 4

Slide 4 text

Internal Gaming Studio: “How do I know what is happening with the users in the system before Twitter knows”…

Slide 5

Slide 5 text

Telemetry Pipelines They Wanted Solution = Active App Logging + Ambient Info Using ETW + Teleminion Agent + ElasticSearch + Kibana

Slide 6

Slide 6 text

The “Build Your Own Telemetry” Experience Typically these efforts evolve as follows. 1. Hook up something like SQL Azure or WA Tables to store data 2. Dump more and more stuff in 3. Queries get slower OR you run out of space (or both ) Once you hit this limit, things get interesting and you use Big Data approaches. These work OK for reporting/data science, poorly for alerts This leads to two systems: “Batch” pipeline and a “Fast” pipeline We will go through this evolution so you can see how to do each one

Slide 7

Slide 7 text

What Data To Collect? • Perf Counters • ETW Events • XEvents (SQL) • DMVs (SQL) • Custom Tracing (Application-Generated Events) (Expect to iterate on this – as you run your service, you find things you need and things that are not worth collecting – you tune frequently)

Slide 8

Slide 8 text

Our Default Telemetry Configuration • Log events from Application code into WA Table Storage • Manually Query Table Storage to find data when there is a problem (no cross partition or table query support) • Put each kind of data (errors, perf counters) in separate tables • Hook up to on-premises SCOM and run machines like you do on-premises • This model works fine for limited scales • Often this is the “first attempt” for telemetry systems to re-use on-premises capabilities for their first cloud deployments SCOM Azure Management Pack: http://www.microsoft.com/en-us/download/details.aspx?id=11324 Application DB DB Telemetry SCOM/Cerebrata

Slide 9

Slide 9 text

Our System Health Landscape Visual Studio Online App Insights • Multitude of agents (WAD, SCOM, New Relic), that don’t meet all the requirements • Ability to transform arbitrary logging data into common format (Logstash-GROK filter capability in the ELK stack) • Target a diverse set of analytics repositories • Surface area support e.g. no worker roles for App insights • Guarantees on the ingestion pipeline – how do I meet my time to detect, time to remediate requirements! • Separation of channels (cold, warm, hot) • Walled gardens: Ambient information focused vs. active application logging • Lack of developer composition points – choice of analytics repository, access to the data, configurable pipeline properties • Quick, iterative, temporal visualization, ability to define derivative KPIs and drill down into active app logging is missing in our stack Full Access Fully Extensible Producers Collection Broker Transformation Storage Presentation and Action Devices Services Apps No Access WAD Agent SCOM Agent 3rd Party Agent Configurable/Somewhat Extensible App Insights SCOM Build Your Own BI Tool

Slide 10

Slide 10 text

North Star and V1

Slide 11

Slide 11 text

ELK is simple but works In a nutshell: It provides the ability to take arbitrary data, transform it and provide temporal visualization and search Pipeline: • Transform arbitrary data easily: Distributed Logstash agent, grok filters • Load Log Content: using log stash or hitting REST endpoint :9200 with JSON • Store documents in Elastic Search: Runs on Windows or Linux • Quickly Get Insight: Choose a timestamp field and get instance temporal visualization, and full search fast interactive exploration using Kibana 1 3 4 2

Slide 12

Slide 12 text

(Near Real-Time) Alerting • Batch Pipes are great at doing things at scale • But they are not fast – often it takes minutes to hours to process at scale • Alerting for errors is all about speed (time-to-detect) • This leads to a different class of solution for “fast pipe” monitoring • We measure incidents on how long it took us to detect it (every time) • We have repair bugs to keep working on that metric to be lower next time • You need to be selective about what you pass through the fast pipe • Perhaps you only look at key errors or pre-aggregate values • Otherwise you will overwhelm the alerting system • Storage efficiency is also key – I see lots of denormalized row solutions

Slide 13

Slide 13 text

NRT Software • Azure Event Hub is in public preview • Lets you send messages and scale out as needed

Slide 14

Slide 14 text

Another Option - Storm • Storm (Apache) has a lot of flexibility • You build your own query plan, effectively • Takes a bit of time to learn the model (Java- based)

Slide 15

Slide 15 text

Machine Learning • After you have NRT Alerts, you can do Machine Learning  • Applications • Auto-tuned alerts • Prediction Models (for failures based on historical behaviors, etc.) • Watching multiple things for errors without defined alerts • We use ML algorithms to detect new bugs in WA SQL Database (SQL Azure) • Watch all errors from all users (every minute or two) • See if new kinds of errors start spiking • Fire alerts for errors of appropriate severity • This is far better than • Firing alerts with static limits (break as your service grows) • Hand-coding each limit (takes a long time)

Slide 16

Slide 16 text

Using Machine Learning • Option 1: Go get R – it is free • Then figure out how to pump lots of data through it, do alerts, etc. • Option 2: Try the Azure ML Service (not free, but easier to start) • Go author a job and try it out

Slide 17

Slide 17 text

Watching User Errors for Spikes • We watch user errors • If they spike, our systems fire alerts • False positives get corrected over time with auto-leveling of thresholds • This is a simple example of SQL Azure user errors

Slide 18

Slide 18 text

Alerting Using Observation Models • We measure our telemetry uploading agent to determine if it is sending data properly • We can detect when spikes or lack of data should cause alerts • We file bugs/incidents automatically with no knowledge • This lets us detect and fix issues before our customers notice • This example: we deployed a bug fix to reduce telemetry volume and our alerting caught it  - we resolved bug as “expected change”

Slide 19

Slide 19 text

Canonical Architecture To Think About Services Collection Broker Transformation Long term storage Presentation and Action Producers Data Store Results Cache Applications Devices Internet of Things Data Broker Data Directory Stream Processing Map-Reduce Processing Text Indexing Machine Learning Infrastructure Devices Dashboard Common schema and data encoding for democratization, correlation, and reuse of data Instr. Library f() Agent External Data Sources Log Search Distributed Tracing Data Analytics App Insights Dial Tone Services Connector (for devices) Monitoring Health State Remediation Monitoring is a form of Stream Processing and logs a copy of its results to Analytics Multi-dimensional metrics Alerting !! Synthetic Transactions