Managing and Monitoring Apps in MS Azure

Ewan Fairweather

What this drill down is about • Apps in the
cloud are composed of stuff which sometimes breaks • How do I find and fix problems in the cloud service before my customers are affected? …all the time? …with potentially millions of customers? …without spending infinite money on telemetry? • I want to spend precious development time building the core logic … not telemetry solutions • Most customers are where we were a few years ago with our own service visibility • I’ll talk about the lessons we’ve learned managing the platform and apps running on it

Buy vs. Build • This space is evolving rapidly •
The choices today will change further 12 months from now – so assume you will revisit the choices you have now • Invest in re-usable pieces, not monoliths • Scale of service usually determines the option • Not all options scale to the largest sizes

Internal Gaming Studio: “How do I know what is happening
with the users in the system before Twitter knows”…

Telemetry Pipelines They Wanted Solution = Active App Logging +
Ambient Info Using ETW + Teleminion Agent + ElasticSearch + Kibana

The “Build Your Own Telemetry” Experience Typically these efforts evolve
as follows. 1. Hook up something like SQL Azure or WA Tables to store data 2. Dump more and more stuff in 3. Queries get slower OR you run out of space (or both ) Once you hit this limit, things get interesting and you use Big Data approaches. These work OK for reporting/data science, poorly for alerts This leads to two systems: “Batch” pipeline and a “Fast” pipeline We will go through this evolution so you can see how to do each one

What Data To Collect? • Perf Counters • ETW Events
• XEvents (SQL) • DMVs (SQL) • Custom Tracing (Application-Generated Events) (Expect to iterate on this – as you run your service, you find things you need and things that are not worth collecting – you tune frequently)

Our Default Telemetry Configuration • Log events from Application code
into WA Table Storage • Manually Query Table Storage to find data when there is a problem (no cross partition or table query support) • Put each kind of data (errors, perf counters) in separate tables • Hook up to on-premises SCOM and run machines like you do on-premises • This model works fine for limited scales • Often this is the “first attempt” for telemetry systems to re-use on-premises capabilities for their first cloud deployments SCOM Azure Management Pack: http://www.microsoft.com/en-us/download/details.aspx?id=11324 Application DB DB Telemetry SCOM/Cerebrata

Our System Health Landscape Visual Studio Online App Insights •
Multitude of agents (WAD, SCOM, New Relic), that don’t meet all the requirements • Ability to transform arbitrary logging data into common format (Logstash-GROK filter capability in the ELK stack) • Target a diverse set of analytics repositories • Surface area support e.g. no worker roles for App insights • Guarantees on the ingestion pipeline – how do I meet my time to detect, time to remediate requirements! • Separation of channels (cold, warm, hot) • Walled gardens: Ambient information focused vs. active application logging • Lack of developer composition points – choice of analytics repository, access to the data, configurable pipeline properties • Quick, iterative, temporal visualization, ability to define derivative KPIs and drill down into active app logging is missing in our stack Full Access Fully Extensible Producers Collection Broker Transformation Storage Presentation and Action Devices Services Apps No Access WAD Agent SCOM Agent 3rd Party Agent Configurable/Somewhat Extensible App Insights SCOM Build Your Own BI Tool

North Star and V1

ELK is simple but works In a nutshell: It provides
the ability to take arbitrary data, transform it and provide temporal visualization and search Pipeline: • Transform arbitrary data easily: Distributed Logstash agent, grok filters • Load Log Content: using log stash or hitting REST endpoint :9200 with JSON • Store documents in Elastic Search: Runs on Windows or Linux • Quickly Get Insight: Choose a timestamp field and get instance temporal visualization, and full search fast interactive exploration using Kibana 1 3 4 2

(Near Real-Time) Alerting • Batch Pipes are great at doing
things at scale • But they are not fast – often it takes minutes to hours to process at scale • Alerting for errors is all about speed (time-to-detect) • This leads to a different class of solution for “fast pipe” monitoring • We measure incidents on how long it took us to detect it (every time) • We have repair bugs to keep working on that metric to be lower next time • You need to be selective about what you pass through the fast pipe • Perhaps you only look at key errors or pre-aggregate values • Otherwise you will overwhelm the alerting system • Storage efficiency is also key – I see lots of denormalized row solutions

NRT Software • Azure Event Hub is in public preview
• Lets you send messages and scale out as needed

Another Option - Storm • Storm (Apache) has a lot
of flexibility • You build your own query plan, effectively • Takes a bit of time to learn the model (Java- based)

Machine Learning • After you have NRT Alerts, you can
do Machine Learning  • Applications • Auto-tuned alerts • Prediction Models (for failures based on historical behaviors, etc.) • Watching multiple things for errors without defined alerts • We use ML algorithms to detect new bugs in WA SQL Database (SQL Azure) • Watch all errors from all users (every minute or two) • See if new kinds of errors start spiking • Fire alerts for errors of appropriate severity • This is far better than • Firing alerts with static limits (break as your service grows) • Hand-coding each limit (takes a long time)

Using Machine Learning • Option 1: Go get R –
it is free • Then figure out how to pump lots of data through it, do alerts, etc. • Option 2: Try the Azure ML Service (not free, but easier to start) • Go author a job and try it out

Watching User Errors for Spikes • We watch user errors
• If they spike, our systems fire alerts • False positives get corrected over time with auto-leveling of thresholds • This is a simple example of SQL Azure user errors

Alerting Using Observation Models • We measure our telemetry uploading
agent to determine if it is sending data properly • We can detect when spikes or lack of data should cause alerts • We file bugs/incidents automatically with no knowledge • This lets us detect and fix issues before our customers notice • This example: we deployed a bug fix to reduce telemetry volume and our alerting caught it  - we resolved bug as “expected change”

Canonical Architecture To Think About Services Collection Broker Transformation Long
term storage Presentation and Action Producers Data Store Results Cache Applications Devices Internet of Things Data Broker Data Directory Stream Processing Map-Reduce Processing Text Indexing Machine Learning Infrastructure Devices Dashboard Common schema and data encoding for democratization, correlation, and reuse of data Instr. Library f() Agent External Data Sources Log Search Distributed Tracing Data Analytics App Insights Dial Tone Services Connector (for devices) Monitoring Health State Remediation Monitoring is a form of Stream Processing and logs a copy of its results to Analytics Multi-dimensional metrics Alerting !! Synthetic Transactions

Managing and Monitoring Apps in MS Azure

Managing and Monitoring Apps in MS Azure

Data Science London

More Decks by Data Science London

Other Decks in Technology

Featured

Transcript

Ewan Fairweather

What this drill down is about • Apps in the

Buy vs. Build • This space is evolving rapidly •

Internal Gaming Studio: “How do I know what is happening

Telemetry Pipelines They Wanted Solution = Active App Logging +

The “Build Your Own Telemetry” Experience Typically these efforts evolve

What Data To Collect? • Perf Counters • ETW Events

Our Default Telemetry Configuration • Log events from Application code

Our System Health Landscape Visual Studio Online App Insights •

North Star and V1

ELK is simple but works In a nutshell: It provides

(Near Real-Time) Alerting • Batch Pipes are great at doing

NRT Software • Azure Event Hub is in public preview

Another Option - Storm • Storm (Apache) has a lot

Machine Learning • After you have NRT Alerts, you can

Using Machine Learning • Option 1: Go get R –

Watching User Errors for Spikes • We watch user errors

Alerting Using Observation Models • We measure our telemetry uploading

Canonical Architecture To Think About Services Collection Broker Transformation Long