Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Black Box Monitoring in Go

Black Box Monitoring in Go

You've set up your favorite monitoring agent on all of your services and you have incredible visibility into the internals of your infrastructure. All seems good in the world, but can your end user actually use your service? In this talk we'll write a simple app that simulates end user activity.

Avatar for Grant Griffiths

Grant Griffiths

August 03, 2018
Tweet

More Decks by Grant Griffiths

Other Decks in Technology

Transcript

  1. Black Box Monitoring in Go An introduction to Building Black

    Box monitors Go Grant Griffiths Software Engineer Platform Cloud Engineering GE Digital
  2. Who am I? Climbing + Mountaineering 2 Building Resilient Data

    Pipelines in Go GE Digital - Platform Cloud Engineering Sr. Software Engineer San Francisco, CA GE Go User Group Founder & Organizer @ggriffiths @griffithsgrant
  3. 1. Engineering @ GE 2. Introduction to Black Box Monitoring

    3. Black Box monitoring @ GE 4. Sample black box monitor 5. Takeaways
  4. Fun facts about GE • GE Power generates roughly 33%

    of the world’s electricity • Every two seconds, an aircraft powered by GE takes off • 35,000 wind turbines globally • 25% of all global hydropower 9 Black Box Monitoring in Go
  5. Some customers • Schindler • Exelon • Rosneft • BP

    • GE Power • GE Aviation • GE Renewables 10 Black Box Monitoring in Go
  6. Industrial Internet of Things (IIoT) • GE Assets produce petabytes

    of useful and mission critical data • Store and analyzed in Predix, our platform for IIoT • Downtime can have huge impact – lives lost and million of dollars • We must focus on reliability and availability 11 Black Box Monitoring in Go
  7. 1. Engineering @ GE 2. Introduction to Black Box Monitoring

    3. Black Box monitoring @ GE 4. Sample black box monitor 5. Takeaways
  8. Site Reliability Engineering • “Fundamentally, it’s what happens when you

    ask a software engineer to design an operations function.” - Ben Treynor, VP of Engineering @ Google • Using software expertise to solve operations problems through automation • Google SRE Book - landing.google.com/sre/book o Additionally, (Site Reliability workbook free to download until Aug 23) o Covers monitoring amongst many other topics 14
  9. What is Black Box Monitoring? • Testing externally visible behavior

    as a user would see it • Questions that it answers o Is my service up? o Is my service consumable? o Is a dependency of my service down? • Node uptime != service uptime 15
  10. What is NOT Black Box monitoring? • White box monitoring

    o CPU/RAM/Disc utilization o JVM stats o pprof/tracing o Logging o System internals • Individual component health checks • Anything the user cannot find out 16
  11. What are probes? • Small applications that measure service uptime

    • Knows nothing about system internals • Simulates an end user (black box) • Automated, pages engineers upon failures 17 Black Box Monitoring in Go
  12. Simple REST API Architecture 18 REST API Black Box Monitoring

    in Go PostgreSQL User n User 2 User 1 GET /v1/users
  13. Let’s add a Probe to monitor it 19 REST API

    Probe Black Box Monitoring in Go PostgreSQL User n User 2 User 1 GET /v1/users APM Solution (New Relic, etc) GET /v1/users Every X seconds Query success
  14. Service queries DB 20 REST API Probe Black Box Monitoring

    in Go PostgreSQL User n User 2 User 1 GET /v1/users Query Failed APM Solution (New Relic, etc) GET /v1/users Every X seconds
  15. Probe returns 500 21 REST API Probe Black Box Monitoring

    in Go PostgreSQL User n User 2 User 1 GET /v1/users Query Failed APM Solution (New Relic, etc) Return 500 GET /v1/users
  16. Page engineer on PostgreSQL team 22 REST API Probe Black

    Box Monitoring in Go PostgreSQL User n User 2 User 1 GET /v1/users Query Failed APM Solution (New Relic, etc) Return 500 GET /v1/users Page sleeping Engineer zZZ
  17. Monitoring a simple REST API – Problem fixed by engineer

    23 REST API Probe Black Box Monitoring in Go PostgreSQL User n User 2 User 1 GET /v1/users Query Success APM Solution (New Relic, etc) GET /v1/users Engineer goes Back to sleep Auto healing would be nice…
  18. Why Go for these probes? • Easily simulating end users

    w/ Time package • Simple and concise HTTP package for interacting with services • Create HTTP endpoints for new relic or other services • Fits with cloud & SRE ecosystem 24 Black Box Monitoring in Go
  19. Probe best practices • Check components that will provide the

    most visibility • Should be lightweight in deployment and operational cost • Probe status should be binary – your service is either up or down 25 Black Box Monitoring in Go
  20. Advanced probe features • Service SLA measurements • Detailed status

    on service components • Load spikes and behavior • Auto healing systems • Randomized user activity 26 Black Box Monitoring in Go
  21. 1. Engineering @ GE 2. Introduction to Black Box Monitoring

    3. Black Box monitoring @ GE Digital 4. Sample black box monitor 5. Takeaways
  22. Monitoring and Diagnostics service (M&D) 28 Black Box Monitoring in

    Go • Stores time series data • Store datapoints via WSS • Query datapoints via REST • Internal components for parsing, writing, etc o Many moving parts o Black box monitor covers all components { "messageId": "<MessageID>", "body":[{ "name":"<TagName>", "datapoints":[ [ <Timestamp>, <Value>, <Quality> ], [ <Timestamp>, <Value>, <Quality> ], [ <Timestamp>, <Value>, <Quality> ] ] }] } M&D Ingestion Message
  23. Monitoring and Diagnostics service (M&D) 29 Apache Kafka Pipeline (Go)

    Customer Query App Query Service (Java) Apache Cassandra Edge device/App Cloud Gateway (Go) Subscribe Write Ingest datapoints Publish Query for datapoints Query C* Black Box Monitoring in Go
  24. Probe ingests to cloud gateway every 10 seconds 30 Apache

    Kafka Pipeline (Go) Customer Query App Query Service (Java) Apache Cassandra Edge device/App Cloud Gateway (Go) Subscribe Write Publish Query C* Black Box Monitoring in Go M&D Probe Ingest for probe_tag every 10 seconds
  25. Cloud gateway publishes to Kafka 31 Apache Kafka Pipeline (Go)

    Customer Query App Query Service (Java) Apache Cassandra Edge device/App Cloud Gateway (Go) Subscribe Write Publish Query C* Black Box Monitoring in Go Ingest for probe_tag every 10 seconds M&D Probe
  26. Pipeline parses data from Kafka 32 Apache Kafka Pipeline (Go)

    Customer Query App Query Service (Java) Apache Cassandra Edge device/App Cloud Gateway (Go) Subscribe Write Publish Query C* Black Box Monitoring in Go Ingest for probe_tag every 10 seconds M&D Probe
  27. Pipeline writes to C* 33 Apache Kafka Pipeline (Go) Customer

    Query App Query Service (Java) Apache Cassandra Edge device/App Cloud Gateway (Go) Subscribe Write Publish Query C* Black Box Monitoring in Go Ingest for probe_tag every 10 seconds M&D Probe
  28. Every X seconds, APM solution checks the probe 34 Apache

    Kafka Pipeline (Go) Customer Query App Query Service (Java) Apache Cassandra Edge device/App Cloud Gateway (Go) Subscribe Write Publish Query C* Black Box Monitoring in Go Query for probe_tag past 30 seconds data Ingest for probe_tag every 10 seconds APM Solution (New Relic, etc) Check Probe M&D Probe
  29. Probe queries for canary_tag data in the past 30 seconds

    35 Apache Kafka Pipeline (Go) Customer Query App Query Service (Java) Apache Cassandra Edge device/App Cloud Gateway (Go) Subscribe Write Publish Query C* Black Box Monitoring in Go Query for probe_tag past 30 seconds data Ingest for probe_tag every 10 seconds APM Solution (New Relic, etc) Check Probe M&D Probe
  30. Query service checks C* 36 Apache Kafka Pipeline (Go) Customer

    Query App Query Service (Java) Apache Cassandra Edge device/App Cloud Gateway (Go) Subscribe Write Publish Query C* Black Box Monitoring in Go Query for probe_tag past 30 seconds data Ingest for probe_tag every 10 seconds APM Solution (New Relic, etc) Check Probe M&D Probe
  31. Returned to APM solution 37 Apache Kafka Pipeline (Go) Customer

    Query App Query Service (Java) Apache Cassandra Edge device/App Cloud Gateway (Go) Subscribe Write Publish Query C* Black Box Monitoring in Go Ingest for probe_tag every 10 seconds APM Solution (New Relic, etc) Query for probe_tag past 30 seconds data Return 500 if no data Return 200 if data exists M&D Probe
  32. Other probes at GE Digital • Monitors our big data

    scheduling & execution service (Predix insights) • Submit Airflow DAG, Execute analytic at an interval • Check for analytic success • Monitors our service instance creation • Create, Bind, Unbind, Delete service instances • Predix Event Hub, Blobstore, Columnar datastore 38 Black Box Monitoring in Go
  33. 1. Engineering @ GE 2. Introduction to Black Box Monitoring

    3. Black Box monitoring @ GE 4. Sample black box monitor 5. Takeaways
  34. 1. Engineering @ GE 2. Introduction to Black Box Monitoring

    3. Black Box monitoring @ GE 4. Sample black box monitor 5. Takeaways
  35. Takeaways 45 Building Resilient Data Pipelines in Go • Node

    uptime != service uptime • Use Go to build black box monitors • Simulate end user traffic • GE uses this concept for many services in prod • Catches many production issues early on that white box monitoring might not