How to monitor Mesos

How to monitor Mesos

6bcba0c09e7fdeed29218918248fec2f?s=128

Alexis Lê-Quôc

August 20, 2015
Tweet

Transcript

  1. How to monitor Mesos Alexis Le-Quoc (@alq) MesosCon 2015 https://goo.gl/8FI1fk

  2. What brings me here • CTO at Datadog • Monitoring

    for a living since 2010
  3. Datadog in a few words • Monitoring for modern apps

    ◦ cloud native ◦ microservices • Collect metrics + events: ◦ analyze, graph, detect anomalies and alert • Over 100 built-in integrations • Mesos & Marathon are among the newest ◦ initially customer-contributed ◦ now officially supported
  4. Datadog in a few pictures Alerts & Anomalies Collaboration Metrics

  5. Out of the box integrations - OS - Datastores -

    Queues - Containers - Web servers - IaaS: AWS, Azure, GCE - SaaS - Escalation services - ChatOps
  6. Customer demand for Mesos monitoring • Clear uptick in Docker

    adoption since 1.0 • Growing production use • Next logical step: orchestration • Be where our customers are going
  7. Table of contents 1. Monitoring theory: How to monitor X

    2. Monitoring: from imperative to declarative 3. Applying the theory: key metrics 4. How we collect Mesos metrics 5. Putting it all together
  8. Monitoring theory: How to monitor X

  9. • Mesos: 98 different metrics • Marathon: 90 different metrics

    • Not even thinking about frameworks or services… Need a rational method to reduce the metric space Too many metrics!
  10. https://goo.gl/t1Rgcg

  11. tl;dr data types

  12. tl;dr act on work metrics Fine print: some resource metrics

    (e.g. no disk space, no master, no slaves) are actionable resource metrics
  13. tl;dr recurse down the rabbit hole

  14. From imperative to declarative

  15. Degraded is the new normal • In distributed apps, there

    is always something broken (hence orchestration and resource scheduling) • In distributed apps, tasks are containerized and have very short lifecycles (seconds to minutes) • In distributed apps, tasks don’t always have consistent locality
  16. The old monitoring model is dead! • Host-centric ◦ Cannot

    track workload migrations across hosts • Imperative ◦ “Host X must be running Cassandra”
  17. Imperative, host-centric monitoring

  18. Query-based monitoring • Aggregates matter since the actual infrastructure is

    layers of abstraction below • Everything is expressed as queries on predicates ◦ “sum of running tasks across all slaves > 0” ◦ “max of elected master across all masters == 1” • Predicates are based on metadata (aka tags, labels)
  19. Declarative, query-based monitoring

  20. Applying the theory: key metrics

  21. Mesos at a high level Masters 1. Broker resources between

    slaves and frameworks 2. Distribute tasks to slaves Slaves 1. Execute tasks
  22. Masters’ work/resources metrics Work • elected master • running tasks

    • finished tasks • lost tasks • failed tasks • error tasks Resource • cpu/net/disk/mem • slaves • messages • events • …
  23. Slaves’ work/resource metrics Work • sum(running tasks) • sum(finished tasks)

    • sum(lost tasks) • sum(failed tasks) • sum(error tasks) Resources • cpu • network • memory • executors • messages • …
  24. Metadata • cluster • role • mesos pid • application

    version • infrastructure (e.g. availability zone, instance type, etc.)
  25. How we collect mesos metrics

  26. Mesos classic When running on full-blown linux nodes 1. Run

    agent directly on OS (chef/puppet/etc.) a. sudo apt-get install datadog-agent b. Edit mesos_master|slave.yaml 2. Agent hits metrics endpoint on localhost: 5050 and localhost:5051 3. Agent collects stats and metadata every 10s
  27. Mesos + Marathon Same as classic except: 1. Run agent

    in Docker container a. https://hub.docker.com/r/alq666/docker-dd-agent/ 2. Agent monitors other local Docker containers 3. Agent monitors local slave and master
  28. Mesos + Marathon • Placement (marathon.json) "constraints": [ ["hostname", "UNIQUE"],

    ["hostname", "GROUP_BY"] ], • Currently requires to set instances to # of slaves
  29. DCOS (on AWS) Same as mesos + marathon, except: 1.

    Packaged for DCOS in universe (PR) 2. Simple options file 3. dcos package update 4. dcos package install --yes -- options=datadog.json datadog 5. Profit
  30. Demo

  31. Thank you Feedback? @alq