Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elastic{ON} 2018 - Monitoring Anything and Ever...

Elastic{ON} 2018 - Monitoring Anything and Everything with Beats at eBay

Elastic Co

March 01, 2018
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. 2 1.2 PB Application Logs 5M/s Metric Data points 100K

    + Computes 100B + URL Hits eBay Scale at a Glance • 1.2 PB of logs collected just from web servers. • Does not include drop in deployments like Hadoop • Log volume of drop in solutions can easily be several PBs.
  2. The Lay of the Land… 4 Marketplace • Applications are

    built with internal frameworks. • Internal PaaS provides lifecycle management for applications built on supported frameworks. • Internal Logging and Monitoring APIs provide ability to ship operational data.
  3. 5 X-as-a-Service and Managed Ops 5 Marketplace • XaaS teams

    manage applications that are widely used inside of eBay. • XaaS takes care of PDMR • Managed Ops provisions gear for other use-cases. • Developer manages DMR.
  4. 6 How Do I Ship My Logs? Marketplace • Use

    internal APIs to ship to centralized logging infrastructure when using frameworks. • Use open source or paid solutions for drop in deployments. • If everything fails… Do tail –f …. % tail -F /var/log/foo.log
  5. 7 How Do I Ship My Metrics? Marketplace • Use

    internal APIs to ship to internal monitoring platform when using frameworks. • Many many open source solutions.
  6. 8 The Challenges… Marketplace • Too many free/paid options •

    Each team manages its own. • Operations doesn't approve of all stacks. • Each and every drop in solution can’t be instrumented with internal APIs.
  7. 10 The New Way of Deploying Applications … Spec based

    deployments Kubernetes manages lifecycle and provides self-healing Containerize Everything
  8. 11 The dream … • Allow developers to drop in

    any application on Tess. • Shipping logs should be as simple as writing to stdout/stderr. • Shipping metrics should be as simple as exposing an HTTP endpoint. • Provide an interface for users to view their logs/metrics/events. • Operations approved!!!
  9. 12 The challenges … • Users now have flexibility to

    deploy any application with ease. • Each application has it’s own unique way of logging and generating metrics. • Pods are dynamic and can move around hosts. • Logs/Metrics need to be tagged with Pod metadata before shipping. • Ability to discover workloads needs to be present.
  10. 14 Logs … Marketplace • Ability to tail log files.

    • Treat each line as a log message. • Parseable by a well defined GROK pattern. • Pass offset with each log message. • Ability to read JSON logs. • Ability to stitch stack traces. {"log":"2017/06/30 23:01:57 [I] Completed 10.64.244.96 - \"GET /kietu/ HTTP/1.0\" 404 Not Found 2931 bytes in 524us\n","stream":"stderr","time":"2017-06- 30T23:01:57.557914444Z"} {"log":"2017/06/30 23:01:57 [I] Completed 10.64.244.96 - \"GET /sf2k/ HTTP/1.0\" 404 Not Found 2931 bytes in 360us\n","stream":"stderr","time":"2017-06- 30T23:01:57.557918844Z"} root@localhost:/home/vija# tail -F /var/log/syslog Jul 29 07:17:01 tess-7829 CRON[6772]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly) Jul 29 08:17:01 tess-7829 CRON[6775]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly) 28-Jul-2017 15:14:28.527 SEVERE [main] org.apache.catalina.util.LifecycleBase.handleSubClassException Failed to initialize component [Connector[HTTP/1.1-8080]] org.apache.catalina.LifecycleException: Protocol handler initialization failed at org.apache.catalina.connector.Connector.initInternal(Connector.java:935) at org.apache.catalina.util.LifecycleBase.init(LifecycleBase.java:136)
  11. Metrics… 15 Marketplace • Applications always generate metrics in a

    well defined manner. • Metrics are either exposed or pushed in well defined formats. mysql> SHOW GLOBAL STATUS; +------------------------------------------+-------------+ | Variable_name | Value | +------------------------------------------+-------------+ | Aborted_clients | 8 | | Aborted_connects | 10 | | Binlog_cache_disk_use | 0 | | Binlog_cache_use | 0 | | Binlog_stmt_cache_disk_use | 0 | | Binlog_stmt_cache_use | 0 | vija@tess-7829:~$ curl http://prometheus.system.svc.31.tess.io/metrics # HELP go_gc_duration_seconds A summary of the GC invocation durations. # TYPE go_gc_duration_seconds summary go_gc_duration_seconds{quantile="0"} 0.000347879 go_gc_duration_seconds{quantile="0.25"} 0.000503623 go_gc_duration_seconds{quantile="0.5"} 0.000728009 go_gc_duration_seconds{quantile="0.75"} 0.003317691 go_gc_duration_seconds{quantile="1"} 0.061497008 $ echo "local.random.diceroll 4 `date +%s`" | nc localhost 2003
  12. 16 16 Introducing Beats … Lightweight shippers of operational data

    Extensible Configurable data destinations Beats Lightweight log tailer Handles JSON files Configurable multiline patterns Built in back pressure based on last offset shipped Filebeat Lightweight metric collector Modules understand how to collect metrics from various applications Metricbeat
  13. 17 What we did … • Built `add_kubernetes_metadata` Beats processor

    to do metadata append. • Added metricbeat modules for Prometheus, Dropwizard. • Adopted Beats plugin support internally to manage eBay specific code. • For the rest we wrote…
  14. 19 What is Collectbeat? • Beat built on top of

    Libbeat which provides discovery capabilities. • Can be started up in Filebeat/Metricbeat modes to collect logs/metrics. • Fits well into Kubernetes eco-system. • Can be extended to other discovery based environments as well. • Can even collect logs from inside containers!!!
  15. 20 Collectbeat Internals … • Discoverers can talk to endpoints

    to obtain metadata of deployed objects. • Builders build Beat configurations based on rules. • Appenders add additional parameters to Config objects. • RunnerFactory is used to deploy the generated configurations. Discoverer Discovery Endpoint Query Discovery Endpoint for metadata B1 B2 B3 B2 Pass Qualifying Metadata to Builders Builders A1 A2 A3 A4 Pass Generated Configs to Appenders RunnerFactory Input/Metricset Config
  16. 21 Using Collectbeat … • User annotates Pod with information

    of what kind of metric the Pod would have and pattern to stitch stack traces. • Collectbeat listens to Kube API server for such annotations and spins up pollers to query metrics and prospectors to tail logs. annotations: io.collectbeat.metrics/type: "prometheus" io.collectbeat.metrics/endpoints: ":8080" io.collectbeat.metrics/namespace: "tess-apps" io.collectbeat.metrics/interval: "1m" io.collectbeat.logs/pattern: "^[0-9]{4}"
  17. 22 • Collectbeat is deployed as a DaemonSet onto every

    node. • Collectbeat uses Kube API server to discover workloads • Collected data is sent to internal monitoring system – Sherlock.io Collectbeat Deployment
  18. 23 Sherlock.io Architecture • Technology constantly changes!!! • Standardize on

    Ingress and Egress APIs • Build Adapters and Storage APIs • Onboard processing capabilities onto Flink.
  19. 24 Beats working with Sherlock.io Marketplace • Manage custom code

    on Beats plugins. • New http2 output plugin to ingest to Ingress APIs • Processors to manipulate standard Beat payloads into Ingress style payloads. • Custom Collectbeat Appenders to restructure fields as required by Ingress APIs.
  20. 26 Resources are finite … • Organic logging growth is

    50% YoY • Monitoring infrastructure expenditure usually is around 10% • Rogue logging can potentially kill the platform. • Checks and balances are essential.
  21. 28 There shall be established limits … Marketplace • Applications

    shall be metered at host/pool level. • Applications based on Tier shall have: • Set Quotas • Set Retention • Any logs/metrics/events beyond ingest threshold shall be shed. • Limits shall be determined based on available CPU/Memory/Storage. • Provide autodiscover with metering information of every source being polled and stop modules after exceeding it’s quota.
  22. 29 There shall be sampling … Marketplace • Soft quotas

    can be established beyond which events can get sampled. • Beats would drop/retain messages based on configured conditions. • Maybe use `drop_events`, `include_lines`, `exclude_lines`
  23. 30 There shall be fairness … • Quality of Service

    (QoS) is crucial for Tier 1 applications and state change signals. • Introduce Event Schedulers into the Beat pipeline. • Allow autodiscover to add “weights” to configurations. Producer Scheduler Weighted Fair Queue Message Queue
  24. 31 There shall only be stock Beats … • Collectbeat

    meets the needs of today. • Beats is a rapidly evolving community. • Vendor upgrades can become more and more cumbersome. • Move all of Collectbeat into autodiscovery. • Retire Collectbeat!!!
  25. 32 The possibilities are infinite … • If we could

    do it inside of Kubernetes then may be we could do it outside. • Package Beats with internal package manager. • Bundle package with OS image. • Beats will be omnipresent.
  26. 34 Growing list of applications … • New applications are

    open sourced every day. • How do we keep up? • Community based development. • Go after formats and protocols > 100 applications supported out of the box!!!
  27. 35 What we have so far … Marketplace • All

    Kubernetes clusters in eBay are managed by Beats. • No custom code running internally. • 40+ PRs to Elastic Beats repository. • Highest contributors to Beats outside of Elastic!! • In progess: • Packaging and deploying Beats outside of Kubernetes. • Merge of Collectbeat into autodiscovery. • QoS on libbeat
  28. 36 What's next? • Sampling and rate limiting. • Get

    Beats deployed in all data centers. • Will be Operations approved! • More applications! • More applications! • More applications! J