Monitoring of SmartNews

Monitoring of SmartNews

D93fb300519f17800d3fbc8119ed4bed?s=128

Nobutoshi Ogata

March 24, 2016
Tweet

Transcript

  1. 2.

    Self Introduction • Nobutoshi Ogata • Manager, Site Reliability Engineering

    • @nobu666 • ❤ Whiskey, Cat, Heavy Metal • Entrusted dev.(10y) ➡ GREE infrastructure devision(3y) ➡ Some startup(1y) ➡ SmartNews(2015/05-)
  2. 3.
  3. 6.
  4. 7.
  5. 8.
  6. 9.

    Before Datadog • We used: • munin • growthforecast •

    cloudwatch • Wanted to centralized management !
  7. 10.

    After Datadog - Phase1 • OK, we can manage centrally

    • But...? • We're respecting the free development of engineers ! • Problem that the monitoring setting is leaked out "
  8. 11.

    Phase2 • Introduce Interferon • Datadog DSL • Well, we

    can monitor all resources automatically • But...? • Unmaintained in active ! • Can't feel free to mute from Web UI " • Lack of flexibility #
  9. 12.

    Phase3 • Integrated itamae • Our engineers were used to

    write chef • Easy to override default settings • It's asynchronous. Feel free to mute from Web UI • Integrated dogaws @takus • Yet another Datadog CloudWatch Integragion • We are used in combination with itamae
  10. 14.

    Datadog tips • Easiness anomary detection • Can't compared over

    24hours until quite recently • We request to be able to compare more longer period. Thank Datadog for implementing ! • This is a closed function. If you want to use it, ask Datadog support "
  11. 15.

    For example • Comapare Kinesis records count EWMA pct_change(median(last_1h), 1w_ago):ewma_20(avg:aws.kinesis.incoming_records{env

    :production,cost:smartnews} by {name}) > 50 • Compare application warn log change(median(last_1h),1w_ago): sum:app.log.warn{env:production} by {autoscaling_group} > 25
  12. 17.

    We're hiring! Only two people on Site Reliability Engineering Team

    ! • εϚχϡʔͷSite Reliability Engineer ืूʂ • http://about.smartnews.com/en/ careers/
  13. 18.