SRE at SmartNews

SRE at SmartNews

I will talk about SRE's work in SmartNews, mainly about Monitoring and Postmortem

D93fb300519f17800d3fbc8119ed4bed?s=128

Nobutoshi Ogata

May 17, 2018
Tweet

Transcript

  1. SRE @ SmartNews SRE Lounge #3 Nobutoshi Ogata 2018/05/17

  2. Introduction

  3. In od on Self Introduction • 尾形 暢俊 (Nobutoshi Ogata

    / @nobu666) • Engineering Manager, SRE • 2015/05 Join • 2016/01 SRE Team organized • About one year on the way, I’m also in charge of Corporate IT as one person
  4. SmartNews

  5. Sma w • 30M download in the world • Delivering

    the world’s quality information to the people who need it
  6. Engineering Organization

  7. Co-CEO Cross-functional Expert En i ri Or a z i

    SRE Mobile App VPoE Corporate Engineering News and Content Delivery US Engineering Ads
  8. Responsibilities

  9. Res s i t e Responsibilities of an SRE •

    Automate, Codify, Standardize operations • Log collection / Analysis platform • Monitoring, provisioning, deployments, and development flows • Secure server side security • Reviewing architecture decisions • Responding to incidents • Supporting postmortems
  10. Res s i t e Responsibilities of an SRE •

    Automate, Codify, Standardize operations • Log collection / Analysis platform • Monitoring, provisioning, deployments, and development flows • Secure server side security • Reviewing architecture decisions • Responding to incidents • Supporting postmortems
  11. Monitoring

  12. Mon in Datadog • Server metrics • Custom app metrics

    w/JMX
  13. Mon in New Relic • Application performance

  14. Mon in Chartio • Data exploration & Visualize

  15. Mon in Runscope • API performance, Data validation

  16. Mon in VAddy • Vulnerability scan w/CI

  17. Mon in CloudWatch + Lambda • AWS Service Limit

  18. Mon in Search ELB logs w/Presto • Find logs quickly

    • Partition made by Airflow DAGs everyday $ presto --server presto.smartnews.internal:8081 --user $USER --catalog hive --schema default --execute "select count(*) from raw_elb_access_log where dt = '2018-05-04' and name = 'ELB-Name' and elb_status_code = '200'" --output-format TSV 187130322
  19. Mon in AWS Config • AWS resource setting

  20. Mon in E-mail + zapier • AWS Retirement and Maintenance

  21. Mon in E-mail + PagerDuty • Airflow SLA miss •

    Google Security Warning
  22. Mon in API + Jenkins • Presto slow query •

    Hit query API and parse it every 5min.
  23. Mon in Kaonavi + GitHub + Spread Sheet • Request

    to review any repository notified to any channel by mention
  24. Postmortem

  25. Pos r Incident Response 1. PagerDuty call 2. Ack 3.

    Investigate & Correspond a. Chat all at #incident what you’re trying to do and what you did b. Tell the current situation at #status so that another person who is not in the process understands even non-engineers 4. Write “Incident Report” a. EM decides who will write 5. Hold “Incident Review” a. EM is the person in charge to arrange date-time, attendees of the meeting. Attendance of Ogata and the primary author of the incident report at the meeting is mandatory
  26. Pos r

  27. Pos r Purpose of incident reviews • Not to repeat

    the mistake, you should discuss considering the following topics: ◦ Better solution to detect the situation ◦ Problem on current development/operation process ◦ Useful tools or systems for improvements ◦ Automation • Don't criticize someone or a specific mistake ◦ Asking "Why did you do xxx instead of yyy?" is meaningless ◦ We should think not to cause the situation again together. • Clarify the action items for improvements at the end of the meeting
  28. Pos r Template of Incident Report # Description // 障害概要

    # Area of Influence // 影響範囲 # Operation // 対応内容 ## Timeline // 流れ # Cause // 発生原因 ## Root Causes // 根本原因 # Recurrence Prevention Measure // 再発防止策 ## Action Items # Supporting information // 補足
  29. Mon in Pos r Tracking

  30. Any Questions?

  31. We’re Hiring! https://smartnews.workable.com /jobs/606363