SRE at SmartNews

SRE at SmartNews

I will talk about SRE's work in SmartNews, mainly about Monitoring and Postmortem

D93fb300519f17800d3fbc8119ed4bed?s=128

Nobutoshi Ogata

May 17, 2018
Tweet

Transcript

  1. 3.

    In od on Self Introduction • 尾形 暢俊 (Nobutoshi Ogata

    / @nobu666) • Engineering Manager, SRE • 2015/05 Join • 2016/01 SRE Team organized • About one year on the way, I’m also in charge of Corporate IT as one person
  2. 5.

    Sma w • 30M download in the world • Delivering

    the world’s quality information to the people who need it
  3. 7.

    Co-CEO Cross-functional Expert En i ri Or a z i

    SRE Mobile App VPoE Corporate Engineering News and Content Delivery US Engineering Ads
  4. 9.

    Res s i t e Responsibilities of an SRE •

    Automate, Codify, Standardize operations • Log collection / Analysis platform • Monitoring, provisioning, deployments, and development flows • Secure server side security • Reviewing architecture decisions • Responding to incidents • Supporting postmortems
  5. 10.

    Res s i t e Responsibilities of an SRE •

    Automate, Codify, Standardize operations • Log collection / Analysis platform • Monitoring, provisioning, deployments, and development flows • Secure server side security • Reviewing architecture decisions • Responding to incidents • Supporting postmortems
  6. 18.

    Mon in Search ELB logs w/Presto • Find logs quickly

    • Partition made by Airflow DAGs everyday $ presto --server presto.smartnews.internal:8081 --user $USER --catalog hive --schema default --execute "select count(*) from raw_elb_access_log where dt = '2018-05-04' and name = 'ELB-Name' and elb_status_code = '200'" --output-format TSV 187130322
  7. 22.

    Mon in API + Jenkins • Presto slow query •

    Hit query API and parse it every 5min.
  8. 23.

    Mon in Kaonavi + GitHub + Spread Sheet • Request

    to review any repository notified to any channel by mention
  9. 25.

    Pos r Incident Response 1. PagerDuty call 2. Ack 3.

    Investigate & Correspond a. Chat all at #incident what you’re trying to do and what you did b. Tell the current situation at #status so that another person who is not in the process understands even non-engineers 4. Write “Incident Report” a. EM decides who will write 5. Hold “Incident Review” a. EM is the person in charge to arrange date-time, attendees of the meeting. Attendance of Ogata and the primary author of the incident report at the meeting is mandatory
  10. 26.
  11. 27.

    Pos r Purpose of incident reviews • Not to repeat

    the mistake, you should discuss considering the following topics: ◦ Better solution to detect the situation ◦ Problem on current development/operation process ◦ Useful tools or systems for improvements ◦ Automation • Don't criticize someone or a specific mistake ◦ Asking "Why did you do xxx instead of yyy?" is meaningless ◦ We should think not to cause the situation again together. • Clarify the action items for improvements at the end of the meeting
  12. 28.

    Pos r Template of Incident Report # Description // 障害概要

    # Area of Influence // 影響範囲 # Operation // 対応内容 ## Timeline // 流れ # Cause // 発生原因 ## Root Causes // 根本原因 # Recurrence Prevention Measure // 再発防止策 ## Action Items # Supporting information // 補足