Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SRE at SmartNews

SRE at SmartNews

I will talk about SRE's work in SmartNews, mainly about Monitoring and Postmortem

Nobutoshi Ogata

May 17, 2018
Tweet

More Decks by Nobutoshi Ogata

Other Decks in Technology

Transcript

  1. In od on Self Introduction • 尾形 暢俊 (Nobutoshi Ogata

    / @nobu666) • Engineering Manager, SRE • 2015/05 Join • 2016/01 SRE Team organized • About one year on the way, I’m also in charge of Corporate IT as one person
  2. Sma w • 30M download in the world • Delivering

    the world’s quality information to the people who need it
  3. Co-CEO Cross-functional Expert En i ri Or a z i

    SRE Mobile App VPoE Corporate Engineering News and Content Delivery US Engineering Ads
  4. Res s i t e Responsibilities of an SRE •

    Automate, Codify, Standardize operations • Log collection / Analysis platform • Monitoring, provisioning, deployments, and development flows • Secure server side security • Reviewing architecture decisions • Responding to incidents • Supporting postmortems
  5. Res s i t e Responsibilities of an SRE •

    Automate, Codify, Standardize operations • Log collection / Analysis platform • Monitoring, provisioning, deployments, and development flows • Secure server side security • Reviewing architecture decisions • Responding to incidents • Supporting postmortems
  6. Mon in Search ELB logs w/Presto • Find logs quickly

    • Partition made by Airflow DAGs everyday $ presto --server presto.smartnews.internal:8081 --user $USER --catalog hive --schema default --execute "select count(*) from raw_elb_access_log where dt = '2018-05-04' and name = 'ELB-Name' and elb_status_code = '200'" --output-format TSV 187130322
  7. Mon in API + Jenkins • Presto slow query •

    Hit query API and parse it every 5min.
  8. Mon in Kaonavi + GitHub + Spread Sheet • Request

    to review any repository notified to any channel by mention
  9. Pos r Incident Response 1. PagerDuty call 2. Ack 3.

    Investigate & Correspond a. Chat all at #incident what you’re trying to do and what you did b. Tell the current situation at #status so that another person who is not in the process understands even non-engineers 4. Write “Incident Report” a. EM decides who will write 5. Hold “Incident Review” a. EM is the person in charge to arrange date-time, attendees of the meeting. Attendance of Ogata and the primary author of the incident report at the meeting is mandatory
  10. Pos r Purpose of incident reviews • Not to repeat

    the mistake, you should discuss considering the following topics: ◦ Better solution to detect the situation ◦ Problem on current development/operation process ◦ Useful tools or systems for improvements ◦ Automation • Don't criticize someone or a specific mistake ◦ Asking "Why did you do xxx instead of yyy?" is meaningless ◦ We should think not to cause the situation again together. • Clarify the action items for improvements at the end of the meeting
  11. Pos r Template of Incident Report # Description // 障害概要

    # Area of Influence // 影響範囲 # Operation // 対応内容 ## Timeline // 流れ # Cause // 発生原因 ## Root Causes // 根本原因 # Recurrence Prevention Measure // 再発防止策 ## Action Items # Supporting information // 補足