$30 off During Our Annual Pro Sale. View Details »

SRE at SmartNews

SRE at SmartNews

I will talk about SRE's work in SmartNews, mainly about Monitoring and Postmortem

Nobutoshi Ogata

May 17, 2018
Tweet

More Decks by Nobutoshi Ogata

Other Decks in Technology

Transcript

  1. SRE @ SmartNews
    SRE Lounge #3 Nobutoshi Ogata
    2018/05/17

    View Slide

  2. Introduction

    View Slide

  3. In od on
    Self Introduction
    ● 尾形 暢俊 (Nobutoshi Ogata / @nobu666)
    ● Engineering Manager, SRE
    ● 2015/05 Join
    ● 2016/01 SRE Team organized
    ● About one year on the way, I’m also in
    charge of Corporate IT as one person

    View Slide

  4. SmartNews

    View Slide

  5. Sma w
    ● 30M download in the world
    ● Delivering the world’s quality information to
    the people who need it

    View Slide

  6. Engineering
    Organization

    View Slide

  7. Co-CEO
    Cross-functional Expert
    En i ri Or a z i
    SRE
    Mobile App
    VPoE
    Corporate Engineering
    News and
    Content
    Delivery
    US
    Engineering
    Ads

    View Slide

  8. Responsibilities

    View Slide

  9. Res s i t e
    Responsibilities of an SRE
    ● Automate, Codify, Standardize operations
    ● Log collection / Analysis platform
    ● Monitoring, provisioning, deployments, and
    development flows
    ● Secure server side security
    ● Reviewing architecture decisions
    ● Responding to incidents
    ● Supporting postmortems

    View Slide

  10. Res s i t e
    Responsibilities of an SRE
    ● Automate, Codify, Standardize operations
    ● Log collection / Analysis platform
    ● Monitoring, provisioning, deployments, and
    development flows
    ● Secure server side security
    ● Reviewing architecture decisions
    ● Responding to incidents
    ● Supporting postmortems

    View Slide

  11. Monitoring

    View Slide

  12. Mon in
    Datadog
    ● Server metrics
    ● Custom app metrics w/JMX

    View Slide

  13. Mon in
    New Relic
    ● Application performance

    View Slide

  14. Mon in
    Chartio
    ● Data exploration & Visualize

    View Slide

  15. Mon in
    Runscope
    ● API performance, Data validation

    View Slide

  16. Mon in
    VAddy
    ● Vulnerability scan w/CI

    View Slide

  17. Mon in
    CloudWatch + Lambda
    ● AWS Service Limit

    View Slide

  18. Mon in
    Search ELB logs w/Presto
    ● Find logs quickly
    ● Partition made by Airflow DAGs everyday
    $ presto --server presto.smartnews.internal:8081
    --user $USER --catalog hive --schema default
    --execute "select count(*) from
    raw_elb_access_log where dt = '2018-05-04' and
    name = 'ELB-Name' and elb_status_code = '200'"
    --output-format TSV
    187130322

    View Slide

  19. Mon in
    AWS Config
    ● AWS resource setting

    View Slide

  20. Mon in
    E-mail + zapier
    ● AWS Retirement and Maintenance

    View Slide

  21. Mon in
    E-mail + PagerDuty
    ● Airflow SLA miss
    ● Google Security Warning

    View Slide

  22. Mon in
    API + Jenkins
    ● Presto slow query
    ● Hit query API and parse it every 5min.

    View Slide

  23. Mon in
    Kaonavi + GitHub + Spread Sheet
    ● Request to review any repository notified to
    any channel by mention

    View Slide

  24. Postmortem

    View Slide

  25. Pos r
    Incident Response
    1. PagerDuty call
    2. Ack
    3. Investigate & Correspond
    a. Chat all at #incident what you’re trying to do and what you did
    b. Tell the current situation at #status so that another person who is
    not in the process understands even non-engineers
    4. Write “Incident Report”
    a. EM decides who will write
    5. Hold “Incident Review”
    a. EM is the person in charge to arrange date-time, attendees of the
    meeting. Attendance of Ogata and the primary author of the incident
    report at the meeting is mandatory

    View Slide

  26. Pos r

    View Slide

  27. Pos r
    Purpose of incident reviews
    ● Not to repeat the mistake, you should
    discuss considering the following topics:
    ○ Better solution to detect the situation
    ○ Problem on current development/operation process
    ○ Useful tools or systems for improvements
    ○ Automation
    ● Don't criticize someone or a specific
    mistake
    ○ Asking "Why did you do xxx instead of yyy?" is meaningless
    ○ We should think not to cause the situation again together.
    ● Clarify the action items for improvements at
    the end of the meeting

    View Slide

  28. Pos r
    Template of Incident
    Report
    # Description // 障害概要
    # Area of Influence // 影響範囲
    # Operation // 対応内容
    ## Timeline // 流れ
    # Cause // 発生原因
    ## Root Causes // 根本原因
    # Recurrence Prevention Measure // 再発防止策
    ## Action Items
    # Supporting information // 補足

    View Slide

  29. Mon in Pos r
    Tracking

    View Slide

  30. Any Questions?

    View Slide

  31. We’re Hiring!
    https://smartnews.workable.com
    /jobs/606363

    View Slide