Slide 1

Slide 1 text

SRE @ SmartNews SRE Lounge #3 Nobutoshi Ogata 2018/05/17

Slide 2

Slide 2 text

Introduction

Slide 3

Slide 3 text

In od on Self Introduction ● 尾形 暢俊 (Nobutoshi Ogata / @nobu666) ● Engineering Manager, SRE ● 2015/05 Join ● 2016/01 SRE Team organized ● About one year on the way, I’m also in charge of Corporate IT as one person

Slide 4

Slide 4 text

SmartNews

Slide 5

Slide 5 text

Sma w ● 30M download in the world ● Delivering the world’s quality information to the people who need it

Slide 6

Slide 6 text

Engineering Organization

Slide 7

Slide 7 text

Co-CEO Cross-functional Expert En i ri Or a z i SRE Mobile App VPoE Corporate Engineering News and Content Delivery US Engineering Ads

Slide 8

Slide 8 text

Responsibilities

Slide 9

Slide 9 text

Res s i t e Responsibilities of an SRE ● Automate, Codify, Standardize operations ● Log collection / Analysis platform ● Monitoring, provisioning, deployments, and development flows ● Secure server side security ● Reviewing architecture decisions ● Responding to incidents ● Supporting postmortems

Slide 10

Slide 10 text

Res s i t e Responsibilities of an SRE ● Automate, Codify, Standardize operations ● Log collection / Analysis platform ● Monitoring, provisioning, deployments, and development flows ● Secure server side security ● Reviewing architecture decisions ● Responding to incidents ● Supporting postmortems

Slide 11

Slide 11 text

Monitoring

Slide 12

Slide 12 text

Mon in Datadog ● Server metrics ● Custom app metrics w/JMX

Slide 13

Slide 13 text

Mon in New Relic ● Application performance

Slide 14

Slide 14 text

Mon in Chartio ● Data exploration & Visualize

Slide 15

Slide 15 text

Mon in Runscope ● API performance, Data validation

Slide 16

Slide 16 text

Mon in VAddy ● Vulnerability scan w/CI

Slide 17

Slide 17 text

Mon in CloudWatch + Lambda ● AWS Service Limit

Slide 18

Slide 18 text

Mon in Search ELB logs w/Presto ● Find logs quickly ● Partition made by Airflow DAGs everyday $ presto --server presto.smartnews.internal:8081 --user $USER --catalog hive --schema default --execute "select count(*) from raw_elb_access_log where dt = '2018-05-04' and name = 'ELB-Name' and elb_status_code = '200'" --output-format TSV 187130322

Slide 19

Slide 19 text

Mon in AWS Config ● AWS resource setting

Slide 20

Slide 20 text

Mon in E-mail + zapier ● AWS Retirement and Maintenance

Slide 21

Slide 21 text

Mon in E-mail + PagerDuty ● Airflow SLA miss ● Google Security Warning

Slide 22

Slide 22 text

Mon in API + Jenkins ● Presto slow query ● Hit query API and parse it every 5min.

Slide 23

Slide 23 text

Mon in Kaonavi + GitHub + Spread Sheet ● Request to review any repository notified to any channel by mention

Slide 24

Slide 24 text

Postmortem

Slide 25

Slide 25 text

Pos r Incident Response 1. PagerDuty call 2. Ack 3. Investigate & Correspond a. Chat all at #incident what you’re trying to do and what you did b. Tell the current situation at #status so that another person who is not in the process understands even non-engineers 4. Write “Incident Report” a. EM decides who will write 5. Hold “Incident Review” a. EM is the person in charge to arrange date-time, attendees of the meeting. Attendance of Ogata and the primary author of the incident report at the meeting is mandatory

Slide 26

Slide 26 text

Pos r

Slide 27

Slide 27 text

Pos r Purpose of incident reviews ● Not to repeat the mistake, you should discuss considering the following topics: ○ Better solution to detect the situation ○ Problem on current development/operation process ○ Useful tools or systems for improvements ○ Automation ● Don't criticize someone or a specific mistake ○ Asking "Why did you do xxx instead of yyy?" is meaningless ○ We should think not to cause the situation again together. ● Clarify the action items for improvements at the end of the meeting

Slide 28

Slide 28 text

Pos r Template of Incident Report # Description // 障害概要 # Area of Influence // 影響範囲 # Operation // 対応内容 ## Timeline // 流れ # Cause // 発生原因 ## Root Causes // 根本原因 # Recurrence Prevention Measure // 再発防止策 ## Action Items # Supporting information // 補足

Slide 29

Slide 29 text

Mon in Pos r Tracking

Slide 30

Slide 30 text

Any Questions?

Slide 31

Slide 31 text

We’re Hiring! https://smartnews.workable.com /jobs/606363