In od on Self Introduction ● 尾形 暢俊 (Nobutoshi Ogata / @nobu666) ● Engineering Manager, SRE ● 2015/05 Join ● 2016/01 SRE Team organized ● About one year on the way, I’m also in charge of Corporate IT as one person
Res s i t e Responsibilities of an SRE ● Automate, Codify, Standardize operations ● Log collection / Analysis platform ● Monitoring, provisioning, deployments, and development flows ● Secure server side security ● Reviewing architecture decisions ● Responding to incidents ● Supporting postmortems
Res s i t e Responsibilities of an SRE ● Automate, Codify, Standardize operations ● Log collection / Analysis platform ● Monitoring, provisioning, deployments, and development flows ● Secure server side security ● Reviewing architecture decisions ● Responding to incidents ● Supporting postmortems
Pos r Incident Response 1. PagerDuty call 2. Ack 3. Investigate & Correspond a. Chat all at #incident what you’re trying to do and what you did b. Tell the current situation at #status so that another person who is not in the process understands even non-engineers 4. Write “Incident Report” a. EM decides who will write 5. Hold “Incident Review” a. EM is the person in charge to arrange date-time, attendees of the meeting. Attendance of Ogata and the primary author of the incident report at the meeting is mandatory
Pos r Purpose of incident reviews ● Not to repeat the mistake, you should discuss considering the following topics: ○ Better solution to detect the situation ○ Problem on current development/operation process ○ Useful tools or systems for improvements ○ Automation ● Don't criticize someone or a specific mistake ○ Asking "Why did you do xxx instead of yyy?" is meaningless ○ We should think not to cause the situation again together. ● Clarify the action items for improvements at the end of the meeting