Slide 1

Slide 1 text

1 17Media SRE Journey Sammy Lin 2019-10-18 DevOpsDays Taipei

Slide 2

Slide 2 text

43& Site Reliability Engineering

Slide 3

Slide 3 text

SRE is what you get when you treat operations as if it’s a software problem. Our mission is to protect, provide for, and progress the software and systems behind all of Google’s public services — Google Search, Ads, Gmail, Android, YouTube, and App Engine, to name just a few — with an ever-watchful eye on their availability, latency, performance, and capacity. What is SRE? 3FGFSFODFIUUQTMBOEJOHHPPHMFDPNTSF

Slide 4

Slide 4 text

4

Slide 5

Slide 5 text

SRE is not just a position. It's a culture.

Slide 6

Slide 6 text

6 Job Description System Administrator • Architecture planning, setup, backup, update, security protection and management of LINUX servers and operating systems. DevOps Engineer • Build and improve our CI/CD process and tools
 • Manage AWS or GCP environment SRE • Scale our applications and infrastructure. • Develop monitoring systems. • Participate in our on-call rotation.

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

There are no best practices. It has to be customized toward your organization.

Slide 9

Slide 9 text

17Media SRE Milestone

Slide 10

Slide 10 text

Why called SRE? ை

Slide 11

Slide 11 text

11 17Media SRE 2015 • 17Media founded • Builded on AWS 2016/8 • First DevOps Engineer joined 2018/5 • Migrated to GCP 2017/1 • Migrated from beanstalk to ECS 2018/6 • Migrated from node.js to golang 2019/8 • Building GRE

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

3FGFSFODFIUUQTDMPVEHPPHMFDPNCMPHQSPEVDUTHDQCSJOHJOHQPLFNPOHPUPMJGFPOHPPHMFDMPVE

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

(3& Group Reliability Engineering

Slide 16

Slide 16 text

16 舉⼿手電商
 有限公司 Paktor Group
 拍拖集團 17 Media BVI ⿇麻吉⼀一七股份有限公司 台灣分公司

Slide 17

Slide 17 text

17 Our Vision • The SREs in different companies can support each other. • Training Jr. SRE to Sr. • Stronger negotiation power toward 3rd party vendors. • Copy successful experience.

Slide 18

Slide 18 text

SRE Culture

Slide 19

Slide 19 text

19 Broken into 5 key areas: Accept Failure as Normal Reduce Organization Silos Implement Gradual Change Leverage Tooling & Automation Measure Everything

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

Accept Failure as Normal

Slide 23

Slide 23 text

Psychological
 Safety Blamelessness Accept Failure as Normal

Slide 24

Slide 24 text

Psychological Danger Psychological Safety Fear of admitting mistakes Blaming others Less likely to share different views Common knowledge effect Comfort admitting mistakes Learning from failure Everyone openly shares ideas Better innovation & decision- making

Slide 25

Slide 25 text

Story 三塊巧克⼒:法國媽媽如何教育孩⼦ 故事內容:https://kknews.cc/education/3jm6nay.html Blamelessness TPVSDFIUUQTLLOFXTDDFEVDBUJPOKNOBZIUNM

Slide 26

Slide 26 text

26 Learning in this story Blameless Action Items Lessons Learned

Slide 27

Slide 27 text

27 Failure is the key to success each mistake teaches us something. — Morihei Ueshiba

Slide 28

Slide 28 text

How does 17Media Accept Failure as Normal?

Slide 29

Slide 29 text

29 One of Our Postmortem • Summary • Full disk caused SQL outage • Root Cause • mysql-tailer didn't set log rotation • mysql-4 boot disk was full • Timeline • 2019/04/26 10:25: Alert triggered. • 2019/04/26 10:28: Discussion thread began. • 2019/04/26 10:35: Alert closed. • 2019/04/26 10:42: Fully recovered. • How to prevent this happening again? • Add log rotate configuration

Slide 30

Slide 30 text

Reduce Organizational Silos

Slide 31

Slide 31 text

31 Reduce Organizational Silos

Slide 32

Slide 32 text

How does 17Media Reduce Organizational Silos?

Slide 33

Slide 33 text

33 Sharing on a regular basis

Slide 34

Slide 34 text

34 1 on 1 You Colleague Colleague Manager/ Supervisors Subordinates

Slide 35

Slide 35 text

35 1 on 1 • Have regular 1:1’s. • Optimize settings. • Have a doc.

Slide 36

Slide 36 text

Implement Gradual Change

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

In early January 2016, Netflix Cloud Migration Complete. Netflix recently announced, through their company blog, their journey to the cloud is finally complete after more than 7 years.

Slide 39

Slide 39 text

How does 17Media Implement Gradual Change?

Slide 40

Slide 40 text

40 17Media SRE 2015 • 17Media founded • Build on AWS 2016/8 • First DevOps Engineer join 2018/5 • Migrate to GCP 2017/1 • Migrate from beanstalk to ECS 2018/6 • Migrate from node.js to golang 2019/8 • Building GRE

Slide 41

Slide 41 text

41 Migrate from node.js to golang • Step 1: Create a proxy layer to redirect traffic • Step 2: Develop new features in golang • Step 3: Migrate node.js to golang one by one Node.js Proxy LB Node.js LB Proxy Node.js LB Golang

Slide 42

Slide 42 text

Leverage Tooling & Automation

Slide 43

Slide 43 text

43 Leverage Tooling & Automation • Tooling to automate repetitive work • release • create an account for new users • automatic repair interrupt? • Less manual work. More R&D • Something to show off on resume

Slide 44

Slide 44 text

How does 17Media Leverage Tooling & Automation?

Slide 45

Slide 45 text

45 Leverage Tooling & Automation - 1 • Situation • The sensitive data like token and password are in the repository so that anyone can access this. • Solution • #1: Manually encrypt passwords. Each new password takes 2 developer days. • #2: Develop a tool called: http://github.com/17media/ macgyver
 3FGFSFODFIUUQTHJUIVCDPNNFEJBNBDHZWFS

Slide 46

Slide 46 text

46 Leverage Tooling & Automation - 1

Slide 47

Slide 47 text

47 Leverage Tooling & Automation - 1

Slide 48

Slide 48 text

48 Leverage Tooling & Automation - 2 • Situation • Our application release on daily basis, and this process need to take 30 min. Git Tag Deploy to Staging Unit test/ Image Build E2E test Deploy to Production

Slide 49

Slide 49 text

49 Leverage Tooling & Automation - 2 Git Tag Deploy to Staging Unit test/ Image Build E2E test Deploy to Production

Slide 50

Slide 50 text

50 Leverage Tooling & Automation - 2 • Solution • Auto deploy to staging at 6am everyday

Slide 51

Slide 51 text

Measure Everything

Slide 52

Slide 52 text

52 Measure Everything • A dashboard to see all states. • Can go deep to trace issues. • Help to make important design decisions.

Slide 53

Slide 53 text

How does 17Media Measure Everything?

Slide 54

Slide 54 text

No content

Slide 55

Slide 55 text

55 Measure Everything • Cluster Level • Node count • Pod count • ….

Slide 56

Slide 56 text

56 Measure Everything • System Level • CPU • Memory • Disk • IOPS • …

Slide 57

Slide 57 text

57 Measure Everything • Application Level • Latency • QPS • Error rate • ….

Slide 58

Slide 58 text

58 Measure Everything • Code level • Cache hit rate • Concurrent users • How many times does your code go into “else” statement

Slide 59

Slide 59 text

It has to be customized toward your organization.

Slide 60

Slide 60 text

SRE Resource in Taiwan 60 IUUQTXXXGBDFCPPLDPNHSPVQTTSFUBJXBO IUUQTXXXGBDFCPPLDPNHSPVQT%FW0QT5BJXBO

Slide 61

Slide 61 text

Try to Change your organization

Slide 62

Slide 62 text

Try to Change your organization Job

Slide 63

Slide 63 text

63 We're hiring Site Reliability Engineer - Infrastructure Site Reliability Engineer - Automation & Tools
 Site Reliability Engineer - DBA
 Backend Engineer Check our job page! 17 Media