Upgrade to Pro — share decks privately, control downloads, hide ads and more …

2019-10-17 17Media SRE Journey

Sammy Lin
October 18, 2019

2019-10-17 17Media SRE Journey

2019-10-17 DevOpsDays Taipei

Sammy Lin

October 18, 2019
Tweet

More Decks by Sammy Lin

Other Decks in Technology

Transcript

  1. SRE is what you get when you treat operations as

    if it’s a software problem. Our mission is to protect, provide for, and progress the software and systems behind all of Google’s public services — Google Search, Ads, Gmail, Android, YouTube, and App Engine, to name just a few — with an ever-watchful eye on their availability, latency, performance, and capacity. What is SRE? 3FGFSFODFIUUQTMBOEJOHHPPHMFDPNTSF
  2. 4

  3. 6 Job Description System Administrator • Architecture planning, setup, backup,

    update, security protection and management of LINUX servers and operating systems. DevOps Engineer • Build and improve our CI/CD process and tools
 • Manage AWS or GCP environment SRE • Scale our applications and infrastructure. • Develop monitoring systems. • Participate in our on-call rotation.
  4. 11 17Media SRE 2015 • 17Media founded • Builded on

    AWS 2016/8 • First DevOps Engineer joined 2018/5 • Migrated to GCP 2017/1 • Migrated from beanstalk to ECS 2018/6 • Migrated from node.js to golang 2019/8 • Building GRE
  5. 17 Our Vision • The SREs in different companies can

    support each other. • Training Jr. SRE to Sr. • Stronger negotiation power toward 3rd party vendors. • Copy successful experience.
  6. 19 Broken into 5 key areas: Accept Failure as Normal

    Reduce Organization Silos Implement Gradual Change Leverage Tooling & Automation Measure Everything
  7. Psychological Danger Psychological Safety Fear of admitting mistakes Blaming others

    Less likely to share different views Common knowledge effect Comfort admitting mistakes Learning from failure Everyone openly shares ideas Better innovation & decision- making
  8. 27 Failure is the key to success each mistake teaches

    us something. — Morihei Ueshiba
  9. 29 One of Our Postmortem • Summary • Full disk

    caused SQL outage • Root Cause • mysql-tailer didn't set log rotation • mysql-4 boot disk was full • Timeline • 2019/04/26 10:25: Alert triggered. • 2019/04/26 10:28: Discussion thread began. • 2019/04/26 10:35: Alert closed. • 2019/04/26 10:42: Fully recovered. • How to prevent this happening again? • Add log rotate configuration
  10. In early January 2016, Netflix Cloud Migration Complete. Netflix recently

    announced, through their company blog, their journey to the cloud is finally complete after more than 7 years.
  11. 40 17Media SRE 2015 • 17Media founded • Build on

    AWS 2016/8 • First DevOps Engineer join 2018/5 • Migrate to GCP 2017/1 • Migrate from beanstalk to ECS 2018/6 • Migrate from node.js to golang 2019/8 • Building GRE
  12. 41 Migrate from node.js to golang • Step 1: Create

    a proxy layer to redirect traffic • Step 2: Develop new features in golang • Step 3: Migrate node.js to golang one by one Node.js Proxy LB Node.js LB Proxy Node.js LB Golang
  13. 43 Leverage Tooling & Automation • Tooling to automate repetitive

    work • release • create an account for new users • automatic repair interrupt? • Less manual work. More R&D • Something to show off on resume
  14. 45 Leverage Tooling & Automation - 1 • Situation •

    The sensitive data like token and password are in the repository so that anyone can access this. • Solution • #1: Manually encrypt passwords. Each new password takes 2 developer days. • #2: Develop a tool called: http://github.com/17media/ macgyver
 3FGFSFODFIUUQTHJUIVCDPNNFEJBNBDHZWFS
  15. 48 Leverage Tooling & Automation - 2 • Situation •

    Our application release on daily basis, and this process need to take 30 min. Git Tag Deploy to Staging Unit test/ Image Build E2E test Deploy to Production
  16. 49 Leverage Tooling & Automation - 2 Git Tag Deploy

    to Staging Unit test/ Image Build E2E test Deploy to Production
  17. 50 Leverage Tooling & Automation - 2 • Solution •

    Auto deploy to staging at 6am everyday
  18. 52 Measure Everything • A dashboard to see all states.

    • Can go deep to trace issues. • Help to make important design decisions.
  19. 58 Measure Everything • Code level • Cache hit rate

    • Concurrent users • How many times does your code go into “else” statement
  20. 63 We're hiring Site Reliability Engineer - Infrastructure Site Reliability

    Engineer - Automation & Tools
 Site Reliability Engineer - DBA
 Backend Engineer Check our job page! 17 Media