Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Site Reliability Engineering - Sesión 1

77c3dfd58301b6dbc53cfd33854f8147?s=47 SRE
September 29, 2021

Site Reliability Engineering - Sesión 1

77c3dfd58301b6dbc53cfd33854f8147?s=128

SRE

September 29, 2021
Tweet

Transcript

  1. SRE Introducción

  2. AGENDA • ¿Qué es SRE? • SRE & DevOps •

    Principios • Prácticas • SLOs - Vocabulario, Error Budgets • Estado del Arte • Un día en la vida de un SRE
  3. TITANIC QUEBEC BRIDGE

  4. ¿QUE ES COMUN EN ESTOS CASOS?

  5. Esos sistemas no eran CONFIABLES! Confiabilidad describe la habilidad que

    tiene un sistema o un componente para funcionar bajo condiciones no esperadas durante un tiempo no especificado.
  6. INGENIERIA DE CONFIABILIDAD Ingeniería de Confiabilidad es la disciplina que

    aplica el know-how científico a un componente, producto o proceso para asegurar que desempeña la función para la que fue diseñado sin falla por un tiempo especificado.
  7. T O O L I N G MODELO OPERATIVO CLOUD

    SQUAD SRE DEVOPS FEATURE SQUAD 1 DEVOPS FEATURE SQUAD 2 DEVOPS FEATURE SQUAD N SRE SQUAD KPIs - Performance - Uptime - Minimal Cost per Service - Deployability SERVICES - Automation Framework - Self Service Infrastructure - Logging & Metrics Monitoring & Reporting - Scaling PRACTICES - Know the Service Level - Embrace Risk - Eliminate Toil - Know What's Broken and Why - Know the Service Level - Stuff Happens - Automate [Almost] Everything - Reliable Releases - Keep it Simple - Chaos Engineering
  8. AREAS DE EXPERIENCIA Introducción

  9. HISTORIA SRE 2003 DevOps is born Ben Treynor coined SRE

    2014 First Conference about SRE: SRECon 2016-2018 SRE Books are released 2019 SRE massification
  10. 3 CARACTERISTICAS NaLSD Evitar el trabajo innecesario Si a la

    curiosidad Es Toil!
  11. DEVOPS O SRE

  12. DEVOPS O SRE

  13. PRINCIPIOS

  14. Acoge el riesgo Monitorear Sistemas Distribuidos Objetivos de Nivel de

    Servicio Eliminar el Trabajo Manual PRINCIPIOS
  15. Automatización Ingeniería de Despliegue Simplicidad PRINCIPIOS

  16. You expect to build 100% reliable services—ones that never fail.

    However, increasing reliability is worse for a service rather than better! Extreme reliability comes at a cost! Embrace the Risk! No se … me gusta la adrenalina. ACOGIENDO EL RIESGO
  17. Toil is not just "work I don’t like to do."

    If a human operator needs to touch your system during normal operations, you have a bug. In Google at least 50% of each SRE’s time should be spent on engineering project work that will either reduce future toil. ELIMINANDO EL TRABAJO MANUAL
  18. Collecting, processing, aggregating, and displaying real-time quantitative data about a

    system, such as query counts and types, error counts and types, processing times, and server lifetimes. A Google SRE team with 10–12 members typically has one members whose primary assignment is to build and maintain monitoring systems for their service. MONITOREANDO
  19. ACOGIENDO EL RIESGO

  20. Automate Yourself Out of a Job: Automate ALL the Things!

    • User account creation. • Software or hardware installation preparation and decommissioning. • Rollouts of new software versions. • Runtime configuration changes. Automate the Resilience! AUTOMATIZANDO
  21. Continuous Build and Deployment! Release engineers have a solid understanding

    of source code management, compilers, build configuration languages, automated build tools, package managers, and installers. 4 principles: Self-Service Model, High Velocity, Hermetic Builds, Enforcement Policies and Procedures. INGENIERIA DE DESPLIEGUE
  22. PRACTICAS

  23. PRACTICAS Cultura Postmortem Estar On-Call Respuesta a Incidentes Admin. de

    Carga
  24. PRACTICAS Pruebas para Confiabilidad Simplicidad Ingeniería de Software Soluciones Eficaces

  25. PRACTICAS Pipelines Ingeniería del Caos Integridad de datos Canary Releases

  26. PRACTICAS

  27. PRACTICAS

  28. HABILIDADES • Software Engineering • Distributed Systems Design • Operating

    systems • Networking • Databases • Security • Reliability • Troubleshooting • Customer support
  29. The best way to promote a DevOps & SRE culture

    is adopting a new view, a view focused in the syntoms, no in the causes ...