Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Site Reliability Engineering - Sesión 1

SRE
September 29, 2021

Site Reliability Engineering - Sesión 1

SRE

September 29, 2021
Tweet

More Decks by SRE

Other Decks in Technology

Transcript

  1. AGENDA • ¿Qué es SRE? • SRE & DevOps •

    Principios • Prácticas • SLOs - Vocabulario, Error Budgets • Estado del Arte • Un día en la vida de un SRE
  2. Esos sistemas no eran CONFIABLES! Confiabilidad describe la habilidad que

    tiene un sistema o un componente para funcionar bajo condiciones no esperadas durante un tiempo no especificado.
  3. INGENIERIA DE CONFIABILIDAD Ingeniería de Confiabilidad es la disciplina que

    aplica el know-how científico a un componente, producto o proceso para asegurar que desempeña la función para la que fue diseñado sin falla por un tiempo especificado.
  4. T O O L I N G MODELO OPERATIVO CLOUD

    SQUAD SRE DEVOPS FEATURE SQUAD 1 DEVOPS FEATURE SQUAD 2 DEVOPS FEATURE SQUAD N SRE SQUAD KPIs - Performance - Uptime - Minimal Cost per Service - Deployability SERVICES - Automation Framework - Self Service Infrastructure - Logging & Metrics Monitoring & Reporting - Scaling PRACTICES - Know the Service Level - Embrace Risk - Eliminate Toil - Know What's Broken and Why - Know the Service Level - Stuff Happens - Automate [Almost] Everything - Reliable Releases - Keep it Simple - Chaos Engineering
  5. HISTORIA SRE 2003 DevOps is born Ben Treynor coined SRE

    2014 First Conference about SRE: SRECon 2016-2018 SRE Books are released 2019 SRE massification
  6. Acoge el riesgo Monitorear Sistemas Distribuidos Objetivos de Nivel de

    Servicio Eliminar el Trabajo Manual PRINCIPIOS
  7. You expect to build 100% reliable services—ones that never fail.

    However, increasing reliability is worse for a service rather than better! Extreme reliability comes at a cost! Embrace the Risk! No se … me gusta la adrenalina. ACOGIENDO EL RIESGO
  8. Toil is not just "work I don’t like to do."

    If a human operator needs to touch your system during normal operations, you have a bug. In Google at least 50% of each SRE’s time should be spent on engineering project work that will either reduce future toil. ELIMINANDO EL TRABAJO MANUAL
  9. Collecting, processing, aggregating, and displaying real-time quantitative data about a

    system, such as query counts and types, error counts and types, processing times, and server lifetimes. A Google SRE team with 10–12 members typically has one members whose primary assignment is to build and maintain monitoring systems for their service. MONITOREANDO
  10. Automate Yourself Out of a Job: Automate ALL the Things!

    • User account creation. • Software or hardware installation preparation and decommissioning. • Rollouts of new software versions. • Runtime configuration changes. Automate the Resilience! AUTOMATIZANDO
  11. Continuous Build and Deployment! Release engineers have a solid understanding

    of source code management, compilers, build configuration languages, automated build tools, package managers, and installers. 4 principles: Self-Service Model, High Velocity, Hermetic Builds, Enforcement Policies and Procedures. INGENIERIA DE DESPLIEGUE
  12. HABILIDADES • Software Engineering • Distributed Systems Design • Operating

    systems • Networking • Databases • Security • Reliability • Troubleshooting • Customer support
  13. The best way to promote a DevOps & SRE culture

    is adopting a new view, a view focused in the syntoms, no in the causes ...