Site Reliability Engineering - Sesión 1

SRE Introducción

AGENDA • ¿Qué es SRE? • SRE & DevOps •
Principios • Prácticas • SLOs - Vocabulario, Error Budgets • Estado del Arte • Un día en la vida de un SRE

TITANIC QUEBEC BRIDGE

¿QUE ES COMUN EN ESTOS CASOS?

Esos sistemas no eran CONFIABLES! Confiabilidad describe la habilidad que
tiene un sistema o un componente para funcionar bajo condiciones no esperadas durante un tiempo no especificado.

INGENIERIA DE CONFIABILIDAD Ingeniería de Confiabilidad es la disciplina que
aplica el know-how científico a un componente, producto o proceso para asegurar que desempeña la función para la que fue diseñado sin falla por un tiempo especificado.

T O O L I N G MODELO OPERATIVO CLOUD
SQUAD SRE DEVOPS FEATURE SQUAD 1 DEVOPS FEATURE SQUAD 2 DEVOPS FEATURE SQUAD N SRE SQUAD KPIs - Performance - Uptime - Minimal Cost per Service - Deployability SERVICES - Automation Framework - Self Service Infrastructure - Logging & Metrics Monitoring & Reporting - Scaling PRACTICES - Know the Service Level - Embrace Risk - Eliminate Toil - Know What's Broken and Why - Know the Service Level - Stuff Happens - Automate [Almost] Everything - Reliable Releases - Keep it Simple - Chaos Engineering

AREAS DE EXPERIENCIA Introducción

HISTORIA SRE 2003 DevOps is born Ben Treynor coined SRE
2014 First Conference about SRE: SRECon 2016-2018 SRE Books are released 2019 SRE massiﬁcation

3 CARACTERISTICAS NaLSD Evitar el trabajo innecesario Si a la
curiosidad Es Toil!

DEVOPS O SRE

PRINCIPIOS

Acoge el riesgo Monitorear Sistemas Distribuidos Objetivos de Nivel de
Servicio Eliminar el Trabajo Manual PRINCIPIOS

Automatización Ingeniería de Despliegue Simplicidad PRINCIPIOS

You expect to build 100% reliable services—ones that never fail.
However, increasing reliability is worse for a service rather than better! Extreme reliability comes at a cost! Embrace the Risk! No se … me gusta la adrenalina. ACOGIENDO EL RIESGO

Toil is not just "work I don’t like to do."
If a human operator needs to touch your system during normal operations, you have a bug. In Google at least 50% of each SRE’s time should be spent on engineering project work that will either reduce future toil. ELIMINANDO EL TRABAJO MANUAL

Collecting, processing, aggregating, and displaying real-time quantitative data about a
system, such as query counts and types, error counts and types, processing times, and server lifetimes. A Google SRE team with 10–12 members typically has one members whose primary assignment is to build and maintain monitoring systems for their service. MONITOREANDO

ACOGIENDO EL RIESGO

Automate Yourself Out of a Job: Automate ALL the Things!
• User account creation. • Software or hardware installation preparation and decommissioning. • Rollouts of new software versions. • Runtime conﬁguration changes. Automate the Resilience! AUTOMATIZANDO

Continuous Build and Deployment! Release engineers have a solid understanding
of source code management, compilers, build conﬁguration languages, automated build tools, package managers, and installers. 4 principles: Self-Service Model, High Velocity, Hermetic Builds, Enforcement Policies and Procedures. INGENIERIA DE DESPLIEGUE

PRACTICAS

PRACTICAS Cultura Postmortem Estar On-Call Respuesta a Incidentes Admin. de
Carga

PRACTICAS Pruebas para Conﬁabilidad Simplicidad Ingeniería de Software Soluciones Eﬁcaces

PRACTICAS Pipelines Ingeniería del Caos Integridad de datos Canary Releases

PRACTICAS

HABILIDADES • Software Engineering • Distributed Systems Design • Operating
systems • Networking • Databases • Security • Reliability • Troubleshooting • Customer support

The best way to promote a DevOps & SRE culture
is adopting a new view, a view focused in the syntoms, no in the causes ...

Site Reliability Engineering - Sesión 1

Site Reliability Engineering - Sesión 1

SRE

More Decks by SRE

Other Decks in Technology

Featured

Transcript

SRE Introducción

AGENDA • ¿Qué es SRE? • SRE & DevOps •

TITANIC QUEBEC BRIDGE

¿QUE ES COMUN EN ESTOS CASOS?

Esos sistemas no eran CONFIABLES! Confiabilidad describe la habilidad que

INGENIERIA DE CONFIABILIDAD Ingeniería de Confiabilidad es la disciplina que

T O O L I N G MODELO OPERATIVO CLOUD

AREAS DE EXPERIENCIA Introducción

HISTORIA SRE 2003 DevOps is born Ben Treynor coined SRE

3 CARACTERISTICAS NaLSD Evitar el trabajo innecesario Si a la

DEVOPS O SRE

DEVOPS O SRE

PRINCIPIOS

Acoge el riesgo Monitorear Sistemas Distribuidos Objetivos de Nivel de

Automatización Ingeniería de Despliegue Simplicidad PRINCIPIOS

You expect to build 100% reliable services—ones that never fail.

Toil is not just "work I don’t like to do."

Collecting, processing, aggregating, and displaying real-time quantitative data about a

ACOGIENDO EL RIESGO

Automate Yourself Out of a Job: Automate ALL the Things!

Continuous Build and Deployment! Release engineers have a solid understanding

PRACTICAS

PRACTICAS Cultura Postmortem Estar On-Call Respuesta a Incidentes Admin. de

PRACTICAS Pruebas para Conﬁabilidad Simplicidad Ingeniería de Software Soluciones Eﬁcaces

PRACTICAS Pipelines Ingeniería del Caos Integridad de datos Canary Releases

PRACTICAS

PRACTICAS

HABILIDADES • Software Engineering • Distributed Systems Design • Operating

The best way to promote a DevOps & SRE culture