“three nines” (99,9%) service, this is your error budget: – 43 minutes of downtime in a month – 2.2 hours of downtime in a quarter – 8.8 hours of downtime in a year
called SLAs (service level agreement) ▪ SLAs consist of an SLO (service level objectives) and penalties if you don’t meet them, like the 99% of the requests finish under 200 ms
▪ How to secure your Node.js applications ▪ How to get full visibility into production systems – Best practices for logging – Monitoring your Node.js applications ▪ What to do after hell broke loose
res, next) => { if (err.isServer) { // log the error... // probably you don't want to log unauthorized access // or do you? } return res.status(err.output.statusCode).json(err.output.payload) })
clickjacking attacks, • Strict-Transport-Security to keep your users on HTTPS, • X-XSS-Protection to prevent reflected XSS attacks, • X-DNS-Prefetch-Control to disable browsers’ DNS prefetching.
given event happened, • Format to keep log lines readable for both humans and machines, • Destination should be the standard output and error only, • Support for log levels
errors, always reported • Used whenever an unexpected error happens which prevents further processing • The app may try to recover (like on database connection lost) or forcefully terminate
events indicating irregular circumstances, with clearly defined recovery strategy • It has no impact on system availability or performance • These events should be reported too
events indicate major state changes in the application, like the startup of the HTTP server • Each component should log: • When it starts and when it became operational • When it started shutdown, and just before it stopped
level events, for internal state changes • These events are usually not reported, just for troubleshooting • At the discretion of the engineer developing the system component
rate, as they directly affect customer satisfaction; • Latency, as the slower the service, the most likely your customers close your application; • Throughput, to put error rate and latency in context; • Saturation, to tell if you can handle more traffic.
will think about how their software is going to run in production • Encourages ownership and accountability which leads to more independent, responsible teammates • Leads to operational excellence • Which leads to more satisfied customers
disaster recovery for Kubernetes cluster resources and persistent volumes • Helps with • Disaster recovery • Cloud provider migration • Clone the production environment for development / testing