more information • In-band and out-of-band monitoring are complementary In-band monitoring • The kernel is well positioned to know when hardware fails • It has a ton of contextual information Interfacing • But logging on the console has limited usability • Is there an API to get this context?
in strategic position in the kernel • Log structured information with very low overhead • Ftrace API Example Hardware errors • Many kernel error code paths have tracepoints (eg block_rq_error) • rasdaemon
if you want more context? • Do you need to recompile the kernel? Observability superpowers • Low overhead in-kernel virtual machine • BPF programs can be attached to tracepoints / error functions • Can be used to create our own context Flight recorder pattern • Record context in kernel-space • Output to user-space on error • User-space exfiltrate to centralized location for analysis