Panic Attack – a discussion about kdump, panic notifiers, graphics on crash event and all of that
The crash/panic path and all its related machinery were always subject to polemics; it’s an area naturally full of trade-offs, conflicting views and antagonistic goals. From one side we have kdump (also called crash_kexec), that requires a minimum touching before the kexec effectively happens; but at the same time, such crash kexec requires adapter resets and other special clean-ups (specially in hypervisors) to work properly. On top of that, add the non-kdump users that rely on panic notifiers to perform last minute actions or the data collection mechanisms, like kmsg dumpers (pstore as an example), the firmware-based approaches (like the PowerPC fadump) and the complete absence of graphical output in such scenarios, making it hard to debug or even to see what’s happening for a regular user.
The bootstrap of the discussion hereby proposed is the following thread, “The panic notifiers refactor”. This all started with a notifiers filter [0] I’ve submitted and Petr Mladek suggested that instead we should improve the notifiers – which led to this refactor. But this is quite polemic, as mentioned we have conflicting users/goals, so it’s hard to reach a consensus – part of it is due to the involvement of architecture code, which is very important for kexec/crash (for an example of an architecture spin-off discussion, see [1]). Also, as a reference for PCI devices resets/complexities in the kdump realm, see [2]. Part of this effort was merged as a set of fixes (see [3]), but I’m working in the second round of the refactor itself, taking into account the ideas from V1 and plan to submit that on July/2023. Now, to the 2nd part of this proposal: currently there is absolutely no way of having graphics output in a crash event. There was a (broken!) panic notifier some years ago, but it was properly removed. Recent efforts on that weren’t merged / didn’t progress much, see [4] for example. This area is becoming increasingly interesting, since Linux is getting used for gaming lately – for example, the Steam Deck [5] console is fully based on Linux and FOSS, but the users aren’t able to notice a panic due to lack of graphical output. Finally, there is also a potential for firmware-aid data collection or even graphics help – we currently struggle to have framebuffer graphics on kdump (see [6] and [7] for other discussions we bootstrapped about that some time ago). As per the above set of topics, we can see this area is quite prolific in multiple fronts, but it doesn’t usually receive the necessary “love” by the vendors or even distros – the efforts are usually quite diffuse and spread. So, the goal for such proposal is to present the latest advances and what’s missing and could be improved with regards kernel crash and its mechanisms to collect data in such panic event.
Guilherme G. Piccoli