André Restivo, Hugo Sereno Ferreira {jpmdias,arestivo,hugosf}@fe.up.pt June 3rd, 2021 3rd International Workshop on Software Engineering Research & Practices for the Internet of Things (SERP4IoT 2021) Co-located with the 43rd ACM/IEEE International Conference on Software Engineering (ICSE 2021)
Rushed development of devices and systems by competing vendors; - Overall neglect of interoperability standards and best practices; - Result is a highly-complex, heterogeneous, and frangible ecosystem. • More and more stories about devices stop working or behaving in unforeseeable ways: - e.g., smart locks which randomly open, doorbells that do not work without Internet, unsafe thermostat temperature adjustments... • Little to none safeguards or fallbacks: - Ignoring years of research in other fields, e.g., mission critical systems. • Traditional development approaches becoming unsuitable for IoT development: - Visual programming solutions (and other low-code approaches) have been proposed as an alternative. 3
popular visual programing solution used for IoT development: - Leverages the use of flows, by drag-n-drop nodes and links; - Provides both a development environment and a runtime; - Additional nodes can be added, by implementing them in JavaScript. • There are few available nodes that allow developers to improve the resilience of flows, and none come with the default node palette. - This is a current limitation of most low-code IoT development solutions. 4
computing as an approach to mitigate some management issues of complex of IoT systems. • An autonomic computing system should be able to: - Configure itself (self-configuration); - Constantly improve its performance (self-optimization); - Protect itself against malicious attacks (self-protection); - Automatically detect, diagnosis and repair system defections (self-healing). 5
responsible objects, stating that things should be self-aware of their context and apply smart self-healing decisions. - However, the transactional nature of the solution proposed have several shortcomings. • Aktas et al. and Leotta et al. were among the few to propose the use of runtime verification mechanisms to detect system problems. The first using a complex-event processing approach and the other using formal specifications of the system (i.e., UML). • Szydlo et al., Blackstock et al. and others have been proposing solutions to improve the reliability of Node-RED itself: - Approaches include partition of Node-RED flows across instances or conversion of flows (or nodes) into code that can be run by edge devices. - Disregards typical edge tier capabilities or assumes computational power above what is typical of constrained devices. 6
- We proposed an approach for improve the reliability of Node-RED flows by leveraging existent nodes (abstracted into sub-flows), found several limitations and shortcomings of the approach and of Node-RED itself. • “A Pattern-Language for Self-Healing Internet-of-Things Systems”: - We systemized the existent knowledge from several fields in what regards reliability, fault-tolerance and self-healing into a pattern-language with 27 patterns. - Defined two pattern categories: - Error Detection (probes) - Recovery and Maintenance of Health 7
of 17 nodes that can be used to add self-healing capabilities to Node-RED flows. • A single node might leverage more than one self-healing pattern. - There are nodes that do both detection and recovery (or maintenance of health), while others do only one of the parts. - Some nodes only provide specific use cases of the general pattern (e.g., Kalman noise filter). • There are some self-healing patterns that are not possible to implement only in Node-RED, e.g., depending on the devices features or exposed interfaces. • The extension only encompass nodes (and flows) with reactive behavior. - We consider proactive (e.g., preventive) approaches as a future research direction.
temperature and humidity readings each 60 seconds. - Possible errors: Sensor do not emit reading, values are out-of-spec for the sensor or other sensor misbehavior (e.g., stuck-at readings). Additionally, Node-RED can restart (e.g., due to a crash). • The extra nodes detect if (1) the device stops emitting values (heartbeat) and (2) if values are out-of- spec (threshold-check). If any issue appears, (3) missing values are compensated (compensate). If Node-RED restarts for some reason, the last reading is injected to the flow output (checkpoint).
is used to authenticate accesses to the lab. - The usage frequency varies during the day, and the load can require extra resources. • The extra nodes detect (1) the frequency of readings (timing) as slow, fast and normal, with a 15s interval per reading configuration. If there is a load spike, a balancing of the requests is done amongst available resources.
Node-RED instances running (which have different flows configured), if one fails, the other (which becomes the main one) must enable a specific flow (which ensures the maintenance of health of the system). • The extra nodes (1) allow to manage different instances by exchanging ping and election messages (redundancy) and (2) to enable or disable flows during runtime (flow-control), thus allowing to configure such self-healing behaviors (RUNTIME ADAPTATION).
we showcase the feasibility of using self- healing mechanisms within Node-RED flows to improve the system dependability. • However, we consider that these scenarios show only a portion of the possibilities of configuration/use of the self-healing extensions. • The experimental scenarios, although inspired in real-world use cases, have been hand-picked with prior knowledge of system, which is one of the considered threats to validity. • We also consider that using a real deployed testbed enhances the quality of the experiments. Nonetheless, it also poses some limitations/threats: - Limits the number of devices used during the experiments due to additional costs; - Makes it more difficult to replicate; - Capturing failures-over-time requires long-running experiments; - The users that typically interact with our system exhibit a level of expertise that is not representative of most IoT deployment scenarios. 14
extensions we have enabled Node- RED users to improve the overall system dependability via the addition of self- healing mechanisms. • We have also encountered several limitations in the current version which we consider as future work: - Some nodes have issues dealing with some cases (e.g., redundancy node is uncapable to deal with runtime network partitioning); - Node-RED's points of extension limits what we can do without modifying Node-RED itself, or the end-devices. - Most of the nodes do not have acceptable delays/margins into consideration (e.g., a delay of 1sec can be ignored for most smart home applications); - The nodes for device/service discovery, device registry and resource monitoring are limited due to the nature of IoT (e.g., lack of standards and the heterogeneity of communication protocols). - To better understand the limitations of our approach we need to be able to deliberately provoke failures, process which is, currently, mostly manual. 15