Chaos testing: Testing the robustness of self-adaptive distributed applications against perturbations
14 August 2019
Context: Most current software systems are highly distributed, modular and configurable and embed their own adaptation mechanisms to resist various perturbations that could occur. These fault tolerance mechanisms work at the system configuration level by providing mechanisms to update configuration such as the running software versions, the node configuration or the application properties. The softwarization of modern networks has led to the emergence of virtual network functions which in turn become highly configurable, making all layers of a distributed system more agile but also making the system as a whole even more complex. Problem: This complexity combined with the absence of a single entry point to manage network reconfiguration, software deployment and internal application policies prevents an a priori assessment of the behaviour of a complex system when facing perturbations or a simple software brick update. The lack of a validation tool to evaluate the robustness of self-adaptive distributed applications against perturbations is a major problem that has been highlighted in recent bugs such as. Contribution: The emergence of platforms to automate and finely monitor the deployment of distributed software bricks makes it possible to build robustness test campaigns for these systems. In this paper, we define the notion of chaos tests which aims at assessing the robustness of such systems. A chaos testing framework is presented to support the definition of chaos tests using specific abstractions to test the robustness of such systems against various perturbations. Validation:We validate this chaos testing framework by demonstrating how a set of bugs that have been detected in production system following a software bricks update could have been detected using our approach.
