The importance of resilience


Data communication networks are serving all kinds of human activities. Whether used for professional or leisure purposes, for safety-critical applications or e-commerce, the Internet in particular has become an integral part of our everyday lives, affecting the way societies operate. However, the Internet was not intended to serve all these roles and, as such, is vulnerable to a wide range of challenges. Malicious attacks, software and hardware faults, human mistakes (e.g., software and hardware misconfigurations), and large-scale natural disasters threaten its normal operation.

Resilience, the ability of a network to defend against and maintain an acceptable level of service in the presence of such challenges, is viewed today, more than ever before, as a major requirement and design objective. These concerns are reflected, among others, in the Cyber Storm III exercise, carried out in the United States in September 2010, and the "cyber stress tests" conducted in Europe by the European Network and Information Security Agency (ENISA) in November 2010; both aimed precisely at assessing the resilience of the Internet, this "critical infrastructure used by citizens, governments, and businesses".

Resilience evidently cuts through several thematic areas, such as information and network security, fault-tolerance, software dependability, and network survivability. A significant body of research has been carried out around these themes, typically focusing on specific mechanisms for resilience and subsets of the challenge space. However, a shortcoming of existing research and deployed systems is the lack of a systematic view of the resilience problem, i.e., a view of how to engineer networks that are resilient to challenges that transcend those considered by a single thematic area. A non-systematic approach to understanding resilience targets and challenges, e.g., one that does not cover thematic areas, leads to an impoverished view of resilience objectives, potentially resulting in ill-suited solutions.

The EU-funded ResumeNet project argues for resilience as a critical and integral property of networks. It advances the state of the art by adopting a systematic approach to resilience, which takes into account the wide-variety of challenges that may occur. At the core of our approach is a coherent resilience framework, which includes implementation guidelines, processes, and toolsets that can be used to underpin the design of resilience mechanisms at various levels in the network. Central to the framework is a control loop, which defines necessary conceptual components to ensure network resilience. The other elements - a risk assessment process, metrics definitions, policy-based network management, and information sensing mechanisms - emerge from the control loop as necessary elements to realise our systematic approach.

Within the project, our framework has been evaluated using a number of future Internet case studies, including wireless mesh networks, opportunistic networks, a publish-subscribe platform that supports an Internet of Things application, and a Voice over IP application. In addition, to findings that relate to specific aspects of the case studies and network resilience, reflections on the framework include the difficulties of detecting challenges in opportunistic networks because of their disconnected and stochastic nature, and how certain forms of resilience mechanism, e.g., that discourage selfishness, are difficult to model with respect to the project's overall resilience strategy. Despite these findings, our experiments conclude there are significant benefits to be had from addressing network resilience using our systematic resilience framework.

©Copyright by ResumeNet