As the IT industry has become more virtualised, with the ongoing migration to the cloud, the emergence of the Industrial Internet of Things and the rise of connectivity, the network becomes more complex and difficult to manage. As more people decide to work from home or to connect remotely while off-site, it becomes more dispersed. Taken together, these developments make it more important that the network is kept up and running but also more likely that there will be outages.
Businesses are adding layers of complexity to networks and that can bring vulnerabilities. Today, we are seeing a raft of factors that can cause network or system outages – from ISP carrier issues to fibre cuts to simple human error. Added to this, network devices are becoming ever-more complex. As software stacks require more frequent updates, they become more vulnerable to bugs, exploits and cyber-attacks and all that in itself leads to more outages.
For all these reasons, we are seeing a growing focus on the concept of network resilience – but what exactly do we mean by this, why does it matter and how can it best be achieved? Network resilience is the ability to withstand and recover from a disruption of service.1 One way of measuring it is how quickly the business can get up and running again at normal capacity following an outage.
Network resilience is unfortunately often confused with redundancy. Organisations sometimes think that if they put two boxes in the core or the edge rather than one, they have solved their problem. Really though, they are just moving it somewhere else. A redundant system duplicates some network elements so that if one path fails another can be used. It removes a single point of failure but resilience considers the full ecosystem from core to edge.
Yet, despite this, many organisations still neglect to consider resilience when designing and building their networks. Unless they have just experienced an outage, they may not appreciate the importance of resilience or assign sufficient resources to it. Moreover, few businesses have the necessary in-house expertise to design a resilient network from the outset.
Something like Out of Band (OOB), for example, is likely to always be a small part, in the network design phase at least, of a much larger project. There is a process of education to take place here of course as organisations that included resilience in their network from the outset save time and money for their business by having that capability in there from the start rather than having to implement it reactively after the event.
The fact is that many organisations today face issues in being able to quickly identify and remediate reliability or resilience issues. Take a large organisation with a Network Operations Centre (NOC). They are lots of branches and offices often in different continents around the world with the attendant time zone issues that this typically brings. Often, they are trying to do more with less, so they may have fewer technical staff based at these remote sites. As a result, they may struggle to get visibility that an outage has even occurred because they are not proactively notified if something goes offline. Even when they are aware, it may be difficult to understand which piece of equipment at which specific location has a problem if nobody is on site to physically look.
True network resilience is not just about providing resilience to a single piece of equipment whether that be a router or a core switch for example; in a global economy it is important that any such solution can plug into all of the equipment at a data centre or edge site, map it and establish what is online and offline at any given time and importantly wherever in the world it is located.
That enables a system reboot to be quickly carried out remotely. If that does not work, it might well be that an issue with a software update is the root of the problem. With the latest smart out-of-band devices this can be readily addressed, because an image of the core equipment and its configuration, whether it be a switch or a router for example, can be retained, and the device can be quickly rebuilt remotely without the need for sending somebody on site. In the event of an outage, it is therefore possible to deliver network resilience via failover to cellular, while the original fault is being remotely addressed, enabling the business to keep up and running even while the primary network is down.
Building in resiliency through the OOB approach does cost money, of course, but it also pays for itself over the long-term. You might only use it a couple of times a year, say – but when you need it, you really need it. Of course anyone that has just suffered a network outage will understand the benefits of OOB, as a way of keeping their business running in what is effectively an emergency but as referenced above it is likely to be much better to plan for resilience from the word go. After all networks are the fundamental ‘backbone’ to the success of almost every organisation today, and many businesses will benefit from bringing network resilience into the heart of their approach right from the very outset.
1. Ray A. Rothrock, Digital Resilience (AMACOM, 2018)