System down

Are we more or less tolerant concerning system downtime? Peter Groucutt, Managing Director at Databarracks provides some analysis

It's hard to know if downtime has suddenly become more common or if we just hear about it more. In business continuity and risk management we talk about likelihood and impact, and they are two of the critical components to be considered.

When you create a Risk Register, you assign a score to your own perceived risks and this helps to define what needs to be addressed first: it sets priorities. However, we need to understand if generally outages are becoming more likely, or whether it is an increase in their impact that is pushing IT outages into the headlines. I think it is probably a combination of both.

The impact of IT downtime has increased as businesses have progressed through their digital transformation initiatives. In the past, if IT systems went offline, staff could do other work. Now IT is so central to operations that there are fewer manual processes for them to revert to.

IT downtime is also more visible, and in this respect, the expectation of customers has changed dramatically. The outages themselves haven't necessarily changed but the perception certainly has. Even a brief outage now draws complaints, criticism and social media attention.

For over ten years we have carried out an annual survey to track the causes of data loss and IT downtime. Hardware failure and human error top the list each year as the most common causes of downtime, but there is no trend to suggest this is either improving or getting worse. The one factor that has shown an increase as a cause of both data loss and downtime is the cyber threat.

The response to non-cyber issues is simply to recover using your most recent backup. It's rarely as easy for cyber issues though, and ransomware for instance may be present in your backups. Recovering from ransomware therefore requires you to carry out several historical recoveries to find the most recent clean backup. Other cyber-attacks might force you to take systems offline to prevent the spread of malware, minimise exposure and prevent further hacks. In each case these extra steps take time and extend that difficult period of IT downtime.

Review your response strategy: outright prevention of a cyber-attack is impossible, but the focus should be on how the organisation acts if compromised. Your Incident Response Team (IRT) must have the authority to make large-scale operational decisions to take systems offline to limit the spread of infection. Once the threat has been isolated and contained, you must establish when the initial attack occurred to be able to restore using clean data.

Crisis communications plan: Good recovery isn't just about technology; it's also about managing the situation. For example, Norsk Hydro recently suffered a ransomware attack and yet provided a masterclass in crisis communications with regular, honest and transparent updates. We can learn a lot from this.

Assess cloud risks: We should recognise that our IT is actually far more resilient than it was 10 or 15 years ago, with entire IT estates hosted from a single server room or data centre. Most organisations now operate a hybrid cloud to some extent. This could include AWS and Azure or might just be a few tactical SaaS applications that reduces risk. An incident is unlikely to affect disparate IT.

However, disparate IT makes protecting data more difficult, residing as it does in multiple locations. Protecting cloud data needs a different approach, and It's best not to rely entirely on in-built data protection options. Make your own backups of cloud data and store them separately from the production environment to another cloud or in your data centre. NC