Stressed Testing: Practical Operational Resilience

Phil Venables

Feb 813 min read

Operational resilience is a concept that has gained even further traction. It first came to prominence from financial regulators, in particular the Bank of England and then others.

“Operational Resilience is the ability of firms and the financial system as a whole to absorb and adapt to shocks, rather than contribute to them”.

This concept, very much applicable to all sectors, met with some eye-rolling with some people saying: “we already do this – it’s called business continuity planning”. This sentiment has subsided as people realized the potential benefits of the subtle and not so subtle approaches the framework introduces that helps converge the management of an array of operational risks – not least cyber, IT and business process risk.

So, in simple terms, what is operational resilience and how is it different from some existing risk frameworks and approaches:

Take a customer perspective. It takes a business service oriented view of resilience - as opposed to a business function (department / process) or IT-centric view of resilience. Some organizations already have this as part of their resilience program. However, some think they do but actually have a business function approach not a business service approach (from the customer's perspective). When you look at this you often find a business service is, naturally, made up of multiple business functions - and resiliency planning for an end to end service can require trade-offs between those functions.

Examine the interplay of risks. It looks at the inter-play of all operating risks across people, process and technology, across risk domains like cybersecurity, physical disasters, capacity risk, software lifecycle risk and so on. This involves looking at their interplay in quantitative as well as qualitative terms - and being able to test trade-offs when operating in degraded states under adverse situations.

Look at the extended enterprise. Not just downstream into an organization's supply chain but also upstream into customer's processes and systems to cover the full end to end business/digital interaction - from web and mobile to APIs and agents.

Consider more severe, but plausible, scenarios. Finally, and most significantly, it establishes impact tolerances. Most current resilience programs tend to focus on the ability of organizations to meet Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs) for punishing but not necessarily extreme events e.g. they contemplate a data center outage but not always a whole region of data centers, or contemplate a supplier going away for a day, but not necessarily weeks (or permanently). The more severe or extreme events are the real test of crisis response beyond the more "routine" resiliency management approaches many organizations have. Impact tolerances will likely be in excess of more routine RTOs and require more significant effort to sustain business services in the face of these scenarios - in fact you'll know you're on the right track if you are selecting scenarios that really push the envelope.

In my first look at this some time ago, I was tempted to simply think of this as essentially a coating around an existing business continuity/resilience program, but I now see this as transformative for 3 reasons:

Truly taking a customer's perspective on the resilience of your services can uncover some seams in your resilience planning that you might not expect - and it can yield some interesting risk mitigation approaches such as substitutability of services or functions.
Many organizations have in place tested RTOs against scenarios that don't go far enough into the tail risks of low frequency but high severity events that should be additionally prepared for. The shift to more severe events - but with flexibility to not simply adhere to an existing RTO - is a healthy step to contemplate more worst case scenarios and hence increase robustness.
Organizations need to adopt a less binary approach of perfect resilience vs. operating in degraded states during the process of recovery – this framework is an opportunity to enshrine that, in addition to further driving consideration of a wider array of risks in scenario selection.

I’ve seen this applied in many organizations and indeed in the PCAST report I worked on for cyber-physical resilience we adapted many of the concepts into some more practical techniques, particularly for cybersecurity. Last month we also did a Google Cloud Security Podcast on this very topic which has proved to be an accessible introduction. From this, Anton Chuvakin put together a great blog and paper applying this to modern cloud architectures.

If you want to distill all of this into some actionable guidance that you can take away and use operational resilience principles to really drive improvements in your organization then take these steps:

1.Know Your Minimum Viable Delivery Objectives

Work through what your organization does, particularly the services that might be considered critical national infrastructure. Some organizations have business process catalogs, most don’t. But you can fairly easily, from existing business continuity plans, construct or revalidate these.

Then, working with business units, executive leadership and the Board come up with what are your Minimum Viable Delivery Objectives. These are the services that you should aim to provide no matter what is thrown at you from cyber-attacks, disasters, errors and outages - both at small and wide scale. These are the things you want to fight through to either continue to deliver, maybe in a degraded state, or to recover to provide within a specific time frame - no matter what extreme scenario you face.

The goal here is to get leadership to understand what must be focused on in extreme events to provide those minimum viable services. This is harder than it sounds when you look at the dependencies that exist across your organization. Keeping the minimum services running isn’t essentially the same as keeping the whole organization running. For example, if you’re a bank your minimum viable services might be payments, corporate cash management and bond market support. You can go without, say for weeks, doing IPOs, issuing new mortgages or supporting M&A transactions. But to really know you can do this in the face of extreme scenarios you might need to decouple dependencies across your organization or technology stack. For some organizations it might not be possible to assure continued running in the face of extreme scenarios but it might be plausible to almost guarantee the restoration of minimum services in some defined time frame. For example, imagine a water utility, you might define the provision of clean drinking water to a covered area as being guaranteed in extreme scenarios within at least 72 hours. Societally, we can then communicate the expectation, or provide for, people having emergency supplies of bottled water to serve basic needs for up to 72 hours.

But like the old adage of planning being more useful than plans, the very act of shifting the mindset of leadership to make the hard choices of what are the minimum viable delivery objectives and to contemplate more extreme, but still plausible, disaster scenarios to push the limits of recoverability will drive more resilient outcomes. This is different to traditional business continuity planning processes that often deliberately exclude such scenarios and focus on recoverability of all services, or worse, prioritize recovery of that which is most critical for the business vs. most critical for the customers.

Again, it’s a subtle but crucial shift. Going from thinking about how to recover all things in some time frame and then having to constrain the scenarios under which you can do this. Instead, consider what minimum services you have to maintain under all (or at least many more scenarios) and then adjust your business to be able to achieve that - decoupling dependencies, operating in degraded states, improving recoverability and so on.

2.Do Some Stress Testing

Now you know your minimum viable delivery objectives and what more extreme scenarios you will contemplate to continue to deliver those under, you need to do some stress testing. Stress testing is a common term in many industries from materials, construction, aviation and more. It has been especially adopted in the context of resilience in financial services where many financial institutions are required by regulators to model how they would perform in various extreme scenarios e.g. the stock market drops by 40%, interest rates rise by 10%, 30% of loans default and more - all at the same time. The idea is to make sure the financial markets survive, and that banks can continue operation and absorb the losses and impact. It’s not to make sure everything is fine - it won’t be, losses will happen, some business will stop and many other consequences will be dire. But, even in an immensely degraded state, it makes sure banks and the system overall can still deliver its basic functions and not trigger a wide scale economic collapse.

The stress tests are, by definition, not intended to show they can be passed with ease. Rather, they can show where the breaking points are and then those breaking points are pushed further out or otherwise compensated for - to meet the goals of systemic resilience.

Adopting stress testing in your organization contemplates scenarios that you really hope don’t happen and then sees how bad it would be if they do. Some examples:

Internet Denial. Switch the Internet off to all or some of your systems and watch the surprise of how many things break that really should still work. I know a lot of organizations that have done this and seen both hardware and software failures. This can be where there are “heartbeats” back to the vendor for license checks, diagnostics or other reasons, and if they fail they disable device functionality. The preferred option is to permit minimum viable operation without such connectivity for a reasonable period of time.

Cold Restarts. Do an actual cold-restart of a set of systems. The problem with conventional backup and recovery is that their determination of success is not always a full restore but rather a sufficient sample of what was backed being readable. Organizations that only do this are often surprised in an extreme scenario (like a ransomware event) that doing a full restore (essentially a “reboot” of your company) is more difficult than expected. Some of the reasons for this include:

The backups don’t contain all that you need to do a mass restore e.g. if the index to the backups or the backup software itself is not in the backup then you’re going to have issues. This is usually because the assumption of backup and recovery is to recover a failed set of systems, not the whole environment.

The backups only contain your data and not all your software needed for recovery and you don’t have a reliable and timely process for software and infrastructure reproducibility from source (see next section).

There are circular dependencies that are unresolvable. Imagine, if you are totally wiped, and in recovery you can’t bring up your authentication infrastructure without DNS but restoring DNS needs that same authentication infrastructure.

Finally, you might not have the capacity to do a wide scale recovery in the time you want or need to. If your environment is wiped and you need to bring back peta- or exabytes of data back your backup systems won’t scale or parallelize for that volume. I’ve seen a few organizations recovering from ransomware where they were in good shape in principle but it still took them weeks to come back simply due to the network and systems capacity to re-instantiate everything from a single-threaded backup and restore system.

3.Focus on Leading Indicators of Performance

Much of our cyber-physical resilience focus to date is to manage lagging indicators of performance. This, so-called outcome based approach, is necessary but it’s not at all sufficient. What you really need, especially for resilience, is to focus on leading indicators. The logic being if you get these leading indicators of performance in good shape you can’t help but realize sustained improvements in your lagging indicators. It’s like in a manufacturing plant you absolutely need to measure the quality of what comes off the production line. But what is really focused on is, for example, the quality of the raw materials, the preventive maintenance levels of the plant machinery, the expertise of the workforce and the extent to which the products are designed for manufacturability. Doing all of this well means that what rolls off the production line is almost guaranteed to be fine.

I’ve covered a number of the principles for how to construct these and what they might be in this post. The important ones for resilience objectives are:

High Assurance Software Reproducibility. What percentage of your entire software is reproducible through a CI/CD pipeline (or equivalent). From this comes so much in terms of security, reliability, resilience and recovery as well as business agility.

Infrastructure Reproducibility. What percentage of your infrastructure (on-premise or cloud) is software defined, follows an immutable infrastructure pattern and for which the configuration code that defines this is adherent to the software reproducibility approach defined above.

Time to Reboot the Company. Essentially the cold restart time we discussed above. Imagine everything you have is wiped by a destructive attack or other cause. All you have is bare metal in your own data centers or empty cloud instances and a bunch of back-ups (tape, optical, or other immutable storage). Then ask, how long does it take to rehydrate/rebuild your environment? The great thing about this metric is it is feasible to put costs to each reduction in risk and it’s usually quite clear what the diminishing returns are. For example, to go from 1 month to 1 week might be $X, 1 week to one day might be $10X but 1 day to 1 minute might be $1000X. Many organizations I have seen that use a variant of this measure have gotten executive buy-in for a $10M+ investment to go from 1 week+ to sub 1 day for core functions. Many also use cloud and IT modernization approaches as a means of doing this and, of course, this becomes easier if you can become really good at software and infrastructure reproducibility.

Systems Stagnancy. Some legacy systems can very well be a true “legacy” of the company and might even be well maintained, reliable, secure and supported despite being of an older technology generation. The real issue, which is what I think we mean when we casually throw around the phrase legacy systems are systems that are stagnant. In other words, systems that are not maintained or kept up to date.
Preventative Maintenance. An obvious big root cause of many of the issues that prevent good resilience outcomes is insufficient budget or resources. This reduces the ability of a team to undertake activities like maintenance, technical debt pay down, or other work that would in other realms be considered preventative maintenance. There’s likely no correct level for this but it’s reasonable for there to be an assigned budget amount, expressed as a percentage of the wider operating budget, that can go up or down. It’s reasonable for management or the Board to dictate that this budget increases according to prior failures (a signal that more maintenance is needed) or it can decrease because of the positive effect of maintenance once that has been fully demonstrated (to avoid premature cutting). The key point here is to make that budget eat into or free up the operating capital or expenditure of that business / department so there are aligned incentives.

4.Use Degraded States for Resilience

It’s important to understand how your services for your minimum viable delivery objectives can operate in a degraded state. Figuring out how to continue to operate in extreme scenarios is easier if you build systems and business processes so that they can continue to operate in some basic way without all their dependencies or other surrounding services. Some services are designed, often inadvertently, to fail when even just one non-critical dependency fails. This is because it’s harder to design for degraded operation and such a goal needs to be asserted by leadership through the resilience program. It’s easier to illustrate this by example (not an exhaustive list):

Temporarily Limited Functions. For example, you might want to make payments through your bank and while you’d prefer to be able to look at your historical transactions, that is not totally necessary if you want to just see your balance and make a payment. In some cases it might even be ok for the bank to authorize some limited payments even without a balance service being available knowing that some potential losses might occur but is more than made up for with customer goodwill.

Deferred Decisions. There might be situations where you continue to operate with other increased risk or incomplete services in the spirit of permitting activity even in the face of dependency failures. For example, signing up for a credit card needs a credit check but if that service is not available you can still complete the sign up with a lower default limit and then when the service is back assign the right limit to that particular customer - or even having a relatively incomplete homegrown credit scoring model.

Run to Fail. Certain services might go down like an authentication or authorization service which you would think should cause a complete stop of all dependent services. But in many cases those dependent services can be safely engineered to run to the point that an authentication refresh is needed or to some time limit while caching permissions. I’ve seen this in some SaaS and cloud outages where an identity service had an outage but services continued to run, but you just couldn’t add or change principals until the service came back.

Degraded State Portability. Another example that is more applicable is the goal of ensuring multi-cloud operational portability for services. This can be done in various ways by ensuring you design for portability and use open source, open standards, containerised deployments and then test across cloud providers to make sure they’re not deviating in some way that would impact portability. But, in a lot of cases certain cloud providers have some highly differentiated services that are unique to that provider. Let’s say this is a highly performant database service that is so capable as to be desired by many customers. Despite the resilience of that cloud and the ability to deploy both regionally and zonally, customers might still want the assurance of a failover to another cloud for truly extreme scenarios. In this case, by definition, that is not possible because it’s proprietary. But it does become a possibility if you consider resilient operations in a degraded state. For example, let’s say you are running an on-line ordering system that is using some uniquely performant cloud database service. You might still snap-shot the data tables to another database at another provider that can’t replicate the superior features of the primary service but is good enough to support basic lookups to permit continued shipments and manufacturing but not necessarily new ordering. You can continue business operations in a degraded state to meet commitments while your primary provider comes back.

Now, all of these require some careful consideration of trade-offs. It might be that the right thing in a security or safety critical environment is to fail closed or fail safe but there it is still likely some compromises can be made without sacrificing security or safety in the name of resilience, even if it’s just for a limited time.

Bottom line: thinking harder about operational resilience is more than just a fancy way of describing business continuity planning. It’s about looking at more extreme but plausible scenarios, stress testing how you fare against these, and finding where the breakpoints are. Then plan to continue to achieve your minimum viable delivery objectives - themselves in some degraded state - so you can keep running almost no matter what. In doing this, business units and executive leadership are more engaged in what their resilience actually is, become more committed to managing leading vs. lagging indicators and shift from an assumption that a static business continuity plan is sufficient for more extreme scenarios.

RISK & CYBERSECURITY

Thoughts from the Field

Stressed Testing: Practical Operational Resilience

1.Know Your Minimum Viable Delivery Objectives

2.Do Some Stress Testing

3.Focus on Leading Indicators of Performance

4.Use Degraded States for Resilience

Recent Posts

Subscribe for updates.