/

Ask any CIO whether they have a disaster recovery plan. The answer is almost always yes. Ask them when it was last tested under realistic conditions, with real pressure, real time constraints, and people who were not told in advance it was a drill. The answer changes.

According to Cockroach Labs’ State of Resilience 2025 report, only 20% of organizations describe themselves as fully prepared for outages, 39% of executives admit their outage handling is purely reactive with no formal protocols in place, and 71% do no failover testing at all. These are not small organizations with limited resources. These are enterprises with IT teams, governance frameworks, and documented recovery procedures, but those procedures are stored in shared drives that nobody opens until something fails. 🔗

The gap between having a plan and having a capability is not a documentation problem. It is a leadership problem.

Angle One: The Plan on Paper

The Uptime Institute’s 2025 Annual Outage Analysis found that human error caused a major outage for nearly 40% of organizations, with 85% of those incidents attributed to staff failing to follow procedures or flaws in existing processes. Not cyberattacks. Not hardware failures. People not following procedures they theoretically had. 🔗

This is what a plan without a capability looks like in practice. The procedure exists. The training happened once, perhaps two years ago. The team has turned over since then. Nobody ran a realistic simulation. When the incident arrived, the plan did not hold because the organization had never built the muscle memory to execute it under pressure.

In July 2024, a faulty CrowdStrike configuration update triggered blue screen failures on 8.5 million Windows systems simultaneously, hitting airports, hospitals, retail stores, and banks almost at once. The organizations that recovered fastest shared one characteristic: their teams had practiced. Not tabletop exercises where everyone agrees the plan makes sense. Actual simulations where systems were deliberately taken down and people had to respond without a script. 🔗

A resilience plan that has never been stress-tested is a document. That is not the same as being prepared.

Angle Two: The Capability Gap

Building genuine resilience capability requires four things most organizations treat as optional.

Proactive monitoring with defined response triggers. This means not relying on dashboards that only generate alerts. Systems that escalate automatically to named humans when thresholds are breached, with response time expectations attached. Network downtime costs an average of $2 million per hour for large enterprises, and 90% of mid-sized and large organizations lose upwards of $300,000 per hour during an outage. At that cost, the time between detection and response is not a technical metric. It is a financial one. 🔗

Practiced response, not documented response. Simulations that catch people off guard. Scenarios that combine a technical failure with a communication breakdown and a key person being unavailable. The stress test reveals gaps that no document review ever will.

Pre-authorized decision rights. When a critical system fails at midnight, someone needs the authority to make consequential decisions immediately. That authority needs to be defined before the incident, understood by everyone in the chain, and tested in the simulations. Ambiguity under pressure is expensive. I covered the broader question of decision authority in Which Decisions Should You Never Delegate to AI, but the same principle applies to any crisis scenario: unclear ownership produces delayed response.

Business impact awareness at every level. Every infrastructure leader and engineer should know specifically what their systems support commercially. Which processes stop. Which revenue streams are affected. Which regulatory obligations are at risk. This is what transforms IT from a technical function into a business-critical one and it is also what makes every operational decision more deliberate.

Angle Three: The Leadership Accountability

In January 2025, Russian-aligned hackers launched coordinated DDoS attacks against Swiss cantonal banks, municipal websites, and energy suppliers, taking several institutions offline simultaneously. The organizations that handled it well had already made a leadership decision: resilience was not an IT project. It was an executive’s accountability. Statista

When the board asks what happened after a serious incident, the first question is rarely technical. It is, “Did we know this was a risk, and what did we do about it before it happened?” That question lands on the CIO. The answer needs to be better than pointing to a document.

23% of companies never test their disaster recovery plans, and only 27% of IT leaders believe their DR strategy is fully adequate. In a room of ten CIOs, seven of them are leading organizations that are not confident in their resilience posture. Most of them know it. Very few are treating it as the leadership priority it deserves to be.

The CIOs who close this gap are not the ones who commission better documentation. They are the ones who schedule the uncomfortable simulation, assign named owners to every recovery scenario, and report resilience metrics to the board the same way they report project delivery or budget performance.

Resilience does not announce itself as urgent until something fails. By then, the conversation has moved from prevention to damage control. The time to act is now, while the plan still has a chance to become a capability.


If the themes here connect to challenges you are navigating in your organization, I explore the broader picture of how technology is reshaping operational risk, leadership responsibility, and business continuity in Life in the Digital Bubble. More perspectives on IT leadership and digital transformation are across the Insights section of this site.