The End of Keep the Lights On IT Operations

For years, IT operations carried a simple promise: keep systems running, fix incidents fast, control costs, and stay invisible when everything works.

But right now, I believe that this model is insufficient, and it is definitely unsuitable for the operation models expected in every modern IT organization.

In today’s business, every major process depends on digital services. Sales, production, customer service, logistics, finance, compliance, and decision-making all run through platforms, networks, data flows, cloud services, and APIs. When operations fail, the business does not simply face an IT problem. It faces serious issues in various directions, whether it is in revenue loss, customer dissatisfaction, regulatory noncompliance, or even reputational damage.

This has been confirmed by recent outages. For example, in June 2025, Google Cloud announced a major incident that affected many of its services, such as IAM, BigQuery, and major cloud services. The incident that went on for about 8 hours was linked later to service problems across API requests, with broad regional and product impact.

Another example: in October 2025, AWS released a summary of an outage in US-EAST-1. The root cause was an error in the automated DNS management system, which resulted in errors in the DynamoDB APIs. The flaw resulted in an incorrect empty DNS record for the regional DynamoDB endpoint, which in turn affected a big set of AWS major services.

Microsoft Azure also went down big time on October 29, 2025. Microsoft attributed the issue to a change in the configuration of its Azure infrastructure and acknowledged problems with Azure Front Door.

Following that was Cloudflare, which had an outage in November 2025 caused by a faulty configuration file for bot management. Cloudflare’s own postmortem shows how quickly a configuration issue can affect dependent services at internet scale. The company stopped propagation of new configuration files, deployed a corrected file globally, and reported full-service restoration later that day.

These incidents are not a reason to blame cloud providers. They are a reminder of reality: modern operations are highly connected, highly automated, and highly dependent on shared digital infrastructure.

Operations Must Move from Availability to Resilience

The old operating model was built around infrastructure availability. The new model must be built around business resilience.

A server can be up while the customer journey is broken. A dashboard can be green while payments fail. A cloud region can recover while backlogs, delayed transactions, broken integrations, and customer complaints continue for hours.

This is why “keep the lights on” is becoming the wrong ambition.

The new ambition is to make operations proactive, data-driven, automated, and business-aligned.

“Proactive” means we do not wait for users to report problems. We monitor service health, transaction flow, user experience, dependency behavior, change risk, and early-warning signals.

“Data-driven” means we connect telemetry, logs, traces, events, incidents, changes, capacity, cost, security, and business impact. Observability is no longer a technical luxury. It is the control system of the digital enterprise. New Relic’s 2025 observability report says many high-impact outages now cost $2 million per hour, while 75% of businesses report positive ROI from observability and 52% are actively consolidating tools.

“Automated” means we remove repetitive manual work from operations. This does not mean blindly handing control to machines. It means using automation for detection, enrichment, correlation, routing, rollback, remediation, scaling, and recovery where the risk is understood.

“Business-aligned” means IT operations must speak in business terms. The executive question is no longer “Are the systems available?” It is, “Can the business continue to serve customers, protect revenue, meet obligations, and recover fast when something goes wrong?”

AI Will Help, But Only If the Foundation Is Ready

The trend is already visible. PagerDuty’s 2025 State of Digital Operations report found that 64% of respondents expected IT operations budgets to increase in 2025, 53% of CIOs and CTOs viewed agentic AI as core to future IT operations, and 88% saw agentic AI as core or peripheral to future IT operations.

But AI will not fix weak operations by itself.

Splunk’s 2025 State of Observability report found that 78% say AI has helped them spend more time on innovation than maintenance, but 48% report low data quality as the main barrier to AI readiness.

That point matters. AIOps depends on clean, connected, trusted operational data. If alerts are noisy, CMDB data is outdated, service ownership is unclear, and incident records are poor, AI will only accelerate confusion.

Many organizations still have too many tools and too little context. Grafana’s 2025 observability survey found that companies use an average of eight observability technologies, while respondents cited 101 different observability technologies currently in use. It also found that 39% saw complexity and overhead as their biggest observability obstacles.

The answer is not to buy another dashboard. The answer is to build an operating model.

That model needs clear service ownership, reliable telemetry, defined service-level objectives, automated runbooks, change risk controls, dependency mapping, incident learning, resilience testing, and a direct link between technical events and business impact.

Regulation is moving in the same direction. The EU’s Digital Operational Resilience Act has applied since 17 January 2025 and requires financial entities to address ICT risk management, incident management and reporting, resilience testing, third-party ICT risk, and information sharing.

This is a useful signal even outside financial services. Resilience is becoming a board-level discipline.

The Lessons for Executive Leaders

The lesson for executives is simple: operations are part of business strategy. Underfunded, reactive IT creates business risk. Ask your leadership team which services generate revenue, which services protect trust, and how fast the organization can recover when they fail.

The lesson for CIOs is clear: move operations from cost center to value system. Build a service reliability view that connects IT health to business outcomes. Reduce tool sprawl. Invest in automation, platform engineering, and operational resilience.

The practical way to go here is to stop measuring success only through tickets closed and infrastructure uptime. Measure recurring incidents, change failure rate, mean time to detect, mean time to restore, automation coverage, backlog recovery time, and business service impact.

The lesson for every infrastructure responsible person is direct: the foundation matters more than ever. Cloud, network, identity, storage, backup, monitoring, endpoint, data center, and platform services are no longer background functions. They are the operating base of the digital business.

The future of IT operations is not about larger war rooms and more heroic firefighting.

It is about fewer surprises, faster diagnosis, safer automation, stronger recovery, and clearer business ownership.

“Keep the lights on” was a useful phrase for a simpler time.

Today, the real job is different.

Keep the business running.

Keep trust intact.

Continue learning before the next incident forces the lesson.

If this argument resonates with you, the question is no longer whether IT operations belong on the executive agenda.

They do.

The real question is whether your organization is ready to move beyond reactive firefighting and build an operating model that connects reliability, resilience, automation, and business outcomes.

This is one of the reasons I wrote Life in the Digital Bubble. The book looks at how AI and digital systems will reshape technology, work, families, and society in the years ahead.

For organizations facing these shifts today, my work in digital transformation and AI consulting focuses on one practical goal: helping leaders turn fragmented technology efforts into clear operating models that create measurable business value.

If this is a conversation your organization needs to have, I would be glad to connect.

The End of “Keep the Lights On” IT Operations: Why Operations Must Become a Resilience Engine

Operations Must Move from Availability to Resilience

AI Will Help, But Only If the Foundation Is Ready

The Lessons for Executive Leaders