What happened with Cloudflare yesterday?

Kay Bee

2 weeks ago

The outage occurred on 18 November 2025, early in the day (UTC) and affected a large portion of the internet.
Cloudflare’s own status updates indicate the root issue was in a configuration file (automatically generated for bot-mitigation / threat traffic) that became larger than expected, triggering a software crash in a core traffic-handling subsystem.
Specifically, the company said: “In short, a latent bug in a service underpinning our bot mitigation capability started to crash after a routine configuration change we made. That cascaded into a broad degradation to our network and other services. This was not an attack.”
Cloudflare also said they observed “a spike in unusual traffic” around 11:20 UTC that caused error rates to rise.
The company issued a fix, and by roughly early afternoon EST/UTC most services were restored.

Because Cloudflare provides services for a very large portion of websites and apps (≈ 20% of global web traffic) the outage had widespread ripple effects.
Affected platforms included:
- ChatGPT (via OpenAI)
- X (formerly Twitter)
- Spotify, Claude, Canva and others.
Even public services: e.g., transport websites (like NJ Transit) and government agencies reported degraded or unavailable services.
Users experienced:
- “500 Internal Server Error” messages.
- Disrupted authentication, site-access issues, delayed load times.
The outage underscores how dependent the internet has become on a few infrastructure providers: one service failure cascades broadly.

Configuration file growth: An automatically generated configuration file (for bot mitigation / threat-traffic handling) grew beyond its expected size. That triggered a crash in the software subsystem that applies that configuration.
Cascade effect: The crash in that subsystem created broader service degradation rather than a localized fault, affecting traffic handling across multiple Cloudflare services.
Not a malicious attack: Cloudflare has stated there is no evidence that the outage was caused by an external attack or malicious activity.
Triggering event: They observed a spike in unusual traffic which may have stressed the system and exposed the latent bug. Timing details: around 11:20 UTC.
Fix deployed: A change was implemented which resolved the issue, and normal service began recovering.

Single point of failure risk: Even with distributed infrastructure, a bug in one subsystem at a major infrastructure provider can cascade broadly — many sites were impacted simply because they used Cloudflare.
Importance of limit/size controls: The configuration file grew “beyond expected” size. Systems that automatically generate configs should enforce strong limits.
Traffic spikes + latent bugs = danger: The unusual traffic spike exposed a hidden bug. Systems must be stress-tested for unusual loads, not just typical ones.
Transparent communication: Cloudflare’s early acknowledgement and updates helped clarify the incident — good practice for critical infrastructure providers.
Resilience planning: Customers relying on third-party infrastructure should have failover or backup plans if a provider goes down.
Monitoring & alerting: Adequate monitoring of error rates, internal config growth, and abnormal traffic flows can help detect issues before they ripple out.