Tech
What happened with Cloudflare yesterday?

What happened?

- The outage occurred on 18 November 2025, early in the day (UTC) and affected a large portion of the internet.
- Cloudflare’s own status updates indicate the root issue was in a configuration file (automatically generated for bot-mitigation / threat traffic) that became larger than expected, triggering a software crash in a core traffic-handling subsystem.
- Specifically, the company said: “In short, a latent bug in a service underpinning our bot mitigation capability started to crash after a routine configuration change we made. That cascaded into a broad degradation to our network and other services. This was not an attack.”
- Cloudflare also said they observed “a spike in unusual traffic” around 11:20 UTC that caused error rates to rise.
- The company issued a fix, and by roughly early afternoon EST/UTC most services were restored.
What it affected
- Because Cloudflare provides services for a very large portion of websites and apps (≈ 20% of global web traffic) the outage had widespread ripple effects.
- Affected platforms included:
- ChatGPT (via OpenAI)
- X (formerly Twitter)
- Spotify, Claude, Canva and others.
- Even public services: e.g., transport websites (like NJ Transit) and government agencies reported degraded or unavailable services.
- Users experienced:
- “500 Internal Server Error” messages.
- Disrupted authentication, site-access issues, delayed load times.
- The outage underscores how dependent the internet has become on a few infrastructure providers: one service failure cascades broadly.

Preliminary RCA (Root Cause Analysis)
- Configuration file growth: An automatically generated configuration file (for bot mitigation / threat-traffic handling) grew beyond its expected size. That triggered a crash in the software subsystem that applies that configuration.
- Cascade effect: The crash in that subsystem created broader service degradation rather than a localized fault, affecting traffic handling across multiple Cloudflare services.
- Not a malicious attack: Cloudflare has stated there is no evidence that the outage was caused by an external attack or malicious activity.
- Triggering event: They observed a spike in unusual traffic which may have stressed the system and exposed the latent bug. Timing details: around 11:20 UTC.
- Fix deployed: A change was implemented which resolved the issue, and normal service began recovering.
Key take-aways & lessons
- Single point of failure risk: Even with distributed infrastructure, a bug in one subsystem at a major infrastructure provider can cascade broadly — many sites were impacted simply because they used Cloudflare.
- Importance of limit/size controls: The configuration file grew “beyond expected” size. Systems that automatically generate configs should enforce strong limits.
- Traffic spikes + latent bugs = danger: The unusual traffic spike exposed a hidden bug. Systems must be stress-tested for unusual loads, not just typical ones.
- Transparent communication: Cloudflare’s early acknowledgement and updates helped clarify the incident — good practice for critical infrastructure providers.
- Resilience planning: Customers relying on third-party infrastructure should have failover or backup plans if a provider goes down.
- Monitoring & alerting: Adequate monitoring of error rates, internal config growth, and abnormal traffic flows can help detect issues before they ripple out.
