Tech

What happened with Cloudflare yesterday?

What happened?

tmeline
  • The outage occurred on 18 November 2025, early in the day (UTC) and affected a large portion of the internet.
  • Cloudflare’s own status updates indicate the root issue was in a configuration file (automatically generated for bot-mitigation / threat traffic) that became larger than expected, triggering a software crash in a core traffic-handling subsystem.
  • Specifically, the company said: “In short, a latent bug in a service underpinning our bot mitigation capability started to crash after a routine configuration change we made. That cascaded into a broad degradation to our network and other services. This was not an attack.”
  • Cloudflare also said they observed “a spike in unusual traffic” around 11:20 UTC that caused error rates to rise.
  • The company issued a fix, and by roughly early afternoon EST/UTC most services were restored.

What it affected

  • Because Cloudflare provides services for a very large portion of websites and apps (≈ 20% of global web traffic) the outage had widespread ripple effects.
  • Affected platforms included:
    • ChatGPT (via OpenAI)
    • X (formerly Twitter)
    • Spotify, Claude, Canva and others.
  • Even public services: e.g., transport websites (like NJ Transit) and government agencies reported degraded or unavailable services.
  • Users experienced:
    • “500 Internal Server Error” messages.
    • Disrupted authentication, site-access issues, delayed load times.
  • The outage underscores how dependent the internet has become on a few infrastructure providers: one service failure cascades broadly.
outage report

Preliminary RCA (Root Cause Analysis)

  • Configuration file growth: An automatically generated configuration file (for bot mitigation / threat-traffic handling) grew beyond its expected size. That triggered a crash in the software subsystem that applies that configuration.
  • Cascade effect: The crash in that subsystem created broader service degradation rather than a localized fault, affecting traffic handling across multiple Cloudflare services.
  • Not a malicious attack: Cloudflare has stated there is no evidence that the outage was caused by an external attack or malicious activity.
  • Triggering event: They observed a spike in unusual traffic which may have stressed the system and exposed the latent bug. Timing details: around 11:20 UTC.
  • Fix deployed: A change was implemented which resolved the issue, and normal service began recovering.

Key take-aways & lessons

  • Single point of failure risk: Even with distributed infrastructure, a bug in one subsystem at a major infrastructure provider can cascade broadly — many sites were impacted simply because they used Cloudflare.
  • Importance of limit/size controls: The configuration file grew “beyond expected” size. Systems that automatically generate configs should enforce strong limits.
  • Traffic spikes + latent bugs = danger: The unusual traffic spike exposed a hidden bug. Systems must be stress-tested for unusual loads, not just typical ones.
  • Transparent communication: Cloudflare’s early acknowledgement and updates helped clarify the incident — good practice for critical infrastructure providers.
  • Resilience planning: Customers relying on third-party infrastructure should have failover or backup plans if a provider goes down.
  • Monitoring & alerting: Adequate monitoring of error rates, internal config growth, and abnormal traffic flows can help detect issues before they ripple out.

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button