Incident History
A log of AI service incidents, outages, and degraded performance events detected by TensorFeed monitoring.
Outages happen. They're embarrassing, costly, and completely predictable. Infrastructure fails. Load balancers misconfigure. Deployments break things. Databases run out of disk space. Not a single major AI provider is immune. By studying incident patterns, we can predict when failures are likely and design our systems to tolerate them.
This database captures every incident we've detected in the TensorFeed monitoring network: when it started, how long it lasted, severity (was it a full outage or partial degradation), and which provider was affected. The data reveals that outages cluster. Claude API might be flaky for a week, then stable for two months. OpenAI's API has experienced multiple major incidents, each lasting 30 to 90 minutes. Hugging Face and Replicate have historically lower reliability than the major commercial providers. Our monthly AI service outage report synthesizes this into actionable insights.
What should you learn from this? First, avoid single points of failure. Distribute your traffic across multiple providers if feasible. Claude and GPT-4 won't both go down at the same time often enough to matter, but their combination is more reliable than either alone. Second, implement exponential backoff and retry logic in your client code. Third, cache successful responses and degrade gracefully when APIs are down. Finally, monitor your own dependencies. The earlier you know an API is degraded, the earlier you can mitigate customer impact.
Incidents (30 days)
...
Avg. Resolution Time
...
Most Affected Service
...