Zero-Downtime LLMs: Building High-Availability AI

Building an AI wrapper is easy. Keeping it online is hard. High availability requires treating LLM APIs as unreliable dependencies.

The "Always-Up" Architecture

At Shahriar Labs, we engineered freelm with an obsessive focus on fault tolerance:

Circuit Breakers: Every API key has a breaker. A 5xx timeout opens the breaker to stop hammering a dead key. It half-opens after a cooldown.
Interleaved Failover: We try the best model from every provider before falling back to the 2nd best model of any provider. This guarantees fast failovers.
Token Bucket Pacing: We track requests-per-minute locally to avoid 429 responses entirely.

Because reliability is the whole point, freelm exposes live state so you can see exactly why the router picked a specific path:

for row in llm.health():
    print(row) # See ready status, breaker, requests used, and latency

Q: Is AsyncFreeLLM thread-safe?
A: Yes, it is safe across many concurrent tasks on one event loop.

Q: What if every key is rate-limited?
A: You can enable wait=True to briefly sleep until a key frees up, bounding it by max_wait.

Stop apologizing for API outages. Use freelm to build resilient, enterprise-grade AI architecture.