How We Build AI That Actually Works: The Agency Stack
Beyond chatbots. How Shahriar Labs orchestrates multi-agent systems to solve complex engineering problems autonomously.
Production AI agents need durable workflows, tool sandboxing, memory, and model fallback — here's the reference architecture used at Shahriar Labs.
Production AI agents need durable workflows, tool sandboxing, memory, and model fallback — not just a prompt loop with tool calls. Most demo-quality agents fail in production because they lack retries, have no fallback when models hallucinate, and get stuck when tools return unexpected results. This is the reference architecture Shahriar Labs uses for production agent systems.
1. Workflow engine: Use Temporal.io (or Prefect for lighter workloads) to make agent steps durable. Every tool call is a Temporal activity — retried on transient failure, checkpointed on success. The agent can be interrupted and resume without replaying from the start.
2. Sandboxed tool use: Never give agents unrestricted shell access. Define tools as typed functions with explicit input/output schemas. Log every tool invocation. For code execution, use Firecracker MicroVMs or E2B's sandboxed environments.
3. Multi-tier memory: In-context (current task), session (Redis, TTL-based), and long-term (vector DB + structured store). Retrieval cost and staleness tolerance differ per tier — design retrieval to match the access pattern.
No single model is always available or always correct. Production agents need: (1) a primary model (Claude Opus or GPT-4o for complex reasoning), (2) a fast secondary (Claude Sonnet, GPT-4o-mini for high-volume steps), and (3) a free fallback for non-critical paths via openrouter-free. Route by task complexity and cost budget, not just availability.
Add an output validation layer between the model and tool execution. Structured output parsing (Pydantic models, JSON Schema) catches malformed tool calls before they hit your infrastructure.
Define failure thresholds explicitly. At Shahriar Labs, the rule is: if an agent fails the same subtask 3 times, it sends a structured alert (Slack webhook, PagerDuty) with full context — what it was trying to do, what failed, and the last 5 tool invocations. Humans resolve and optionally re-queue. This avoids the silent failure mode where agents loop forever on a stuck task.
For multi-agent systems, see our post on RAG, knowledge graphs, and agent memory. For the workflow layer, see how we use Temporal for GenAI pipelines.
Written by Shihab Shahriar Antor — AI Engineer & Founder of Shahriar Labs. Building LetX, QuantumSketch, and open-source AI agent skills.
Beyond chatbots. How Shahriar Labs orchestrates multi-agent systems to solve complex engineering problems autonomously.
In 2026, AI agents handle planning, coding, testing, and deployment under human direction — shifting developers from implementers to architects and reviewers.