AI Infrastructure
AI Vibe Coding CI/CD Engine
A fully autonomous CI/CD orchestration system that manages AI coding agents building production software. It coordinates multiple agents working in parallel — handling work decomposition, isolation, monitoring, failure recovery, and deployment.
The Problem
AI agents are powerful but unreliable
Anyone who has used an LLM coding agent for more than a trivial task knows the failure modes. Building production software with AI requires solving all of these simultaneously.
Context drift
As the context window fills, agents lose track of instructions given earlier. Rules stated clearly in the system prompt get deprioritized as the conversation grows.
Spinning
When an approach fails, agents try the same approach again — or oscillate between two failing approaches — consuming their entire context window.
Coordination failure
Two agents editing the same file, starting servers on the same port, running tests that interfere with each other. Without isolation, parallelism is impossible.
Silent failures
Code that passes TypeScript and unit tests can still produce a blank page. Agents can't see the app they're building.
Abandoned state
When agents crash, they leave behind running containers, orphaned worktrees, and half-finished code that blocks future work.
The Insight
Treat AI agents like unreliable distributed workers in a fault-tolerant system — the same way you'd design for unreliable network nodes or crash-prone processes.
Give them atomic units of work small enough to complete within their reliable context window. Monitor them with heartbeat checks. When they fail, clean up and retry with a fresh agent. Never let two agents edit the same file. Gate every phase with runtime verification. The build engine is, in essence, an operating system for AI workers.
How It Works
The Bead Abstraction
The atomic unit of work is a bead — typically 15-30 minutes of coding work. Each bead specifies exactly which files to touch, what acceptance criteria to meet, and what other beads must complete first. Beads are grouped into epics and organized with dependency chains.
An agent's reliability degrades as context fills. Small beads keep agents in their most reliable window. If one fails, the cost is minimal — stall it, clean up, let a fresh agent retry. Fresh context is cheaper than spinning.
PSO-01: Add packing list data model ┐
PSO-02: Build packing list UI ┤ Phase 1
PSO-03: Add shopping integration ┤ (parallel)
PSO-04: Wire up Firestore persistence ┘
PSO-GT: Smoke gate — runtime verification
PSO-05: Add sharing between travelers ┐
PSO-06: Build suggestions engine ┤ Phase 2
PSO-07: Weather-based recommendations ┘
PSO-GT2: Smoke gate — verify Phase 2
PSO-PUSH: Deploy to dev
Transactional State Management
All state lives in Firestore, accessed exclusively through a REST API. No agent ever writes to Firestore directly. This is the most important architectural decision in the system.
Early iterations let agents access state directly. They corrupted it. They skipped steps. One agent decided incrementing the attempt counter was unnecessary. Another marked its own bead complete without running tests.
The API boundary creates a hard wall. It validates every state transition, enforces invariants, detects races, and logs every action. The critical acquire operation runs as a Firestore transaction that atomically checks engine status, validates dependencies, detects file conflicts, and claims both bead and slot — or rolls back everything.
A tripwire circuit breaker monitors for excessive claim races. Three races in 60 seconds auto-pauses the engine — a signal that too many agents are competing for too few beads.
The Heartbeat Control Loop
Every 4 minutes and 45 seconds, each agent sends a health report and receives instructions back. This bidirectional communication keeps agents on track as context drifts.
Agent → Engine (uplink)
Context window remaining, files changed, TypeScript status, commit count, current stage.
Engine → Agent (downlink)
Warnings — informational. Injected rules — behavioral commands the agent must obey. Standing reminders — critical rules repeated on every heartbeat because agents forget.
Nine health checks run each cycle: context exhaustion, scope drift, uncommitted work, spinning detection, file churn, TypeScript regression, stagnation, and build duration limits.
Self-Healing Loops
Four interlocking feedback loops handle failures automatically:
Bead Retry
Failed bead → stall → re-enters queue → fresh agent retries. Most failures are context-related — a different agent with clean context often succeeds.
Smoke → Hotfix → Re-Smoke
Runtime verification fails → hotfix beads auto-created at highest priority → smoke gate re-runs. Catches blank pages, broken routes, invisible elements.
P0 Defect Gate
Critical defect logged → all feature work halted → only hotfixes proceed → defect resolved → work resumes. A global circuit breaker.
Stall Detection
No heartbeat for 30 minutes → monitor stalls bead → slot released → worktree cleaned → bead re-enters queue.
Workspace Isolation
Each agent gets an isolated git worktree with its own Docker container running Vite (frontend) and Express (backend) on unique port pairs. The slot pool has 10 positions, each with full process isolation — true parallelism without interference.
Worktrees isolate the filesystem but not processes. Docker gives each slot its own process namespace and network. Each agent can run dev servers, execute tests, and run browser automation in complete isolation.
Slot Pool
Deployment Pipeline
main
Working branch. Bead merges accumulate here.
dev
First deployed environment. Cloud Build triggered.
stage
Pre-production. Tim reviews the full experience.
prod
Live. Requires Tim's explicit approval.
What I Learned
After 40+ epics, several patterns emerged
Fresh context solves most failures
When an agent fails, the instinct is to debug. The better strategy is to stall and let a fresh agent try. Agent failures are rarely deterministic. The retry loop has a remarkably high success rate on second attempts.
Rules must be repeated, not just stated
An instruction in the system prompt works for the first 10 minutes. By minute 30, it's been deprioritized. The heartbeat reminder system exists because "tell them once" doesn't work.
Runtime verification catches what tests miss
TypeScript and unit tests catch about 70% of issues. The remaining 30% — blank pages, broken routing, invisible elements — only appear when you run the app. Smoke gates are not optional.
The API boundary is load-bearing
Letting agents access state directly failed repeatedly. They find shortcuts, skip validations, and "improve" processes. The HTTP API is the only thing preventing agents from helpfully destroying the system.
Atomic work units are the foundation
Every feature — parallelism, fault tolerance, retry logic — depends on work being small and well-scoped. Bad decomposition cascades into every downstream problem.
The system needs to watch itself
AI agents don't know when they're stuck. External observation combined with forced behavioral change is essential. The agent doesn't decide to stop — the engine tells it to.
The Human Role
Tim is not a programmer. He's a technical leader who orchestrates AI agents. The system is designed so that a non-programmer with strong product sense can direct AI agents to build production software.
Every operational task — from running tests to deploying to production — is handled by the agents and the engine. Tim never runs terminal commands, modifies code, commits, or deploys.
Decision maker
Approves deployments, resolves ambiguity, makes business calls
Visual reviewer
Reviews UI in his own browser — he is the smoke gate
Quality gatekeeper
Feedback flows back into specs for future beads
System administrator
Configures cloud services, manages credentials