How I Let an Agent Into Production (and Why I Made It Worse on Purpose)
Some context first: Greenkub builds wooden homes. Not software. Actual houses, with actual wood. The kind of company where "deploy" means a truck. So when I tell you we've been elbow-deep in AI since 2023, that's the fun part of the story.
Our CEO is the kind of person who, the moment the recent wave of models got good, put a Claude subscription on every desk that could use one. And people did use it. Enthusiastically.
The honeymoon, and the cold sweat
People wired their AI straight into their tools: an MCP here, a hand-rolled skill there when the tool didn't have one. It worked. It worked a little too well, actually, which is exactly when I started losing sleep.
Because "it works" quietly meant an LLM was now one cheerful tool-call away from the ERP, the CRM, the planning boards, the project tracking, the whole load-bearing stack. Great until the day it isn't. So the brief wrote itself: give everyone an AI they can use without holding their breath, plugged into critical tools, safely.
Attempt one: the tempting shortcut
My first move was to not build anything. I stood up a POC on Openclaw and its deterministic-workflow integration, Lobster. On paper it was a goldmine: a prebuilt agent, battle-tested for months by hundreds of thousands of people. Plug in the right config, collect the prize.
I was wrong.
It was wildly oversized for the small, sharp tasks we were handing it. The tooling got messy fast, it reached for the terminal a lot more often than I was comfortable with, and the Lobster integration was still a bit young. The result was a stew of plumbing, a pile of .mjs scripts for the deterministic bits, and a Slack HITL duct-taped on the side. It ran. I just didn't want to maintain it for the next two years.
Attempt two: build for the long haul
So I started over with LangGraph, and a deliberately boring plan. The agent should take requests from any channel, fully decoupled, and hand results back the same decoupled way. And because I actually wanted a product, with nice UX and outputs richer than a wall of prose, I left room for that too. (Another post.)
Don't reinvent the wheel
Tech is full of elegant patterns for exactly this, and event-driven architecture was the one I wanted (I just like it, fight me). A pnpm workspace + Turborepo monorepo, and the first thing I built wasn't a feature, it was the contracts.
Here's the rule that holds the whole thing up: nothing in the agent-core types mentions a specific tool, channel, or ingress. Not Slack, not Monday, not WhatsApp. Nothing. The core only speaks in neutral events. That means I can bolt on any tool or channel, human-driven or not, without touching the brain. The agent is just as happy reacting to a webhook as to a human typing a message.
Here's the command every gate ultimately produces, straight from the contracts package. Notice what it carries and what it doesn't:
// the one envelope the agent-core consumes
export const messageReceivedSchema = z.object({
schema: z.literal('agent.gateway.message.received.v1'),
correlation_id: z.string().min(1), // traceable end to end
session_id: z.string().min(1), // conversation / HITL thread
ts: z.string().datetime(),
user: userContextSchema, // identity + role, resolved upstream
source: sourceSchema, // slack | whatsapp | generic
data: z.object({ text: z.string().min(1) }),
});
const userContextSchema = z.object({
employee_id: z.string().min(1),
role: z.string().min(1), // selects the graph
external_ids: z.record(z.string()).default({}),
// ...first_name, last_name
});The source is a discriminated union, so a Slack thread and a WhatsApp group are both just "a source" to the core. The user is already a real person with a role. There is no field called slack_channel_id anywhere near the brain. That's the whole trick.
Three jobs, one broker
Concretely there are three kinds of thing in the system, and they only ever talk through a single RabbitMQ broker with four topic exchanges. No app holds a reference to another app; they hold a reference to an event name.
The gates are adapters at the edge. A gate is a RabbitMQ consumer on the company's global event bus, a Slack bot, a webhook receiver, whatever comes next. It speaks the messy native dialect of its source on the outside, and publishes neutral events onto channel.in on the inside. The brain never learns a new language; you just write another adapter.
The gateway is the hub. It consumes everything off channel.in, does the unglamorous-but-critical work (identity, sessions, the human-in-the-loop bridge) and emits a clean command on agent.commands. The agent-core (the LangGraph graph) is the only thing that consumes commands. It runs, calls its tools over MCP, and streams its whole lifecycle back out on agent.events. The gateway picks those up, turns them into outbound intents on channel.out, and whichever gate is listening renders them. Often that's the channel the message came from, but not always. A reply can land somewhere else entirely, or nowhere at all: not every event needs an answer.
Reading it as a loop: a gate adapts the outside world onto channel.in; the gateway resolves identity → role → graph and emits a command; the core runs that graph and streams its thinking / tool / lifecycle events back on agent.events; the gateway turns those into intents on channel.out, and a gate picks them up, whether that's back to the origin channel, on to a different one, or nowhere if no answer is due.
Those four exchanges are the whole bus, and every queue dead-letters to a fifth, agent.dlx, so a poison message parks in a DLQ instead of wedging a consumer.
It all rides the same broker. The response doesn't take some special return path; channel.out is just another exchange the gates are listening on, and the gateway decides which one (if any) should act on it. Decoupled in, decoupled out, and a brain that doesn't know or care where any of it came from, or where the answer goes.
Identity picks the graph
That middle box does more than route. A Slack user id, a WhatsApp number, a Monday account: none of those mean anything on their own, so the first thing the gateway does is identity resolution: it maps the raw (system, external_id) off the channel to a real person in our directory, and from that person, a role. That resolved identity rides along in every command as the event's user context, and the core never sees a raw Slack handle.
And the role is the interesting bit, because it doesn't just trim a list of tools, it selects an entirely different LangGraph graph. A field installer messaging from a chantier and an ops lead in Slack can send the exact same sentence and be handled by two different state machines: different nodes, different prompts, different retrieval steps, different gated writes at the end. The installer's graph might go classify → pull chantier context → propose a Monday incident; another role's graph wouldn't even have that path.
It's the same instinct as the rest of the system. The blast radius isn't a prompt asking the model to behave, it's a hard boundary decided before the model runs: your identity chooses which graph executes, and a graph can only do what its nodes were built to do. You can't wander into a capability that isn't wired into the graph you were routed to.
More than a wall of text
Remember the "I wanted a product" bit? This is where it shows up. The core doesn't just publish a final answer, it narrates itself on agent.events as it goes. The gateway relays those as outbound intents, and a gate gets to render them into a live timeline instead of a paragraph that appears all at once.
| Event (agent.core.*.v1) | What the user sees |
|---|---|
run.started / completed / failed | spinner, then a duration or an error |
thinking.phase_started | a labelled step ("Looking up the chantier…") |
thinking.emitted | reasoning streaming in token by token |
reasoning_summary.emitted | a collapsible titled summary block |
tool.invoked / tool.completed | "called Monday → 1 item updated" |
message.completed | the final answer |
Because it's all events, each gate gets to decide how to render them. For now the Slack gate does the pragmatic thing: it translates certain events into certain Slack Block Kit blocks. That works really well for the renders whose shape you already know up front, the HITL approval card, an ask-user-question prompt, a tool-result summary, where a hand-mapped block is exactly what you want.
The interesting question is everything else: what about output whose shape the agent decides at runtime? That's a whole post on its own, building an A2UI renderer for Slack with generative UI, on top of Google's A2UI project. Soon.
Why I sleep
And this is the part that lets me actually close my laptop: the agent never writes on its own. When it wants to touch something real (open an incident, update a status), it doesn't call the tool. It proposes, and the proposal shows up as an approval card in Slack. A human taps yes. Only then does anything happen.
The hard part isn't the "ask first", it's that the message came in on WhatsApp and the approval happens in Slack, possibly hours later, possibly after the process restarted. Nothing can live in memory. The whole bridge sits in a gateway_sessions row:
The dangerous edges are deterministic and live below the model: which tools even exist, what a write requires, what states are legal. One invariant I enforce in the database rather than in a prompt: closed ⇒ no pending HITL, as a partial index, so a resolved incident physically cannot carry a dangling approval. The LLM is allowed to be wrong, because when it is, a human catches it before anything changes and the bad path is just a no-op.
An agent in production isn't scary because it's autonomous. It's scary when it's autonomous and unbounded. Bound it (neutral contracts, gates at the edges, a human on every real write) and you get to go home. Even at a company that, fundamentally, makes houses out of wood.
"But doesn't a human saying yes defeat the point?"
Fair objection. Making a person approve every agent action sounds like it cancels out the whole reason you brought in an AI. I'd argue it doesn't, and the best evidence is you and me.
If you've read this far, you're probably in tech, which means you probably code with an agent. When Cursor first landed, I reviewed every output. I re-read every prompt, checked it against the repo's conventions, the whole ritual. Then the quality got good enough that I started leaning on regression and e2e tests instead, just scanning for anti-patterns and keeping the code roughly elegant.
And as it kept getting better, and I got used to it, I reviewed less and less. These days I sometimes catch myself prompting an agent to build something I built fifteen minutes ago. (I should run fewer of them in parallel.)
Here's the catch, and it's not about output quality: it's about information. Adoption across our non-technical teams went great. Output quality got genuinely good. The signal I'd wired up to count how often someone corrected the agent's proposed data drifted closer and closer to zero. Edits after an agent create-or-update: also nearly zero.
We are not the exception. The same thing happened. Less time reading the HITL prompts, less review of the data, and with it a quiet loss of information. For a helper that capitalises the first letter of a string, near-zero edits is fine. For a system that detects field incidents and logs them into the stack by itself, near-zero edits means nobody is actually processing what's being entered.
And that matters, because the system doesn't close the loop. It doesn't call the supplier. It doesn't get goods to the post office in under 24 hours. A human still does all of that downstream, and to do it well, that human has to have actually absorbed the information, not rubber-stamped a card.
So I added a margin of error. On purpose. I told the teams the incoming data wasn't clean enough to trust blindly. The fixes were minor, and not on every prompt, just frequent enough to make people stop and read. To understand and process what they were putting into the system. The human-in-the-loop isn't there to babysit the AI. It's there to keep the human in the loop.