The demo works. Then you point it at real data and it hallucinates something confidently wrong, loops 40 times before timing out, or fails silently and you don't notice for two days.
This is the pattern. Almost every founder who's tried to build agents has hit it at least once. And almost every time, the model gets blamed. But the model isn't the problem.
The architecture is.
#What an agent actually is
An agent isn't a chatbot with extra steps. It's a system that has a goal, has tools it can call to take action in the world, and runs a loop — observe, reason, act, observe — until the goal is achieved or it hits a stop condition.
That loop is where the intelligence lives. It's also where most implementations break.
When founders treat agents like chatbots, they skip the loop entirely. One call, one response, done. That's a prompt wrapped in an API call. Not an agent.
#The failure modes, roughly in order of how often I see them
No grounding in real-world state. The agent makes decisions based on what it thinks is true rather than what is true right now.
Classic example: a customer support agent responding to billing questions based on your pricing structure from six months ago, because nobody gave it a tool to look up live subscription data. Every answer it gives about plan features is confidently, plausibly wrong.
The fix is straightforward but easy to skip: give the agent tool calls that read live data. A get_customer_tier(email) function that hits your actual database beats any amount of context stuffed into the system prompt. If the agent can look something up, don't make it remember it.
Unbounded loops. The agent gets stuck reasoning about a problem, calling the same tool repeatedly, because you never defined what "done" looks like. This is how you wake up to a $400 API bill on a Tuesday with no explanation.
The agent didn't fail loudly. It just kept going.
Hard-code a maximum iteration count before anything ships. If the agent reaches that limit without completing the goal, route it to a human — don't let it loop silently into your bill. The stop condition isn't optional.
Using the LLM for things tools should do. Founders ask the model to do math, track state across turns, or look up real-time data — things LLMs handle poorly — instead of building deterministic functions that handle those tasks reliably.
An LLM is excellent at reasoning, planning, interpreting ambiguous input, and generating text. It should not be your calculator, your database, or your search engine. Before adding something to the system prompt, ask: could a simple function do this better and more reliably? If yes, build the function. Give it to the agent as a tool.
No observability. When something goes wrong, you can't figure out why because you never logged what the agent decided and why it decided it.
Build the trace before you build features. Every tool call, every observation, every reasoning step — log it. Running blind in production isn't a fast iteration loop. It's a fast path to breaking things you can't diagnose.
#The architecture that actually holds up
AGENT LOOP
├── System prompt (goal + constraints + available tools)
├── Current state (what has happened so far)
├── Tool calls (deterministic functions the agent can invoke)
├── Observation (what the tool returned)
└── Stop condition (max_iterations OR goal_met OR error_threshold)
The system prompt is the job description — goal, constraints, tools available. Keep it specific and don't try to handle every edge case in the prompt. Edge cases belong in your error handling logic.
State is the agent's working memory within a single run. Keep it minimal. Everything that can live in a tool call should live there, not in the context window.
Tools are deterministic, testable, and completely separate from the LLM. Build each one like a small API endpoint you can test in isolation before wiring it into the agent. If you can't test it independently, it's not ready.
The stop condition is non-negotiable. Without one, you don't have an agent. You have a liability.
#A real example: lead research at scale
Here's an agent pattern running in production at B2B companies doing outbound work:
Goal: Given a company name and domain, produce a prospect brief — company size, funding stage, recent news, tech stack signals, suggested outreach angle.
Tools:
search_web(query)→ returns top resultsget_linkedin_data(domain)→ headcount, industry, recent postsget_crunchbase_summary(name)→ funding historyanalyze_tech_stack(domain)→ detected technologies
The loop: Agent receives company input → calls tools to gather data → synthesizes into a structured brief → checks if all sections are complete with recent data → if not, identifies the gap and calls additional tools → returns brief, or routes to human after N iterations if data is insufficient.
In practice: paste 20 company names into a spreadsheet. Run the agent on each. Twenty minutes later, 20 research briefs — accurate because the agent pulled live data, not hallucinated plausible-sounding facts. What used to take a researcher 3 hours is now zero marginal time per run after the initial setup.
#How to actually start
Pick one repetitive research or data-gathering task you do at least three times a week. Map the data sources you use to complete it — those become your tools. Write the system prompt in plain English: what's the goal, what are the constraints, what does a good output look like?
Build the tools first. Test each one independently. Then connect the agent, run 10 test cases with full logging, and verify the outputs before it touches anything real. Add the stop condition and error handling before anything goes to production.
The best first agent is the simplest one that reliably does one thing. Complexity is what you add after it works — not before.