Adding AI Features to Products

Adding AI Features to Products

The AI features that ship are the ones built around the LLM's actual capabilities, not its imagined ones. The features that fail are the ones where someone heard "AI can do anything" and tried to ship that. This page walks through the patterns that work, the patterns that look like they should work but do not, and the operational realities of running language models in production. The goal is to help you spot which AI features will earn their keep and which ones will consume engineering hours, burn API credits, and generate user complaints in roughly equal measure.

An AI feature is a bet. Every prompt is a bet that the model can produce a useful response in this context, at this cost, in this time budget, with this failure rate. Some bets are easy and obvious. Summarizing a 2,000-word article into three bullets is a near-certain win for any modern frontier model. Generating tax advice that won't get someone audited is a near-certain loss. Most product features sit in between, and the work of an AI engineer is mostly figuring out where in that spectrum a given feature lives, then building around the answer.

The teams that ship working AI features tend to share a few habits. They start with the failure mode, not the happy path. They pick the smallest model that produces correct output, not the biggest model that produces flashy output. They build evals before they tune prompts. They cache aggressively. They have a fallback for when the API is down. They measure cost per request the same way they measure latency. None of this is exciting, and that is precisely why it works.

When AI Is the Right Answer (And When It Absolutely Is Not)

The first decision in shipping an AI feature is the decision not to ship one. Most product problems have better solutions than calling an LLM. Search has a query parser. Validation has a regex. Authorization has a policy engine. The LLM is a hammer that can hit almost any nail, but most of those nails have purpose-built drivers that are faster, cheaper, and more reliable. Reach for the LLM when the alternatives are genuinely worse, not when "AI" sounds good in a press release.

The shape of a good AI feature is recognizable. The input is messy, unstructured text or context that resists parsing by traditional means. The output benefits from natural language fluency or from synthesis across many inputs. The cost of being wrong is low, recoverable, or visible to the user. The user can correct the output if it goes off, either by rephrasing or by editing the result directly. When all four of these conditions hold, an LLM is usually the cleanest tool. When any one of them breaks, you should look harder at alternatives.

Good shape for an AI feature

A user pastes a 1,500-word product description and asks for a 100-word marketing summary. The input is unstructured, the output benefits from fluency, the cost of mediocrity is "user rewrites it" rather than "lawsuit," and the user sees the result and can edit it before publishing. Claude Haiku handles this in 800 milliseconds for a fraction of a cent. Ship it.

Bad shape for an AI feature

A user asks "how much sales tax do I owe in California for a $4,200 order?" The input is structured, the right answer is a single number determined by law, and the cost of being wrong is a customer who gets fined. This is a lookup table joined with a multiplication. An LLM will probably get it right but might confidently invent a wrong rate. Use the table.

The rule of thumb that filters most of these calls correctly: would a human handle this in 30 seconds with a clear right answer? If yes, an LLM is overkill, and a deterministic system will be cheaper, faster, and more reliable. The LLM earns its place when the human would take 30 minutes, the right answer is a matter of judgment, or the input is so varied that no fixed schema captures it.

Tasks that LLMs handle well, in rough order of how forgiving they are:

  • Summarization. Compressing long inputs into shorter outputs. Low-stakes, near-deterministic in quality.
  • Classification with fuzzy boundaries. "Is this email a complaint, a question, or feedback?" categories that resist clean rules.
  • Structured extraction. Pulling fields out of unstructured text into JSON. Strong if you provide a schema.
  • Content generation. Drafts, outlines, first passes, fill-in-the-blank text.
  • Translation and rewriting. Tone shifts, language conversion, formality adjustments.
  • Conversational interfaces. Chatbots, Q&A over docs, customer support assistants.
  • Code generation and explanation. Snippets, refactors, doc strings.

Tasks where LLMs reliably disappoint or actively cause harm:

  • Exact arithmetic. Numbers, dates, percentages. The model can do it, but small errors slip through and confidently. Use a calculator.
  • Regulatory compliance answers. Tax, legal, medical advice presented as authoritative. The hallucination rate is non-zero and the consequences are not.
  • Real-time facts. Stock prices, sports scores, current events past the training cutoff. Without retrieval, the model will guess.
  • Deterministic lookups. "What's the SKU for this product?" If you have a database, query it.
  • Anything where wrong-but-confident is worse than no answer. Medical dosage, legal deadlines, financial transactions.

The clearest signal that you are reaching for an LLM in the wrong place is when you find yourself constructing elaborate guard rails around the output. Validation pipelines that check the model's number against a reference, regex parsers that extract the right field from prose, retry loops that ask the model again when the first answer was wrong: each of these is the system telling you that a deterministic alternative would do the job better. Build AI features where fluency, synthesis, or judgment are the bottleneck. Skip them where the answer is a fact, a number, or a rule. The LLM is a tool for things that resist rules, not a replacement for the rules you already have.

LLM Integration Patterns

There are four shapes most LLM-powered features take. Pure prompt is the simplest. Retrieval-augmented generation grounds the model in your data. Tool use lets the model call functions. Agent loops chain tool use into multi-step autonomous behavior. The cost and complexity grow at each step, and so does the surface area for things to go wrong. Most product features should sit at the simpler end of this spectrum and only move up when the simpler pattern provably falls short.

Pure prompt is what most demos look like. You take user input, drop it into a prompt template, send the prompt to the API, and stream the response back. There is no state, no retrieval, no tools. The flow is just: user input goes into a prompt template, the template hits the Claude API, the response streams back to the user. This pattern handles classification, summarization, basic Q&A, content generation, and a surprising amount of customer-facing chat. It is fast, cheap, easy to debug, and easy to reason about. If a pure-prompt approach gets you 90% of the way to a working feature, do not reach for anything fancier just because RAG sounds more legitimate.

Retrieval-augmented generation, or RAG, is the pattern when your data does not fit in a prompt. You have 50,000 internal documents, a knowledge base, a corpus of customer tickets. You cannot stuff all of it into context, and even if you could, the model would lose the relevant signal in the noise. So you embed each document into a vector, store the vectors in a database (Postgres with pgvector, Pinecone, Weaviate, Qdrant), and at query time you embed the user's question and pull the top-K most similar chunks. Those chunks go into the prompt as context, and the model answers from them.

User question
Embed query
Vector search
Top-K chunks
Prompt + chunks
Grounded answer

RAG works well when the answer is in your data and you need to find it. It is less useful when the answer requires reasoning across many documents, when the relevant chunk is hard to retrieve via embedding similarity (technical jargon, code, structured data), or when the question itself is poorly formed. The teams that ship good RAG systems spend most of their time on retrieval quality, not on prompt engineering. Better chunking, hybrid search (combining vector similarity with keyword search via BM25 or Postgres full-text search), reranking with a cross-encoder, query rewriting before retrieval. The LLM at the end of the pipeline is often the easy part.

Tool use is the next step up. The model is given a list of functions it can call, with names, descriptions, and parameter schemas. It decides when to call them, you execute the call on your side, you feed the result back, and the model continues. The classic example is a weather tool: the user asks "what's the weather in Tokyo?" and the model calls get_weather(city="Tokyo"), your code hits a weather API, and you return the result so the model can phrase the final answer. Tool use is what lets an LLM interact with the real world, not just the world inside the prompt.

User request
Model + tools
Tool call
Your function
Result back
Final response

Agent loops are tool use turned up to eleven. The model takes an action, observes the result, decides on the next action, takes it, observes again. It keeps doing this until the task is done or some stopping condition triggers. This is how you build something that actually does work autonomously: file an expense report, debug a failing test, gather research from multiple sources. The cost is real. Each turn is another API call. The latency stacks up. Debugging gets harder because the model's choices are non-deterministic. But for tasks that genuinely require multi-step exploration, no simpler pattern will work.

The mistake most teams make with agents is reaching for them too early. A pure prompt with a clean schema handles 70% of "AI feature" use cases. RAG handles another 20%. Tool use covers most of the remainder. Agents are for the last 5% where the problem genuinely needs an autonomous loop. If you can do it with a single prompt or a single retrieval, you should.

The pattern selection rule is simple. Start with pure prompt. Add retrieval when the data does not fit in context. Add tools when the model needs to act on the world. Add agent loops only when the problem is genuinely multi-step and exploratory. Each step up the ladder adds cost, latency, and new failure modes. Earn the complexity by demonstrating that the simpler pattern provably falls short, not by assuming the more sophisticated approach must be better. Most "agentic" features in production today would work just as well as a single well-designed prompt; the agent label is often marketing language for what is really a chain of three or four prompts.

Streaming and UX

A 10-second wait with a spinner feels like the app is broken. The same 10 seconds, but with text streaming in word by word starting at 500 milliseconds, feels alive. This is not a small UX detail. It is the single biggest perceived-quality lever in any LLM-powered feature. Users will tolerate slow if they can see progress. They will not tolerate fast that looks frozen.

The mechanism is server-sent events (SSE). The Anthropic API and the OpenAI API both support streaming responses. You set stream: true on the request, and instead of waiting for the full response, you get a sequence of chunks each containing a few tokens. Your server forwards those chunks to the client over an SSE connection (or a WebSocket, or a fetch stream consumed via the browser's ReadableStream API). The client appends each chunk to the visible output as it arrives. The first token appears in 200 to 800 milliseconds depending on model and load. The rest follow at maybe 30 to 100 tokens per second.

The concrete numbers matter. A 500-token response from Claude Sonnet without streaming might take 6 seconds end-to-end. With streaming, the first character appears in around 600 milliseconds, and the user is reading the response while the model is still generating. Perceived latency drops from "frustrating" to "fluent." For chat interfaces, autocomplete, content generation, and any feature where the output is shown directly to the user, streaming is non-optional.

~600ms
First token (Sonnet, streaming)
~6s
Full 500-token response, no streaming
~50 tok/s
Typical streaming rate, Sonnet
~150 tok/s
Typical streaming rate, Haiku

Implementing streaming is straightforward with the Anthropic SDK. The pattern below is the basic shape: you call messages.stream, iterate over the events as they come in, and emit each text chunk to your transport (SSE, WebSocket, etc).

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

export async function streamCompletion(prompt, onChunk) {
  const stream = client.messages.stream({
    model: "claude-sonnet-4-5",
    max_tokens: 1024,
    messages: [{ role: "user", content: prompt }],
  });

  for await (const event of stream) {
    if (event.type === "content_block_delta" &&
        event.delta.type === "text_delta") {
      onChunk(event.delta.text);
    }
  }

  const finalMessage = await stream.finalMessage();
  return finalMessage;
}

On the client side, the consumption pattern depends on your transport. With Server-Sent Events from a Next.js Route Handler, you set the response to text/event-stream and write chunks as they arrive. With the Vercel AI SDK, you get higher-level abstractions: useChat handles the streaming connection, message state, and rendering for you with a few lines of React. For most product builds, the AI SDK is the right starting point unless you need fine control over the wire format.

Streaming is not always available. Some providers do not support it for all model variants. Some intermediate hosting layers buffer responses and break the stream. Mobile clients on flaky networks may drop the connection mid-stream. For these cases, you need a fallback: skeleton states that animate while the request is in flight, a clear loading indicator, and a hard timeout (typically 30 seconds for chat, 60 seconds for long-form generation) after which you show a retry button.

The deeper UX point is that LLM responses are inherently uncertain in length and timing. Treat them like any other long-running operation: show progress, allow cancellation, surface errors clearly, and never block the rest of the UI. A "stop generating" button is essential. So is graceful handling of partial responses (the user can read what arrived even if the stream cuts off). If you would not ship a feature this slow without streaming, do not ship it at all.

The bottom line on streaming: it turns a 10-second wait into a 600-millisecond engagement. Use server-sent events or the Vercel AI SDK. Always include cancellation, timeouts, and skeleton fallbacks for the cases where streaming fails or is unavailable. The implementation is a few hundred lines once and then forgotten; the perceived quality difference is enormous and permanent. Skip streaming for batch jobs and webhook callbacks where there is no user waiting; use it everywhere else.

Cost and Latency Tradeoffs

Every API call has a cost. Every API call has a latency. The numbers are small enough that early-stage prototypes can ignore them, and large enough that production features cannot. The teams that ship sustainable AI features are the ones who treat cost-per-request and latency-per-request as first-class metrics, tracked alongside error rate and throughput.

Model selection is the biggest lever. Modern frontier providers offer a tiered family: a small, fast, cheap model for high-volume simple tasks; a mid-tier model for most production work; and a large, slow, expensive model for high-stakes reasoning. Anthropic's lineup is Claude Haiku, Claude Sonnet, and Claude Opus. The pattern repeats at OpenAI (mini, standard, premium) and Google (Flash, Pro). The numbers below are illustrative ballparks for current Claude pricing tiers and will shift as new model versions ship, but the order of magnitude is stable.

Haiku
Fast classification, extraction, simple Q&A
Sonnet
Most production features, balanced cost/quality
Opus
Complex reasoning, agentic work, high-stakes output
~10x
Typical cost gap between tiers

The default move is to start with Sonnet. It handles the bulk of real-world tasks well. You drop down to Haiku when you find a high-volume task where Haiku produces correct output: routing requests, extracting structured fields, simple classification, summarizing short text. You climb up to Opus when Sonnet's reasoning falls short: multi-step planning, complex code review, nuanced writing, agentic loops where each decision compounds. Most production AI features run primarily on Sonnet, with Haiku for edges and Opus reserved for specific subtasks.

Token economics matter. Input tokens (the prompt you send) cost less per token than output tokens (what the model generates). The ratio is typically 3-to-1 to 5-to-1. This shapes how you write prompts. Rich, detailed prompts with extensive examples are usually fine cost-wise because input is cheap. The expensive part is when the model writes a lot back. If you can constrain output (with explicit instructions like "respond in 50 words or fewer," with structured output schemas, with tight max_tokens limits), you save real money at scale.

Prompt caching is the second major cost lever. The Anthropic API lets you mark sections of your prompt as cacheable. The first request pays full price for those tokens. Subsequent requests within the cache TTL (5 minutes by default, with extended caching available) read those tokens from cache at roughly 10% of the input cost. For features with stable system messages, large reference documents, or repeated few-shot examples, this cuts costs dramatically. A RAG system that includes the same 20,000-token reference doc on every query is the canonical case: cache it once, reuse it cheaply.

Latency tracks model size. Haiku is roughly 3x to 5x faster than Sonnet, which is roughly 2x to 3x faster than Opus. A request that takes 600 milliseconds on Haiku might take 2 seconds on Sonnet and 4-5 seconds on Opus. This is end-to-end latency, not first-token latency (streaming makes the difference much smaller for perceived speed). For real-time interactions, prefer the smallest model that produces correct output. For batch jobs where the user is not waiting, model size matters less and quality matters more.

The discipline summary on cost and latency: pick the smallest model that produces correct output, use prompt caching for stable context, constrain output length where you can, and track cost-per-request and latency-per-request the same way you track error rate. The teams that get this wrong run out of API budget at the worst possible moment, usually a few months after launch when usage has scaled past their projections. The teams that get it right ship features that hold up under real load and stay economically sustainable across model price changes and growing user bases.

Caching Strategies

Caching is the single biggest cost optimization in any LLM system. It is also one of the most under-implemented because it requires thinking about your prompts in terms of stable and dynamic parts, and most prompts are written as a single blob. The teams that hit 60-80% cache hit rates on their LLM calls are paying a fraction of what the teams at 0% are paying for the same workload.

There are three caching strategies that compose. Prompt caching at the API level reuses prefix tokens. Semantic caching at the application level reuses entire responses for similar queries. Structured caching reuses outputs by exact key for deterministic-ish workloads. Each addresses a different layer.

Prompt caching is provider-side. With the Anthropic API, you mark blocks of your prompt with cache_control: { type: "ephemeral" }. The system message, large reference documents, few-shot examples, tool definitions: anything stable across requests. The API hashes the cached blocks and stores them. When the same prefix shows up on a subsequent request, the cached tokens are charged at roughly 10% of the input rate, and the time to first token drops because the model does not need to reprocess them. The cache TTL is 5 minutes by default, with longer-lived options available, which means a steady-state workload (more than one request every five minutes touching the same prefix) gets the discount continuously.

const response = await client.messages.create({
  model: "claude-sonnet-4-5",
  max_tokens: 1024,
  system: [
    {
      type: "text",
      text: longSystemPrompt,
      cache_control: { type: "ephemeral" },
    },
  ],
  messages: [
    {
      role: "user",
      content: [
        {
          type: "text",
          text: stableReferenceDocument,
          cache_control: { type: "ephemeral" },
        },
        { type: "text", text: userQuery },
      ],
    },
  ],
});

Semantic caching works at the application level. The idea: many user queries are paraphrases of queries you have already answered. "What are your refund policies?" and "How do I get a refund?" should resolve to the same answer. You compute an embedding of the incoming query, search a cache keyed by embedding similarity, and if you find a hit above a similarity threshold (typically 0.92 or higher with cosine similarity on a good embedding model), you return the cached response without calling the LLM at all. This is most powerful for support bots, FAQ assistants, and any feature where the question space is finite and repetitive.

The risks are real. Semantic caching can return a wrong answer if your similarity threshold is too loose, if the cached response was based on stale data, or if the user's query has subtle context the embedding does not capture. The practical pattern: start with conservative thresholds (0.95+), invalidate cache entries when the underlying data changes, and instrument the system so you can see the hit rate and the false-positive rate. A well-tuned semantic cache might serve 30-50% of queries in a stable Q&A workload.

Structured caching is the simplest and the most overlooked. If your AI feature produces an output that is a function of a known input ("summary of document X version Y," "tags for article Z," "translation of paragraph P to French"), you can cache the output by the exact input key. This is just regular caching, the kind you would implement with Redis for any database query. The key insight is to recognize that most "AI" features are at heart deterministic-ish (the same input usually produces a usable output, even though it is not bit-exact across runs), and you can store the output once and serve it forever, or until the input changes.

Prompt cache hit rate, well-tuned RAG 70%
Semantic cache hit rate, FAQ bot 40%
Structured cache hit rate, doc summaries 85%
Cost reduction at 70% prompt cache 60%

Cache hit rate as a metric should sit on the same dashboard as error rate and p95 latency. If your hit rate is zero, you are either not caching or your cache is broken. If it is above 80%, you should worry that you might be serving stale answers. The right number depends on the feature, but the absence of any number means nobody is paying attention. Surface it.

Some workloads are inherently uncacheable. Personalized responses where the prompt includes user-specific context that changes per request. Real-time data fetches where the answer depends on the current timestamp. Generative features where each call is supposed to produce something different. For these, focus on making each call as cheap and fast as possible (smaller model, shorter output, prompt caching for the stable parts) rather than trying to cache the response.

The caching summary: layer prompt caching, semantic caching, and structured caching. Track cache hit rate as a real metric. A well-tuned cache stack cuts LLM costs by 60% or more without changing the user-facing behavior. The teams that ignore this end up paying for the same tokens over and over. The implementation cost is small (a few weeks of engineering for a mature stack) and the recurring savings are real and permanent. Set up cache hit rate dashboards on day one of any production AI feature; if you can see the metric, you can act on it.

Fallback and Graceful Degradation

The LLM API will fail. Sometimes the provider has an incident. Sometimes you hit a rate limit. Sometimes the request times out. Sometimes the response comes back malformed and your downstream parsing chokes. Sometimes the model refuses the request because the input triggered a safety filter. Designing for these cases is not optional. A feature that breaks the moment the upstream API breaks is a feature that breaks for users.

The fallback hierarchy is a decision tree. Each level represents a less ideal but still functional response. The goal is to keep the user in motion: if you cannot give them the perfect answer, give them a usable one; if you cannot give them a usable one, give them a clear failure message they can act on; never give them a silent broken state.

Primary call
Retry with backoff
Fall back to smaller model
Serve cached response
Show clear error UI

Retry with backoff is the first line of defense. Transient failures (timeouts, 5xx errors, occasional rate limits) often resolve on the second or third attempt. The standard pattern: exponential backoff with jitter, capped at maybe 3 attempts and 5 seconds total wait. The Anthropic SDK and the OpenAI SDK both have built-in retry logic with sensible defaults, so for most cases you do not need to implement this from scratch. You do need to know what they retry and what they do not.

Falling back to a smaller model is a useful intermediate step. If your primary call to Sonnet fails, retry with Haiku. The output quality drops, but a working answer at lower quality is usually better than no answer. This is especially valuable for high-volume, low-stakes features where any reasonable response is acceptable. Implement it as a wrapper that takes a list of model IDs in priority order and tries each on failure.

Cached responses are the next fallback. If the same query has been answered before, even imperfectly, serving the cached answer is better than failing the request. This is one place where a slightly aggressive semantic cache earns its keep: when the API is down, the threshold for "close enough" can drop because the alternative is no response at all. Some teams maintain a "last known good response" cache specifically for fallback, separate from their normal optimization cache.

The circuit breaker pattern prevents cascading failures. If your LLM provider is failing 80% of requests in the last 60 seconds, do not keep hitting them. Open the circuit (stop making calls), serve from cache or static fallbacks for some interval, then half-open (try one or two requests to see if they recover), and close the circuit when health returns. Libraries like opossum in Node or pybreaker in Python implement this directly. The benefit: you avoid spending API budget on calls that will fail, you avoid making your own latency worse by waiting on doomed requests, and you protect the provider from your retries amplifying their incident.

The user-facing failure state matters. "Something went wrong, please try again later" is the minimum bar. Better is something that hints at the cause and the user's options: "AI assistant is temporarily unavailable. You can still browse and search manually." Best is a fallback feature that does something useful without the AI (a basic keyword search instead of semantic search, a FAQ list instead of a chatbot). Treat the AI as an enhancement layered over a working baseline; if the AI fails, the baseline still works.

Multi-provider fallback is the highest level of resilience. Some teams use the Vercel AI Gateway or a similar proxy layer to route between Anthropic, OpenAI, and Google. If Claude is down, you fall over to GPT-4 or Gemini Pro. The catch: prompts tuned for one model may not work as well on another, and your output schema may differ subtly. For most products this is overkill until you have specific reliability requirements (enterprise SLAs, regulated workloads). For products that need it, plan for it from day one because retrofitting multi-provider support after launch is painful.

The fallback playbook in summary: build the chain explicitly with retry, smaller model, cached response, baseline non-AI feature, and clear failure UI. Add a circuit breaker for cascading failure prevention. Multi-provider routing is for products with hard reliability requirements; for most features, a clean fallback ladder on a single provider is enough. Test the fallback paths regularly. The first time you discover that your fallback is broken should not be during an actual provider incident with users watching the feature fail.

Evaluating Output Quality

Without evals, you are flying blind. Every prompt change is a coin flip. You think you improved it; the user notices it got worse on a case you forgot about. The change ships, complaints roll in, and you scramble to figure out what broke. The fix is not heroic effort or genius prompts. It is a simple, boring, maintained eval set, run on every prompt change, that tells you whether the change made things better or worse.

An eval set is a list of inputs paired with expected outputs or expected properties of outputs. For a classification task: "this email is a complaint" -> expected category "complaint." For a summarization task: "this 2,000-word article" -> expected summary contains key facts X, Y, Z. For a chatbot: "user asks question Q in context C" -> expected answer matches a reference, or passes a rubric (mentions the right policy, correct tone, no hallucination).

1
Collect 20-50 real cases

Pull from production logs or write them by hand. Cover happy paths, edge cases, and known failure modes. More is better; 20 is the floor.

2
Define the success criterion per case

Exact match, regex match, structured field equality, contains keyword, passes a rubric, or LLM-as-judge against a reference. Pick what fits.

3
Run baseline

Send each case through your current prompt. Score each one. Record the pass rate and the failures.

4
Change one thing

Prompt tweak, model change, parameter adjustment. One variable at a time so the delta is attributable.

5
Re-run, compare

Did the pass rate go up or down? Did any cases regress? Inspect the diffs case by case, not just the aggregate.

6
Ship if better, otherwise iterate

The eval is the source of truth. If the change does not improve the eval, do not ship it just because it feels nicer.

Offline evals run before deployment. You curate a fixed test set, you score outputs against expected behavior, you track the score over time. This is the slow, reliable feedback loop that catches regressions. It is also the only honest measurement of "did this prompt change help?" Without an offline eval, you are doing taste-based prompt engineering, which is exactly as accurate as it sounds.

Online metrics run after deployment. Thumbs-up and thumbs-down on responses. Completion rate (did the user actually use the AI output, or did they immediately rephrase?). Time-on-result. Implicit feedback from session duration and follow-up actions. These signals are noisy and biased (only some users provide explicit feedback, and they are not representative), but they catch problems offline evals miss. The combination is the right answer: offline evals as the development feedback loop, online metrics as the production health signal.

LLM-as-judge is the technique that makes large eval sets practical. For tasks where the success criterion is fuzzy (was the summary "good"? was the response "helpful"?), a human grader is too slow. Instead, you use a separate, often more capable LLM to grade the output. Claude Opus grading Sonnet's outputs is a common pattern. The grader is given the input, the output, the reference (if any), and a rubric. It scores the output. Inter-rater agreement with human graders is generally 80-90% for well-designed rubrics, which is good enough for most product-quality decisions.

The tooling is reasonable. Promptfoo is a CLI tool for running prompt-vs-prompt comparisons against eval sets, with built-in support for many providers and assertion types. LangSmith (from LangChain) provides eval orchestration with a UI for tracing and analysis. The Anthropic Workbench includes prompt testing and evaluation. OpenAI's Evals framework is open source. For most teams, a 200-line custom Python or TypeScript script that loads a JSON file of test cases, sends them through the API, scores the outputs, and writes results to a CSV is enough to get started. Tools come later.

The mindset shift is the hard part. AI engineers used to traditional software testing want a green-or-red answer. Evals do not give that. They give a pass rate (something like 87 out of 100 cases passed), and the question is whether 87 is better than the previous 84, and whether the 13 failures are acceptable for the use case. You are doing statistics, not deterministic testing. The teams that get good at this learn to think probabilistically about quality.

The eval summary: build evals before you tune prompts. A simple test set of 20-50 cases, scored on every change, beats taste-based iteration every time. Combine offline evals (development feedback) with online metrics (production health). Use LLM-as-judge to scale. The teams who do this ship better AI features faster. The teams who skip it ship faster initially and then spend the next three months apologizing for regressions and quality drops they did not see coming. Eval discipline pays for itself within the first week of production traffic.

The Agent Pattern

An agent is a loop. The model takes input, picks an action from a set of tools, executes it (or asks you to execute it), observes the result, decides on the next action, and continues until it produces a final answer or hits a stopping condition. This pattern enables features that pure prompts and even RAG cannot reach: the model can navigate a problem space, gather information as needed, recover from errors, and synthesize a result that depends on multiple intermediate steps.

The classic agent example is research. The user asks "what are the top three open-source vector databases by GitHub stars and what are their key tradeoffs?" A pure prompt cannot answer this; the model does not know current GitHub stars and would either guess or refuse. RAG cannot answer it without indexing every relevant repo. An agent with a web search tool, a GitHub API tool, and the ability to read the results can plan: search for vector databases, look up stars, read the README of each top result, synthesize a comparison. The same loop pattern applies to debugging tasks, expense reports, customer support escalations, and a wide range of multi-step workflows.

Pure prompt

Single API call. Stateless. Best for classification, summarization, content generation, single-shot Q&A. Latency: 1-5 seconds. Cost: predictable per request. Failure mode: bad output, retry. Use when the task fits in one prompt and the model has the context it needs.

RAG

One retrieval, one generation. Stateless after retrieval. Best for grounded Q&A over a known corpus. Latency: 2-7 seconds. Cost: vector search plus one generation. Failure mode: retrieval misses the right chunk. Use when the answer is in your data and the question is single-step.

Agent

Multiple turns, tool calls, branching paths. Best for multi-step exploration where the path is not known in advance. Latency: 10-60+ seconds. Cost: variable, often 5-20x a single prompt. Failure mode: the model loops, picks bad tools, or runs out of budget. Use only when simpler patterns provably fail.

Agents are expensive in three ways. They make many API calls, often 5 to 20 per task, where each call carries the full conversation context (input cost grows fast). They have unpredictable latency, because the model decides how many turns it needs. They are harder to debug, because the model's choices are non-deterministic and a failed run might be a prompt issue, a tool issue, a retrieval issue, or just bad luck.

The cost discipline for agents starts with budgets. Set a hard cap on the number of turns (maybe 10 by default, configurable per task). Set a hard cap on input tokens accumulated (maybe 100,000). Set a wall-clock timeout (maybe 60 seconds for interactive use, 300 seconds for batch). When any of these trip, you stop the agent and return whatever partial result it produced. Without these, an agent can loop indefinitely, burning API credits and producing nothing.

Tool design matters more than prompt design for agents. The tools you expose are the action space the model has to work with. Bad tools (vague names, unclear parameters, ambiguous return values) lead to bad agent behavior. Good tools (clear names, narrow scope, structured returns) lead to clear traces and reliable outcomes. Each tool description should read like API documentation: what it does, what arguments it takes, what it returns, what failure modes look like. The model treats your tool descriptions as ground truth, and it will use the tools the way you describe them.

The Claude API supports tool use natively through the tools parameter on the messages endpoint. You define tools as JSON schemas, the model decides when to call them, you execute and return the result with a tool_result block, and the conversation continues. The Vercel AI SDK provides a higher-level tool abstraction that handles the schema, the call, and the typing automatically. LangChain has agent abstractions but tends to add complexity that most product features do not need; for new builds, the AI SDK or direct API calls are usually simpler.

When not to reach for agents: when the task is single-step, when the right answer is one query and one response, when the failure mode of "model picks wrong tool" is worse than just hardcoding the workflow. Most product features are not agentic. A document summarizer is not an agent. A customer support chatbot answering FAQ is not an agent. A code completion tool is not an agent. These are pure prompt or RAG features, and dressing them up as agents adds cost and brittleness without adding capability.

Agents earn their place in three rough categories. Research-style tasks where the model needs to gather information from multiple sources. Workflow automation where each step depends on the previous result and the path is data-dependent. Coding assistants that need to read files, run commands, and respond to errors. Outside these patterns, the agent label is usually marketing.

Takeaway

An agent is a loop with tools. Use them for genuine multi-step problems where the path is not known in advance. Set hard budgets on turns, tokens, and time. Design clear, narrow tools. Default to pure prompt or RAG for most features; the agent label gets thrown around too freely, and most "agentic" products would be simpler and better as straightforward LLM calls.

Putting It Together

A working AI feature stack looks like this. You start with a clear definition of what the feature does, what it does not do, and what the failure mode is. You pick the simplest pattern that handles the problem: pure prompt if the input fits, RAG if the data is large, tool use if the model needs to act, agents only if you need a loop. You build evals first, then iterate on prompts. You stream responses. You cache aggressively. You set fallbacks. You track cost and latency the way you track errors. None of this is glamorous; all of it is the difference between a feature that ships and a feature that gets demoed once and abandoned.

The model itself is the easiest part of the stack to swap. Anthropic, OpenAI, Google, Meta, and the open-weights ecosystem all ship capable models with broadly similar APIs. Claude Sonnet is our default recommendation for most production work because the quality is consistent, the long context is reliable, and the prompt caching support is mature. Switch to OpenAI for specific features where their tool ecosystem (DALL-E, Whisper, code interpreter) gives you something off-the-shelf. Switch to Gemini Pro when you need extremely long context or when cost per token is the binding constraint. The portable skill is not loyalty to one provider; it is the ability to evaluate models against your eval set and pick the one that wins on the metrics that matter for your feature.

The infrastructure layer keeps growing. The Vercel AI SDK provides a unified TypeScript interface across providers with first-class streaming, tool use, and React integration. The Vercel AI Gateway centralizes provider routing and observability. LangChain and LlamaIndex provide higher-level abstractions for RAG and agent patterns; useful for prototyping, sometimes overkill in production. Promptfoo and LangSmith help with evals. Pinecone, Weaviate, Qdrant, and pgvector handle the vector store. Pick the parts you need; resist the urge to install the entire stack on day one.

The two failure modes to watch out for are scope creep and capability worship. Scope creep is when a feature that was supposed to summarize emails grows into a full customer support automation system; the eval gets harder, the prompts get tangled, and quality drops. Capability worship is when the team picks the largest, most expensive model "for quality" without testing whether a smaller model would work; cost balloons, latency tanks, and the actual user experience does not improve. Both come from the same root cause: not letting the data drive the decision. Build evals, run the comparisons, and let the numbers tell you what to ship.

Every shipped AI feature is a bet that the model can do this specific thing reliably enough at this specific cost in this specific time budget. Pick the bets you can win. Skip the ones where the model almost-does-it but the failure mode is worse than no feature at all. The wins compound: each working feature buys credibility for the next, while each broken one teaches the user not to trust your AI promises. Ship the ones that work, and skip the ones that do not.