Building Backends With AI: Contracts, Auth, Queues

A backend built with AI is only as good as the API contract you defined before the agent started typing. The contract is the load-bearing decision. Everything downstream is implementation.

This is the part of AI-assisted backend work that most people get wrong. They open a chat, ask for a user signup endpoint, and accept whatever the model produces. The result is code that works for the demo and falls apart the first time anyone hits it from a real client. The fix is not a smarter model. The fix is doing the work before the typing starts: define the request shape, the response shape, the error shape, the auth requirement, the rate limit, and the side effects. Hand the agent constraints, not vibes.

What follows is the discipline that makes AI-built backends shippable. None of it is exotic. All of it is the same thing senior backend engineers have done for fifteen years, except now you are explicit about it because the agent needs the explicit version to produce the right code.

API Design Before Implementation

Contract-first means you write the request and response shapes before any handler code exists. Field names, types, optional versus required, error envelopes, status codes, idempotency rules. Once that exists, everything else is mechanical translation, and translation is exactly what AI agents are good at.

The reason this works has nothing to do with AI specifically. It works because you have separated two distinct cognitive tasks. Designing an API is a thinking task: what does this endpoint mean, who calls it, what happens if it fails, what does the next version look like. Implementing an API is a typing task: take the spec, write the route, validate the input, call the database, format the output. AI agents are decent at the second and unreliable at the first. Contract-first puts the right work in the right place.

The tools you can use to express the contract:

OpenAPI 3.1 for REST. Verbose but every code generator and documentation tool understands it. Best when the API is consumed by external clients or multiple internal teams.
GraphQL SDL for GraphQL. The schema is the contract; there is no separate spec.
tRPC type definitions for TypeScript-only stacks. The types are the contract; clients import them directly.
Plain TypeScript interfaces with Zod schemas for runtime validation. Lightweight, no extra tooling, works for internal APIs where you control both ends.
Protocol Buffers for gRPC, or when you have polyglot services and need binary efficiency.

Pick one and commit. The worst outcome is half-using two of them, where the OpenAPI spec drifts from the actual handler signatures and nobody trusts either.

Here is a contract-first OpenAPI fragment for a user creation endpoint. Notice that the error responses are defined as carefully as the success response.

paths:
  /v1/users:
    post:
      summary: Create a user
      operationId: createUser
      requestBody:
        required: true
        content:
          application/json:
            schema:
              type: object
              required: [email, password]
              properties:
                email:
                  type: string
                  format: email
                  maxLength: 254
                password:
                  type: string
                  minLength: 12
                  maxLength: 128
                display_name:
                  type: string
                  maxLength: 80
      responses:
        '201':
          description: User created
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/User'
        '400':
          description: Validation failed
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ValidationError'
        '409':
          description: Email already in use
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/Conflict'
        '429':
          description: Rate limit exceeded
components:
  schemas:
    User:
      type: object
      required: [id, email, created_at]
      properties:
        id: { type: string, format: uuid }
        email: { type: string, format: email }
        display_name: { type: string, nullable: true }
        created_at: { type: string, format: date-time }

Twenty lines of YAML and the agent now has every constraint it needs. The handler it generates from this will be reviewable in minutes because every line maps back to a clause in the contract. If the handler does something the contract did not authorize, that is a defect, not an opinion.

The other side benefit of contract-first work shows up in code review. Without a contract, every reviewer has to reason from scratch about whether the handler is doing the right thing. With a contract, the question becomes mechanical: does the handler match the spec? You can check it in two passes. First pass, read the spec. Second pass, read the handler with the spec next to it. Anything in the handler that is not authorized by the spec is a defect or a contract gap, and either way it needs a decision before merging.

One more practical detail. Treat the contract as a versioned artifact, not a living document that gets updated in place. When you change a contract, you bump the version. Existing clients keep talking to the old version, new clients pick up the new one. Breaking changes go in a major version (v1 to v2), additive changes go in a minor version. AI agents will happily generate breaking changes by default ("just add a required field, it makes the API cleaner") so the contract version discipline lives with the human reviewing the change.

Code generation from contracts is where the productivity win is real. From a single OpenAPI spec you can generate: TypeScript types, request validators, mock servers, API documentation, client SDKs in five languages, and Postman collections. Tools like openapi-zod-client, orval, openapi-typescript, and stainless do the heavy lifting. Burn the contract change once, get fifteen artifacts updated automatically. AI agents are good at filling the gaps the generator does not cover (custom auth flows, edge-case formatting, business-specific defaults).

REST vs GraphQL vs RPC, and Which One You Should Pick

This is the question that eats weeks of architecture meetings and produces almost no business value. The honest version of the answer is short: most backends should be REST. The other two are answers to specific problems and should be chosen only when you actually have those problems.

REST

The right default. Every tool understands it. Easy to cache at the edge. Easy to debug with curl. Pagination is well-understood. Versioning is well-understood. The downside is over-fetching: clients pull more fields than they need. For most apps this does not matter.

GraphQL

Right when you have many client types with different data needs: a web app, a mobile app, an admin dashboard, a partner integration. The single schema lets each client request exactly what it needs. The cost is operational: caching is harder, query cost is non-obvious, and untrusted clients can write expensive queries unless you use persisted queries.

RPC (tRPC, Twirp, gRPC)

Right when client and server share a language, especially TypeScript. The types flow end-to-end without a separate spec. The cost is interoperability: outside clients cannot consume tRPC easily, and gRPC needs HTTP/2 and a code generator.

The numbers behind the recommendation. A REST API serving JSON has been the dominant pattern since 2010. Every cloud provider, every CDN, every monitoring tool, every load tester, every API gateway treats REST as the first-class citizen. Caching a GET response at Cloudflare or Fastly takes one header. Caching a GraphQL POST takes either persisted queries or client-side normalization. The same goes for rate limiting, request logging, and replay debugging. You pay an operational tax for every non-REST choice. Sometimes that tax is worth it. Often it is not.

When AI agents generate REST handlers, the output is usually fine because there is twenty years of REST code in the training data. When they generate GraphQL resolvers, the output is correct in shape but often misses the N+1 query problem and dataloader patterns. When they generate gRPC, the output usually compiles but the streaming semantics need a careful read. The complexity of the framework correlates with the supervision burden.

The framework choices within REST that are worth knowing in 2026:

Express on Node.js. Still ubiquitous. Maintenance is steady. The middleware ecosystem is enormous. AI agents produce correct Express code at very high rates.
Hono for edge runtimes (Cloudflare Workers, Deno Deploy, Vercel Edge). Smaller, faster, modern API. Works on Node.js too. Increasingly the default for new TypeScript projects.
NestJS when you want opinionated structure: dependency injection, modules, decorators, a clear way to organize a large codebase. Heavier than Express but the structure pays off past 50 endpoints.
Fastify when raw throughput matters and you want JSON schema validation built in. Faster than Express by 2-3x in benchmarks.
FastAPI on Python when the team is Python-first. Excellent OpenAPI integration, Pydantic for validation, async support.
Go's net/http with a router like chi when you need predictable performance and small binaries. The standard library is good enough on its own.

The choice between Express and Hono in 2026 deserves a closer look because it is the most common decision for new TypeScript backends. Express is older, slower per request, and has a callback-style API that predates async/await. The reason it persists is gravity: the middleware ecosystem is enormous, every Stack Overflow answer assumes Express, every contractor knows it. Hono is faster, smaller, has a modern API built around Web Fetch (Request, Response), runs on every modern runtime including Cloudflare Workers and Bun, and the type inference for routes is cleaner. If you are starting today and you are not bound by an existing Express codebase, Hono is the better choice. If you are bound, the migration cost from Express to Hono is real but bounded; you can do it route by route.

The TypeScript-first stack worth committing to: Hono for routing, Zod for validation, Drizzle or Prisma for database access, Vitest for testing, OpenAPI generation from Hono routes via @hono/zod-openapi. The whole stack composes. The agent generates code that fits the stack because every piece reinforces the others. The contract is in TypeScript types, the validation runs from those types, the database layer is typed, the tests use the same types, and the OpenAPI spec falls out automatically.

For Python, FastAPI on Pydantic 2 is the equivalent recommendation. The Pydantic v2 rewrite (released 2023) made validation 5-50x faster than v1, which removed the last performance argument against Python for API work. Combine FastAPI with SQLAlchemy 2 (typed ORM) and pytest, and you get a stack that holds together the same way the Hono one does.

Authentication and Authorization

Authentication is who you are. Authorization is what you can do. They are different problems, they fail in different ways, and you should implement them in that order with auditing as the third step.

The patterns that always come up:

Session cookies. The original, still the most reliable for browser-based apps. The session ID is opaque, stored server-side, transmitted in an HttpOnly Secure cookie with SameSite set to Lax or Strict. Logout invalidates the session immediately because the server controls the lookup table.
JWT. A signed token that contains claims. Fast because verification needs no database call. The downside is revocation: a compromised token is valid until expiry unless you maintain a revocation list, which defeats the stateless property. Use short expiry windows (15 minutes is typical) with refresh tokens that hit the database.
OAuth 2.0 and OpenID Connect. The protocol for delegating auth to a provider: Google, Apple, Microsoft, GitHub. You almost never want to implement OAuth from scratch. Use a library or a hosted service.
API keys. For server-to-server and developer-facing APIs. Long random strings, scoped to a single client, revocable independently. Hash them at rest the same way you hash passwords.
WebAuthn and passkeys. Passwordless using device-bound credentials. Harder to phish, harder to leak, increasingly supported across browsers and platforms.

The hosted services that save real time:

Auth0

Mature, enterprise-ready, expensive at scale

Clerk

Polished UI components, strong React integration

NextAuth

Self-hosted, free, lots of providers built in

Lucia

Library not service, you own the database

Better Auth

TypeScript-first, plugin architecture, growing fast

Stytch

API-first, passwordless and B2B focus

Supabase Auth

Bundled with Postgres, RLS-friendly

Firebase Auth

Google-backed, generous free tier, mobile-first

The implementation order that actually works in practice:

Authentication

Pick a provider or library. Get login, logout, password reset, and email verification working. Test that sessions expire correctly and that logout invalidates server-side state.

Identity in requests

Decide how the user identity flows through your handlers. Middleware that attaches a verified user object to the request is the standard pattern. The handler should never have to re-verify.

Authorization model

Decide the granularity. Roles (admin, user) are the simplest. Permissions (read:posts, write:posts) are more flexible. Resource-level rules (this user owns this post) are the most expressive. Most apps need a mix.

Authorization enforcement

Centralize the check. A middleware or decorator that says requireRole('admin') or canAccess(user, resource). Do not scatter if-statements across handlers.

Auditing

Log every privileged action. Who, what, when, from which IP, with which token. Audit logs are write-only, retained separately from operational logs, and reviewed during incidents.

Security review

Before launch, walk every endpoint and ask: who can call this, what happens if a user calls it with another user's ID, what happens if the token is missing or invalid. AI agents miss these checks regularly.

The single biggest mistake in AI-generated auth code is missing the authorization check on resource access. The handler authenticates the caller, looks up the requested resource, and returns it without checking that the caller is allowed to see it. This is called insecure direct object reference and it is the single most common vulnerability in AI-generated CRUD code. The fix is the authorization middleware: never let a handler return a resource without an explicit canAccess(user, resource) check.

For tokens specifically, when you need JWT, sign with EdDSA (Ed25519) or ES256 if you can. HS256 with a shared secret is fine for single-service apps but breaks down across services. Rotate signing keys on a schedule (every 90 days is a reasonable target). Store the keys in a secrets manager, not in environment variables checked into git.

Rate limiting deserves its own paragraph in the auth section because the two concerns intersect. Login endpoints get brute-forced. Password reset endpoints get used for email enumeration. Token refresh endpoints get hammered. The right pattern is per-IP and per-account rate limiting on auth-adjacent routes, with sliding windows. Five failed login attempts per IP per minute, ten per account per hour, with exponential backoff on the response. The libraries: express-rate-limit on Node (or the rate-limiter-flexible package, more flexible), slowapi on Python, the equivalent middleware in NestJS. AI agents do not add rate limiting unless you ask. Ask.

The session lifecycle that handles real-world cases. A session has an expiry and an idle timeout. The expiry is the maximum age (30 days is typical for "remember me", 8 hours for sensitive apps). The idle timeout is how long without activity before logout (30 minutes for banking, 24 hours for consumer apps). Both timers run, whichever expires first ends the session. On every request, you bump the idle timer. On expiry, you redirect to login. The agent often gets one of these timers and forgets the other; spec out both.

Multi-factor auth is the right default for any privileged account in 2026. The cost has dropped: TOTP via apps like Authy, Duo, or 1Password, push notifications via Auth0 Guardian or Okta Verify, WebAuthn for passkey support, SMS as a fallback you discourage. Email magic links are fine for low-security flows but are not a replacement for MFA. Decide which roles require MFA, enforce at login, and re-prompt for MFA when the user does anything sensitive (changing email, exporting data, accessing financial records).

One more pattern. Service-to-service authentication. When backend A calls backend B, there is no human user. The two services authenticate with each other using either mTLS (mutual TLS, certificates on both sides), signed JWTs with short expiry, or scoped API keys. mTLS is the most secure, also the highest operational overhead. JWTs are the standard middle ground. AI agents tend to default to long-lived API keys for service-to-service auth, which is the worst option. Push back on that pattern in review.

Error Handling Patterns

Inconsistent error responses are the silent killer of API quality. The success path gets attention because that is what the demo shows. The error paths get whatever the framework produces by default, which is usually wrong, leaky, or both.

The contract for errors should be as explicit as the contract for success. Every error response has the same shape: a status code that follows HTTP semantics, a structured body that the client can parse, and a machine-readable code that does not change between releases.

The status code rules:

4xx is client error. The caller did something wrong. Bad input, missing auth, requesting a resource they cannot access, requesting a resource that does not exist. The client can fix this by changing what they send.
5xx is server error. The server failed at something the caller could not have prevented. Database is down, an upstream API timed out, a panic in the handler. The client cannot fix this by retrying with different input.
The boundary cases: 404 for "this resource does not exist." 403 for "you are authenticated but not authorized." 401 for "you are not authenticated." 422 for "the input was syntactically valid but semantically wrong." 429 for rate limiting. 409 for conflicts (duplicate keys, version mismatches).

The body shape that works across teams:

{
  "error": {
    "code": "validation_failed",
    "message": "Email is invalid",
    "request_id": "req_8f2a1c3d",
    "details": [
      {
        "field": "email",
        "code": "invalid_format",
        "message": "Must be a valid email address"
      },
      {
        "field": "password",
        "code": "too_short",
        "message": "Must be at least 12 characters"
      }
    ]
  }
}

Three things to notice. First, the error is wrapped in an "error" key so the client can distinguish error responses from success responses without checking status codes (which still works, but the wrapper helps generic client libraries). Second, the code field is a stable machine-readable string. Clients write code against the code, not the message. The message is for humans, the code is for software. Third, validation errors are an array of per-field details so the client can render them next to the right form input.

Validation specifically is where AI-generated handlers are weakest. The agent will write a route, write a handler that pulls fields off the request body, and forget to validate that those fields are even present, let alone the right type. The fix is a validation library called from the framework as middleware, not from inside the handler.

The validation libraries to know:

Zod on TypeScript. The schema is also the type. Compose schemas, derive types, parse and refine. The dominant choice in 2026.
Yup, older, similar shape, still in use.
Valibot, smaller bundle than Zod, growing for edge runtimes.
Pydantic on Python. The standard since FastAPI made it ubiquitous.
JSON Schema when you want validation that flows from your OpenAPI spec directly. AJV is the reference implementation in JavaScript.

The "do not leak internal details" rule is non-negotiable. A 500 response should say "internal error" with a request ID and nothing else. The full stack trace, the SQL query that failed, the file path of the source file, the database connection string, all of that goes to logs and never to the client. AI agents default to verbose error messages because verbose error messages help during development. Strip them in production. The middleware that converts thrown exceptions to error responses should have two modes: development (verbose) and production (minimal). Default to production-safe behavior and only flip the switch when running locally.

Request IDs deserve their own paragraph. Every request gets a unique ID, generated at the entry point (the load balancer or the first middleware), attached to every log line for that request, and returned in every response (success or error). When a customer reports a problem, they give you the request ID, and you find every log line for that request in seconds. Without request IDs, support is guesswork.

Idempotency keys are the next layer up. For any non-GET endpoint that creates or modifies state, accept an Idempotency-Key header from the client. Store the result of the request keyed by that header for 24 hours. If the same key comes in again, return the cached result instead of re-running the operation. This is how Stripe, Square, and every other payment-grade API handles network retries. The client can retry safely without creating duplicate charges or duplicate users. The agent does not add idempotency keys by default; specify them in the contract for any endpoint that costs money or creates persistent state.

Distinguish error categories that look similar but mean different things. A 404 because the resource never existed is different from a 404 because the resource was deleted is different from a 410 (Gone, "this resource used to exist and is permanently removed"). A 429 with a Retry-After header tells the client when to retry; a 429 without one tells them nothing useful. A 503 with a Retry-After is a graceful "we are overloaded, slow down"; a 500 is "we crashed, debug us." These distinctions feel pedantic until a client integration breaks because your API returns 500 for anything that goes wrong, including expected business cases.

The error catalog is the document that makes this manageable. Every machine-readable error code your API can return, listed in one place, with the HTTP status, the meaning, the typical cause, and the recommended client action. Generate it from the same source as the OpenAPI spec. Hand it to client developers along with the API docs. AI agents are great at writing error catalogs from existing code; ask the agent to walk every endpoint and produce the table.

Background Jobs and Queues

Synchronous request-response is the right default. The client sends, the server processes, the server responds, the connection closes. Most endpoints fit this model and should stay this way.

The cases where synchronous breaks down:

Long operations. Anything past 10 seconds is a UX problem. Past 30 seconds and you start hitting load balancer timeouts. Past 60 seconds and you definitely hit them.
Retries on transient failure. An email send fails because the SMTP provider is rate-limiting you. You want to try again in 30 seconds, then 2 minutes, then 10 minutes, before giving up.
Scheduled work. Send the weekly digest on Friday at 9am. Run the cleanup job at midnight. Renew expiring certificates an hour before they expire.
Decoupled side effects. The user signs up, and three things should happen: send a welcome email, create a Stripe customer record, log to the analytics pipeline. None of these should block the signup response. None of them failing should cause the signup to fail.

The pattern that handles all of this is a job queue. The HTTP handler enqueues a job. A worker (a separate process) picks the job off the queue, runs the work, marks the job done. If the work fails, the queue retries with backoff. If it fails enough times, the job goes to a dead-letter queue for human review.

HTTP handler enqueues job

Queue stores job

Worker pulls job

Job runs

Success: mark done. Failure: retry with backoff

Permanent failure: dead-letter queue

The queue products to know in 2026:

BullMQ on Node.js with Redis. The default for TypeScript projects. Strong concurrency control, scheduled jobs, repeatable jobs, priorities, rate limiting, all built in. The dashboard (Bull Board or Arena) is mature.
Celery on Python with Redis or RabbitMQ as the broker. The default for Python. Mature, deeply integrated with Django and Flask, well-documented.
Sidekiq on Ruby with Redis. The default for Rails. The "pro" and "enterprise" tiers add features but the open-source version handles most apps.
Inngest, hosted, language-agnostic, function-based. Define jobs as functions, deploy them, Inngest handles the queue and the retries. Pay per execution.
Trigger.dev, similar to Inngest, TypeScript-first, with built-in long-running step orchestration.
AWS SQS or Google Cloud Tasks when you are already on those platforms and want managed infrastructure. Less ergonomic than the language-specific libraries but you do not run anything yourself.
Postgres-based queues like pg_jobs or Graphile Worker. If you already have Postgres and your throughput is moderate, you may not need Redis at all.

The mistakes AI-generated queue code makes:

No idempotency. The job runs twice (because of a retry, or two workers picked it up). The agent does not consider whether running the job twice is safe. Make every job idempotent: check before write, use unique constraints, store an idempotency key.
Unbounded retries. The job fails, the retry count is infinite, the queue fills up. Always set a max retry count and a dead-letter destination.
Synchronous-shaped jobs. The agent enqueues a job, then waits for it to finish in the same handler. This is the worst of both worlds: you have queue infrastructure but you blocked the response. Either run synchronously or run asynchronously, not both.
Missing observability. The job runs, the job fails, nobody knows. Every job emits structured logs, success/failure metrics, and a trace span. Same observability as your HTTP handlers.

The performance numbers worth knowing. Redis-backed queues like BullMQ can sustain 10,000 to 50,000 jobs per second on a modest single-node Redis (depending on payload size). Postgres-based queues are slower (1,000 to 5,000 jobs per second on a single node) but the operational simplicity is real. Hosted queues (Inngest, Trigger) are slower per job because of the network round-trip but you stop running infrastructure.

Workflow orchestration is the next category up from job queues. Sometimes the work is not a single job but a sequence: charge the card, then provision the resource, then send the receipt, then trigger the analytics event, with rollback if any step fails. Tools like Temporal, AWS Step Functions, Inngest's step orchestration, or Trigger.dev's workflow API model this as a durable function. Each step has its own retries, its own timeout, and the whole sequence resumes after a process crash. For anything more complex than "send an email," workflow orchestration is worth considering.

Scheduled jobs deserve their own pattern. The naive approach is "run a cron on the server." This works until you have two servers, at which point both run the cron and you double-charge users. The fix is either a single dedicated scheduler process, or distributed locks (Redis, ZooKeeper) to ensure only one process runs the job at a time. BullMQ has built-in repeatable jobs that handle this. Inngest and Trigger handle it at the platform level. Whatever you pick, do not hand-roll it; the failure modes are subtle.

The handoff between HTTP and queue is where bugs hide. The handler accepts a request, validates it, enqueues a job, returns 202 Accepted with a job ID. The client polls for the result, or subscribes to a webhook, or opens a websocket. The job runs, succeeds or fails, updates the status. The client checks the status and presents a result. Each link in this chain has a failure mode: the enqueue could fail, the worker could crash, the result could be lost. Build observability for the whole chain, not just the HTTP request.

Performance From Day One

The advice in older systems books is "make it work, then make it fast." The advice in 2026 is slightly different: make it work, then measure it, then make the measured-slow parts fast. The middle step is what gets skipped in AI-generated code, because the agent does not have a profiler.

The patterns that move the needle, ranked by typical impact:

Eliminate N+1 queries 80% reduction in p95 latency on list endpoints

Add Redis cache for hot reads 60% reduction

CDN cache on public GETs 50% reduction in origin traffic

Connection pooling on Postgres 40% reduction in connection overhead

Index the columns you filter and sort on 30-99% reduction depending on table size

Pagination with cursor instead of OFFSET 20% reduction at high page numbers

In-memory cache for static config 10% reduction

N+1 is the number-one performance bug in AI-generated code. The agent writes a handler that fetches a list of posts, then loops through them and fetches each author. One query becomes N+1 queries. On a list of 100 posts, that is 101 round-trips to the database, each with its own latency. The fix is either eager loading (Prisma's include, Sequelize's include, SQLAlchemy's joinedload) or a dataloader pattern (batch the author lookups into a single IN query). When you review AI-generated code, the first thing to check is whether any iteration over a result set issues another query inside the loop.

The caching layers, from outermost to innermost:

Browser cache. Send the right cache-control headers and the browser stops asking. Free, no infrastructure.
CDN cache. Cloudflare, Fastly, CloudFront. Cache public GETs at the edge, hundreds of milliseconds saved per request, less load on your servers.
Application cache (Redis). The classic shared cache. Sub-millisecond reads on local Redis, single-digit milliseconds on cloud Redis. Use for query results, computed views, session lookups.
In-process cache. A Map in memory, or a library like lru-cache. Fastest possible (nanoseconds) but only useful for data that does not change often and where staleness across instances is acceptable.
Database query cache. Some databases (Postgres with shared_buffers, MySQL query cache historically) cache results internally. Free if the working set fits in RAM, useless if it does not.

Connection pooling is the un-glamorous fix that prevents production fires. Every Postgres connection costs around 10MB of RAM on the database side. A handler that opens a connection per request and closes it does this in tens of milliseconds; an idle connection does nothing but consume RAM. The fix is a pool: the application opens a small number of connections (typically 5-20 per process) and reuses them across requests. PgBouncer is the standard external pooler. The frameworks have their own internal pools (Prisma, knex, SQLAlchemy). Run with both in production: framework pool for the in-process queue, PgBouncer in transaction mode for the cross-process pool.

Pagination deserves a paragraph. OFFSET-based pagination (LIMIT 20 OFFSET 1000) gets slower as the offset grows because the database still scans the skipped rows. Cursor-based pagination (WHERE id < cursor LIMIT 20) is constant time. AI agents default to OFFSET because the SQL is shorter. Push them to cursor-based for any list that can grow past a thousand rows.

The "measure before optimizing" rule matters more with AI than without it. The agent has been trained on a lot of code that includes a lot of premature optimization. It will happily add a Redis cache to a function that runs once a day. It will denormalize a table that has 100 rows. It will write a microservice for a problem that needs a function. The way to push back: ask for the latency budget, the QPS estimate, and the cost ceiling, and let those drive what gets optimized. Without numbers, the agent will optimize the wrong thing.

Database indexing is where AI agents are unreliable in a specific way. They will add an index when you ask for one, but they will not notice that the existing query needs one. The fix is to read the EXPLAIN output. For Postgres, EXPLAIN ANALYZE on the query that backs each list endpoint, looking for sequential scans on tables larger than a few thousand rows. For MySQL, EXPLAIN with extended output. The agent can interpret the EXPLAIN if you paste it in; what it cannot do is run the query against your data and notice the problem. That step is yours.

The classic indexing mistakes the agent makes when you do ask for indexes. First, indexing every column individually instead of a composite index that matches the actual WHERE clause. A query that filters on (tenant_id, status, created_at DESC) wants an index on those three columns in that order, not three separate indexes. Second, missing the WHERE clause selectivity. An index on a boolean column with two values is useless; the planner will scan the table anyway. Third, ignoring the cost of writes: every index slows down INSERT, UPDATE, and DELETE on that table. Indexes have a budget too.

Cache invalidation is the second-hardest problem in computer science (the first is naming things). The patterns that work in practice. Time-based: cache for 60 seconds, accept that data is up to 60 seconds stale. Simple, predictable, fits most cases. Event-based: invalidate the cache when the underlying data changes, by pushing a message from the writer to the cache. Correct in theory, complex in practice because you have to find every code path that writes the data. Tag-based: cache entries are tagged with the keys they depend on, and any write to a key invalidates all entries tagged with that key. Used by Vercel's runtime cache, Cloudflare cache tags, and Stripe's internal caching layer. The default approach: time-based first, push to event-based only when staleness causes a real problem.

Database connection pool sizing has a formula that the agent will not derive on its own. The right size is roughly (number of CPU cores on the database) * 2, plus the disk spindle count (which is one for SSDs). For a 4-core Postgres database, a pool of around 8-10 connections per application instance, with maybe 5 instances, gives 40-50 total connections, well within Postgres's typical 100-connection default. Setting the pool to 100 and running 10 instances will saturate the database with idle connections and cause the planner to thrash. PgBouncer in transaction mode lets you have a small server-side pool with a much larger client-side virtual pool, decoupling the two.

Observability and Logging

You do not have observability when the system is working. You have observability when the system is broken at 3am and you need to know why in under five minutes. Everything in this section is for that moment.

The three pillars are logs, metrics, and traces. They answer different questions. Logs answer "what happened on this specific request." Metrics answer "how is the system trending over time." Traces answer "where did time go in this one slow request, across all the services it touched."

Structured logs first. Plain text logs ("user 12345 logged in at 14:32 with email [email protected]") are unsearchable at scale because every variation in phrasing breaks your grep. Structured logs are JSON, with stable keys, that ingestion tools parse and index. The minimum fields:

timestamp (ISO 8601, UTC)
level (debug, info, warn, error)
request_id
message (human-readable, but stable across releases)
service name
environment (dev, staging, prod)
any structured fields specific to the event (user_id, route, status_code, duration_ms)

The libraries: pino on Node.js (fast, JSON by default), Winston (slower, more flexible), structlog on Python, zap on Go. Configure them at the application entry point, never at the call site.

Metrics second. The four golden signals from the Google SRE book: latency, traffic, errors, saturation. For an HTTP API:

Request rate (requests per second per route per status code)
Latency distribution (p50, p95, p99 per route, in histograms not averages)
Error rate (5xx per route, 4xx per route separately)
Resource saturation (CPU, memory, database connections, queue depth)

The reason histograms beat averages: an average of 100ms hides that 5% of requests are taking 5 seconds. The p95 and p99 surface the long tail that average smooths over. Always look at p95 and p99, not the mean.

Traces third. A trace is a tree of spans, where each span is a unit of work (an HTTP request, a database query, an external API call). Spans have parent-child relationships, durations, and attributes. When a request crosses three services, the trace shows you exactly which service ate the time. Without traces, you guess.

The tools that work in 2026:

Sentry. Error tracking with stack traces, release tagging, user impact. The default for catching exceptions in production. Pricing scales with event volume.
Datadog. The everything-store. Logs, metrics, traces, infrastructure monitoring, security. Expensive but unified. Strong agent and integrations.
Honeycomb. Trace-first observability. Strong for diagnosing the long-tail latency problems that aggregate dashboards miss. The query language (BubbleUp, heatmaps) is the differentiator.
OpenTelemetry. The vendor-neutral standard for instrumenting code. Emit OTel traces, metrics, and logs from your application; ship them to whatever backend you choose. The right default unless you are locked into a vendor SDK.
Grafana with Prometheus and Loki. Self-hosted, open-source, the metrics-and-logs side. Cheaper than Datadog at scale, more work to operate.
Jaeger or Tempo for self-hosted traces.
Better Stack, Axiom, Logflare for managed log aggregation if Datadog is overkill.

Instrument the entry point

Generate a request ID at the load balancer or the first middleware. Attach it to the request context so every downstream call sees it.

Structured logs

Replace console.log with a structured logger. Set up the standard fields (timestamp, level, request_id, message, service, env). Write logs at info for normal flow, warn for recoverable issues, error for unrecoverable.

HTTP metrics

Emit a metric for every request: rate, status, duration. Use a histogram for duration, not a gauge. Tag by route, method, status code.

Traces

Wire up OpenTelemetry. Auto-instrumentation gets you most of the way: HTTP requests, database queries, outbound HTTP calls all become spans automatically. Add manual spans around critical custom logic.

Error tracking

Send unhandled exceptions to Sentry or equivalent. Tag with release and environment. Set up alerts on new error types and on error rate spikes.

Dashboards and alerts

Build the four-golden-signals dashboard per service. Set alerts on p99 latency, error rate, and saturation. Page on real problems, do not page on every blip.

The two mistakes AI agents consistently make on observability code. First, they log too much: every input parameter, every intermediate value, every step. The result is logs that cost money to ingest and signal that drowns in noise. Log decisions and outcomes, not narration. Second, they log secrets: passwords, tokens, payment details, PII. Always sanitize before logging. The logger config should have a redaction list, and you should test that the redaction works.

Sampling is the operational lever you tune as traffic grows. Logging every request at full detail works at 10 requests per second. At 10,000 requests per second, the storage and ingest cost becomes real. The standard pattern is full logging on errors, head-based sampling on success (1% or 0.1% of successful requests get full detail, the rest get summary metrics). Tail-based sampling is fancier (collect all spans, decide after the fact whether to keep the trace based on whether anything went wrong), supported by Honeycomb, Datadog, and the OpenTelemetry collector. AI agents do not configure sampling; you configure sampling.

Service-level objectives (SLOs) are the higher-level discipline that turns observability from "we have dashboards" into "we know when something is wrong." Define for each critical path: the indicator (p95 latency, error rate), the objective (under 200ms, under 0.1%), the window (rolling 30 days). Track the error budget (the amount you can fail without missing the objective). When the error budget is consumed, slow down feature work and fix reliability. Without an SLO, every dashboard is a Rorschach test; with one, the team has a number to point at.

Where AI Agents Excel and Where They Need Supervision

The honest map of what works and what does not. This is the section that should change how you allocate review time.

AI does well

CRUD scaffolding, route handlers, validation schemas, OpenAPI to TypeScript translation, test stubs, database migrations from a schema description, boilerplate middleware, parsing well-defined formats, refactoring within a single file, generating typed clients from a spec.

AI needs close review

Auth flows (security-critical), payment integrations (correctness-critical and money-critical), schema design (load-bearing for years), perf-sensitive paths (the agent does not run a profiler), distributed system semantics (consistency, idempotency, ordering), anything involving cryptography, anything that touches user data exports or deletions for compliance.

The reason the split looks like this. Tasks the agent does well are tasks where the answer is well-defined and bounded by the contract. Translate this OpenAPI spec to a Hono handler. Generate a Zod schema for this Postgres table. Write a test for this function. The space of correct answers is narrow and the agent stays in it.

Tasks the agent needs supervision on are tasks where the correct answer depends on context that is not in the immediate prompt. Auth flows depend on threat models. Payment code depends on the specific provider's idempotency rules and webhook signing. Schema design depends on access patterns that have not been written yet. Performance depends on real-world load that the agent has not seen. The agent will produce code that looks correct and is wrong in ways that only show up under specific conditions.

The patterns for supervised work:

Diff review with intention. Read every line of generated code in security-critical paths. Do not skim.
Adversarial prompting. After the agent writes the auth handler, prompt: "list every way this could be exploited" and act on the answers. The agent is decent at red-teaming its own code if you ask.
Reference real specifications. Hand the agent the actual Stripe webhook signing docs, not your description of them. Hand it the actual Postgres pg_dump format spec, not a hand-rolled summary.
Test against fixtures. For payments specifically, every payment provider ships test cards and webhook fixtures. Run the integration against them in CI before any human review.
Scope the agent. For schema changes, do not let the agent design the schema in one shot. Have it generate options, you pick, then it implements. The decision stays with the human.

One thing the agent is unexpectedly bad at: knowing when to ask for clarification. By default it produces an answer even when the spec is ambiguous. The fix is in the prompt: "if any requirement is unclear, ask before writing code." This single line cuts the rate of "wrong-but-plausible" output noticeably. Anthropic's Claude is the model that does this best in 2026; if you are using a different model and finding it forges ahead too easily, switch.

The frameworks for this kind of work in production codebases. Claude Code as the IDE-integrated agent. Cursor with Claude as the model when you want a more visual diff workflow. The OpenAI tools (Codex CLI, ChatGPT) are an option but Anthropic's models are stronger on the long-context, large-codebase work that backends require, and the prompt caching makes them cheaper to run at scale.

Test coverage is the underrated control on AI-generated code. The agent will write tests if you ask for them, and the tests it writes are usually decent for the surface behavior. What it misses are edge cases: empty inputs, extreme values, concurrent requests, malformed data. The pattern that works is property-based testing on top of unit tests. Tools like fast-check (TypeScript), Hypothesis (Python), and PropEr (Erlang) generate hundreds of inputs per test and let you assert invariants instead of specific outputs. The agent can write the property tests once you give it the invariant; coming up with the invariants is the part that needs human thought.

Database migrations are another supervised area worth calling out. The agent will write a migration when you ask. It will not always notice that the migration locks the table for ten minutes on a 100-million-row database, or that the new column with a default value rewrites every row. The patterns: add nullable columns, backfill in batches, then add the NOT NULL constraint in a separate migration. Drop columns in a deprecation cycle: stop reading first, stop writing second, drop third. AI-generated migrations need the same review you would give human migrations on a hot table.

Takeaway

Use AI for the wide top of the work: scaffolding, translation, test stubs, refactor. Spend your review time on auth, payments, schemas, performance, and distributed semantics. The split is the discipline.

Closing

The backend is where AI-built systems either ship to production or fall over. The patterns that survive the transition are not new. They are the same patterns that worked before the agents existed: contract-first design, consistent error handling, real authentication and authorization, queues for the things that should not block, performance discipline tied to measurement, observability set up before it is needed, and a clear-eyed view of what the agent is trustworthy for.

The shift is in how you spend your time. The agent does the typing. You do the deciding: what the API should look like, what counts as a valid input, what counts as a real error, what gets logged, what gets cached, what gets queued, what gets reviewed line by line and what gets accepted on a glance. The discipline is about being explicit about those decisions so the agent has the context it needs and you have the criteria to know whether the result is right.

If you remember one rule from this whole page: write the contract before the code. Every other decision flows from it. Get that one right and the agent becomes a force multiplier. Skip it and you ship a backend that is fast to write and slow to fix, which is the worst trade in software.