Top 1% Upwork (8 years) 286+ client deployments 2,036+ projects shipped GoHighLevel Certified Partner Featured speaker: GHL Summit 2025 Client Login
← All issues
The Scale Brief · Issue #146

Your agent will hit the context window.
Plan for it.

Look at the longest conversations your agent had last month. Bet you can find one where the agent suddenly forgot context mid-thread.

It happens around the 40th or 50th turn. The user has been working with the agent for ten minutes, building up context. Then the agent answers a question and the answer is weirdly… disconnected. It forgets a detail the user mentioned six turns ago. The user repeats themselves. The trust drops a notch. The conversation ends not long after.

You can ignore this until it bites, or you can plan for it. The pattern that solves it is mostly named after itself: retry-with-summary.

What's actually happening

Every LLM has a context window. The window is finite. Most agent setups today operate well within it for the first 10-20 turns of a conversation. Past that, depending on system prompt length + tool-call overhead + retrieved context + per-turn message size, the agent starts crowding the ceiling.

Two failure modes typically follow:

  1. Naive truncation: the agent silently drops the oldest messages to make room for the new one. The agent "forgets" the early context. The user notices. Trust drops.
  2. Hard error: the call throws a context_length_exceeded error. The agent shows the user something like "Sorry, something went wrong." The conversation dies.

Both are bad. Both are avoidable.

The pattern

Before each model call, check the prospective token count against the model's context limit. If you're approaching the ceiling — say 80% — pause the live exchange and do a summarization pass:

  1. Compress the conversation: a cheap model (sonnet-haiku, gpt-4.1-nano) summarizes the older turns into a 200-word digest that preserves: the user's stated goal, the key decisions made so far, any concrete data the user provided (names, IDs, preferences), and the current state of the conversation.
  2. Restart with the digest: replace the older messages with a single system-role or assistant-role message containing the summary. Keep the most recent 4-6 turns verbatim so continuity is intact.
  3. Continue: the agent answers the user's latest question with the digest as context. The user notices nothing.

The user sees no error, no interruption, no "let me start over." The conversation just continues.

The implementation

If you're using the standard chat-completion protocol, this is a 30-line helper:

async function chatWithSummaryFallback(messages, model, opts) {
  const ESTIMATED_LIMIT = opts.contextLimit || 120000;  // tokens
  const TRIGGER_AT = ESTIMATED_LIMIT * 0.8;
  const KEEP_RECENT = 6;  // keep last N turns verbatim

  const estimated = estimateTokens(messages);
  if (estimated < TRIGGER_AT) {
    return await chat({ messages, model });
  }

  // Compress everything except the last KEEP_RECENT turns
  const old = messages.slice(0, -KEEP_RECENT);
  const recent = messages.slice(-KEEP_RECENT);

  const digest = await chat({
    model: opts.cheapModel,
    messages: [
      { role: "system", content: "Summarize the conversation. Preserve: user goal, decisions made, any specific data the user mentioned, current state. Be concise. 200 words." },
      ...old
    ]
  });

  const compressed = [
    { role: "system", content: `Earlier conversation summary: ${digest.content}` },
    ...recent
  ];

  return await chat({ messages: compressed, model });
}

That's it. The agent now degrades gracefully forever. Conversations can go 200 turns without hitting a hard error.

The edge cases

A few things the naive version misses:

Why this beats "just use a bigger context window"

Modern models advertise 200K+ context windows. Some go to 1M+. So you can just buy your way out of this, right?

Two reasons no:

  1. Quality degrades at the long end. Retrieval accuracy in extreme long-context regimes is documented to drop, especially for facts that sit in the middle of the conversation. A 200-word digest of relevant facts often beats 200K tokens of raw history.
  2. Cost scales with context length. Every turn re-sends the entire conversation. At the 80th turn, you're paying for 80 turns × every input token. Summarization caps the linear growth.

The retry-with-summary pattern is a quality + cost win, not a workaround.

The dashboard signal

Add one chart to your agent observability dashboard: per-conversation message count, bucketed by outcome (completed / abandoned). If you see a spike in abandonment past a specific turn count — that's where your agent is hitting the ceiling without recovery. The fix is the pattern above. Want us to audit yours? Apply for the audit.

The one-line summary

Every long conversation eventually hits the context-window ceiling. Naive truncation makes the agent forget. Hard errors kill the conversation. A 30-line retry-with-summary helper makes the failure invisible. Ship it before your longest-conversation tail breaks something visible.

Enjoyed this? One essay like this every Sunday — 12,400+ founders read it.
Subscribe free RSS

Keep reading

Issue #145
Your Form Looks Like It Works. Your Leads Are Gone.
Four silent failure modes in lead-capture forms.
Issue #147
The Cheap-Model Planner: Route, Don't Reason
Stop paying frontier rates for intent classification.
Issue #150 · NEW
12 Business Automations + The OS That Makes Them Compound
Why scattered automation can't compound — and the business-OS architecture that fixes it.
★★★★★

"This review is in regards to a consultation only. I look forward to working with Adam in the near future."

60 minute consultation · 2023 · Upwork verified →
★★★★★

"This review is in regards to a consultation only. I look forward to working with Adam in the near future."

60 minute consultation · 2023·Upwork verified → · Upwork ✓
★★★★★

"Pleasure working with Adam. Great communicator, easy and fun to work with. Goes above and beyond to make sure we're understanding every step of the way as we transition into ActiveCampaign. A++"

ActiveCampaign · 33h·2017·Upwork verified → · Upwork ✓
Run the audit on your agents

The Scale Audit ships this pattern
+ 24 others on every deploy.

Apply for an audit and we instrument your agent's conversation length, find the recovery gap, and ship the retry-with-summary helper as part of the engagement.

Apply for a free audit All issues