The Context Sandwich: Why 'Infinite Memory' is a Trap

Andreea•February 23, 2026

Everyone wants an "infinite context window." The math says no.

When you chat with your codebase using an AI agent, we have to build a context sandwich for every single message you send:

The Bun (Required): System prompts, persona rules, and tool definitions. The non-negotiable instruction manual.
The Meat (Expensive): Vector search results pulled from your repository for relevant background knowledge.
The Condiments (Critical): The files you currently have open, plus persistent "Sticky Context" like your active branch or workspace.
The Leftovers (First thing to go): Your actual chat history from the current session.
The Bottom Bun: Your newest query.

The Cost of a Full Sandwich

A typical prompt payload runs about 2,500 tokens. Even a 10-token user query ("Fix the auth bug") still requires the Bun, the Meat, and the Condiments — all of that context just to ensure the model can actually fix the bug.

Now imagine you've been chatting for an hour. Your "Leftovers" have ballooned to hundreds of thousands of tokens. If you blindly stuff all of that history into the context window:

The cost skyrockets. You pay for every token, every turn.
The "Lost in the Middle" effect. Once the payload gets massive, the LLM literally forgets the system prompt at the top. Your expert coding agent suddenly forgets how to use its tools or starts writing code in the wrong format because its original instructions were pushed out of working memory.

Token 0          Token 4,000       Token 120,000    Token 128,000
┌─────────────┬──────────────────┬────────────────┬──────────────┐
│ System       │                  │                │ User Query   │
│ Prompt       │  ...forgotten... │ ...forgotten...│              │
│ (The Bun)    │                  │                │ (Bottom Bun) │
└─────────────┴──────────────────┴────────────────┴──────────────┘
        ↑ remembers                                    ↑ remembers
                    ↑ "Lost in the Middle" ↑

Managing the Cutoff Line

Engineering an AI agent is about ruthlessly managing this cutoff line. In AICoven, we don't infinitely append chat history. We built three specific systems:

1. Sticky Context over Recitation Instead of forcing the LLM to read through 50 old messages to remember you're working in the api/ directory on the dev branch, our backend extracts that data and pins it to the top of the context window. Chat history doesn't need to bloat with repeated facts.

2. Unified Memory over Raw History When an agent learns something important (like how your authentication flow routes), it saves that to AICoven's local Unified Memory. The agent can then fetch just that memory chunk — a few hundred tokens — rather than keeping 10,000 tokens of chat history alive just in case.

3. Strict Truncation Policies When the agent reads a file, we check the byte length first. We don't blind-paste massive config files into the prompt. File attachments are formatted explicitly ([File contents: {path}]\n{content}) and truncated at the threshold to ensure the system prompt stays intact.

An infinite context window doesn't mean you should use it. Context is a budget, and the best agents know exactly what to cut.

About the Author

I'm Andreea, the creator of AICoven. I build local-first tools for developers who care about architecture, privacy, and prompt economics.

See more of my work at papillonmakes.tech →