← Back to Blog

The Context Sandwich: Why 'Infinite Memory' is a Trap

Andreea

Everyone wants an "infinite context window." The math says no.

When you chat with your codebase using an AI agent, we have to build a "Context Sandwich" for every single message you send. Here is what that looks like in practice:

  1. The Bun (Required): System prompts, persona rules, and tool definitions. This is the non-negotiable instruction manual that tells the AI who it is and what it can do.
  2. The Meat (Expensive): Vector search results (RAG) pulled from your repository to give the agent relevant background knowledge.
  3. The Condiments (Critical): The specific files you currently have open and are asking the agent to edit, along with your persistent "Sticky Context" (like your current active branch or workspace).
  4. The Leftovers (The first thing to go): Your actual chat history from the current session.
  5. The Bottom Bun: Your newest user query.

The Physics of the Context Window

Let's use a typical 2,500 token payload. If you have a 10-token user query ("Fix the auth bug"), we still have to send the Bun, the Meat, and the Condiments to ensure the model actually has the context to fix the bug. That payload inflates quickly.

But what happens when you've been chatting for an hour, and your "Leftovers" (chat history) push into the hundreds of thousands of tokens?

If we blindly stuff all of that history into the context window, two things happen:

  1. The Cost Skyrockets: You pay for every token, every time.
  2. The "Lost in the Middle" Effect: If the payload gets too massive, the LLM literally forgets the System Prompt (The top Bun). Suddenly, your expert coding agent forgets how to use its tools or starts writing code in the wrong format because its original instructions were pushed out of its working memory.

Managing the Cutoff Line in AICoven

Engineering an AI agent isn't about magic; it's about ruthlessly managing the cutoff line. In AICoven, we don't just infinitely append your chat history. We built three specific systems to manage the sandwich:

1. Sticky Context over Recitation Instead of forcing the LLM to read through 50 old messages to remember that you are working in the api/ directory on the dev branch, our backend actively extracts that data and pins it to the top of the context window. Your chat history doesn't need to bloat with repeated facts.

2. Unified Memory over Raw History When an agent learns something important (like how your authentication flow routes), it is instructed to save that to AICoven's local Unified Memory. The agent can then dynamically fetch just that memory chunk (a few hundred tokens) to add to the "Meat," rather than keeping 10,000 tokens of chat history alive just in case it needs to remember something from yesterday.

3. Strict Truncation Policies When the agent uses a tool to read a file, it checks the byte length. We don't blind-paste huge binaries or massive configuration files into the prompt. AICoven formats file attachments explicitly (e.g., [File contents: {path}]\n{content}) and truncates anything that breaches the threshold, ensuring the system prompt remains the central focus.

An infinite context window doesn't mean you should use it. Context is a budget, and the best agents know exactly what to cut.

About the Author

I'm Andreea, the creator of AICoven. I build local-first tools for developers who care about architecture, privacy, and prompt economics.

See more of my work at papillonmakes.tech →