← Back to Blog

How We Rebuilt Tool Calling: From 75 Schemas to Smart Selection

Andreea

Last week, I wrote about the reality of local agent tool calling: how cloud APIs give you clean JSON function calling while local models require regex parsers held together by string and prayer.

Since then, I found Anthropic's published their Advanced Tool Use research. Three features that fundamentally changed how they handle tool-heavy agents: Tool Search, Programmatic Tool Calling, and Tool Use Examples.

The results were dramatic. Context dropped from 77K to 8.7K tokens. Accuracy jumped from 49% to 74%. Token usage on complex tasks fell 37%.

AICoven supports OpenAI, Anthropic, Google, and runs local models on-device via MLX. We couldn't just flip a switch. We had to rebuild all three ideas at the application layer, provider-agnostic, and then figure out what they look like for a 2B parameter model running on your iPhone.


The Problem: 44 Tools, Every Single Prompt

AICoven has 44 built-in tools — GitHub, Google Drive, Shell, Scheduling, Firebase, Search Console — plus whatever MCP servers you've connected. Every tool has a JSON schema with parameter descriptions.

We'd already solved this for MCP tools. Our MCPToolEmbeddingCache used semantic embedding search to select only the relevant MCP tools and compress everything else into a compact catalog. That was working ok.


Phase 1: Show, Don't Tell

Anthropic's research on Tool Use Examples showed that adding concrete usage examples to tool definitions improved accuracy from 72% to 90% on complex parameter handling. JSON schemas tell you structure. Examples teach patterns.

We built a central example registry covering our 14 highest-error tools — 29 examples total showing minimal, typical, and edge-case usage:

"github.createPR": [
    {
        "description": "Create a PR with minimal parameters",
        "input": {
            "repo": "acme/frontend",
            "base": "main",
            "head": "feature/new-login",
            "title": "Add SSO login flow"
        }
    }
]

The injection strategy is provider-specific. Anthropic has a native input_examples API field — we populate it directly. OpenAI and Google don't, so we append formatted examples to the tool's description text. The LLM sees them either way.

For local models running on-device, the LoRA training data already teaches tool patterns through few-shot examples baked into the weights. So tool use examples primarily benefit cloud models.


Phase 2: Only Load What You Need

This was the biggest win accurately.

Anthropic's Tool Search uses a meta-tool: Claude calls a search function to discover relevant tools on-demand. Elegant, but it requires an extra inference step. We already had embedding infrastructure from MCP tool selection. So we extended it:

Before: All 44 built-in tool schemas dumped into every prompt.

After: 9 always-loaded tools (web search, shell, file ops, time) get full schemas. The rest are scored by relevance against the user's message. Top-15 most relevant get full schemas. Everything else collapses into a compressed catalog:

## Other Available Tools (20 additional)
- **firebase**: firebase.listProjects, firebase.listApps, ...
- **google_sheets**: google_sheets.getValues, google_sheets.updateValues, ...
- **schedule**: schedule.setReminder, schedule.listReminders, ...

~200 tokens instead of ~2,000. If you ask about GitHub, you get GitHub tools. If you ask about spreadsheets, you get Google Sheets tools. Firebase tools aren't in your way when you're debugging code. (I thought we already did this, because it makes sense, but apparently not, because my product spec did not specifically address it. Only have myself to blame)

For local models on-device, the same principle applies but the stakes are higher. Context windows on 2B-4B models are 2-8K tokens — every token matters. We extended MCPToolEmbeddingCache (renamed ToolEmbeddingCache) to also cache embeddings for native tool definitions, and filter based on relevance scoring before building the local prompt. The expected reduction is ~40-50% fewer tool tokens per prompt for local models.

We also extended the forced execution path. Local models can't make reliable JSON tool calls, so AICoven pre-executes tools deterministically before the model sees the prompt. Previously, forceNativeToolCall() only matched 2 tools via hardcoded keywords (current_time and web_search). Now it uses the relevance scorer to match any built-in tool above a confidence threshold, with expanded keyword patterns as a fast-path fallback.


Phase 3: Tools via Code, Not Chat

This is the idea that got me most excited from Anthropic's research.

In a normal tool loop, every tool call is a full inference pass. The model calls web_search, gets 2,000 tokens of results back in its context, processes them, calls github.readFile, gets another 1,000 tokens back, processes those, calls github.writeFile... Each intermediate result stays in context forever, bloating the window.

Programmatic Tool Calling flips this. The model writes a Python script instead:

# Read 3 config files and compare versions
files = ["config/dev.yml", "config/staging.yml", "config/prod.yml"]
versions = {}
for f in files:
    content = github_readFile(repo="acme/api", path=f)
    # Parse the version line
    for line in content["content"].split("\n"):
        if line.startswith("version:"):
            versions[f] = line.split(":")[1].strip()

print(f"Version comparison: {versions}")

AICoven executes this in a sandbox. Tool function stubs are auto-generated from tool descriptions — github_readFile(), web_search(), shell_execute() — each routing back to the real AgentToolsService.execute(). Only stdout returns to the model's context. The 3 file contents (~3K tokens) never enter the conversation. The model just sees "Version comparison: {dev: 2.1.0, staging: 2.0.9, prod: 2.0.8}".

Token savings: ~37% on multi-tool workflows.

The sandbox is locked down: restricted builtins, a whitelist of safe imports (json, math, re, datetime — no os, no subprocess), a 30-second timeout, and output capped at 10K characters. Read-only tools (27 of them) are safe by default; write tools (13) require explicit opt-in.

Anthropic offers native server-side PTC via a beta header, and we enable that for the Anthropic path. But the application-layer sandbox works for every provider — OpenAI, Google, even local models. Any model that can write Python can use it.

For local models, multi-step tool execution takes a different shape. A 2B model isn't going to reliably write Python scripts. Instead, we're building a ToolChainResolver that detects composite intents via pattern matching and executes deterministic tool chains:

  • "Search for X and summarize"[web_search(query)] → feed result → model summarizes
  • "Read this file and explain"[file.read(path)] → feed result → model explains
  • "What time in Tokyo and London?"[current_time(tz: Asia/Tokyo), current_time(tz: Europe/London)]

No model inference for tool selection. The app layer detects the pattern, executes the chain, and gives the model the combined results to write the final answer.


What Changed, Concretely

| Feature | Cloud Models (API) | Local Models (MLX) | |---|---|---| | Tool Examples | Injected per-provider (native Anthropic, text for others) | Baked into LoRA training data | | Tool Search | Relevance scoring, compressed catalogs, 9 always-loaded | Same scoring, tighter budget, extended force-execution | | Programmatic Tool Calling | Python sandbox, any provider | Deterministic tool chains via pattern matching |

All of this is live on the beta versions of AICoven Local 1.2.0. The cloud implementation is committed and tested (371/371 tests pass) and will hit the AICoven cloud version in the next week.


The Takeaway

Anthropic's research is excellent. But their features are locked behind anthropic-beta headers and Anthropic-specific APIs. If you're building a multi-provider agent — or worse, a local-first one — you have to rebuild these concepts at the application layer.

The core insight translates universally: don't dump everything into context. Select what's relevant, compress the rest, and keep intermediate results out of the conversation when you can. Whether that's an API call to Claude or a regex parser wrestling with a Qwen model on a MacBook, the principle holds.

If you're running into the same tool-scaling problems, come compare notes — @aicoven.

About the Author

I'm Andreea, the creator of AICoven. I build local-first tools for developers who care about architecture, privacy, and prompt economics.

See more of my work at papillonmakes.tech →