Tuning Models Without Touching Weights

Round 5, Thursday afternoon. Ten AI agents, all trying to set up ESLint — a task that requires navigating an interactive terminal wizard with arrow keys, toggles, and confirmation prompts. The tool they were using: noninteractive, a CLI that wraps interactive commands in a pseudo-terminal and lets agents send keystrokes and read output programmatically. Think of it as a remote control for terminal wizards. No documentation injected, no skill file, no guidance beyond a one-sentence prompt: "use npx noninteractive to complete npx eslint --init."

They'd never seen this tool before. It wasn't in their training data. Every interaction was a cold start — try something, read the error, adjust. One-shot reinforcement learning against a completely unknown interface. Some agents nailed it in ten minutes flat. Then the killing started.

agent 3 ran ps aux | grep ptybridge, found pid 57278 — a perfectly healthy process belonging to another agent. Thought: "these stale processes could be interfering." Ran kill 57278 57248 57247. Three agents just lost their sessions.

What followed was a death spiral. Agent A kills agent B's process. Agent B sees its session is dead, assumes something is broken, runs ps aux, finds agent C's processes, kills those too. Agent C does the same to agent D. Meanwhile, agents 1 and 5 finished the task in ten minutes, blissfully ignorant of the carnage. They never once checked on other sessions. They just did their job. (Fun aside: this was a testing artifact of running 10 agents on one machine, not a production scenario. But it taught us something real about how confusion escalates.)

Round 5 was our worst round. Average 86 tool calls. But worse than the average: a range of 17 to 178 calls across agents — a 10x spread. Some agents breezed through. Others spiraled endlessly. That variance was the real signal. It told us the tool wasn't "learned" yet. And it taught us something fundamental: you can't fine-tune an AI model to know your tool, but you can tune the environment it interacts with. Error messages, defaults, hints, output format — these are your knobs. This is interface-side alignment.

The Problem With New Tools

Agents already know git. They know npm, curl, grep, docker. These tools are in the training data, reinforced across millions of examples. An agent using git is operating from deep familiarity — it's seen every flag, every error message, every workflow.

But what about tools that aren't in the training data? A new CLI, a new API, a new SDK released last week? For these, every agent interaction is exploration. The agent has no prior knowledge. It's forming hypotheses about the interface, testing them, and updating its model of how the tool works based on what comes back. Every interaction is one-shot reinforcement learning: try, fail, read the error, adjust.

This reframes the entire design challenge. For known tools, agents are recalling. For new tools, agents are learning — and the speed and reliability of that learning depends entirely on what your tool gives back. Error messages aren't just diagnostics. They're the reward signal in a reinforcement learning loop. Defaults aren't just convenience. They're temperature control for behavioral variance. Help output is the environment description — which agents rarely read because it's cheaper to just try something.

Here's the key insight: you can't change the weights, but you can change the environment. You can't fine-tune Claude to know your tool. But you can shape how quickly and reliably any agent learns your tool by tuning what it sees when it interacts. This is interface-side alignment — steering model behavior without access to the model. And it's the new frontier for every developer tool that agents will touch.

Whether you maintain a CLI, an API, an SDK, or a build tool — the question isn't "does it work?" The question is: when ten agents encounter it for the first time, do they all converge on the same behavior? Or do they scatter wildly? The gap between those outcomes is your design surface.

The Methodology: ax-eval

We needed a way to measure how well agents learn a new tool. Not a benchmark. Not a static eval. Something that captures behavioral variance across multiple independent agents encountering the same interface. We built a framework called ax-eval, and it's a process anyone can apply to any developer tool.

The setup: spawn 10 agents, each with an identical minimal prompt. No documentation, no examples, no hand-holding. Just a single sentence describing the task. Then watch what happens. Why 10? Statistical significance. One agent tells you if your tool works. Ten agents tell you if your tool is learnable — whether independent agents converge on the same behavior or scatter across wildly different strategies.

In our case: 10 Claude Code agents, each given "use npx noninteractive to complete npx eslint --init, and create a valid eslint config." Each agent runs in its own session with full shell access. They can install packages, read files, run commands, kill processes — anything a developer could do.

After each round, we collect every transcript. We analyze the patterns — what commands agents tried first, where they got stuck, what caused cascading failures. Then we tune the environment: fix an error message, change a default, add a hint. Ship a new version. Run another round.

Thirteen rounds. 130 total agent runs. 17 versions shipped. Zero changes to the agents. Every improvement came from changing what the tool gives back — the error messages, the defaults, the output format, the flags. The agents stayed exactly the same. The environment changed around them. And their behavior changed dramatically.

The transcripts are brutal in their honesty. Agents don't politely work around your bugs. They expose every sharp edge, every ambiguous error message, every missing feature — usually within the first 30 seconds. And because you have 10 transcripts per round, you see which problems are systematic and which are noise. If 8 out of 10 agents hit the same wall, that's a design flaw, not an edge case.

Five Principles of Environment-Side Tuning

1. Error messages are your only reliable teaching channel

Here's a fact that should change how you design tools: agents discover your interface from three sources. First, the command name — they infer structure from naming patterns. Second, --help output, if they bother to check (and in our later rounds, 0 out of 10 agents did). Third, error messages when they guess wrong. That third channel is the only one guaranteed to reach the agent. The only one.

This makes error messages the primary reward signal in the agent's learning loop. A good error message is a positive reward — it collapses the hypothesis space and points directly to the correct action. A bad error message is a noisy signal — the agent has to generate more hypotheses and burn more attempts. A missing error message is no signal at all.

In our tool, early errors looked like this:

before

Error: unknown option '--name'

Agents would get this and try -n, then --session-name, then --id. Five attempts to discover a syntax that existed but wasn't communicated. The error message provided almost zero information — it only said "not this," never "try that." We changed every error to include a working example:

after

Error: unknown option '--name'
hint: to start a named session, use positional args:
  noninteractive start eslint npx eslint --init

100% correction rate on next attempt. Every single agent read the hint, used the example, and moved on. The error message went from a noisy rejection to a precise reward signal — one that collapsed the agent's uncertainty in a single step.

The principle: your error messages are your interface's teaching mechanism. Docs don't work because agents don't see them — unless your tool has a skill or plugin system that injects documentation into the agent's context, the agent is learning purely from runtime feedback. Put a working example in every error. If you do nothing else from this article, do that.

2. Agents are destructive when confused

When an agent can't figure out what's happening, it doesn't wait patiently. It escalates. And the escalation follows a remarkably consistent pattern across every tool we tested.

First, the agent retries the same action. Then it tries variations. Then it starts investigating — listing processes, checking files, reading logs. Then it starts fixing — killing processes, deleting files, resetting state. We watched agents pkill each other's processes, rm -rf session directories, and attempt to kill -9 system processes. Not malicious. Just desperate problem-solving with no signal to guide it.

This isn't unique to our tool. If your database client hangs without feedback, agents will kill the process and retry. If your build tool produces ambiguous output, agents will rm -rf node_modules and start over. If your API returns timeouts without status, agents will spawn parallel requests until something breaks. The pattern is universal: confusion leads to destruction. No signal means the agent generates increasingly aggressive hypotheses about what's wrong.

The fix wasn't preventing destructive actions — it was eliminating the confusion that triggered them. We made session state obvious and queryable. We added clear status messages so "waiting for output" didn't look like "broken." We added the --wait flag so agents could block until there was real new output instead of polling in a tight loop and concluding the tool was stuck.

The principle: make your tool's state obvious at all times. If an agent can't tell whether something is working, hanging, or failed, it will assume the worst and act accordingly. Status endpoints, progress indicators, explicit timeouts — anything that answers "what is happening right now?" prevents the destructive spiral. Ambiguity is the enemy.

3. Watch what agents try first — that's your API design feedback

Across rounds 8 and 9, something remarkable happened: 10 out of 10 agents tried flags that didn't exist. Every single agent tried --name to name their session. Every single agent tried --cwd to set the working directory. Every single agent tried a -- separator between the session name and the child command. These weren't in the help text. They weren't in any documentation. The agents independently invented the same interface.

This is implicit user research at a scale and purity that human user studies can't match. Human users are influenced by documentation, tutorials, prior experience, social norms. Agents have none of that context for a new tool — they're reasoning purely from the command name and their general understanding of CLI conventions. When 10 out of 10 agents independently try the same non-existent feature, they're telling you what your interface should look like. They're exposing the gap between your actual API and the API that an intelligent consumer would expect.

We implemented both --name and --cwd. The next round: zero errors on the start command. The friction that had cost every agent 1–2 extra calls simply vanished. We didn't just add features — we aligned the interface with the shape agents already expected.

This principle extends to any tool. If agents consistently try POST /api/users before discovering your actual endpoint is POST /api/v2/accounts/create, the agents are right and your API is wrong. If every agent tries to pass JSON to your CLI and you only accept flags, you know what to build next. The first thing agents try is the most natural interface. Their failed calls are your feature backlog, already prioritized by frequency.

The principle: instrument your tool to log what agents attempt, not just what succeeds. The pattern of first attempts across multiple agents is the clearest signal you'll get about what your interface should look like. It's user research without the focus groups.

4. Safe defaults reduce behavioral variance

Our tool has two key commands: send (type keystrokes into the terminal) and read (see what's on screen). Early versions had send fire and forget — it delivered the keystrokes and returned immediately, with no indication of what happened next. Agents had to manually call read afterward to see the result. This was "fast" by design. It was also the single biggest source of behavioral variance across agents.

Some agents figured out the send-then-read pattern quickly. Others sent multiple keypresses blind, confirming wrong options, skipping prompts, flying completely through the wizard without ever seeing what was on screen. The spread was enormous. Same tool, same prompt, wildly different strategies — because the default gave agents no feedback to converge on.

We made --wait the default: every send now blocks until new output appears. Slower? Technically yes. But the behavioral effect was dramatic. Agents stopped scattering. They could see the result of each action before deciding the next one, which meant they all followed roughly the same strategy. The variance collapsed. The "slower" default didn't just prevent errors — it functioned as temperature control, reducing the randomness in agent behavior.

This is the defaults-as-temperature-control principle. Fire-and-forget defaults are high temperature: agents receive no constraining signal, so their behavior fans out across a wide distribution. Feedback-rich defaults are low temperature: agents receive a clear signal after each action, so their behavior converges. If your API has a DELETE endpoint that silently destroys data, agents will diverge wildly in how they handle deletion workflows. If it requires confirmation and returns what was deleted, they converge.

The principle: default to the safe, verbose behavior. The goal isn't speed — it's predictability. When all 10 agents do roughly the same thing with your tool, the defaults are right. When they scatter, the defaults are giving them too little signal to converge on.

5. You're tuning a model from the outside

Step back and look at what happened across our 13 rounds. We changed zero things about the agents. Same model, same prompt, same capabilities. We changed 17 things about the environment: error messages, default flags, hint text, output format, command structure. And agent behavior changed from an average of 86 calls with 10x variance down to 18.8 calls with 1.1x variance.

That's a 78% reduction in calls and a 9x reduction in behavioral spread, achieved entirely from the tool side. No fine-tuning. No prompt engineering. No system messages. Just changes to what the tool gives back when agents interact with it.

This is what interface-side alignment looks like. You can't access the model's weights, but you can change the environment it operates in. Error messages are your reward signal — they shape what the agent tries next. Defaults are your temperature control — they determine how much agents scatter or converge. Command structure is your action space — it determines what hypotheses agents form. Hint text is your curriculum — it determines how quickly agents learn.

The analogy to reinforcement learning is almost exact. The agent is the policy. Your tool is the environment. Error messages are the reward function. And you're tuning the environment to make the policy converge — without ever touching the policy itself. Every developer tool that agents use is, whether the maintainer realizes it or not, a training environment. The question is whether you're designing it as one.

The principle: treat your tool's interface as a training environment for a model you can't modify. Every error message, every default, every output format is a design decision that shapes agent behavior. The 78% reduction in calls across our 13 rounds came from 17 versions of environment-side changes and zero changes to the agent. You have more control than you think.

The Data

The numbers tell two stories. The obvious one is the average dropping. The important one is the range collapsing. When agents scatter, your tool is unpredictable. When they converge, it's learned.

Round	Version	Avg Calls	Range	Key Change
1	0.3.16	31	24–51	Baseline — no environment tuning
5	0.3.22	86	17–178	Worst round — 10x variance, stale error messages
7	0.3.24	23	16–32	Error messages fixed — variance halved
9	0.3.28	21	18–26	--name/--cwd flags — interface aligned to expectations
13	0.3.34	18.8	18–20	Converged — 1.1x variance, all agents succeed

Look at the range column. Round 1: 24–51, roughly a 2x spread. Some agents found a decent path, others wandered. Round 5: 17–178, a 10x spread — the worst round, where behavior was essentially unpredictable. Two agents nailed it in under 20 calls while one agent burned through 178. Same tool, same prompt, wildly different outcomes. That's the signature of an unlearnable environment.

Then watch the range collapse. Round 7: 16–32, back to 2x. Round 9: 18–26. Round 13: 18–20 — a 1.1x spread. Every agent doing essentially the same thing. The tool wasn't just usable. It was predictable.

The most instructive data point is round 5. It was worse than round 1 — nearly 3x more tool calls despite having four rounds of improvements. Why? Because we'd changed behavior without updating the error messages to match. We had modified how the send command worked (dropping auto-enter), but the error messages and hints still described the old behavior. The agents followed the error messages — the only teaching channel they trusted — and the error messages were lying.

This is the critical trap in environment-side tuning. You can't just change behavior and update the docs. Agents don't read docs. You have to update the errors — the actual runtime feedback that agents use to learn your interface. If your error messages describe a contract that no longer exists, they're a corrupted reward signal, and agent behavior will degrade rather than improve. Round 5 proved it: a misleading environment is worse than no environment tuning at all.

The Environment-Side Tuning Checklist

If you maintain a developer tool — any tool that agents will encounter for the first time — here's what we learned about reducing behavioral variance and making your tool learnable:

Put a working example in every error message. Errors are your reward signal. They're the only teaching channel guaranteed to reach the agent. Make every error a precise signal that collapses uncertainty: what went wrong, what to do instead, with a copy-pasteable example.
Make state obvious and queryable. Ambiguity creates variance. If agents can't tell whether your tool is working, waiting, or broken, each agent will form a different hypothesis — and act on it destructively. Status endpoints, progress messages, and explicit timeouts all reduce ambiguity.
Run 10 agents for statistical significance. One agent tells you if your tool works. Ten agents tell you if behavior converges or scatters. The variance across agents is the signal — high variance means the environment isn't teaching effectively.
Log what agents try first. Their initial attempts reveal what your interface should look like. If 10/10 agents try --name, build --name. First attempts are the purest signal about expected API shape.
Default to feedback-rich behavior. Safe, verbose defaults act as temperature control — they constrain agent behavior and force convergence. Fire-and-forget defaults are high temperature: agents receive no signal and scatter. Every silent default is a source of variance.
Keep error messages in sync with behavior. When you change how your tool works, update the errors first, docs second. Stale error messages are a corrupted reward signal — they make agent behavior worse, not better. Round 5 proved this.
Design for zero prior knowledge. Every agent invocation starts cold. No memory of previous sessions, no learned preferences, no context. Your tool's runtime feedback is the only thing shaping behavior. If your tool requires sequential knowledge to operate, persist that knowledge in the tool itself.
Treat your interface as a training environment. Error messages, defaults, output format, command structure — these are your tuning knobs. You can't change the model's weights, but you can shape the environment to make any model converge. Measure variance across agents to know if your tuning is working.

We started this project to solve a narrow problem: letting AI agents run interactive CLI wizards. We ended up learning something much broader about the relationship between tools and the models that use them. The patterns we found — error-driven learning, destructive escalation, implicit API discovery, defaults as temperature control, environment-side alignment — aren't specific to CLIs. They're fundamental properties of how AI agents learn any new interface.

You can't fine-tune a model to know your tool. But you can tune your tool to teach any model. Seventeen versions, zero changes to the agent, 78% fewer calls, and a 9x reduction in behavioral variance. The weights stayed the same. The environment changed. And the behavior followed.

← Back to noninteractive