Everyone's obsessing over which model to use. GPT-5, Claude, Qwen, Llama. That conversation matters, model quality does affect outcomes, but after building AI agents into real internal tooling at Driveline Baseball I'm pretty convinced the model is only half of what determines whether something actually works.

The other half is the scaffolding around it: how you build context, how you structure tool calls, how you handle failures, and how you define done. A well-designed agent scaffold makes a good-enough model useful on a real task. A sloppy one makes a great model useless. I've watched both happen.

This matters especially under compute constraints. I'm running Qwen3-Coder-30B on a single A10 GPU on Modal, which is real hardware with real limits. Context windows fill up, latency matters, and a 30B model needs more guidance than a 200B system would naturally give itself. The scaffold isn't decoration here; it's doing structural work.

What the scaffold actually needs to answer

On every turn, a coding agent needs to answer 3 questions: what context does the model need right now, what tools can it call with what constraints, and how do I know when the task is actually done? Most implementations answer the first question by dumping everything into context and hoping the model sorts it out. They skip the second entirely. And they answer the third with "when it stops calling tools," which fails in basically every non-trivial case.

How I designed mine

The code index in clivic is SQLite-backed with symbol extraction, chunked source, and import graph edges, but it doesn't run on startup. It builds the first time a retrieval tool actually needs it. This isn't just a startup-time optimization; it's a constraint on how the model gets information. It asks for what it needs, you give it exactly that, and the context window stays clean for the actual task.

The system prompt encodes tool-use rules as behavioral contracts: batch independent reads in one turn, use multi_edit instead of sequential single-file edits, run grep to find the exact region before reading a whole file. On a 30B model with a limited context window, wasted turns are wasted tokens. The prompt engineering that pays off most is the kind that stops the model from burning turns on things it doesn't need to do.

At specific turn counts, the runner injects structured reminders into the conversation as user messages. Turn 3: have you understood the task? Turn 6: have you planned the changes? Turn 10: have you confirmed your exact edit locations? Near the turn budget the message shifts to something like "X turns remaining, are you done?" This sounds almost patronizing written out, but it materially improves task completion on a smaller model that can't self-organize as naturally as a frontier system running on unlimited compute.

The part I care most about

A run is FAIL by default. It only resolves as success if the agent calls mark_complete, or if a real validation command (unit tests, a build, whatever the project uses) exits 0 after the agent made edits. There's no "I think I'm done," no defaulting to success because it ran out of turns and didn't crash. This came directly from watching agents on less disciplined setups confidently declare they'd finished while leaving tests broken, and deciding that optimistic defaults were the wrong call.

When context gets long, the runner compacts older tool observations by replacing them with stubs. It keeps the fact that a tool ran without keeping the full output in context. This matters because without it the model gets buried in its own history on long tasks and starts losing track of what it already did, which wastes turns and produces worse edits.

You have to measure it

None of this matters if you can't measure it. The companion project clivic-eval is a Python benchmark that runs real coding tasks against the model with ground-truth pass/fail criteria. Before a feature ships to clivic, I have a test that checks whether the model can reliably do what that feature requires, not just whether it seemed to work once on my prompt.

A measured pass rate across 40 task instances is a different thing than "it worked in my test run." When you're betting real internal workflows on a 30B model you're hosting yourself, the benchmark is what earns that bet. Skipping it means you're shipping on vibes, and that catches up with you fast.

The model is one variable. The scaffolding around it is where the actual engineering happens.