shanraisshan/claude-code-best-practice / Chapter 2

Programming /

The_Wrapper_Does_the_Heavy_Lifting

# 01 — The Wrapper Does the Heavy Lifting > Your teammate opens the same project, types the same prompt, gets a different endpoint. You open your terminal to debug. The endpoint works perfectly. Both of you used Claude Code. Both of you got Claude's "best effort." So why are the two best efforts different? I used to think the wrapper around Claude Code was theatre. A skill is just a prompt. A command is just a prompt. A subagent is just a prompt with extra steps. Eventually all of it gets reduced to tokens the model sees, and a strong prompt alone should be enough. This is the most common reduction among experienced Claude Code users, and it is wrong in a way that matters. Not in a subtle, philosophical way. In a *your output is wrong by an order of magnitude* way. The reduction is right about one thing: at the moment of inference, the model only sees tokens. That's it. No magic, no hidden state. Tokens in, tokens out. But that's not the layer where engineering happens. The engineering happens **before** tokens arrive, **after** tokens are produced, **across** sessions, and **between** contexts. At every one of those layers, the harness is doing work a prompt cannot do — and the most expensive mistake you can make as a practitioner is to mistake the model's view of the world for the system's view of the world. Let me show you what I mean. ## Six tokens You sit down at your keyboard on a Tuesday afternoon. You open Claude Code in a half-million-line TypeScript monorepo. You type: ``` add a notes feature ``` Six tokens. You hit enter. Now think about what you think Claude sees. If you're like most engineers I talk to, the answer is some version of: "Claude sees my six tokens, plus maybe a `CLAUDE.md` if we have one, plus whatever files it reads along the way." That mental model is roughly 5% of the story. ## What the model actually sees The harness assembles something called the **effective context** — the actual token stream that reaches the model at inference time. For a six-token user prompt in Claude Code, the effective context looks like this: ```mermaid flowchart TD A["User prompt<br/>'add a notes feature'<br/>~6 tokens"] --> B[Effective context at inference<br/>~5,000–50,000+ tokens] B --> C1["CLAUDE.md<br/>(project conventions)"] B --> C2[".claude/rules/*.md<br/>(lazy-loaded via paths: frontmatter)"] B --> C3["Modular system prompt<br/>(110+ fragments, conditionally loaded)"] B --> C4["Tool definitions<br/>(Read, Write, Bash, Agent, etc.)"] B --> C5["Environment context<br/>(cwd, git status, platform)"] B --> C6["Prior turn history"] B --> C7["Files read via Read/Grep"] B --> C8["User's 6-token request"] ``` Notice what's missing from your mental model. The CLAUDE.md is obvious. But the harness also injects matching rules from `.claude/rules/` based on which files Claude is touching. It loads modular system prompt fragments — over a hundred of them, conditionally — that you cannot hand-author or swap in. It prepends tool definitions so the model knows what tools it can call. It appends environment context so the model knows what platform it's on. It carries forward every prior turn. It reads files via tools as the agentic loop progresses. The user types 6 tokens. The model sees somewhere between 5,000 and 50,000 tokens at inference, **most of which the user did not write and cannot directly see.** Output quality is a function of the effective context. Output quality is *not* a function of your typed prompt. That's the asymmetry, and it's not subtle. ## The two meanings of "prompt" Here's where the reduction collapses. The word *prompt* is doing two jobs in everyday speech: | Meaning | Who controls it | Typical size | |---|---|---| | (a) What the user typed | You | ~6–60 tokens | | (b) What the model sees at inference | The harness | ~5,000–50,000+ tokens | In a chatbot, (a) and (b) are the same thing. In Claude Code, they are radically different. A "strong prompt" — meaning, a strong (a) — controls a sliver of (b). The harness constructs the rest. The harness's job is precisely to make (b) much richer than (a). If you confuse (a) with (b), you'll spend your time rewriting the same six tokens, trying to outflank the model. You won't. The model isn't the bottleneck. The wrapper around the model is the bottleneck — and the wrapper is what you should be writing. ## What prompts cannot do There are ten architectural capabilities of the harness that operate at layers where prompts have no acc

Chapter 2 of 5 8m Article Learning path

01 — The Wrapper Does the Heavy Lifting

Your teammate opens the same project, types the same prompt, gets a different endpoint. You open your terminal to debug. The endpoint works perfectly. Both of you used Claude Code. Both of you got Claude's "best effort." So why are the two best efforts different?

I used to think the wrapper around Claude Code was theatre.

A skill is just a prompt. A command is just a prompt. A subagent is just a prompt with extra steps. Eventually all of it gets reduced to tokens the model sees, and a strong prompt alone should be enough.

This is the most common reduction among experienced Claude Code users, and it is wrong in a way that matters. Not in a subtle, philosophical way. In a *your output is wrong by an order of magnitude* way.

The reduction is right about one thing: at the moment of inference, the model only sees tokens. That's it. No magic, no hidden state. Tokens in, tokens out.

But that's not the layer where engineering happens. The engineering happens before tokens arrive, after tokens are produced, across sessions, and between contexts. At every one of those layers, the harness is doing work a prompt cannot do — and the most expensive mistake you can make as a practitioner is to mistake the model's view of the world for the system's view of the world.

Let me show you what I mean.

Six tokens

You sit down at your keyboard on a Tuesday afternoon. You open Claude Code in a half-million-line TypeScript monorepo. You type:

add a notes feature

Six tokens. You hit enter.

Now think about what you think Claude sees.

If you're like most engineers I talk to, the answer is some version of: "Claude sees my six tokens, plus maybe a CLAUDE.md if we have one, plus whatever files it reads along the way."

That mental model is roughly 5% of the story.

What the model actually sees

The harness assembles something called the effective context — the actual token stream that reaches the model at inference time. For a six-token user prompt in Claude Code, the effective context looks like this:

flowchart TD
    A["User prompt<br/>'add a notes feature'<br/>~6 tokens"] --> B[Effective context at inference<br/>~5,000–50,000+ tokens]
    B --> C1["CLAUDE.md<br/>(project conventions)"]
    B --> C2[".claude/rules/*.md<br/>(lazy-loaded via paths: frontmatter)"]
    B --> C3["Modular system prompt<br/>(110+ fragments, conditionally loaded)"]
    B --> C4["Tool definitions<br/>(Read, Write, Bash, Agent, etc.)"]
    B --> C5["Environment context<br/>(cwd, git status, platform)"]
    B --> C6["Prior turn history"]
    B --> C7["Files read via Read/Grep"]
    B --> C8["User's 6-token request"]

Notice what's missing from your mental model. The CLAUDE.md is obvious. But the harness also injects matching rules from .claude/rules/ based on which files Claude is touching. It loads modular system prompt fragments — over a hundred of them, conditionally — that you cannot hand-author or swap in. It prepends tool definitions so the model knows what tools it can call. It appends environment context so the model knows what platform it's on. It carries forward every prior turn. It reads files via tools as the agentic loop progresses.

The user types 6 tokens. The model sees somewhere between 5,000 and 50,000 tokens at inference, most of which the user did not write and cannot directly see.

Output quality is a function of the effective context. Output quality is *not* a function of your typed prompt.

That's the asymmetry, and it's not subtle.

The two meanings of "prompt"

Here's where the reduction collapses. The word *prompt* is doing two jobs in everyday speech:

| Meaning | Who controls it | Typical size | |---|---|---| | (a) What the user typed | You | ~6–60 tokens | | (b) What the model sees at inference | The harness | ~5,000–50,000+ tokens |

In a chatbot, (a) and (b) are the same thing. In Claude Code, they are radically different.

A "strong prompt" — meaning, a strong (a) — controls a sliver of (b). The harness constructs the rest. The harness's job is precisely to make (b) much richer than (a).

If you confuse (a) with (b), you'll spend your time rewriting the same six tokens, trying to outflank the model. You won't. The model isn't the bottleneck. The wrapper around the model is the bottleneck — and the wrapper is what you should be writing.

What prompts cannot do

There are ten architectural capabilities of the harness that operate at layers where prompts have no access. Each row in this table is a place where "strong wording" is not a substitute.

| # | Capability | What it does | Why a prompt can't replicate | |---|------------|--------------|------------------------------| | 1 | Context isolation | Subagents run in separate context windows | A prompt fills one window. N parallel subagents give ~N× effective context. | | 2 | Harness-enforced tool restrictions | allowed-tools / disallowedTools block tools before the model can use them | Prompt instructions are advisory; the model can ignore them. Deny rules cannot be ignored. | | 3 | Lazy-loaded rules & memory | paths: frontmatter and descendant CLAUDE.md files load only when Claude touches matching paths | A prompt is static — it cannot conditionally load based on which files are being read at runtime. | | 4 | Hooks: deterministic code execution | Shell commands run at lifecycle events (PreToolUse, PostToolUse, Stop) and can block tool calls | A prompt cannot intercept its own tool calls. Hooks execute even if the model doesn't "want" them to. | | 5 | Model routing | model: haiku or model: opus routes a call to a different model endpoint | No token in the prompt can change which model answers. | | 6 | Parallelism | Multiple subagents execute concurrently | A prompt is sequential. The harness schedules and collects results from parallel processes. | | 7 | Cross-session persistence | Memory system and settings hierarchy persist across conversations | A prompt dies when the session ends. | | 8 | Modular system prompt | The CLI loads 110+ system prompt fragments conditionally based on features activated | A user cannot hand-author or swap in the CLI's internal prompt fragments. | | 9 | Skill preloading | skills: field injects a skill's full content into a subagent's starting context | The user cannot pre-stuff another agent's context — only the harness loader can. | | 10 | Permission classification | auto permission mode uses a background classifier to pre-approve or block tool calls | A prompt cannot add a pre-execution safety layer to itself. |

The pattern across all ten rows is the same: the harness controls what the system does at layers the model cannot reach. Before tokens arrive, after tokens are produced, across sessions, across contexts, across processes.

Read that sentence again. It's the article in one line.

Why this matters for your Tuesday afternoon

Let's return to your six-token prompt: *add a notes feature.*

A strong-prompt-only advocate will tell you to write something like:

Add a notes feature. Follow the existing patterns in routes/todos.py for the backend route, use the same Pydantic model location, mirror the auth dependency. For the frontend, integrate into the existing sidebar — slot reserved, icon reused, route registered. Add test_notes.py matching test_todos.py style. Match the {data, error} response envelope. One commit, clean diff, all tests green.

This is a better prompt. It will produce a better result — for one engineer, on one day, in one session, on one feature.

But the engineer who wrote that prompt cannot be everywhere. They cannot be on every team, in every codebase, for every intern's 2am prompt. The strong prompt doesn't generalize. The strong prompt doesn't survive a teammate. The strong prompt doesn't compound.

The wrapper does.

If your project has a CLAUDE.md that says *"backend routes follow routes/todos.py; new tests mirror test_todos.py; response envelope is {data, error}"* — then any prompt, including a sloppy six-token prompt from an intern, produces the same result. The wrapper carries the convention. The convention carries across the team. The team's prompts get better not because anyone rewrote them but because the wrapper grew.

The kitchen analogy

A useful way to think about it:

| Layer | Chatbot | Claude Code | |---|---|---| | Recipe | The user's message | The user's message + harness-assembled context | | Kitchen | None — just a student | Tools, hooks, memory, parallel workers, lifecycle events |

You can write the world's best recipe. Without a kitchen, you cannot cook at scale.

The wrapper isn't a more verbose prompt. It's the kitchen. It's the prep station, the ingredient sourcing, the parallel line cooks, the dishwasher that runs after every dish, the recipe book that every cook in the kitchen can consult. A better recipe helps. But the kitchen is what scales.

What the reduction gets right

I don't want to oversell this. There are regimes where the reduction holds.

For an atomic single-shot task — *"write me a recursive Fibonacci function"* — the harness contributes nothing. Hand the same tokens to the same model and you get the same distribution of outputs whether they arrived via a skill, a command, or a raw prompt.

Output quality ≈ prompt quality.

This is the regime where Claude Code offers little value over a plain chatbot. It is also the regime the reduction implicitly assumes — and precisely the regime real engineering work is not in.

Real engineering work has context. Real engineering work spans sessions. Real engineering work requires determinism, isolation, parallelism, and persistence. Real engineering work is what the wrapper is for.

The correct mental model

Prompts control what the model is asked to do.
The harness controls what the system does at layers the model cannot reach.

Features are not prompts with extra steps. They are harness-level primitives — deterministic execution, context architecture, infrastructure routing — that operate at layers where the model has no voice.

A useful practitioner rule:

  • Use prompts for what the model is asked to do in this session.
  • Use the wrapper for what the system does before tokens arrive, after tokens are produced, across sessions, and between contexts.

The two are complementary. Prompts without a wrapper are recipe cards in an empty kitchen. A wrapper without prompts is a kitchen that nobody tells what to cook. You need both.

But the wrapper scales. The prompt doesn't.

What I revised while writing this

I started this chapter believing prompts could carry the wrapper. The harness made me wrong.

The model isn't the bottleneck — the wrapper is. And the wrapper is built out of files, not sentences: CLAUDE.md, .claude/rules/*.md, .claude/skills/, .claude/agents/, .claude/commands/, hooks, MCP servers. None of those are prompts. All of them are configuration. All of them live in your repository, version-controlled, reviewed, evolved.

The next chapter is about the most important file in that wrapper — the file that, if you write nothing else, you should write first.

CLAUDE.md. The most important file you haven't written.