Programming /

E01_One_Tool_One_Thought_Per_Step

# One Tool, One Thought Per Step > The agent does not chain tools. Every round-trip is one call to one tool called `AgentOutput`, and that tool's input schema forces the model to reflect before it acts. The architecture treats reflection as a structural property, not a prompt-level suggestion. ## Key Takeaways - Each ReAct step is exactly one tool call. There is no multi-tool reasoning within a step. - The single tool's schema mandates `evaluation_previous_goal`, `memory`, `next_goal`, and `action` — every step the model reasons explicitly. - Tools are async, isolated from each other, and share state only through `BrowserState` snapshots. - Cancellation is a single `AbortSignal` flowing through the LLM fetch, every tool, and the JavaScript runtime. --- The most surprising file in the page-agent repository is not the DOM extractor. It is `packages/core/src/PageAgentCore.ts` — 660 lines that define the entire agent loop, and which look, on first reading, almost nothing like the OpenAI tool-calling examples I had internalized from the SDK documentation. ```typescript // PageAgentCore.ts:285-289 const result = await this.#llm.invoke(messages, macroTool, signal, { toolChoiceName: 'AgentOutput', normalizeResponse: (res) => normalizeResponse(res, this.tools), }) ``` One tool. One call. One signal. That is the entire loop. The `#packMacroTool()` method, defined a few hundred lines below, is the trick that makes this work — it merges every registered tool into a single OpenAI tool definition called `AgentOutput`. The model never picks among `click_element_by_index` and `input_text` and `scroll`. It picks *one* tool, `AgentOutput`, whose `action` field is a discriminated union over the registered tool set. I had expected an OpenAI-style tool loop: the model emits multiple tool calls, the runtime executes them, the results come back, the model reasons again. Page-agent explicitly forbids that. `parallel_tool_calls: false` is set in `packages/llms/src/OpenAIClient.ts:45`. The result is that every round-trip is one thought. ## The reflection schema The Zod schema that defines what the model must produce on every step is the most consequential piece of code in the repository. It is forty lines that define the agent's reasoning discipline. ```typescript // PageAgentCore.ts:395-401 const macroToolSchema = z.object({ evaluation_previous_goal: z.string().optional(), memory: z.string().optional(), next_goal: z.string().optional(), action: actionSchema, }) ``` Three reflection fields, then an action. Each step the model is asked four things, in order: 1. **Did the previous step succeed?** `evaluation_previous_goal` is a one-sentence verdict — success, failure, or uncertain. The system prompt makes the verdict mandatory and forbids the lazy answer: > Never assume an action succeeded just because it appears to be executed in your last step in `<agent_history>`. If the expected change is missing, mark the last action as failed (or uncertain) and plan a recovery. 2. **What should I remember?** `memory` is a free-text working memory. The system prompt recommends 1–3 sentences of "specific memory of this step and overall progress." Counts of pages visited, IDs already tried, blockers encountered — anything that should survive into the next step's context goes here. 3. **What is the immediate next goal?** `next_goal` is a single sentence. Not a plan — a goal. The system prompt explicitly tells the model not to plan the whole task here: "your next immediate goal and action to achieve it, in one clear sentence." 4. **Which tool, with what input?** `action` is the discriminated union. The model picks one tool from the registered set and provides its typed input. The schema enforces the discipline. A model that emits only `{"action": {"click_element_by_index": {"index": 33}}}` will pass schema validation — the reflection fields are all optional. In practice the prompt steers models to fill them, but the structural guarantee is that *if* the model reflects, it has a place to put each piece. The effect is that the LLM's context window becomes the agent's working memory. Each step's `memory` becomes part of `<agent_history>` in the next step's user prompt. The agent never has to take notes elsewhere. It never has to remember across task boundaries because the history stream is the only memory it has. ## What one step actually looks like A complete step in `PageAgentCore.ts` is the following sequence. I am walking it linearly so the simplicity is visible. ```mermaid sequenceDiagram participant Agent as PageAgentCore participant PC as PageController participant LLM as LLM (OpenAI-compat) participant Tool as Concrete Tool Agent->>PC: getBrowserState() PC-->>Agent: {url, title, header, content, footer} Agent->>Agent: assembleUserPrompt (history + state + browser_state) Agent->>LLM: invoke(messages, {AgentOutput: macroTool}) LLM-->>Agent: tool_call("AgentOutput", {evaluation, memory, next_goal, action}) Agent->>Agent: pack reflection into console + emit activity event Agent->>Tool: tool.execute(args, {signal}) Tool-->>Agent: "✅ Clicked element (Login)." Agent->>Agent: emit historychange event Agent->>PC: cleanUpHighlights() ``` There is no branching. There is no parallel execution. There is no "the model called three tools at once, here are the results." Each step is a single observation, a single thought, a single action, a single result. This is not how browser-use works. Browser-use, in its Python incarnation, lets the model chain multiple actions in a single step. Page-agent's designers considered this and rejected it — the `parallel_tool_calls: false` setting is the explicit refusal. The reason appears in the action results field of `MacroToolResult`: after every tool execution, the model sees only that tool's string result. There is no aggregation across multiple actions. A model that wants to click then type must do it across two steps, with reflection in between. The benefit is bounded complexity per step. The cost is that multi-step tasks run as long, predictable chains. Both are visible in the panel: every step has a thinking phase (LLM call) and an executing phase

Chapter 2 of 5 10m Article Learning path

One Tool, One Thought Per Step

The agent does not chain tools. Every round-trip is one call to one tool called AgentOutput, and that tool's input schema forces the model to reflect before it acts. The architecture treats reflection as a structural property, not a prompt-level suggestion.

Key Takeaways

Each ReAct step is exactly one tool call. There is no multi-tool reasoning within a step.
The single tool's schema mandates evaluation_previous_goal, memory, next_goal, and action — every step the model reasons explicitly.
Tools are async, isolated from each other, and share state only through BrowserState snapshots.
Cancellation is a single AbortSignal flowing through the LLM fetch, every tool, and the JavaScript runtime.

---

The most surprising file in the page-agent repository is not the DOM extractor. It is packages/core/src/PageAgentCore.ts — 660 lines that define the entire agent loop, and which look, on first reading, almost nothing like the OpenAI tool-calling examples I had internalized from the SDK documentation.

// PageAgentCore.ts:285-289
const result = await this.#llm.invoke(messages, macroTool, signal, {
    toolChoiceName: 'AgentOutput',
    normalizeResponse: (res) => normalizeResponse(res, this.tools),
})

One tool. One call. One signal. That is the entire loop. The #packMacroTool() method, defined a few hundred lines below, is the trick that makes this work — it merges every registered tool into a single OpenAI tool definition called AgentOutput. The model never picks among click_element_by_index and input_text and scroll. It picks *one* tool, AgentOutput, whose action field is a discriminated union over the registered tool set.

I had expected an OpenAI-style tool loop: the model emits multiple tool calls, the runtime executes them, the results come back, the model reasons again. Page-agent explicitly forbids that. parallel_tool_calls: false is set in packages/llms/src/OpenAIClient.ts:45. The result is that every round-trip is one thought.

The reflection schema

The Zod schema that defines what the model must produce on every step is the most consequential piece of code in the repository. It is forty lines that define the agent's reasoning discipline.

// PageAgentCore.ts:395-401
const macroToolSchema = z.object({
    evaluation_previous_goal: z.string().optional(),
    memory: z.string().optional(),
    next_goal: z.string().optional(),
    action: actionSchema,
})

Three reflection fields, then an action. Each step the model is asked four things, in order:

1. Did the previous step succeed? evaluation_previous_goal is a one-sentence verdict — success, failure, or uncertain. The system prompt makes the verdict mandatory and forbids the lazy answer:

> Never assume an action succeeded just because it appears to be executed in your last step in <agent_history>. If the expected change is missing, mark the last action as failed (or uncertain) and plan a recovery.

2. What should I remember? memory is a free-text working memory. The system prompt recommends 1–3 sentences of "specific memory of this step and overall progress." Counts of pages visited, IDs already tried, blockers encountered — anything that should survive into the next step's context goes here.

3. What is the immediate next goal? next_goal is a single sentence. Not a plan — a goal. The system prompt explicitly tells the model not to plan the whole task here: "your next immediate goal and action to achieve it, in one clear sentence."

4. Which tool, with what input? action is the discriminated union. The model picks one tool from the registered set and provides its typed input.

The schema enforces the discipline. A model that emits only {"action": {"click_element_by_index": {"index": 33}}} will pass schema validation — the reflection fields are all optional. In practice the prompt steers models to fill them, but the structural guarantee is that *if* the model reflects, it has a place to put each piece.

The effect is that the LLM's context window becomes the agent's working memory. Each step's memory becomes part of <agent_history> in the next step's user prompt. The agent never has to take notes elsewhere. It never has to remember across task boundaries because the history stream is the only memory it has.

What one step actually looks like

A complete step in PageAgentCore.ts is the following sequence. I am walking it linearly so the simplicity is visible.

sequenceDiagram
  participant Agent as PageAgentCore
  participant PC as PageController
  participant LLM as LLM (OpenAI-compat)
  participant Tool as Concrete Tool

  Agent->>PC: getBrowserState()
  PC-->>Agent: {url, title, header, content, footer}
  Agent->>Agent: assembleUserPrompt (history + state + browser_state)
  Agent->>LLM: invoke(messages, {AgentOutput: macroTool})
  LLM-->>Agent: tool_call("AgentOutput", {evaluation, memory, next_goal, action})
  Agent->>Agent: pack reflection into console + emit activity event
  Agent->>Tool: tool.execute(args, {signal})
  Tool-->>Agent: "✅ Clicked element (Login)."
  Agent->>Agent: emit historychange event
  Agent->>PC: cleanUpHighlights()

There is no branching. There is no parallel execution. There is no "the model called three tools at once, here are the results." Each step is a single observation, a single thought, a single action, a single result.

This is not how browser-use works. Browser-use, in its Python incarnation, lets the model chain multiple actions in a single step. Page-agent's designers considered this and rejected it — the parallel_tool_calls: false setting is the explicit refusal. The reason appears in the action results field of MacroToolResult: after every tool execution, the model sees only that tool's string result. There is no aggregation across multiple actions. A model that wants to click then type must do it across two steps, with reflection in between.

The benefit is bounded complexity per step. The cost is that multi-step tasks run as long, predictable chains. Both are visible in the panel: every step has a thinking phase (LLM call) and an executing phase (tool execution), and the chain length is bounded by maxSteps (default 40 since v1.5.1).

The tool vocabulary

The registered tool set in packages/core/src/tools/index.ts is short — eight tools, plus done which is handled by the main loop rather than executed. The shape of each tool reveals what the runtime considers the irreducible vocabulary of web interaction:

| Tool | Purpose | Notes | |------|---------|-------| | done | End the task with a text result and a success flag. | Not actually executed; the main loop catches it. | | wait | Sleep for 1–10 seconds. | Compensates by subtracting getLastUpdateTime() so the model doesn't over-wait on a settled page. | | ask_user | Prompt the user for clarification. | Only available if onAskUser is set; otherwise auto-disabled. | | click_element_by_index | Click an indexed element. | Delegates to PageController.clickElement(index). | | input_text | Click an input and type text. | Uses the native value setter so React's onChange fires. | | select_dropdown_option | Pick an <option> by visible text. | | | scroll / scroll_horizontally | Scroll page or container by pages or pixels. | Indexed mode walks up to 10 parents to find a scrollable ancestor. | | execute_javascript | Run arbitrary JS on the page. | Wraps the script in (async (signal) => { ... }); the abort signal is in scope. |

Two absences are worth noting. There is no extract_structured_data tool — the comment says "Tables need a dedicated parser to extract structured data. This tool is useless." There is no upload_file tool — listed as a TODO. The runtime is honest about what it cannot do.

Every tool receives { signal: AbortSignal } as its ToolContext. The signal is the same AbortController that the LLM fetch accepts. Tools are required to honor it — the system prompt makes cooperation explicit:

An AbortSignal named signal is available in scope: long-running async code MUST honor it (e.g. await fetch(url, { signal }), or signal.throwIfAborted() in loops)

When the user clicks the panel's Stop button, the agent calls this.#abortController.abort(). The signal propagates: the LLM fetch rejects with an AbortError, any in-flight tool receives a rejected promise via signal.throwIfAborted(), and any user-supplied execute_javascript script can react to signal.aborted === true. v1.9.0 made this discipline comprehensive across all packages; before that, sync tools and loop execution occasionally ignored the signal.

The auto-fixer

Small models fail in interesting ways. The auto-fixer in packages/core/src/utils/autoFixer.ts exists because qwen2.5-coder and similar small models occasionally emit primitive arguments directly:

{ "click_element_by_index": 33 }

Instead of the schema-required object:

{ "click_element_by_index": { "index": 33 } }

The auto-fixer inspects each tool's Zod schema and re-wraps primitive arguments into objects before they reach validation. The repair happens *before* the LLM client sees the failure — the model never learns its response was malformed. It is a tiny piece of code, but it is the difference between "works on gpt-4 only" and "works on every OpenAI-compatible model including the cheap ones."

When the model is the kind that hallucinates, the auto-fixer is also where you discover the second-most-common failure mode: the model emits an index that is out of bounds. The runtime handles this in PageController.clickElement:

// PageController.ts:243-269
async clickElement(index: number): Promise<ActionResult> {
    try {
        this.assertIndexed()
        const element = getElementByIndex(this.selectorMap, index)
        const elemText = this.elementTextMap.get(index)
        await clickElement(element)
        // ...
        return { success: true, message: `✅ Clicked element (${elemText ?? index}).` }
    } catch (error) {
        return { success: false, message: `❌ Failed to click element: ${error}` }
    }
}

A failed action returns a string result. The string goes into <agent_history> as Action Results. The next step's prompt contains the failure message, and the model's evaluation_previous_goal field can mark the previous action as failed. The system recovers without throwing exceptions across the loop boundary. The architecture is robust to the model being wrong.

What the runtime does between steps

After every tool executes, the agent performs a small handful of housekeeping operations. None are glamorous; together they make the loop survivable.

Activity events. thinking, executing, executed, retrying, and error events are emitted to the panel UI. These are transient — they are not part of <agent_history> and do not feed back to the model. The UI uses them to render the live status. The model never sees them.
History events. step, observation, user_takeover, and error events are persistent — they accumulate in <agent_history> and survive into the next prompt. The model *does* see these.
Wait time accounting. After every non-wait action, the cumulative totalWaitTime resets to zero. The system uses this counter to inject an observation when the model has been waiting for more than 3 seconds total: "You have waited X seconds accumulatively. DO NOT wait any longer unless you have a good reason."
URL change detection. After every step, the runtime compares the current window.location.href to the last known URL. If it changed, an observation is pushed: "Page navigated to → {url}." A 500ms settle delay gives the new page time to render before the next DOM extraction.
Remaining-step warnings. At 5 steps remaining, the model sees: "Only 5 steps remaining. Consider wrapping up or calling done with partial results." At 2 steps remaining, it becomes a critical alert.

These are not bugs; they are the loop's self-regulation. The model can thrash in a stuck state for three steps before the URL-change observation fires, but it cannot thrash indefinitely without bumping into a step-count or wait-time boundary.

The discipline this imposes

The single-tool, single-step design is not faster than chaining. A user request that takes 6 actions takes 6 LLM round-trips — each with its own think time, each with its own token cost. A chained design might compress that to 2 round-trips with 3 actions each. Page-agent rejects that compression because it makes the model undebuggable.

When the model fails, the failure is pinned to one step. The reflection fields tell you *what the model thought*. The action field tells you *what it tried*. The action result tells you *what actually happened*. The next step's evaluation_previous_goal tells you *whether the model noticed*. A chained design collapses three of those five signals into "the model emitted a tool batch, here is what happened" — and debugging that is much harder.

For the user, this means the panel's display is faithful to what the model did. The "Step 6 of 40" label is not a marketing number — it is the count of LLM round-trips the task required. When the panel says "Done (success: false)" it means the model reached a state it was willing to call done on, and it believed the task was incomplete. The bounded vocabulary produces bounded failure modes.

What the contract hides

The reflection-before-action schema is also a defense against a specific failure mode: silent reasoning drift. If the model could chain tools, it could quietly accumulate state across multiple actions, and the next step's prompt would have to summarize all of it. The model would be tempted to summarize cheaply ("I clicked some buttons"). The structured reflection forces it to write a clean summary before it acts — and the next step reads that summary verbatim.

That is not a guarantee against hallucinations. The model can still emit evaluation_previous_goal: "Success" when the previous action failed. But the next prompt's <agent_history> contains the actual Action Results string from the failed action, and the user can compare the model's claim to the recorded result. The architecture does not lie about what happened. It only fails to detect when the model lies about it.

That limitation is real and acknowledged. The honest framing in the README: "Traceability and predictability is more important than success rate." When success rate and traceability conflict, page-agent picks traceability. That is the design choice that produces the single-tool loop.

---

References: