Deepmox / Programming

learning path

alibaba/page-agent

JavaScript in-page GUI agent. Control web interfaces with natural language.

5 chapters 1 audio lessons 1 videos 3 free previews Fresh topic

Start here

1. E00_The_Web_The_Model_Can_See

The Web The Model Can See

A pure in-page JavaScript GUI agent does not give the LLM eyes. It gives it Braille. That choice — to strip the page down to indexed text rather than render it as pixels — is the move that makes page-agent possible.

Key Takeaways

  • Page-agent feeds the LLM indexed text, never pixels, never screenshots.
  • The trade is real and worth it: cost drops, latency drops, permission boundaries disappear, and the page stays inside the user's browser.
  • The trade's cost surfaces the moment the DOM mutates — the model is now blind to anything the indexed text doesn't describe.
  • The model never sees CSS selectors, XPath, or coordinates. It sees [33]<button>Submit</button> and replies with {"index": 33}.

---

A user pastes a single line of HTML into a page they were already on:

<script src="https://cdn.jsdelivr.net/npm/page-agent@1.10.0/dist/iife/page-agent.demo.js" crossorigin="true"></script>

A small floating panel appears at the bottom-right. They type: *"Click the login button."* Twenty seconds later the panel reads:

Step 6 of 40
✅ Clicked element (Login).
Done (success: true)

That is the entire product. There is no Python, no browser extension, no headless Chrome. There is no server holding the page. There is no multimodal vision model. There is a script tag, a text-only LLM, and a piece of JavaScript that turned the user's page into something a language model could read.

I had assumed — when I first read the README — that the agent's cleverness was in how it *acted* on the page. I was wrong. The cleverness is in what it *sees*.

The perception that almost won

Most GUI agents see the way humans do: they take a screenshot, hand the image to a multimodal model, and ask it where to click. This is the natural shape of "browser automation" in 2024–2026. Browser-use, Anthropic's computer-use, OpenAI's Operator — they all sit somewhere on this stack.

The shape has three real costs.

First, money. Image tokens are expensive. A 1920×1080 screenshot at reasonable detail can be 800–1,500 tokens per frame. Multiply that by a 10-step task and the LLM bill starts to look like a SaaS subscription of its own. text-based representations of the same page compress to a few hundred tokens because they ignore pixels that don't carry information.

Second, latency. Screenshots are synchronous with the GPU render loop. In an in-page environment that loop is owned by the user's browser and is not exposed to scripts at all. Page-agent runs in the page's main JavaScript context. It cannot take a screenshot of itself.

Third, permissions. Screenshot APIs in the browser require user opt-in (or a browser extension with the tabs.captureVisibleTab permission). A library that needs the user to grant a permission to load is not a library — it's a download.

Page-agent's README states the trade plainly: "No screenshots. No multi-modal LLMs or special permissions needed." That sentence is not a feature claim. It is a forced architectural decision.

The alternative — and the choice page-agent made — is to give the model the page's *source of truth* instead of its *appearance*. The browser's DOM is the same thing the application code is rendering. The text of a button label, the type of an <input>, the aria-label of a <div> — all of this is in the DOM, accessible synchronously from any script running on the page. No GPU. No permissions. No round-trip.

The catch: a text representation is impoverished compared to a screenshot. A screenshot tells the model "this region looks like a disabled button." The DOM tells the model "this is a <button disabled> element with the text 'Submit.'" Both are useful. They are not equivalent.

What the model actually sees

The model does not see a querySelector('button.login').click() chain. It does not see XPath. It does not see pixel coordinates. It sees something like this:

[0]<a aria-label="page-agent.js 首页" />
[1]<div >P />
[3]<a >Docs />
[4]<a aria-label="View source (new window)">Source />
[5]<a role=button>Quick start />
[6]<a role=button>View docs />
[33]<button aria-label='Submit form'>Submit</button>

That is a real example — taken almost verbatim from packages/page-controller/src/dom/index.ts:flatTreeToString(). Interactive elements get a numeric [index] prefix. New elements since the previous step get a * prefix. Indentation (\t) encodes parent-child relationships. Attributes the LLM needs for disambiguation — title, role, placeholder, aria-label, aria-expanded, data-state, value — are inlined. Attributes the LLM does not need are dropped. Text the model can read at a glance is the *element's own text content*, joined from descendant text nodes until the next clickable element.

The model replies in the same vocabulary. It does not say "click the third button under the second section." It says {"click_element_by_index": {"index": 33}}. The runtime resolves index 33 against a map of number → HTMLElement and runs the click. The model never names an element by selector, attribute, or location. It names them by *ordinal position in a flattened view of the page*.

The discipline is total. From the system prompt in packages/core/src/prompts/system_prompt.md:

Only interact with elements that have a numeric [index] assigned.
Only use indexes that are explicitly provided.

That is the entire contract. The model is given a Braille page and a Braille vocabulary and told to operate.

The tra

10m / Article + audio + video

2. E01_One_Tool_One_Thought_Per_Step

One Tool, One Thought Per Step

The agent does not chain tools. Every round-trip is one call to one tool called AgentOutput, and that tool's input schema forces the model to reflect before it acts. The architecture treats reflection as a structural property, not a prompt-level suggestion.

Key Takeaways

  • Each ReAct step is exactly one tool call. There is no multi-tool reasoning within a step.
  • The single tool's schema mandates evaluation_previous_goal, memory, next_goal, and action — every step the model reasons explicitly.
  • Tools are async, isolated from each other, and share state only through BrowserState snapshots.
  • Cancellation is a single AbortSignal flowing through the LLM fetch, every tool, and the JavaScript runtime.

---

The most surprising file in the page-agent repository is not the DOM extractor. It is packages/core/src/PageAgentCore.ts — 660 lines that define the entire agent loop, and which look, on first reading, almost nothing like the OpenAI tool-calling examples I had internalized from the SDK documentation.

// PageAgentCore.ts:285-289
const result = await this.#llm.invoke(messages, macroTool, signal, {
    toolChoiceName: 'AgentOutput',
    normalizeResponse: (res) => normalizeResponse(res, this.tools),
})

One tool. One call. One signal. That is the entire loop. The #packMacroTool() method, defined a few hundred lines below, is the trick that makes this work — it merges every registered tool into a single OpenAI tool definition called AgentOutput. The model never picks among click_element_by_index and input_text and scroll. It picks *one* tool, AgentOutput, whose action field is a discriminated union over the registered tool set.

I had expected an OpenAI-style tool loop: the model emits multiple tool calls, the runtime executes them, the results come back, the model reasons again. Page-agent explicitly forbids that. parallel_tool_calls: false is set in packages/llms/src/OpenAIClient.ts:45. The result is that every round-trip is one thought.

The reflection schema

The Zod schema that defines what the model must produce on every step is the most consequential piece of code in the repository. It is forty lines that define the agent's reasoning discipline.

// PageAgentCore.ts:395-401
const macroToolSchema = z.object({
    evaluation_previous_goal: z.string().optional(),
    memory: z.string().optional(),
    next_goal: z.string().optional(),
    action: actionSchema,
})

Three reflection fields, then an action. Each step the model is asked four things, in order:

1. Did the previous step succeed? evaluation_previous_goal is a one-sentence verdict — success, failure, or uncertain. The system prompt makes the verdict mandatory and forbids the lazy answer:

> Never assume an action succeeded just because it appears to be executed in your last step in <agent_history>. If the expected change is missing, mark the last action as failed (or uncertain) and plan a recovery.

2. What should I remember? memory is a free-text working memory. The system prompt recommends 1–3 sentences of "specific memory of this step and overall progress." Counts of pages visited, IDs already tried, blockers encountered — anything that should survive into the next step's context goes here.

3. What is the immediate next goal? next_goal is a single sentence. Not a plan — a goal. The system prompt explicitly tells the model not to plan the whole task here: "your next immediate goal and action to achieve it, in one clear sentence."

4. Which tool, with what input? action is the discriminated union. The model picks one tool from the registered set and provides its typed input.

The schema enforces the discipline. A model that emits only {"action": {"click_element_by_index": {"index": 33}}} will pass schema validation — the reflection fields are all optional. In practice the prompt steers models to fill them, but the structural guarantee is that *if* the model reflects, it has a place to put each piece.

The effect is that the LLM's context window becomes the agent's working memory. Each step's memory becomes part of <agent_history> in the next step's user prompt. The agent never has to take notes elsewhere. It never has to remember across task boundaries because the history stream is the only memory it has.

What one step actually looks like

A complete step in PageAgentCore.ts is the following sequence. I am walking it linearly so the simplicity is visible.

sequenceDiagram
  participant Agent as PageAgentCore
  participant PC as PageController
  participant LLM as LLM (OpenAI-compat)
  participant Tool as Concrete Tool

  Agent->>PC: getBrowserState()
  PC-->>Agent: {url, title, header, content, footer}
  Agent->>Agent: assembleUserPrompt (history + state + browser_state)
  Agent->>LLM: invoke(messages, {AgentOutput: macroTool})
  LLM-->>Agent: tool_call("AgentOutput", {evaluation, memory, next_goal, action})
  Agent->>Agent: pack reflection into console + emit activity event
  Agent->>Tool: tool.execute(args, {signal})
  Tool-->>Agent: "✅ Clicked element (Login)."
  Agent->>Agent: emit historychange event
  Agent->>PC: cleanUpHighlights()

There is no branching. There is no parallel execution. There is no "the model called three tools at once, here are the results." Each step is a single observation, a single thought, a single action, a single result.

This is not how browser-use works. Browser-use, in its Python incarnation, lets the model chain multiple actions in a single step. Page-agent's designers considered this and rejected it — the parallel_tool_calls: false setting is the explicit refusal. The reason appears in the action results field of MacroToolResult: after every tool execution, the model sees only that tool's string result. There is no aggregation across multiple actions. A model that wants to click then type must do it across two steps, with reflection in between.

The benefit is bounded complexity per step. The cost is that multi-step tasks run as long, predictable chains. Both are visible in the panel: every step has a thinking phase (LLM call) and an executing phase

10m / Article + audio

3. E02_Why_dot_click_Isnt_Enough

Why .click() Isn't Enough

The runtime does not call .click() on elements. It dispatches the W3C pointer/mouse event sequence verbatim — pointerover → mouseover → pointerdown → mousedown → focus → pointerup → mouseup → click — because modern web apps listen for the sequence, not the abstraction. Without this, half the web silently breaks.

Key Takeaways

  • A real browser click is eight events in a specific order, dispatched to the deepest element at the click point — not a single method call.
  • React's onChange fires only when the *native* value setter is used, not when input.value = text is assigned.
  • Contenteditable requires Plan A (synthetic InputEvents) plus a Plan B fallback (execCommand) because some editors ignore synthetic events.
  • Indexed scroll walks up to ten parents looking for a scrollable container; the deprecated page-scroll heuristic is still in the code, marked for removal.

---

The question I had, the first time I read packages/page-controller/src/actions.ts, was: why is this file 557 lines long? It exposes four public functions. Each of them does something the platform already does. HTMLElement.click() is a built-in. element.scrollIntoView() is built-in. document.execCommand('insertText') is a single line. What is the runtime doing that justifies re-implementing all of it?

The answer, after reading the comments and the test cases, is that the runtime is *not* re-implementing what the platform does. It is re-implementing what a real *user* does — and the two are not the same.

The discrepancy is invisible until you try to write a browser test against a modern React application and discover that .click() does not fire onPointerDown, that setting input.value does not fire React's onChange, that calling element.focus() does not match how a user's tab key navigates. Each of these surprises has a workaround. The runtime implements all of them.

The eight events that constitute a click

A user clicks a button. From the application's perspective, one event happened. From the browser's perspective, eight happened in a specific order, each targeting a specific element. The order matters because some applications install handlers on intermediate events (onMouseEnter, onPointerDown) and use them to update state before the click arrives.

flowchart TD
  A["pointerover<br/>bubbles"] --> B["pointerenter<br/>no-bubble"]
  B --> C["mouseover<br/>bubbles"]
  C --> D["mouseenter<br/>no-bubble"]
  D --> E["pointerdown<br/>bubbles"]
  E --> F["mousedown<br/>bubbles"]
  F --> G["focus<br/>on original element"]
  G --> H["pointerup<br/>bubbles"]
  H --> I["mouseup<br/>bubbles"]
  I --> J["click<br/>activation bubbles to interactive ancestor"]

actions.ts:clickElement() dispatches all eight. The dispatch is not naive — there are three subtleties worth seeing:

1. Hit-test for the deepest element. A button is rarely alone in a coordinate. There is an icon, a label, a tooltip wrapper. A user clicking the button is hitting whatever element is on top at that pixel — usually the icon, sometimes the label, sometimes the button itself. The runtime uses elementFromPoint(x, y) after temporarily disabling the visual mask's pointer-events, then dispatches events to the *hit-test target* while keeping focus on the *original element*:

// actions.ts:86-91
const hitTarget = doc.elementFromPoint(x, y)
await disablePassThrough()
const target =
    hitTarget instanceof HTMLElement && element.contains(hitTarget) ? hitTarget : element

This matches what a real browser does when a click lands inside a nested element. The deeper target receives the events; the original element receives focus; the activation (the click() call) bubbles up to the button so onClick handlers fire correctly.

2. The pointer/mouse event split. Pointer events came after mouse events in the platform timeline. Modern applications listen for *both* because Safari did not support pointer events until recently. The runtime dispatches the pointer sequence first, then the mouse sequence, with the exact same coordinates. Comments in actions.ts explain the choice:

Simulate a full click following W3C Pointer Events + UI Events spec order: pointerover/enter → mouseover/enter → pointerdown → mousedown → [focus] → pointerup → mouseup → click

3. blurLastClickedElement between actions. When a previous click set focus on element A and the current click targets element B, the runtime first fires pointerout/pointerleave/mouseout/mouseleave/blur on A. Without this, applications that track "currently focused element" lose track of focus, and the next interaction (a hover-triggered tooltip, a focusin listener) silently fails.

The click sequence is, in total, eight events plus two cross-element state transitions. .click() is one event. The gap between them is the gap between "what the platform exposes" and "what real applications rely on."

React's value setter

The text input is harder.

<input type="text" />

A user types "hello". The browser fires keydown, keypress, input, change, keyup events, in that order. React's onChange handler listens to input events and updates component state. The state update triggers a re-render. The re-render commits a new value to the DOM input element.

Now an agent wants to set the input value to "hello" programmatically. The natural code is:

element.value = 'hello'

This sets the DOM property. It does *not* fire an input event. React's synthetic event system, which diffs the input's value against the React state, sees no event to react to. The component state is not updated. The user-visible state diverges from the React state. The next re-render wipes the agent's change.

This is a well-known React quirk. The workaround is to call the *native* value setter — the function React installed when the input first mounted — and bypass React's overridden setter:

// actions.ts:223
getNativeValueSetter(element as HTMLInputElement | HTMLTextAreaElement).call(element, text)

The getNativeValueSetter helper, defined in packages/page-controller/src/utils/index.ts, returns the prototype's value setter:

const nativeInputValueSetter = Object.getOwnPropertyDescriptor(
    window.HTMLInputElement.prototype,
    'value'
).set

The agent calls that, then dispatches a single synthetic input event. React's listener picks it up. The state upd

10m / Article + audio

Premium chapters

4. E03_One_Core_Three_Surfaces
Available after upgrade / 11m
5. E04_When_the_Agent_Should_Stop
Available after upgrade / 11m