alibaba/page-agent / Chapter 1

Programming /

E00_The_Web_The_Model_Can_See

# The Web The Model Can See > A pure in-page JavaScript GUI agent does not give the LLM eyes. It gives it Braille. That choice — to strip the page down to indexed text rather than render it as pixels — is the move that makes page-agent possible. ## Key Takeaways - Page-agent feeds the LLM indexed text, never pixels, never screenshots. - The trade is real and worth it: cost drops, latency drops, permission boundaries disappear, and the page stays inside the user's browser. - The trade's cost surfaces the moment the DOM mutates — the model is now blind to anything the indexed text doesn't describe. - The model never sees CSS selectors, XPath, or coordinates. It sees `[33]<button>Submit</button>` and replies with `{"index": 33}`. --- A user pastes a single line of HTML into a page they were already on: ```html <script src="https://cdn.jsdelivr.net/npm/page-agent@1.10.0/dist/iife/page-agent.demo.js" crossorigin="true"></script> ``` A small floating panel appears at the bottom-right. They type: *"Click the login button."* Twenty seconds later the panel reads: ``` Step 6 of 40 ✅ Clicked element (Login). Done (success: true) ``` That is the entire product. There is no Python, no browser extension, no headless Chrome. There is no server holding the page. There is no multimodal vision model. There is a script tag, a text-only LLM, and a piece of JavaScript that turned the user's page into something a language model could read. I had assumed — when I first read the README — that the agent's cleverness was in how it *acted* on the page. I was wrong. The cleverness is in what it *sees*. ## The perception that almost won Most GUI agents see the way humans do: they take a screenshot, hand the image to a multimodal model, and ask it where to click. This is the natural shape of "browser automation" in 2024–2026. Browser-use, Anthropic's computer-use, OpenAI's Operator — they all sit somewhere on this stack. The shape has three real costs. **First, money.** Image tokens are expensive. A 1920×1080 screenshot at reasonable detail can be 800–1,500 tokens per frame. Multiply that by a 10-step task and the LLM bill starts to look like a SaaS subscription of its own. text-based representations of the same page compress to a few hundred tokens because they ignore pixels that don't carry information. **Second, latency.** Screenshots are synchronous with the GPU render loop. In an in-page environment that loop is owned by the user's browser and is not exposed to scripts at all. Page-agent runs in the page's main JavaScript context. It cannot take a screenshot of itself. **Third, permissions.** Screenshot APIs in the browser require user opt-in (or a browser extension with the `tabs.captureVisibleTab` permission). A library that needs the user to grant a permission to load is not a library — it's a download. Page-agent's README states the trade plainly: "No screenshots. No multi-modal LLMs or special permissions needed." That sentence is not a feature claim. It is a forced architectural decision. The alternative — and the choice page-agent made — is to give the model the page's *source of truth* instead of its *appearance*. The browser's DOM is the same thing the application code is rendering. The text of a button label, the type of an `<input>`, the `aria-label` of a `<div>` — all of this is in the DOM, accessible synchronously from any script running on the page. No GPU. No permissions. No round-trip. The catch: a text representation is impoverished compared to a screenshot. A screenshot tells the model "this region looks like a disabled button." The DOM tells the model "this is a `<button disabled>` element with the text 'Submit.'" Both are useful. They are not equivalent. ## What the model actually sees The model does not see a `querySelector('button.login').click()` chain. It does not see XPath. It does not see pixel coordinates. It sees something like this: ``` [0]<a aria-label="page-agent.js 首页" /> [1]<div >P /> [3]<a >Docs /> [4]<a aria-label="View source (new window)">Source /> [5]<a role=button>Quick start /> [6]<a role=button>View docs /> [33]<button aria-label='Submit form'>Submit</button> ``` That is a real example — taken almost verbatim from `packages/page-controller/src/dom/index.ts:flatTreeToString()`. Interactive elements get a numeric `[index]` prefix. New elements since the previous step get a `*` prefix. Indentation (`\t`) encodes parent-child relationships. Attributes the LLM needs for disambiguation — `title`, `role`, `placeholder`, `aria-label`, `aria-expanded`, `data-state`, `value` — are inlined. Attributes the LLM does not need are dropped. Text the model can read at a glance is the *element's own text content*, joined from descendant text nodes until the next clickable element. The model replies in the same vocabulary. It does not say "click the third button under the second section." It says `{"click_element_by_index": {"index": 33}}`. The runtime resolves index 33 against a map of `number → HTMLElement` and runs the click. The model never names an element by selector, attribute, or location. It names them by *ordinal position in a flattened view of the page*. The discipline is total. From the system prompt in `packages/core/src/prompts/system_prompt.md`: > Only interact with elements that have a numeric [index] assigned. > Only use indexes that are explicitly provided. That is the entire contract. The model is given a Braille page and a Braille vocabulary and told to operate. ## The tra

Chapter 1 of 5 10m Article Audio Video Learning path

The Web The Model Can See

A pure in-page JavaScript GUI agent does not give the LLM eyes. It gives it Braille. That choice — to strip the page down to indexed text rather than render it as pixels — is the move that makes page-agent possible.

Key Takeaways

  • Page-agent feeds the LLM indexed text, never pixels, never screenshots.
  • The trade is real and worth it: cost drops, latency drops, permission boundaries disappear, and the page stays inside the user's browser.
  • The trade's cost surfaces the moment the DOM mutates — the model is now blind to anything the indexed text doesn't describe.
  • The model never sees CSS selectors, XPath, or coordinates. It sees [33]<button>Submit</button> and replies with {"index": 33}.

---

A user pastes a single line of HTML into a page they were already on:

<script src="https://cdn.jsdelivr.net/npm/page-agent@1.10.0/dist/iife/page-agent.demo.js" crossorigin="true"></script>

A small floating panel appears at the bottom-right. They type: *"Click the login button."* Twenty seconds later the panel reads:

Step 6 of 40
✅ Clicked element (Login).
Done (success: true)

That is the entire product. There is no Python, no browser extension, no headless Chrome. There is no server holding the page. There is no multimodal vision model. There is a script tag, a text-only LLM, and a piece of JavaScript that turned the user's page into something a language model could read.

I had assumed — when I first read the README — that the agent's cleverness was in how it *acted* on the page. I was wrong. The cleverness is in what it *sees*.

The perception that almost won

Most GUI agents see the way humans do: they take a screenshot, hand the image to a multimodal model, and ask it where to click. This is the natural shape of "browser automation" in 2024–2026. Browser-use, Anthropic's computer-use, OpenAI's Operator — they all sit somewhere on this stack.

The shape has three real costs.

First, money. Image tokens are expensive. A 1920×1080 screenshot at reasonable detail can be 800–1,500 tokens per frame. Multiply that by a 10-step task and the LLM bill starts to look like a SaaS subscription of its own. text-based representations of the same page compress to a few hundred tokens because they ignore pixels that don't carry information.

Second, latency. Screenshots are synchronous with the GPU render loop. In an in-page environment that loop is owned by the user's browser and is not exposed to scripts at all. Page-agent runs in the page's main JavaScript context. It cannot take a screenshot of itself.

Third, permissions. Screenshot APIs in the browser require user opt-in (or a browser extension with the tabs.captureVisibleTab permission). A library that needs the user to grant a permission to load is not a library — it's a download.

Page-agent's README states the trade plainly: "No screenshots. No multi-modal LLMs or special permissions needed." That sentence is not a feature claim. It is a forced architectural decision.

The alternative — and the choice page-agent made — is to give the model the page's *source of truth* instead of its *appearance*. The browser's DOM is the same thing the application code is rendering. The text of a button label, the type of an <input>, the aria-label of a <div> — all of this is in the DOM, accessible synchronously from any script running on the page. No GPU. No permissions. No round-trip.

The catch: a text representation is impoverished compared to a screenshot. A screenshot tells the model "this region looks like a disabled button." The DOM tells the model "this is a <button disabled> element with the text 'Submit.'" Both are useful. They are not equivalent.

What the model actually sees

The model does not see a querySelector('button.login').click() chain. It does not see XPath. It does not see pixel coordinates. It sees something like this:

[0]<a aria-label="page-agent.js 首页" />
[1]<div >P />
[3]<a >Docs />
[4]<a aria-label="View source (new window)">Source />
[5]<a role=button>Quick start />
[6]<a role=button>View docs />
[33]<button aria-label='Submit form'>Submit</button>

That is a real example — taken almost verbatim from packages/page-controller/src/dom/index.ts:flatTreeToString(). Interactive elements get a numeric [index] prefix. New elements since the previous step get a * prefix. Indentation (\t) encodes parent-child relationships. Attributes the LLM needs for disambiguation — title, role, placeholder, aria-label, aria-expanded, data-state, value — are inlined. Attributes the LLM does not need are dropped. Text the model can read at a glance is the *element's own text content*, joined from descendant text nodes until the next clickable element.

The model replies in the same vocabulary. It does not say "click the third button under the second section." It says {"click_element_by_index": {"index": 33}}. The runtime resolves index 33 against a map of number → HTMLElement and runs the click. The model never names an element by selector, attribute, or location. It names them by *ordinal position in a flattened view of the page*.

The discipline is total. From the system prompt in packages/core/src/prompts/system_prompt.md:

Only interact with elements that have a numeric [index] assigned.
Only use indexes that are explicitly provided.

That is the entire contract. The model is given a Braille page and a Braille vocabulary and told to operate.

The trade is real

The cost of this design surfaces immediately when the model meets a dynamic page.

flowchart LR
  A["Screenshot path<br/>Page → render → pixels → multimodal LLM<br/>800–1500 tokens/frame · GPU bound · permissions required"] -->|cost, latency, permission| B[Higher bills, slower tasks, less deployable]
  C["Text-DOM path<br/>Page → FlatDomTree → indexed text → text LLM<br/>300–600 tokens/frame · synchronous · no permissions"] -->|cheap, fast, embeddable| D[Lower bills, real-time, in-page]
  C --> E[Blindness: layout, color, disabled visual state, hover, animation]
  A --> F[Heavy infra: headless browser or extension, GPU access, user consent]

A disabled button that *looks* disabled but whose disabled property is false will mislead the model — the model reads the property, not the appearance. A loading spinner with no aria-busy="true" attribute is invisible. A draggable element has no index until the runtime decides what to do with drag — and the runtime has not yet implemented drag. Hover-only interactions (tooltips, dropdowns that appear on mouseover) are equally invisible until pointer events fire.

These are real limits. The README's "Known Limitations" section lists them. The system prompt goes further:

It is ok to fail the task.
User can be wrong. If the request of user is not achievable, inappropriate or you do not have enough information or tools to achieve it. Tell user to make a better request.

The text-DOM trade is the *cause* of these limits and the *motivation* for that honesty. A perception system that is fundamentally impoverished cannot pretend to see what it doesn't. Better to admit blindness than to thrash.

The design choice that earned the trade

The dehydration from a live <button> to [33]<button aria-label='Submit form'>Submit</button> is not trivial. The runtime has to walk the DOM, decide what is interactive, decide what attributes matter, decide how to nest text content, decide what to drop. The work lives in packages/page-controller/src/dom/dom_tree/index.js — 1,751 lines of judgment, ported from browser-use and refined across the project's ten minor versions.

A handful of choices in that file are worth naming because they reveal the project's values:

  • Heuristic interactivity. A <div> with role="button" and an onclick handler is interactive. A <span> is not. A <a> without href is not. The heuristics are not exhaustive — and they are not meant to be. They're good enough to surface ~95% of clickable elements on a typical SaaS page.
  • Attribute filtering with wildcards. includeAttributes accepts glob patterns. The default set includes title, placeholder, aria-*, data-state, value — the attributes that disambiguate elements when the model needs to choose. Cosmetic attributes (class, style, id unless opted in) are dropped. The goal is the minimum information the model needs to identify each element.
  • New-element detection. A WeakMap<HTMLElement, string> caches elements the runtime has seen. New elements get a * prefix in the next prompt, so the model knows what changed since last step. URL changes also reset the tree — popstate, hashchange, beforeunload, and the modern Navigation API navigate event all trigger cleanUpHighlights() in packages/page-controller/src/dom/index.ts:528-569. The fallback for browsers without the Navigation API is a 500ms setInterval poll on window.location.href — pragmatic, ugly, working.
  • data-scrollable hint. Scrollable containers get their scroll distances inlined as a custom attribute: data-scrollable="top=200, bottom=600". The model uses this to choose scroll tool calls with indexed targets rather than blind page scrolls.

Each of these is a small compromise. Together they let a text-only LLM perform actions that, in a screenshot world, would have required a vision model and a fast GPU.

What this buys

The text-DOM path is cheap enough that the entire agent can run inside the user's existing browser tab, with the user's existing LLM subscription, with no server. The script tag above is the deployment artifact. It is 60KB gzipped, no peer dependencies, no build step required of the host application.

For SaaS companies, this changes the calculus of "should we ship an AI copilot?" The build-vs-buy analysis collapses. Shipping a copilot is a <script> tag and an API key. The copilot runs against the same DOM your QA tests run against, with the same selectors your features use, in the same browser session the user is already in.

That is the point of the text-DOM choice. It is not a technical preference. It is a deployment preference disguised as a technical one. The runtime picked the cheapest perception that could still drive a web page, so that any product team could embed an agent without spinning up infrastructure.

What the model cannot see, and what it does about it

The text-DOM path is not just impoverished in obvious ways. It has subtler blind spots that surface only when the agent is mid-task.

The first is *temporal state*. The runtime snapshots the DOM at the start of each step. If the page mutates between the snapshot and the model's action — a tooltip appears, a popover dismisses, an animation completes — the model's next observation will reflect the new state, but its reflection block cannot anticipate the mutation. The model has to observe, decide, act, and observe again. It cannot act on a predicted state.

The second is *off-screen content*. By default, viewportExpansion: -1 returns the full page, but the LLM still has to scroll to interact with content below the fold. Off-screen elements are not in the user's visual focus, but they are in the indexed text. The model sees them; the user does not. This asymmetry can produce actions the user did not anticipate — the model scrolls, clicks something out of view, and the page jumps. The runtime surfaces the scroll-distance hint (... 600 pixels below ...) in the prompt's footer to give the model context.

The third is *visual ambiguity*. Two buttons with the same text — "Submit" — are indistinguishable by text alone. The runtime preserves aria-label, title, data-state, and disabled to help, but if the application does not provide those attributes, the model is guessing. The system prompt acknowledges this: "If there are multiple elements that could be the target, the model should look for attributes that disambiguate." When disambiguation fails, the ask_user tool exists for exactly this case — but the runtime disables ask_user if the host has not provided an onAskUser callback.

These blind spots are not bugs. They are the predictable consequences of the perception choice. The runtime does not pretend they do not exist. The README lists them under "Known Limitations," and the system prompt instructs the model to prefer honest reporting over silent guessing.

The trade is not "text is better than screenshots" or "screenshots are better than text." The trade is "what does this agent need to see, and what is the cheapest way to give it that view?" For page-agent, the answer is: enough to identify interactive elements, not enough to render them. The rest is left to the user, the model, and the application.

If the page is reduced to indexed text, the obvious next question is: how does the model *think* about that text? It receives a flat string with hundreds of [index]<tag>text</tag> lines. It must output a structured response — not just an action, but an action plus a reflection on what it just learned. That reflection contract is where the agent's reasoning lives.

---

References: