alibaba/page-agent / Chapter 4

Programming /

E03_One_Core_Three_Surfaces

# One Core, Three Surfaces > The same `PageAgentCore` powers three deployments: a `<script>` tag in your SaaS, a Chrome extension with a side panel, and an MCP server that lets Claude Desktop drive your browser. The decomposition is not product strategy — it is the natural consequence of separating agent reasoning from DOM control. ## Key Takeaways - `PageAgentCore` has no DOM dependency of its own. It accepts a `PageController` interface that can be local, remote, or test-mocked. - The in-page surface (the script tag) and the extension surface (the side panel) and the MCP surface (Claude Desktop) compose the same five packages in three different orders. - The cross-page extension adds a content-script boundary; the MCP server adds an HTTP/WebSocket boundary. Both preserve the single-tool, single-step loop. - v1.8.0 bound the MCP HTTP+WS server to `localhost` only — a security decision, not a constraint. --- The page-agent repository ships six packages, two applications, and one website. From the outside, the project looks like a small product suite — a library, an extension, an MCP server, a docs site. From the inside, the package graph tells a different story. There is exactly one piece of agent logic. Everything else is plumbing to give that logic a place to run. That piece is `packages/core/src/PageAgentCore.ts`. It does not import from `page-controller`, `react`, `vite`, `chrome`, or `mcp`. It imports from `@page-agent/llms` and accepts a `PageController` from the constructor: ```typescript // packages/core/src/PageAgentCore.ts:29 export type PageAgentCoreConfig = AgentConfig & { pageController: PageController } ``` The `PageController` is an interface, not a class. It is implemented by `packages/page-controller/src/PageController.ts` (the in-page version), by `packages/extension/src/agent/RemotePageController.ts` (the version that crosses the Chrome content-script boundary), and could be implemented by a test double that returns scripted responses. The core does not know which one it is talking to. That single interface split is why three product surfaces can exist. Each surface provides its own implementation of `PageController` and adds the surface-specific UI. The agent loop is unchanged. ## The in-page theater The most familiar surface is the script tag. A developer writes: ```html <script src="https://cdn.jsdelivr.net/npm/page-agent/dist/iife/page-agent.demo.js" crossorigin="true"></script> ``` A floating panel appears. The user types "click the login button." The agent runs. Behind the panel: ```typescript // packages/page-agent/src/PageAgent.ts:13-29 export class PageAgent extends PageAgentCore { panel: Panel constructor(config: PageAgentConfig) { const pageController = new PageController({ ...config, enableMask: config.enableMask ?? true, }) super({ ...config, pageController }) this.panel = new Panel(this, { language: config.language, promptForNextTask: config.promptForNextTask, }) } } ``` The wiring is short. `PageAgent` constructs a `PageController` in the same JavaScript realm as the host page, hands it to the core, and attaches a `Panel` to the core's event stream. Total: 16 lines. The complexity is in the parts being composed, not in the composition. This is the surface for SaaS AI copilots. The deployment artifact is a CDN script tag; the runtime shares the user's browser session with the host application; the LLM API key belongs to the user or to the application depending on configuration. No backend rewrite, no headless browser farm, no GPU. ## The extension theater The cross-page problem breaks the in-page theater. A user opens five tabs to compare prices across vendors. A single tab's PageAgent cannot see the other four. It cannot click into them, read their content, or coordinate actions across them. Each tab is a sandbox. The Chrome extension theater solves this. The architecture is documented in `packages/mcp/README.md` and lives in `packages/extension/src/agent/MultiPageAgent.ts`: ```mermaid flowchart TB subgraph Sidepanel A["MultiPageAgent<br/>Agent loop · LLM call"] end subgraph Background["Background Service Worker<br/>(message relay)"] B["chrome.runtime.sendMessage<br/>↔ chrome.tabs.sendMessage"] end subgraph ContentScript["Content Script (per tab)"] C["PageController<br/>real DOM ops · getBrowserState · click · input"] end A -->|"structured message"| B B -->|"relay to target tab"| C C -->|"BrowserState snapshot"| B B -->|"relay back"| A ``` `MultiPageAgent` runs in the extension's side panel or popup. It uses the same `LLM` client and the same reflection schema as `PageAgentCore`. The difference is that its `PageController` is `RemotePageController` — a thin client that serializes method calls and sends them across the `chrome.runtime` boundary to a `PageController` running in each tab's content script. The content script's `PageController` performs the real DOM operation and serializes the result back. Three things this surface adds over the in-page surface: - **A trust boundary.** The content script runs in the page's JavaScript realm. The side panel runs in the extension's isolated world. Every operation crosses a serialization boundary. This is slower than in-page, but it is the only way to share state across multiple tabs without a server. - **A tab controller.** `TabsController` (`packages/extension/src/agent/TabsController.ts`) tracks which tabs are open, which one is active, and which ones the agent has permission to control. The user can grant or revoke permission per tab. v1.6.3 added an experimental `experimentalIncludeAllTabs` flag for users who want the agent to control every tab without prompting. - **A hub for cross-context communication.** The extension's `entrypoints/hub/hub-ws.ts` exposes a WebSocket endpoint that other surfaces (notably the MCP server) can talk to. The hub is generic — it has no knowledge of MCP, no knowledge of the agent loop. It relays messages. The cross-page loop still does one tool call per step. The tool execution is now async over a message channel. The reflection contract is unchanged. The agent does not know it is crossing a trust boundary — it just sees slower responses. ## The MCP theater The MCP server is the newest surface. It exists to let AI agents that are not the page-agent (Claude Desktop, Cursor, Copilot, anything that speaks the Model Context Protocol) control the user's browser through p

Chapter 4 of 5 11m Article Learning path

One Core, Three Surfaces

The same PageAgentCore powers three deployments: a <script> tag in your SaaS, a Chrome extension with a side panel, and an MCP server that lets Claude Desktop drive your browser. The decomposition is not product strategy — it is the natural consequence of separating agent reasoning from DOM control.

Key Takeaways

  • PageAgentCore has no DOM dependency of its own. It accepts a PageController interface that can be local, remote, or test-mocked.
  • The in-page surface (the script tag) and the extension surface (the side panel) and the MCP surface (Claude Desktop) compose the same five packages in three different orders.
  • The cross-page extension adds a content-script boundary; the MCP server adds an HTTP/WebSocket boundary. Both preserve the single-tool, single-step loop.
  • v1.8.0 bound the MCP HTTP+WS server to localhost only — a security decision, not a constraint.

---

The page-agent repository ships six packages, two applications, and one website. From the outside, the project looks like a small product suite — a library, an extension, an MCP server, a docs site. From the inside, the package graph tells a different story. There is exactly one piece of agent logic. Everything else is plumbing to give that logic a place to run.

That piece is packages/core/src/PageAgentCore.ts. It does not import from page-controller, react, vite, chrome, or mcp. It imports from @page-agent/llms and accepts a PageController from the constructor:

// packages/core/src/PageAgentCore.ts:29
export type PageAgentCoreConfig = AgentConfig & { pageController: PageController }

The PageController is an interface, not a class. It is implemented by packages/page-controller/src/PageController.ts (the in-page version), by packages/extension/src/agent/RemotePageController.ts (the version that crosses the Chrome content-script boundary), and could be implemented by a test double that returns scripted responses. The core does not know which one it is talking to.

That single interface split is why three product surfaces can exist. Each surface provides its own implementation of PageController and adds the surface-specific UI. The agent loop is unchanged.

The in-page theater

The most familiar surface is the script tag. A developer writes:

<script src="https://cdn.jsdelivr.net/npm/page-agent/dist/iife/page-agent.demo.js" crossorigin="true"></script>

A floating panel appears. The user types "click the login button." The agent runs.

Behind the panel:

// packages/page-agent/src/PageAgent.ts:13-29
export class PageAgent extends PageAgentCore {
    panel: Panel

    constructor(config: PageAgentConfig) {
        const pageController = new PageController({
            ...config,
            enableMask: config.enableMask ?? true,
        })
        super({ ...config, pageController })
        this.panel = new Panel(this, {
            language: config.language,
            promptForNextTask: config.promptForNextTask,
        })
    }
}

The wiring is short. PageAgent constructs a PageController in the same JavaScript realm as the host page, hands it to the core, and attaches a Panel to the core's event stream. Total: 16 lines. The complexity is in the parts being composed, not in the composition.

This is the surface for SaaS AI copilots. The deployment artifact is a CDN script tag; the runtime shares the user's browser session with the host application; the LLM API key belongs to the user or to the application depending on configuration. No backend rewrite, no headless browser farm, no GPU.

The extension theater

The cross-page problem breaks the in-page theater. A user opens five tabs to compare prices across vendors. A single tab's PageAgent cannot see the other four. It cannot click into them, read their content, or coordinate actions across them. Each tab is a sandbox.

The Chrome extension theater solves this. The architecture is documented in packages/mcp/README.md and lives in packages/extension/src/agent/MultiPageAgent.ts:

flowchart TB
  subgraph Sidepanel
    A["MultiPageAgent<br/>Agent loop · LLM call"]
  end
  subgraph Background["Background Service Worker<br/>(message relay)"]
    B["chrome.runtime.sendMessage<br/>↔ chrome.tabs.sendMessage"]
  end
  subgraph ContentScript["Content Script (per tab)"]
    C["PageController<br/>real DOM ops · getBrowserState · click · input"]
  end
  A -->|"structured message"| B
  B -->|"relay to target tab"| C
  C -->|"BrowserState snapshot"| B
  B -->|"relay back"| A

MultiPageAgent runs in the extension's side panel or popup. It uses the same LLM client and the same reflection schema as PageAgentCore. The difference is that its PageController is RemotePageController — a thin client that serializes method calls and sends them across the chrome.runtime boundary to a PageController running in each tab's content script. The content script's PageController performs the real DOM operation and serializes the result back.

Three things this surface adds over the in-page surface:

  • A trust boundary. The content script runs in the page's JavaScript realm. The side panel runs in the extension's isolated world. Every operation crosses a serialization boundary. This is slower than in-page, but it is the only way to share state across multiple tabs without a server.
  • A tab controller. TabsController (packages/extension/src/agent/TabsController.ts) tracks which tabs are open, which one is active, and which ones the agent has permission to control. The user can grant or revoke permission per tab. v1.6.3 added an experimental experimentalIncludeAllTabs flag for users who want the agent to control every tab without prompting.
  • A hub for cross-context communication. The extension's entrypoints/hub/hub-ws.ts exposes a WebSocket endpoint that other surfaces (notably the MCP server) can talk to. The hub is generic — it has no knowledge of MCP, no knowledge of the agent loop. It relays messages.

The cross-page loop still does one tool call per step. The tool execution is now async over a message channel. The reflection contract is unchanged. The agent does not know it is crossing a trust boundary — it just sees slower responses.

The MCP theater

The MCP server is the newest surface. It exists to let AI agents that are not the page-agent (Claude Desktop, Cursor, Copilot, anything that speaks the Model Context Protocol) control the user's browser through page-agent.

flowchart LR
  subgraph Desktop["Claude Desktop / Cursor / Copilot"]
    A["MCP client<br/>stdio JSON-RPC"]
  end
  subgraph Server["@page-agent/mcp (Node.js process)"]
    B["MCP server<br/>stdio"]
    C["HTTP + WebSocket bridge<br/>localhost:38401"]
  end
  subgraph Browser["User's Browser"]
    D["Launcher page<br/>opens hub tab"]
    E["Hub tab<br/>extension-side WebSocket peer"]
    F["MultiPageAgent<br/>(via extension)"]
  end
  A -->|stdio| B
  B -->|HTTP| D
  D -->|triggers| E
  E -->|WebSocket| C
  B -->|executes task via hub| E
  E -->|delegates to| F

The MCP server is a tiny process. It exposes three tools:

| Tool | Input | Behavior | |------|-------|----------| | execute_task | { task: string } | Sends a task to the hub; blocks until the agent reports completion. | | get_status | — | Returns { connected, busy }. | | stop_task | — | Aborts the in-flight task via the hub's WebSocket protocol. |

The server is bound to localhost only. v1.8.0 added this constraint explicitly:

Bound the MCP HTTP + WebSocket server to localhost only.

That is a security decision. The MCP server has direct access to the user's browser session via the extension hub. If the server accepted remote connections, any service on the network could issue tasks. Binding to localhost means only a process running on the user's machine can talk to the server, and that process must be one the user has launched.

The interesting surface-level shift here is *who initiates the conversation*. In the in-page surface, the user's application initiates — it loads the script, and the user types a task. In the extension surface, the user initiates through the side panel. In the MCP surface, the *AI agent* in Claude Desktop initiates. The user asks Claude a question, Claude decides a browser task is needed, Claude calls execute_task, the agent in the user's browser executes it, and the result flows back to Claude. The user is two agents deep — one in their desktop, one in their browser — and the user's request is decomposed across both.

The actor playing Hamlet

There is a useful analogy here, though I want to be careful with it. The page-agent stack is like an actor playing the same role — a web page operator — in three different theaters with three different stage-managers and three different audiences.

In the in-page theater, the stage-manager is PageController and the audience is the SaaS application the agent is embedded in. The actor has full access to the stage (the page's DOM) and the script (the system prompt). The performance is local, immediate, and free of serialization.

In the extension theater, the stage-manager is RemotePageController and the audience is whatever tabs the extension is allowed to control. The actor rehearses in the side panel but performs in each tab. Every line crosses a content-script boundary. The performance is local to the user's machine but distributed across tabs.

In the MCP theater, the stage-manager is the WebSocket hub and the audience is whatever AI agent is talking to the MCP server — Claude Desktop today, possibly other MCP clients tomorrow. The actor is not even aware of the audience's identity. The performance is whatever the upstream agent asks for, executed in the user's browser.

The actor is the same. The role is the same. The system prompt is the same. The tool vocabulary is the same. What changes is the *interface* between the role and its environment.

The analogy breaks at one place: in real theater, the actor adjusts to the stage. In page-agent, the agent does not adjust at all. The architecture refuses to special-case per surface. There is no "extension mode" or "MCP mode" branch in PageAgentCore. The differences are entirely in how the surrounding plumbing handles serialization, scheduling, and UI.

What the package boundaries enforce

The repository's AGENTS.md documents the module boundaries in plain language:

Page Agent: Main entry with UI. Extends PageAgentCore and adds Panel. Imports from @page-agent/core, @page-agent/ui
Core: PageAgentCore without UI. Imports from @page-agent/llms, @page-agent/page-controller
LLMs: LLM client with MacroToolInput contract. No dependency on page-agent
UI: Panel and i18n. Decoupled from PageAgent via PanelAgentAdapter interface
Page Controller: DOM operations with optional visual feedback (SimulatorMask). No LLM dependency. Enable mask via enableMask: true config

The topology is a DAG. core depends on llms and page-controller. page-agent depends on core, ui, and page-controller. extension depends on core, ui, and page-controller. mcp depends on extension (transitively, through the hub protocol). Nothing depends on page-agent — the entry-class library is the leaf product, not a shared dependency.

This is what makes the three-surface story possible. If page-agent (the in-page library) depended on chrome.* APIs, the extension could not reuse the same core. If the core depended on window directly, the MCP server could not test it in Node. If the controller depended on a specific LLM, the test suite could not stub the model.

The boundaries are not product strategy. They are the architectural shape that lets the same agent appear in three theaters without modification. Every package boundary is a boundary the LLM does not know about.

The cost of the decomposition

The decomposition is not free. There are three concrete costs.

First, deployment complexity. A user who wants all three surfaces must install the library, the extension, and run the MCP server. The cross-surface wiring (the launcher page that triggers the hub, the WebSocket connection, the local HTTP server) is non-trivial. The mcp package ships a launcher.html that detects the extension and triggers the hub to open — a piece of glue that exists only because the surfaces are separate processes.

Second, debugging opacity. When a task fails in the MCP theater, the failure could be in Claude Desktop, in the MCP server, in the WebSocket bridge, in the hub, in the extension, in the content script, in the page controller, in the agent loop, or in the LLM. The architecture provides structured logging at each layer, but a user tracing a failure must follow the log across process boundaries.

Third, version drift. The extension, the MCP server, and the in-page library share the agent protocol, but they ship at different cadences. v1.10.0 introduced a new lifecycle state (stopped); the extension's heart-beat logic was updated to clear stale activity on any non-running status including stopped. Each surface must update its state handling in lockstep with the core's lifecycle changes. The architecture accommodates this through clear interfaces, but the work is real.

For these reasons, the in-page surface is by far the most-used. The extension is for users who specifically need cross-tab automation. The MCP server is for users who already use an MCP-aware agent and want it to control their browser. Each surface has a niche; the union is the product.

Why the MCP surface is the most interesting

Among the three surfaces, the in-page one is the obvious one — the README's "ship an AI copilot in lines of code" pitch. The extension surface is the obvious one for power users. The MCP surface is the one I find most interesting, and not because of what it does today.

The MCP surface inverts the control flow. In the in-page and extension surfaces, the agent is initiated by a human typing into the panel. In the MCP surface, the agent is initiated by *another AI* that has decided a browser task is needed. The user is no longer the agent's interlocutor. The user is the agent's *environment*.

This is a different shape. The user's prompt to Claude Desktop becomes the user-facing task. Claude's decision to invoke the page-agent MCP becomes the internal-to-Claude reasoning. The page-agent executes the task. Claude integrates the result back into its answer. The user sees a Claude response that includes real browser actions.

The architectural implication is that the page-agent protocol becomes a *capability protocol* — a way for any sufficiently capable agent to delegate browser work to a specialist that already lives in the user's browser. The page-agent does not need to be Claude. The page-agent does not need to know which agent is calling it. The protocol is the interface.

That is why the localhost binding matters. The page-agent in the user's browser is *the user's*. It acts under the user's authority, in the user's session, with the user's permissions. The MCP server is a relay, not a controller. The AI agent on the other end of the protocol issues requests, but the page-agent in the browser remains the user's agent.

The three surfaces are not three products. They are three views onto a single capability. The user's AI copilot, the user's browser assistant, the user's MCP-callable browser — all three are the same agent with three different intercoms.

---

References: