Why .click() Isn't Enough
The runtime does not call.click()on elements. It dispatches the W3C pointer/mouse event sequence verbatim —pointerover → mouseover → pointerdown → mousedown → focus → pointerup → mouseup → click— because modern web apps listen for the sequence, not the abstraction. Without this, half the web silently breaks.
Key Takeaways
- A real browser click is eight events in a specific order, dispatched to the deepest element at the click point — not a single method call.
- React's onChange fires only when the *native* value setter is used, not when
input.value = textis assigned. - Contenteditable requires Plan A (synthetic InputEvents) plus a Plan B fallback (
execCommand) because some editors ignore synthetic events. - Indexed scroll walks up to ten parents looking for a scrollable container; the deprecated page-scroll heuristic is still in the code, marked for removal.
---
The question I had, the first time I read packages/page-controller/src/actions.ts, was: why is this file 557 lines long? It exposes four public functions. Each of them does something the platform already does. HTMLElement.click() is a built-in. element.scrollIntoView() is built-in. document.execCommand('insertText') is a single line. What is the runtime doing that justifies re-implementing all of it?
The answer, after reading the comments and the test cases, is that the runtime is *not* re-implementing what the platform does. It is re-implementing what a real *user* does — and the two are not the same.
The discrepancy is invisible until you try to write a browser test against a modern React application and discover that .click() does not fire onPointerDown, that setting input.value does not fire React's onChange, that calling element.focus() does not match how a user's tab key navigates. Each of these surprises has a workaround. The runtime implements all of them.
The eight events that constitute a click
A user clicks a button. From the application's perspective, one event happened. From the browser's perspective, eight happened in a specific order, each targeting a specific element. The order matters because some applications install handlers on intermediate events (onMouseEnter, onPointerDown) and use them to update state before the click arrives.
flowchart TD
A["pointerover<br/>bubbles"] --> B["pointerenter<br/>no-bubble"]
B --> C["mouseover<br/>bubbles"]
C --> D["mouseenter<br/>no-bubble"]
D --> E["pointerdown<br/>bubbles"]
E --> F["mousedown<br/>bubbles"]
F --> G["focus<br/>on original element"]
G --> H["pointerup<br/>bubbles"]
H --> I["mouseup<br/>bubbles"]
I --> J["click<br/>activation bubbles to interactive ancestor"]
actions.ts:clickElement() dispatches all eight. The dispatch is not naive — there are three subtleties worth seeing:
1. Hit-test for the deepest element. A button is rarely alone in a coordinate. There is an icon, a label, a tooltip wrapper. A user clicking the button is hitting whatever element is on top at that pixel — usually the icon, sometimes the label, sometimes the button itself. The runtime uses elementFromPoint(x, y) after temporarily disabling the visual mask's pointer-events, then dispatches events to the *hit-test target* while keeping focus on the *original element*:
// actions.ts:86-91
const hitTarget = doc.elementFromPoint(x, y)
await disablePassThrough()
const target =
hitTarget instanceof HTMLElement && element.contains(hitTarget) ? hitTarget : element
This matches what a real browser does when a click lands inside a nested element. The deeper target receives the events; the original element receives focus; the activation (the click() call) bubbles up to the button so onClick handlers fire correctly.
2. The pointer/mouse event split. Pointer events came after mouse events in the platform timeline. Modern applications listen for *both* because Safari did not support pointer events until recently. The runtime dispatches the pointer sequence first, then the mouse sequence, with the exact same coordinates. Comments in actions.ts explain the choice:
Simulate a full click following W3C Pointer Events + UI Events spec order: pointerover/enter → mouseover/enter → pointerdown → mousedown → [focus] → pointerup → mouseup → click
3. blurLastClickedElement between actions. When a previous click set focus on element A and the current click targets element B, the runtime first fires pointerout/pointerleave/mouseout/mouseleave/blur on A. Without this, applications that track "currently focused element" lose track of focus, and the next interaction (a hover-triggered tooltip, a focusin listener) silently fails.
The click sequence is, in total, eight events plus two cross-element state transitions. .click() is one event. The gap between them is the gap between "what the platform exposes" and "what real applications rely on."
React's value setter
The text input is harder.
<input type="text" />
A user types "hello". The browser fires keydown, keypress, input, change, keyup events, in that order. React's onChange handler listens to input events and updates component state. The state update triggers a re-render. The re-render commits a new value to the DOM input element.
Now an agent wants to set the input value to "hello" programmatically. The natural code is:
element.value = 'hello'
This sets the DOM property. It does *not* fire an input event. React's synthetic event system, which diffs the input's value against the React state, sees no event to react to. The component state is not updated. The user-visible state diverges from the React state. The next re-render wipes the agent's change.
This is a well-known React quirk. The workaround is to call the *native* value setter — the function React installed when the input first mounted — and bypass React's overridden setter:
// actions.ts:223
getNativeValueSetter(element as HTMLInputElement | HTMLTextAreaElement).call(element, text)
The getNativeValueSetter helper, defined in packages/page-controller/src/utils/index.ts, returns the prototype's value setter:
const nativeInputValueSetter = Object.getOwnPropertyDescriptor(
window.HTMLInputElement.prototype,
'value'
).set
The agent calls that, then dispatches a single synthetic input event. React's listener picks it up. The state update propagates. The re-render commits the value. The component and the DOM are now consistent.
This is what actions.ts:inputTextElement() does for <input> and <textarea> elements. The runtime could not use element.value = text because it would silently break every React form in the user's application.
Contenteditable and the Plan A / Plan B pattern
The hard case is contenteditable. These are <div contenteditable="true"> elements used by rich-text editors — Notion-style apps, comment fields with mentions, code editors like Monaco. They are not <input>s. They have no value property. They have no native value setter. Their internal state is owned by the editor library (Quill, Slate, Draft.js, ProseMirror, etc.), and the way each library consumes input varies.
// actions.ts:139-218 (excerpted)
if (isContentEditable) {
// Plan A: Dispatch synthetic InputEvents
if (element.dispatchEvent(new InputEvent('beforeinput', {
bubbles: true, cancelable: true,
inputType: 'insertText', data: text,
}))) {
element.innerText = text
element.dispatchEvent(new InputEvent('input', {
bubbles: true, inputType: 'insertText', data: text,
}))
}
// Verify Plan A worked
const planASucceeded = element.innerText.trim() === text.trim()
if (!planASucceeded) {
// Plan B: execCommand fallback
element.focus()
const selection = window.getSelection()
const range = document.createRange()
range.selectNodeContents(element)
selection.removeAllRanges()
selection.addRange(range)
document.execCommand('delete', false)
document.execCommand('insertText', false, text)
}
}
The runtime tries Plan A first — synthetic InputEvents with the correct inputType. This works for React-managed contenteditable, Quill, and most modern editors. The runtime *verifies* by reading element.innerText after the dispatch. If the text wasn't actually inserted, it falls back to Plan B: the deprecated document.execCommand('insertText'), which integrates with the browser's undo stack and most editors' native handling.
The comments are unusually candid:
Strategy: Try Plan A (synthetic events) first, then verify and fall back to Plan B (execCommand) if the text wasn't actually inserted.
Plan A: Dispatch synthetic events
Works: React contenteditable, Quill.
Fails: Slate.js, some contenteditable editors that ignore synthetic events.
That is the entire battle. Some editors will not accept synthetic events under any circumstances. The runtime tries Plan A, checks the result, and falls back. When the fallback also fails, the action returns a failure string, and the next step's prompt tells the model that typing didn't work.
@todo Monaco/CodeMirror: Require direct JS instance access. No universal way to obtain. @todo Draft.js: Not responsive to synthetic/execCommand/Range/DataTransfer. Unmaintained.
Those editor families are listed as known-unsupported. The runtime is honest about its reach.
The indexed scroll
Scrolling is the simplest action to write and the hardest to make reliable. A user scrolls with the mouse wheel — but mouse-wheel events are not what actions.ts dispatches. They are not dispatched at all. Instead, the runtime sets element.scrollTop directly, which is what the wheel handler *would* do if the user had scrolled.
The interesting question is *which* element to scroll. The page has many scrollable containers — the viewport itself, side panels, modal dialogs, internal scroll regions in custom widgets. The user knows which one they want to scroll by looking at it. The runtime has to know which one by walking the DOM.
// actions.ts:288-326 (excerpted)
while (currentElement && attempts < 10) {
const computedStyle = window.getComputedStyle(currentElement)
const hasScrollableY =
/(auto|scroll|overlay)/.test(computedStyle.overflowY) ||
(computedStyle.scrollbarWidth && computedStyle.scrollbarWidth !== 'auto') ||
(computedStyle.scrollbarGutter && computedStyle.scrollbarGutter !== 'auto')
const canScrollVertically = currentElement.scrollHeight > currentElement.clientHeight
if (hasScrollableY && canScrollVertically) {
// try to scroll
}
if (currentElement === document.body) break
currentElement = currentElement.parentElement
attempts++
}
When the model calls scroll({index: 14, num_pages: 0.5}), the runtime looks up element 14 in the selector map, walks up to 10 parents checking overflowY, scrollbar-width, and scrollbar-gutter, and finds the first ancestor that actually has scrollable height. It scrolls that. The model does not need to know about the container hierarchy — it only needs to provide the element whose *visible region* it wants to scroll.
This is the indexed scroll the system prompt advertises. It works on multi-panel layouts because the model picks the element in the panel it cares about, and the runtime finds the right container.
The fallback for when the model does *not* provide an index — scroll({down: true}) with no index — is the deprecated heuristic. The comments mark it as such:
@deprecated Heuristic container search. Unreliable in multi-panel layouts. Should guide LLMs to use indexed scroll for consistency. TODO: remove this fallback
The fallback still works on simple pages. It walks up from document.activeElement, then falls back to a querySelectorAll('*') search for the first scrollable element. On a single-panel page this finds the body. On a multi-panel page it can find a panel the user did not intend to scroll.
The runtime's design choice — index-first, heuristic-second — reflects a deeper commitment: prefer the deterministic, debuggable path. The LLM's role is to choose which element to scroll. The runtime's role is to find the container that holds that element. The deterministic path works as long as the LLM picks the right element. The heuristic path is only used when the LLM cannot pick an element, which is rare.
The visual mask
The runtime ships with a SimulatorMask that overlays the page during automation. It is not required (enableMask: false by default for PageAgentCore; true by default for PageAgent), but it is the default for the in-page experience. Its purpose is twofold: it tells the user "the agent is operating; do not interact with the page right now," and it prevents double-clicks when the agent is mid-action.
// PageController.ts:179-214
// Temporarily bypass mask to allow DOM extraction
if (this.mask) {
this.mask.wrapper.style.pointerEvents = 'none'
}
dom.cleanUpHighlights()
this.flatTree = dom.getFlatTree({...this.config, interactiveBlacklist: blacklist})
// ...
// Restore mask blocking
if (this.mask) {
this.mask.wrapper.style.pointerEvents = 'auto'
}
The mask's pointer events are toggled around DOM extraction so the walk can pass through it. v1.7.0 added passthrough event handling so the mask can selectively allow some input through. v1.8.0 fixed a memory leak where the requestAnimationFrame loop wasn't cancelled on dispose — a sign that visual feedback systems are where production agents leak memory first.
The mask is cosmetic relative to the agent's reasoning — the LLM does not know it exists — but it is essential to the user experience. Without it, the agent and the user can both click at once. With it, the agent is the only thing that clicks.
The patterns this leaves behind
Three patterns recur across the action layer:
Verification, not assertion. The runtime does not assume a synthetic event did what it was supposed to. Plan A is verified before Plan B is attempted. The click is dispatched to the deepest element, but focus is kept on the original. The scroll checks actualScrollDelta > 0.5 before reporting success.
Specificity over generality. Each editor family has a different input contract. Rather than abstracting over them, the runtime maintains the lowest-common-denominator Plan A and a deprecated-but-effective Plan B. The abstraction layer would have hidden the differences; the runtime surfaces them.
Honest TODO comments. actions.ts lists the things it cannot do: drag-drop, hover-only interactions, canvas operations, file upload, Monaco and Draft.js. The architecture's reach is bounded, and the bounds are documented inline.
These patterns are not unique to browser automation. They are the patterns of any system that interacts with a stateful environment through an event API. What page-agent shows is how strict they have to be when the environment is a web browser in 2026, where every framework adds its own layer between the user's input and the application's state.
The W3C sequence in actions.ts is what the platform already does. What the runtime adds is the discipline of doing it consistently, in the right order, to the right element, with verification at each step.
---
References:
- packages/page-controller/src/actions.ts — W3C event sequence, contenteditable fallback, indexed scroll
- packages/page-controller/src/utils/index.ts — native value setter, pass-through toggling
- W3C UI Events Specification —
clickactivation behavior - W3C Pointer Events Specification —
pointerdown/pointeruporder - React source — value setter tracking