When the Agent Should Stop
The most underappreciated line in the page-agent codebase is in the system prompt: "It is OK to fail the task." The architecture is built around graceful stopping —done({success: false}), astoppedlifecycle state separate fromerror, a max-steps guard, an abort signal threaded through every layer. Reliability here comes from bounded effort, not from getting it right.
Key Takeaways
- The agent has five terminal states:
completed,error,stopped, plusdisposed. Each is a distinct decision the architecture makes visible. - A single
AbortControllerflows through the LLM fetch, every tool execution, and any user-supplied JavaScript. Stopping the agent means one method call. - The
donetool returns{success, text}. The model is explicitly told to setsuccess: falseif any part of the request is missing — the architecture prefers honest failure over quiet completion. - v1.10.0 decoupled run status from task outcome. A self-reported failure ends as
completed, noterror. The lifecycle now distinguishes "the agent finished" from "the task succeeded."
---
A captcha appears on step 38 of 40. The agent is partway through a multi-page form submission. The form is on the user's SaaS application. The captcha is the SaaS's defense against automated submissions. The agent cannot solve it.
What should the agent do?
The system prompt gives a precise answer:
If a captcha appears, tell user you can not solve captcha. Finish the task and ask user to solve it.
So the agent calls done with {success: false, text: "I encountered a captcha on the order confirmation page. Please solve it manually; I will continue from where I stopped."}. The panel shows:
Step 38 of 40
✅ Clicked element (Confirm order).
❌ Failed: I encountered a captcha on the order confirmation page...
Done (success: false)
The user solves the captcha. The user re-issues the task (or, with promptForNextTask: true, the panel prompts the user for the next task and the agent continues). The form completes. The user's data is preserved.
That is the architecture speaking. The agent did not retry the click that hit the captcha. It did not attempt to bypass the captcha. It did not pretend the captcha was not there. It stopped, reported what happened, and yielded control.
I used to think reliability meant higher success rate. The lifecycle in PageAgentCore.ts showed me it means predictable failure boundaries.
The state machine
The agent's lifecycle is one of the few places in the codebase where the design is explicit about its states. AgentStatus is a string union with five values:
// packages/core/src/types.ts
export type AgentStatus = 'idle' | 'running' | 'completed' | 'error' | 'stopped'
The transitions are managed by #setStatus:
// packages/core/src/PageAgentCore.ts:179-184
#setStatus(status: AgentStatus): void {
if (this.#status !== status) {
this.#status = status
this.#emitStatusChange()
}
}
#setStatus is called once per execute() cycle, in the finally block, with the terminal state of the run. The terminal state is computed by the run:
stateDiagram-v2
[*] --> idle
idle --> running : execute()
running --> completed : done({success: true or false}) or LLM self-reports failure
running --> error : agent caught an exception
running --> stopped : user clicks Stop, signal aborted
running --> error : step > maxSteps
completed --> [*]
error --> [*]
stopped --> [*]
note right of completed
v1.10.0: includes self-reported failures
"the agent finished" ≠ "the task succeeded"
end note
The four terminal paths are distinct events with distinct consequences.
completed — the agent called done. The task may have succeeded or failed; the agent reached a state it was willing to call done on. The terminal state is reported via lastResult: ExecutionResult { success, data, history }. v1.10.0 made this explicit: previously, LLM self-reported failures ended as error. After v1.10.0, the agent's *self-assessment* is honored — if the model says "I failed but I am done," the status is completed. Errors are reserved for things the agent did not expect.
error — something the agent did not expect. An exception escaped the inner try/catch. The OpenAI client returned an invalid schema, or a tool threw. The agent's recovery path could not handle it. The status is error and the message goes into the history as {type: 'error'}.
stopped — the user pressed Stop, or dispose() was called, or the abort signal fired for some other reason. The agent is in a clean state. lastResult is populated with whatever the agent was about to return. The agent can be reused — a new execute() call resets the state and starts again.
idle — the agent has never run, or has run and is ready to run again.
There is one more state, separate from the lifecycle: disposed. A disposed agent cannot be reused. dispose() cleans up the page controller, aborts any in-flight signal, emits a dispose event for the UI, and sets this.disposed = true. Any subsequent execute() throws "PageAgent has been disposed. Create a new instance." The distinction matters: stop() is reversible; dispose() is terminal.
The abort signal
The single most important design choice in the lifecycle is that one AbortController is constructed per task and threaded through every async operation that the task performs.
// packages/core/src/PageAgentCore.ts:222-223
this.#abortController = new AbortController()
const signal = this.#abortController.signal
This signal is then passed to:
- the LLM fetch in
OpenAIClient.ts:65-73as thesignaloption of thefetchcall; - every tool's
ToolContext.signal(packages/core/src/tools/index.ts); - the
execute_javascripttool's script scope (thesignalis bound into the wrapper function so user-supplied scripts cansignal.throwIfAborted()or checksignal.aborted); - any
executeJavascriptcaller inPageController.ts:383-398.
When the user clicks Stop, the panel calls agent.stop():
// packages/core/src/PageAgentCore.ts:200-204
async stop(): Promise<void> {
if (this.#status !== 'running') return
this.#abortController.abort()
await this.#running
}
That abort() call propagates everywhere. The LLM fetch rejects with AbortError. The in-flight tool receives a rejected promise. The user's execute_javascript script can react to signal.aborted === true. v1.9.0 made this discipline comprehensive; before that, sync tools and loop execution could occasionally ignore the signal and resolve successfully after the user had clicked Stop.
AbortError is handled distinctly from InvokeError. v1.9.0 decoupled them:
// packages/llms/src/index.ts:80-82
if ((error as any)?.name === 'AbortError') throw error
if (error instanceof InvokeError && !error.retryable) throw error
AbortError does not trigger the retry loop. It propagates immediately to the agent's outer catch:
// packages/core/src/PageAgentCore.ts:329-337
const isAbortError = (error as any)?.name === 'AbortError'
if (!isAbortError) console.error('Task failed', error)
const message = isAbortError ? 'Task aborted' : String(error)
this.#emitActivity({ type: 'error', message: message })
this.#emitHistoryChange({ type: 'error', message: message, rawResponse: error })
taskResult = { success: false, data: message, history: this.history }
this.#lastResult = taskResult
finalStatus = isAbortError ? 'stopped' : 'error'
The terminal state is stopped, not error. The history event records "Task aborted." The user knows the difference between "the agent failed" and "I stopped the agent." The architecture preserves the distinction.
The max-steps guard
Every step the agent takes is one round-trip. The cost of unbounded reasoning is unbounded spend. The runtime caps this with maxSteps (default 40 since v1.5.1):
// packages/core/src/PageAgentCore.ts:349-358
step++
if (step > maxSteps) {
const message = 'Step count exceeded maximum limit'
console.error(message)
this.#emitActivity({ type: 'error', message: message })
this.#emitHistoryChange({ type: 'error', message: message })
taskResult = { success: false, data: message, history: this.history }
this.#lastResult = taskResult
finalStatus = 'error'
break
}
The user sees "Step count exceeded maximum limit" in the panel. The task result has success: false. The history contains every step the agent took before the cap.
The system prompt warns the model before the cap is hit. At 5 remaining steps:
⚠️ Only 5 steps remaining. Consider wrapping up or calling done with partial results.
At 2 remaining steps:
⚠️ Critical: Only 2 steps left! You must finish the task or call immediately.
These are not hard stops. The model can ignore them and burn the last steps on a doomed recovery attempt. But the architecture gives the model an honest countdown so it can choose to call done with partial results rather than crash into the wall.
This is what graceful stopping looks like in practice: not a hard cutoff, but a soft countdown with a hard backstop. The model is informed. The model has agency. The architecture guarantees that the model cannot exceed the cap regardless.
The done action
The done tool is the formal channel for the agent to declare its task complete. Its schema is two fields:
// packages/core/src/tools/index.ts:38-52
tools.set(
'done',
tool({
description: 'Complete task. Text is your final response to the user — keep it concise unless the user explicitly asks for detail.',
inputSchema: z.object({
text: z.string(),
success: z.boolean().default(true),
}),
execute: async function (this: PageAgentCore, input) {
// main loop will handle this one
return Promise.resolve('Task completed')
},
})
)
The execute body is a placeholder — the main loop intercepts the done action before it reaches the tool executor:
// packages/core/src/PageAgentCore.ts:317-325
if (actionName === 'done') {
const success = action.input?.success ?? false
const data = action.input?.text || 'no text provided'
console.log(chalk.green.bold('Task completed'), success, data)
taskResult = { success, data, history: this.history }
this.#lastResult = taskResult
finalStatus = 'completed'
break
}
The agent can set success: false and still end as completed. The architecture treats "the agent finished" as orthogonal to "the task succeeded." The history and the lastResult preserve both signals. The UI displays both.
The system prompt drives this distinction home:
Setsuccesstotrueonly if the full USER REQUEST has been completed with no missing components.
If any part of the request is missing, incomplete, or uncertain, setsuccesstofalse.
A model that reports success: true for a partial task is lying. The architecture cannot prevent the lie — that is a model-side concern. But the architecture makes the lie visible: every step's action result is in the history; the model's evaluation_previous_goal for each step is in the history; the next_goal for each step is in the history. The user can audit the model's claims against the recorded actions.
This is what "traceability over success rate" means in code. The runtime prefers to record every step and let the user verify, over discarding steps to save tokens or over silently retrying until something works.
What the lifecycle protects against
The lifecycle is shaped by the failure modes the project has seen.
Runaway loops. A model can ask the same question 30 times without making progress. The max-steps guard caps the damage.
Mid-action cancellation. A user clicks Stop while a click is dispatching. The abort signal reaches the click's event sequence mid-flight. v1.9.0 made sure sync tools (the click dispatcher, the scroll handler) honor the signal — before that, the click would complete even after Stop.
LLM timeouts. A slow model takes 60 seconds per response. The user expects Stop to work immediately. The abort signal reaches the fetch and rejects it. The agent's status becomes stopped, not error.
Concurrent execution. v1.9.0 added a guard: if execute() is called while another task is running, it throws "A task is already running." Before that, two concurrent runs could interleave their history events and confuse the UI.
Disposal during execution. dispose() aborts the in-flight signal and sets disposed = true. Any subsequent execute() throws. The agent cannot be reused, but the in-flight run resolves cleanly.
Each of these is a small guard. Together they form a runtime that does not trap the user in a broken state.
The pattern this leaves
The lifecycle is one expression of a more general principle in the codebase: prefer bounded effort over unbounded persistence.
The system prompt's "It is OK to fail the task" is the verbal expression of the principle. The done({success: false}) schema field is the structural expression. The max-steps guard is the runtime expression. The stopped lifecycle state is the temporal expression. The auto-fixer's tolerance for malformed model output is the input-handling expression. The deprecation comments in scrollVertically are the maintenance expression.
Each subsystem chooses its own bounded effort. The agent as a whole is bounded by the union of those choices.
The principle extends to the package decomposition. The MCP server is bound to localhost. The in-page library does not implement drag-drop. The extension does not implement canvas operations. Each surface has a defined reach, declared in its README, enforced by its missing tools. The user is told, in advance, what the agent cannot do.
That is not a limitation framed as a feature. It is a feature framed honestly. The architecture's reliability comes from being honest about what it cannot do, and designing the lifecycle so the user is not surprised when it stops.
Where the loop closes
I opened the first piece with a script tag and a panel reaction. Twenty seconds after the user typed "click the login button," the panel read:
Step 6 of 40
✅ Clicked element (Login).
Done (success: true)
I want to return to that panel now. It shows six steps. Each step is one LLM round-trip. Each step is one tool call. Each tool call has a reflection before the action. Each reflection has an evaluation, a memory, a next goal. Each evaluation is honest because the runtime made it hard to be dishonest. Each memory is preserved because the runtime records the history.
If the login button had been a captcha, the panel would have read:
Step 6 of 40
✅ Clicked element (Continue).
Done (success: false)
Six steps still. Same architecture. Different verdict. The panel did not lie. The agent did not pretend. The architecture produced a clean outcome — success or failure, with the boundary visible in both cases.
That is what it means for an agent to know when to stop. It means the runtime tells the model where the boundaries are. It means the model reports what it saw. It means the lifecycle distinguishes "I finished" from "I succeeded." It means the user is never trapped in an agent that will not release them.
The captcha is not a failure of the agent. It is a feature of the agent — the agent refusing to pretend it can do something it cannot, and the architecture making that refusal possible.
---
References:
- packages/core/src/PageAgentCore.ts — state machine and abort flow
- packages/core/src/prompts/system_prompt.md — "It is OK to fail the task," capability limits, completion rules
- packages/core/src/tools/index.ts —
done,wait,ask_userdefinitions - packages/llms/src/index.ts —
withRetryexcludingAbortError - docs/CHANGELOG.md — v1.10.0 lifecycle rework, v1.9.0 abort handling