browser-use/video-use / Chapter 1

Programming /

Introduction

# Why the LLM shouldn't watch the video > Don't feed your coding agent the raw video. Give it a 12 KB text file built from word-level transcripts and a handful of on-demand filmstrips. The recipe sounds perverse, and it's the only way this skill works. The naive instinct is wrong. When I first read the `browser-use/video-use` README — drop raw footage in a folder, chat with the agent, get `final.mp4` back — I assumed the agent must be running a vision model over the frames. Maybe it had stitched the video into a grid of thumbnails and asked a multimodal LLM to summarize. It does not. It does something stranger: the agent **never watches the video at all**. It reads a packed phrase-level transcript and makes every editing decision from text, calling a filmstrip-and-waveform PNG only at decision points where the text is ambiguous. This contrarian move is the entire thesis. And like every interesting contrarian move, the cost of getting it wrong is silent — you'd produce a video with the right cuts but jarring transitions, or a video that no platform will surface because the audio is too quiet, or subtitles that vanish the moment an overlay lands. The rest of this series is what makes the contrarian move actually work: how the transcript gets built, why the edit decisions must still snap to word boundaries, which twelve production rules the renderer enforces in code, how an editor sub-agent picks the best of twenty-three takes of the same line, and what the self-evaluation loop catches before you ever see the output. ## Key Takeaways - The LLM treats the video the way `browser-use` treats a web page — as a structured document, not raw pixels. That choice cascades into every other design decision in the skill. - A packed phrase-level transcript of an hour of footage is roughly **12 KB** — the same footage as raw frames would be **45 million tokens** of, by the repository's own framing, "noise". - The architecture is not magic. It rests on twelve hard production rules enforced inside `helpers/render.py`, plus a parallel sub-agent pattern and a self-evaluation loop you don't see unless you go looking. - The reader for these chapters is a developer using Claude Code, Codex, or Hermes who wants to understand what `video-use` does well, what it can't do, and how the helpers enforce the rules that distinguish a usable edit from a silent failure. ## How the contrarian move is built The simplest way to see why this works is to follow the data. A single video source runs through `helpers/transcribe.py`, which extracts mono 16 kHz PCM with ffmpeg, sends it to ElevenLabs Scribe with `model_id=scribe_v1`, `diarize=true`, `tag_audio_events=true`, `timestamps_granularity=word`, and writes the response to `<edit_dir>/transcripts/<stem>.json`. The cache is immediate — if the JSON already exists, the upload is skipped (`transcribe.py` L106–109). Over a folder of takes, `transcribe_batch.py` parallelises this with four workers, and `pack_transcripts.py` then collapses every transcript into a single `takes_packed.md` whose lines break on silence gaps and speaker changes. That single markdown file — twelve kilobytes — is the agent's *primary* reading surface. ```mermaid flowchart LR A[Raw video<br/>30,000 frames<br/>~45M tokens] -->|can't read| LLM1[LLM] B[transcribe.py<br/>+ pack_transcripts.py<br/>12 KB packed MD] -->|reads every cut from this| LLM2[LLM] LLM2 -->|edl.json| C[render.py] C -->|base.mp4 + overlays + subtitles| D[final.mp4] D -->|timeline_view at every cut| E{Self-eval OK?} E -->|no, fix & re-render ≤3| C E -->|yes| F[Show to user] ``` The same diagram appears — in different forms — inside `README.md` L94 (`Transcribe ──> Pack ──> LLM Reasons ──> EDL ──> Render ──> Self-Eval`) and `SKILL.md` L83–105 (the eight-step process: Inventory → Pre-scan → Converse → Propose → Execute → Preview → Self-eval → Iterate). The two dia

Chapter 1 of 5 7m Article Learning path

Why the LLM shouldn't watch the video

Don't feed your coding agent the raw video. Give it a 12 KB text file built from word-level transcripts and a handful of on-demand filmstrips. The recipe sounds perverse, and it's the only way this skill works.

The naive instinct is wrong. When I first read the browser-use/video-use README — drop raw footage in a folder, chat with the agent, get final.mp4 back — I assumed the agent must be running a vision model over the frames. Maybe it had stitched the video into a grid of thumbnails and asked a multimodal LLM to summarize. It does not. It does something stranger: the agent never watches the video at all. It reads a packed phrase-level transcript and makes every editing decision from text, calling a filmstrip-and-waveform PNG only at decision points where the text is ambiguous.

This contrarian move is the entire thesis. And like every interesting contrarian move, the cost of getting it wrong is silent — you'd produce a video with the right cuts but jarring transitions, or a video that no platform will surface because the audio is too quiet, or subtitles that vanish the moment an overlay lands. The rest of this series is what makes the contrarian move actually work: how the transcript gets built, why the edit decisions must still snap to word boundaries, which twelve production rules the renderer enforces in code, how an editor sub-agent picks the best of twenty-three takes of the same line, and what the self-evaluation loop catches before you ever see the output.

Key Takeaways

  • The LLM treats the video the way browser-use treats a web page — as a structured document, not raw pixels. That choice cascades into every other design decision in the skill.
  • A packed phrase-level transcript of an hour of footage is roughly 12 KB — the same footage as raw frames would be 45 million tokens of, by the repository's own framing, "noise".
  • The architecture is not magic. It rests on twelve hard production rules enforced inside helpers/render.py, plus a parallel sub-agent pattern and a self-evaluation loop you don't see unless you go looking.
  • The reader for these chapters is a developer using Claude Code, Codex, or Hermes who wants to understand what video-use does well, what it can't do, and how the helpers enforce the rules that distinguish a usable edit from a silent failure.

How the contrarian move is built

The simplest way to see why this works is to follow the data. A single video source runs through helpers/transcribe.py, which extracts mono 16 kHz PCM with ffmpeg, sends it to ElevenLabs Scribe with model_id=scribe_v1, diarize=true, tag_audio_events=true, timestamps_granularity=word, and writes the response to <edit_dir>/transcripts/<stem>.json. The cache is immediate — if the JSON already exists, the upload is skipped (transcribe.py L106–109). Over a folder of takes, transcribe_batch.py parallelises this with four workers, and pack_transcripts.py then collapses every transcript into a single takes_packed.md whose lines break on silence gaps and speaker changes. That single markdown file — twelve kilobytes — is the agent's *primary* reading surface.

flowchart LR
  A[Raw video<br/>30,000 frames<br/>~45M tokens] -->|can't read| LLM1[LLM]
  B[transcribe.py<br/>+ pack_transcripts.py<br/>12 KB packed MD] -->|reads every cut from this| LLM2[LLM]
  LLM2 -->|edl.json| C[render.py]
  C -->|base.mp4 + overlays + subtitles| D[final.mp4]
  D -->|timeline_view at every cut| E{Self-eval OK?}
  E -->|no, fix & re-render ≤3| C
  E -->|yes| F[Show to user]

The same diagram appears — in different forms — inside README.md L94 (Transcribe ──> Pack ──> LLM Reasons ──> EDL ──> Render ──> Self-Eval) and SKILL.md L83–105 (the eight-step process: Inventory → Pre-scan → Converse → Propose → Execute → Preview → Self-eval → Iterate). The two diagrams agree on the topology, which is one of the cross-source consistencies I keep coming back to. The architecture isn't a slide — it's a sequence of artifacts that survive being implemented independently by transcribe.py, pack_transcripts.py, and render.py.

What the text-first architecture buys you is precise. You can ask the agent, in plain English, to "cut the third false start and keep the take where she laughs at 0:42". The agent can answer that question without an image. Scribe already emitted the word-level timestamps, the speaker labels, the (laughter) audio events as inline tokens. Everything you wanted from a vision model is already in the text, and the text costs orders of magnitude fewer tokens to read than the frames do. The agent can reason across the entire transcript in a single context window.

What it doesn't buy you is real-time visual reasoning. When the agent is choosing between two takes whose transcripts are identical — same words, same duration, same speaker — text is not enough. That's where helpers/timeline_view.py enters: given a video and a [start, end] range, it extracts N evenly-spaced frames, builds a 1920-pixel-wide composite PNG with a waveform envelope, shading silence gaps ≥ 400 ms, and word labels positioned above the waveform. The agent calls it only at decision points — never in a scan loop, never over every utterance. SKILL.md is emphatic about this: *"Not a scan tool — use it at decision points, not constantly."* (L77.)

The pushback I expected, and what changed my mind

I had two priors when I started reading the repository. First, that an LLM editing video would want raw pixel access for aesthetic reasoning — "does the lighting feel warm?" is a question text cannot answer. Second, that a 12 KB transcript would lose whatever signal a vision model would extract from facial expression, gesture, or shot framing. Both priors turn out to be half-right. The transcript does lose those signals. The skill does not care, because most of the cuts the agent is asked to make — filler words, dead space, false starts, retake selection across clips, slideshow pacing — are already captured by Scribe's word-level + speaker diarization + audio-event output. The aesthetic reasoning happens elsewhere: in the color-grading preset (grade.py ships four), in the subtitle style (the shipped bold-overlay is one of many possible force_style strings), in the choice of animation engine (HyperFrames, Remotion, Manim, PIL — all four are referenced, none is mandatory).

The aesthetic decisions are not the LLM's. They're the user's, made once at the start of the session during "Converse" — what kind of video this is, what aesthetic/brand direction, what pacing feel, what must be preserved, what must be cut. SKILL.md L86: *"Collect: content type, target length/aspect, aesthetic/brand direction, pacing feel, must-preserve moments, must-cut moments, animation and grade preferences, subtitle needs. Do not use a fixed checklist — the right questions are different every time."* The hard-coded defaults live somewhere else: in render.py's SUB_FORCE_STYLE (MarginV=90 to clear TikTok's bottom 25–30% UI chrome), in grade.py's auto_grade_for_clip (bounded to ±8% on every axis), in extract_segment's 30 ms audio fades. Taste is the user's. Correctness is the helpers'.

That's the deal. You give up fine-grained visual reasoning over the frames. You gain small transcript budgets, word-boundary cuts, deterministic helpers, and a renderer that bakes its own correctness into the filter graph — so the agent's errors are confined to *which* segments to keep, not *how* they compose.

What this series covers

The remaining four chapters each take one slice of that contract:

  • The reading view itself — how takes_packed.md is built, where the silence-break threshold comes from, what the editor sub-agent brief looks like when there are dozens of takes per beat.
  • Twelve hard rules and where the helpers enforce them — every rule has a load-bearing code citation, including the HDR → SDR tonemap chain that catches iPhone footage, the lossless -c copy concat that prevents double re-encoding, and the subtitles-last filter ordering that prevents overlays from hiding captions.
  • Cut craft from speech boundaries — the audio-first cut logic, retake selection across clips, silence-gap targeting, and the working window of 30–200 ms cut padding that absorbs Scribe's 50–100 ms timestamp drift.
  • Self-evaluation and the missing layer — the loop that runs timeline_view.py on the *rendered* output, the parallel sub-agent pattern for animations, and what you'd have to add to make this work for your stack.

If you finish the series with a working mental model of how a text-reading LLM produces a render-correct video, the series has done its job. If you finish it slightly suspicious of the architecture and eager to test it on your own footage — better still.

Imagine you have a folder of talking-head takes — three to five clips per beat, maybe forty minutes of raw video between them. You want a three-minute cut. Where you would normally spend two hours in a timeline tool, the agent reads a 12 KB markdown file and proposes an EDL in under a minute. The next chapter shows you what it actually reads.

---

References: