browser-use/video-use

Edit videos with coding agents

5 chapters 0 audio lessons Article-first 3 free previews Fresh topic

Start here

1. Introduction

Why the LLM shouldn't watch the video

Don't feed your coding agent the raw video. Give it a 12 KB text file built from word-level transcripts and a handful of on-demand filmstrips. The recipe sounds perverse, and it's the only way this skill works.

The naive instinct is wrong. When I first read the browser-use/video-use README — drop raw footage in a folder, chat with the agent, get final.mp4 back — I assumed the agent must be running a vision model over the frames. Maybe it had stitched the video into a grid of thumbnails and asked a multimodal LLM to summarize. It does not. It does something stranger: the agent never watches the video at all. It reads a packed phrase-level transcript and makes every editing decision from text, calling a filmstrip-and-waveform PNG only at decision points where the text is ambiguous.

This contrarian move is the entire thesis. And like every interesting contrarian move, the cost of getting it wrong is silent — you'd produce a video with the right cuts but jarring transitions, or a video that no platform will surface because the audio is too quiet, or subtitles that vanish the moment an overlay lands. The rest of this series is what makes the contrarian move actually work: how the transcript gets built, why the edit decisions must still snap to word boundaries, which twelve production rules the renderer enforces in code, how an editor sub-agent picks the best of twenty-three takes of the same line, and what the self-evaluation loop catches before you ever see the output.

Key Takeaways

The LLM treats the video the way browser-use treats a web page — as a structured document, not raw pixels. That choice cascades into every other design decision in the skill.
A packed phrase-level transcript of an hour of footage is roughly 12 KB — the same footage as raw frames would be 45 million tokens of, by the repository's own framing, "noise".
The architecture is not magic. It rests on twelve hard production rules enforced inside helpers/render.py, plus a parallel sub-agent pattern and a self-evaluation loop you don't see unless you go looking.
The reader for these chapters is a developer using Claude Code, Codex, or Hermes who wants to understand what video-use does well, what it can't do, and how the helpers enforce the rules that distinguish a usable edit from a silent failure.

How the contrarian move is built

The simplest way to see why this works is to follow the data. A single video source runs through helpers/transcribe.py, which extracts mono 16 kHz PCM with ffmpeg, sends it to ElevenLabs Scribe with model_id=scribe_v1, diarize=true, tag_audio_events=true, timestamps_granularity=word, and writes the response to <edit_dir>/transcripts/<stem>.json. The cache is immediate — if the JSON already exists, the upload is skipped (transcribe.py L106–109). Over a folder of takes, transcribe_batch.py parallelises this with four workers, and pack_transcripts.py then collapses every transcript into a single takes_packed.md whose lines break on silence gaps and speaker changes. That single markdown file — twelve kilobytes — is the agent's *primary* reading surface.

flowchart LR
  A[Raw video<br/>30,000 frames<br/>~45M tokens] -->|can't read| LLM1[LLM]
  B[transcribe.py<br/>+ pack_transcripts.py<br/>12 KB packed MD] -->|reads every cut from this| LLM2[LLM]
  LLM2 -->|edl.json| C[render.py]
  C -->|base.mp4 + overlays + subtitles| D[final.mp4]
  D -->|timeline_view at every cut| E{Self-eval OK?}
  E -->|no, fix & re-render ≤3| C
  E -->|yes| F[Show to user]

The same diagram appears — in different forms — inside README.md L94 (Transcribe ──> Pack ──> LLM Reasons ──> EDL ──> Render ──> Self-Eval) and SKILL.md L83–105 (the eight-step process: Inventory → Pre-scan → Converse → Propose → Execute → Preview → Self-eval → Iterate). The two dia

7m / Article + audio

2. The_12KB_Reading_View

The packed transcript that beats raw frames

The single artifact the LLM actually reads is takes_packed.md — phrase-level lines, word-boundary timestamps, breaks on silence ≥ 0.5 s and on speaker change. Once you see what it looks like, you can't unsee why a text-only agent can outperform a vision-augmented one on cut selection.

A packed transcript is deceptively boring to look at. SKILL.md L117–121 shows the shape:

## C0103  (duration: 43.0s, 8 phrases)
  [002.52-005.36] S0 Ninety percent of what a web agent does is completely wasted.
  [006.08-006.74] S0 We fixed this.

That is the entire reading substrate. A ## line per take with duration and phrase count, followed by one phrase per line with [start-end] timestamps, a speaker tag, and the verbatim text. There is no granularity finer than the phrase. There is no metadata about framing, lighting, gesture, facial expression, or shot scale. There is, however, a vocabulary that maps directly onto everything an editor needs to know for a talking-head cut: where each utterance begins and ends, who said it, what the previous utterance's last-word timestamp was, and therefore how big the silence gap before the next phrase is.

When the LLM in this skill "reads" the video, this is what it reads.

flowchart LR
  A[Raw MP4] -->|ffmpeg -vn -ac 1 -ar 16000| B[WAV 16 kHz mono]
  B -->|ElevenLabs Scribe<br/>scribe_v1 / word timestamps / diarize| C["transcripts/C0103.json<br/>~30 KB per hour"]
  C -->|pack_transcripts.py<br/>silence ≥ 0.5 s splits phrases| D["takes_packed.md<br/>~12 KB per hour"]
  D -->|editor sub-agent reads| E[EDL JSON<br/>ranges + grades + overlays + subs]

Where the file comes from

Start at transcribe.py L49–55: extract_audio() runs ffmpeg -vn -ac 1 -ar 16000 -c:a pcm_s16le against the source. That command strips video, forces a single audio channel, resamples to 16 kHz, and writes 16-bit little-endian PCM. The 16 kHz / mono / 16-bit profile is intentional: Scribe's API is tuned to a phone-narrow band, and the upload is dramatically smaller than the original video's audio track. For a one-hour source the WAV might be 30 MB; the MP4's audio track was likely 60 MB or more.

Scribe receives the file with diarize=true and tag_audio_events=true (transcribe.py L66–69). Diarization matters because the agent later asks questions like *"is this a clean single-speaker take, or did the second speaker jump in at 0:30?"* If diarization is off, the entire S0 / S1 / ... tagging collapses. Audio-event tagging matters because (laughter), (sigh), (applause) come back as inline tokens that the agent treats as semantic signals: *preserve past the laugh, don't cut mid-sigh, the applause IS the beat.* SKILL.md L107: *"Audio events as signals. (laughs), (sighs), (applause) mark beats. Extend past them."*

call_scribe() posts to https://api.elevenlabs.io/v1/speech-to-text with data={"model_id": "scribe_v1", "diarize": "true", "tag_audio_events": "true", "timestamps_granularity": "word"} (transcribe.py L64–69) and returns the full JSON, which is written verbatim to <edit_dir>/transcripts/<video_stem>.json. The 30 KB-or-so JSON for a one-hour source is then collapsed by pack_transcripts.py into the much smaller markdown file.

The collapse is where the magic happens. The packaging rule, from SKILL.md L114–121, is two-fold:

Phrases break on any silence ≥ 0.5 s OR speaker change. This is the artifact the editor sub-agent reads to pick cuts — it gives word-boundary precision from text alone at 1/10 the tokens of raw JSON.

A phrase in this vocabulary is a contiguous run of words from one speaker with no internal gap larger than 0.5 s. Each phrase line carries the start of the first word, the end of the last word, the speaker tag, and the verbatim text. That gives the agent three independent things to reason about per line: the time the phrase starts (useful for cut pre-roll), the time it ends (useful for cut post-roll), and the silence gap to the next phrase (useful for cut *placement*).

Why "word boundary" is the cut unit

SKILL.md L25 enumerates this as Hard Rule 6: *"Never cut inside a word. Snap every cut edge to a word boundary from the Scribe transcript."* The justification lives in one place: Scribe's word timestamps drift by 50–100 ms relative to where the human ear thinks words begin and end, and any cut inside a word loses its consonant tail. If you cut between a word's start and its end, you get an audible *clip*. Every cut edge in the EDL is therefore two seconds-past the last word-end of th

8m / Article + audio

3. Twelve_Rules_That_Save_Your_Render

The twelve rules that save your render

Every video editing skill in the world has taste rules — fonts, palettes, easing curves, runtimes. video-use has twelve *correctness* rules where deviation produces silent failure, not bad aesthetics. These rules live in helpers/render.py because the renderer is the only place correctness can be enforced.

I want to begin with the most counter-intuitive one. Rule 1 — subtitles go LAST in the filter chain — looks trivial until you see what happens when it doesn't hold. If you composite overlays onto a base video and *then* burn subtitles over the composited result, every overlay drawn over the lower 30% of the frame will sit on top of the caption. Silent failure: the user sees the caption for the first half of every clip, then loses it the moment an animated overlay slides in. The fix isn't a re-render with different assets; it's a re-render with the filter graph re-ordered. render.py orders the parts of the filter graph the only safe way: base → overlays → subtitles. Look at build_final_composite L538–543:

if has_subs:
    subs_abs = str(subtitles_path.resolve()).replace(":", r"\:").replace("'", r"\'")
    filter_parts.append(
        f"{current}subtitles='{subs_abs}':force_style='{SUB_FORCE_STYLE}'[outv]"
    )

The subtitles= filter is the *final* filter part, regardless of how many overlays preceded it. That's not taste; that's correctness.

flowchart TD
  A[Raw MP4<br/>per source] -->|ffmpeg extract<br/>30ms audio fades<br/>grade baked in| B[clips_graded/seg_NN.mp4]
  B -->|-c copy concat demuxer| C[base.mp4]
  C -->|setpts=PTS-STARTPTS+T/TB<br/>per overlay| D[overlay shifted to window]
  D -->|overlay=enable='between(t,...)'| E[composited video]
  E -->|subtitles=...:force_style=...| F[final.mp4 — captions on top]
  F -->|loudnorm pass 1 + pass 2<br/>-14 LUFS / -1 dBTP / LRA 11| G[final_loudnorm.mp4]
  G -->|timeline_view at every cut| H{Self-eval pass?}
  H -->|no, fix| B
  H -->|yes| I[Deliver to user]

The order matters more than the parts

The pipeline is not nine steps, it is nine steps in a specific order. Each reorder breaks one property. Let me walk the four most consequential ones:

Rule 2 — Per-segment extract → lossless -c copy concat, not single-pass filtergraph (render.py L267–283). The naive approach is to build a single filter graph that takes every source, trims every range, and concatenates — all in one ffmpeg invocation. That's brittle. When the user adds overlays or subtitles, the same single-pass graph has to be re-invoked with the new filter chain, which double-encodes the segments a second time. The two-stage pipeline extracts per-segment first (with grade + 30 ms fades baked in), then concatenates with -c copy and no re-encode. Adding overlays is a third pass that touches only the compositing, not the segments. Cost: re-encoding happens once per segment, even when overlays change three times during iteration.

Rule 3 — 30 ms audio fades at every segment boundary (render.py L188–189). The fade filter chain is afade=t=in:st=0:d=0.03,afade=t=out:st={dur-0.03}:d=0.03. Without it, every transition between segments carries an audible *pop* — a discontinuity in the waveform's first derivative that humans hear instantly and find unbearable. ffmpeg's -c copy concat cannot smooth that itself; you have to bake the fades during the segment extract. The 30 ms value is not arbitrary: it is short enough to be inaudible (you don't *hear* the dip in volume), long enough to remove the click. Anything below 20 ms is unreliable across ffmpeg versions; anything above 50 ms becomes a perceptible volume dip at every cut.

Rule 4 — PTS-shift overlays to land their frame 0 at the window start (render.py L522–524). Animation overlays are short rendered clips with their own internal timeline. If you composite them at output time T using their raw PTS, the visible "first frame" lands at output T plus the animation's pre-roll, which is not T. The visual effect is that every overlay's content appears to start a fraction of a second *into* the window, then snap forward as the overlay catches up. The fix is setpts=PTS-STARTPTS+T/TB — a per-overlay expression that re-stamps the first frame to timestamp T. This is unique to overlays: base video and grade-extracted segments do not need it because they were already time-aligned.

Rule 5 — Master SRT uses output-timeline offsets (render.py L362–363). When you build the master subtitle file from per-source transcripts, you cannot just dump the original timestamps. You must remap each word's start into the *output* timeline:

out_start = max(0.0, local_start - seg_start) + seg_offset
out_end = max(0.0, local_end - seg_start) + seg_offset

seg_offset is the accumulated duration of every segment that came before this one. If you skip this remap, captions land at the source's local time — which means they appear *before* their visual segment in the output. The

8m / Article + audio

Premium chapters

4. Cut_Craft_From_Speech_Boundaries

Available after upgrade / 8m

5. Self_Eval_And_The_Stack_You_Build_Around_It

Available after upgrade / 9m