browser-use/video-use / Chapter 2

Programming /

The_12KB_Reading_View

# The packed transcript that beats raw frames > The single artifact the LLM actually reads is `takes_packed.md` — phrase-level lines, word-boundary timestamps, breaks on silence ≥ 0.5 s and on speaker change. Once you see what it looks like, you can't unsee why a text-only agent can outperform a vision-augmented one on cut selection. A packed transcript is deceptively boring to look at. `SKILL.md` L117–121 shows the shape: ``` ## C0103 (duration: 43.0s, 8 phrases) [002.52-005.36] S0 Ninety percent of what a web agent does is completely wasted. [006.08-006.74] S0 We fixed this. ``` That is the entire reading substrate. A `##` line per take with duration and phrase count, followed by one phrase per line with `[start-end]` timestamps, a speaker tag, and the verbatim text. There is no granularity finer than the phrase. There is no metadata about framing, lighting, gesture, facial expression, or shot scale. There is, however, a vocabulary that maps directly onto everything an editor needs to know for a talking-head cut: where each utterance begins and ends, who said it, what the previous utterance's last-word timestamp was, and therefore how big the silence gap before the next phrase is. When the LLM in this skill "reads" the video, this is what it reads. ```mermaid flowchart LR A[Raw MP4] -->|ffmpeg -vn -ac 1 -ar 16000| B[WAV 16 kHz mono] B -->|ElevenLabs Scribe<br/>scribe_v1 / word timestamps / diarize| C["transcripts/C0103.json<br/>~30 KB per hour"] C -->|pack_transcripts.py<br/>silence ≥ 0.5 s splits phrases| D["takes_packed.md<br/>~12 KB per hour"] D -->|editor sub-agent reads| E[EDL JSON<br/>ranges + grades + overlays + subs] ``` ## Where the file comes from Start at `transcribe.py` L49–55: `extract_audio()` runs `ffmpeg -vn -ac 1 -ar 16000 -c:a pcm_s16le` against the source. That command strips video, forces a single audio channel, resamples to 16 kHz, and writes 16-bit little-endian PCM. The 16 kHz / mono / 16-bit profile is intentional: Scribe's API is tuned to a phone-narrow band, and the upload is dramatically smaller than the original video's audio track. For a one-hour source the WAV might be 30 MB; the MP4's audio track was likely 60 MB or more. Scribe receives the file with `diarize=true` and `tag_audio_events=true` (`transcribe.py` L66–69). Diarization matters because the agent later asks questions like *"is this a clean single-speaker take, or did the second speaker jump in at 0:30?"* If diarization is off, the entire `S0 / S1 / ...` tagging collapses. Audio-event tagging matters because `(laughter)`, `(sigh)`, `(applause)` come back as inline tokens that the agent treats as semantic signals: *preserve past the laugh, don't cut mid-sigh, the applause IS the beat.* `SKILL.md` L107: *"Audio events as signals. `(laughs)`, `(sighs)`, `(applause)` mark beats. Extend past them."* `call_scribe()` posts to `https://api.elevenlabs.io/v1/speech-to-text` with `data={"model_id": "scribe_v1", "diarize": "true", "tag_audio_events": "true", "timestamps_granularity": "word"}` (`transcribe.py` L64–69) and returns the full JSON, which is written verbatim to `<edit_dir>/transcripts/<video_stem>.json`. The 30 KB-or-so JSON for a one-hour source is then collapsed by `pack_transcripts.py` into the much smaller markdown file. The collapse is where the magic happens. The packaging rule, from `SKILL.md` L114–121, is two-fold: > Phrases break on any silence ≥ 0.5 s OR speaker change. This is the artifact the editor sub-agent reads to pick cuts — it gives word-boundary precision from text alone at 1/10 the tokens of raw JSON. A phrase in this vocabulary is a contiguous run of words from one speaker with no internal gap larger than 0.5 s. Each phrase line carries the start of the first word, the end of the last word, the speaker tag, and the verbatim text. That gives the agent three independent things to reason about per line: the time the phrase starts (useful for cut pre-roll), the time it ends (useful for cut post-roll), and the silence gap to the next phrase (useful for cut *placement*). ## Why "word boundary" is the cut unit `SKILL.md` L25 enumerates this as **Hard Rule 6**: *"Never cut inside a word. Snap every cut edge to a word boundary from the Scribe transcript."* The justification lives in one place: Scribe's word timestamps drift by 50–100 ms relative to where the human ear thinks words begin and end, and any cut inside a word loses its consonant tail. If you cut between a word's start and its end, you get an audible *clip*. Every cut edge in the EDL is therefore two seconds-past the last word-end of th

Chapter 2 of 5 8m Article Learning path

The packed transcript that beats raw frames

The single artifact the LLM actually reads is takes_packed.md — phrase-level lines, word-boundary timestamps, breaks on silence ≥ 0.5 s and on speaker change. Once you see what it looks like, you can't unsee why a text-only agent can outperform a vision-augmented one on cut selection.

A packed transcript is deceptively boring to look at. SKILL.md L117–121 shows the shape:

## C0103  (duration: 43.0s, 8 phrases)
  [002.52-005.36] S0 Ninety percent of what a web agent does is completely wasted.
  [006.08-006.74] S0 We fixed this.

That is the entire reading substrate. A ## line per take with duration and phrase count, followed by one phrase per line with [start-end] timestamps, a speaker tag, and the verbatim text. There is no granularity finer than the phrase. There is no metadata about framing, lighting, gesture, facial expression, or shot scale. There is, however, a vocabulary that maps directly onto everything an editor needs to know for a talking-head cut: where each utterance begins and ends, who said it, what the previous utterance's last-word timestamp was, and therefore how big the silence gap before the next phrase is.

When the LLM in this skill "reads" the video, this is what it reads.

flowchart LR
  A[Raw MP4] -->|ffmpeg -vn -ac 1 -ar 16000| B[WAV 16 kHz mono]
  B -->|ElevenLabs Scribe<br/>scribe_v1 / word timestamps / diarize| C["transcripts/C0103.json<br/>~30 KB per hour"]
  C -->|pack_transcripts.py<br/>silence ≥ 0.5 s splits phrases| D["takes_packed.md<br/>~12 KB per hour"]
  D -->|editor sub-agent reads| E[EDL JSON<br/>ranges + grades + overlays + subs]

Where the file comes from

Start at transcribe.py L49–55: extract_audio() runs ffmpeg -vn -ac 1 -ar 16000 -c:a pcm_s16le against the source. That command strips video, forces a single audio channel, resamples to 16 kHz, and writes 16-bit little-endian PCM. The 16 kHz / mono / 16-bit profile is intentional: Scribe's API is tuned to a phone-narrow band, and the upload is dramatically smaller than the original video's audio track. For a one-hour source the WAV might be 30 MB; the MP4's audio track was likely 60 MB or more.

Scribe receives the file with diarize=true and tag_audio_events=true (transcribe.py L66–69). Diarization matters because the agent later asks questions like *"is this a clean single-speaker take, or did the second speaker jump in at 0:30?"* If diarization is off, the entire S0 / S1 / ... tagging collapses. Audio-event tagging matters because (laughter), (sigh), (applause) come back as inline tokens that the agent treats as semantic signals: *preserve past the laugh, don't cut mid-sigh, the applause IS the beat.* SKILL.md L107: *"Audio events as signals. (laughs), (sighs), (applause) mark beats. Extend past them."*

call_scribe() posts to https://api.elevenlabs.io/v1/speech-to-text with data={"model_id": "scribe_v1", "diarize": "true", "tag_audio_events": "true", "timestamps_granularity": "word"} (transcribe.py L64–69) and returns the full JSON, which is written verbatim to <edit_dir>/transcripts/<video_stem>.json. The 30 KB-or-so JSON for a one-hour source is then collapsed by pack_transcripts.py into the much smaller markdown file.

The collapse is where the magic happens. The packaging rule, from SKILL.md L114–121, is two-fold:

Phrases break on any silence ≥ 0.5 s OR speaker change. This is the artifact the editor sub-agent reads to pick cuts — it gives word-boundary precision from text alone at 1/10 the tokens of raw JSON.

A phrase in this vocabulary is a contiguous run of words from one speaker with no internal gap larger than 0.5 s. Each phrase line carries the start of the first word, the end of the last word, the speaker tag, and the verbatim text. That gives the agent three independent things to reason about per line: the time the phrase starts (useful for cut pre-roll), the time it ends (useful for cut post-roll), and the silence gap to the next phrase (useful for cut *placement*).

Why "word boundary" is the cut unit

SKILL.md L25 enumerates this as Hard Rule 6: *"Never cut inside a word. Snap every cut edge to a word boundary from the Scribe transcript."* The justification lives in one place: Scribe's word timestamps drift by 50–100 ms relative to where the human ear thinks words begin and end, and any cut inside a word loses its consonant tail. If you cut between a word's start and its end, you get an audible *clip*. Every cut edge in the EDL is therefore two seconds-past the last word-end of the previous segment and a few tens of milliseconds before the first word-start of the next segment.

SKILL.md L28 makes this concrete: *"Pad every cut edge. Working window: 30–200 ms. Scribe timestamps drift 50–100 ms — padding absorbs the drift."* The shipped launch video used 50 ms pre / 80 ms post (SKILL.md L108). That's the user's worked example and is not a mandate — but the broader window is a correctness constraint. Below 30 ms you risk clipping the consonant; above 200 ms you leak silence that the silence-gap detection has already said is editable. The agent picks a value in that window based on the material's pacing.

Crucially, that decision does not require a vision model. It's a textual reasoning exercise: "this is a fast-paced launch video, cut padding tighter; this is a documentary interview, cut padding looser." All the inputs are in takes_packed.md.

The silence-gap test, from text alone

The most useful cut decision the agent makes is also the most purely textual. Given consecutive phrases:

[014.20-018.66] S0 The fix we deployed cut latency by half in production.
[019.84-022.10] S0 We tested it against six competitor stacks.

The cut between phrases lands in the silence gap from 18.66 to 19.84 — 1.18 s of silence, well past the 400 ms threshold SKILL.md L107 calls *"usually the cleanest"*. The agent writes that cut as {"source": "...", "start": 18.62, "end": 19.94, "beat": "BENEFIT", ...} — pre-roll padding of -40 ms (the agent is going inside the silence gap, well clear of the audio event), post-roll padding of +100 ms (the consonant at the start of "We tested it" wants that breathing room). All of that reasoning sits inside a markdown line. No frame ever had to be loaded.

Now consider the harder case:

[043.10-046.20] S0 Honestly? I don't think — I mean, the migration went fine but — look, the latency —
[046.42-047.10] S0 Look, the latency — sorry.
[047.32-050.78] S0 The latency story is, we cut it by half.

The first phrase is a false start with three restarts; the second is the speaker restarting *again*; the third is the clean delivery. A vision model, watching the speaker's face, would have a hard time deciding which to keep — facial strain is identical across all three. The text alone gives the answer in a one-glance read: the third phrase is the one to keep. Phrases 1 and 2 are pre-scanned from the transcript, marked as verbal slips, and clipped in the EDL. The agent's job is to follow the *textual* signal, not the gestural one.

SKILL.md L89 codifies this: *"Pre-scan for problems. One pass over takes_packed.md to note verbal slips, obvious mis-speaks, or phrasings to avoid."* The pre-scan is text-only by design. The slips it finds feed the editor sub-agent brief as a "verbal slips to avoid" list (SKILL.md L138).

Retake selection, the editor sub-agent pattern

When the agent has twenty-three takes of the same beat — common for a launch video with a tight script — the picking problem is different. Now the agent must *compare* phrases across takes. The structure SKILL.md L123–160 spells out for that is deliberately a *sub-agent*:

When the task is "pick the best take of each beat across many clips," spawn a dedicated sub-agent with a brief shaped like this. The structure is load-bearing.

The sub-agent receives:

  • takes_packed.md for every clip
  • The product/narrative context (one or two sentences from the user)
  • Speaker(s) with delivery-style notes
  • An expected structure archetype (HOOK → PROBLEM → SOLUTION → BENEFIT → EXAMPLE → CTA is the shipped tech-launch example, but the brief is explicit about inventing one)
  • The verbal slips to avoid
  • A target runtime in seconds

The sub-agent's output is a JSON array, no prose, one entry per selected segment. Every entry has source, start, end, beat, quote, reason. The reason field is not optional in spirit: it forces the sub-agent to justify why it chose this take over its alternatives, using the transcript as evidence.

This is where the design choice crystallises. The LLM picks the best take of each beat by *reading the transcripts for that beat from multiple clips*, deciding which delivery is cleanest, which has the right emotional shape, which has the audio events you want to preserve, and which does not have a slip the editor can live with. Every reasoning input is textual. Every output is a textual EDL. The compose step that turns the EDL into MP4 is render.py.

What's missing on purpose

SKILL.md L9 enumerates an anti-pattern worth quoting: *"Hierarchical pre-computed codec formats with USABILITY / tone tags / shot layers. Over-engineering. Derive from the transcript at decision time."* This is the negative-space of the design. The transcript is *the* artifact. There is no second artifact pre-computed alongside it. Shot classification, emphasis scoring, filler tagging, retake suitability — none of these are persisted as derived data. They are computed at decision time from the source-of-truth phrase lines. The transcript is small enough that the agent can re-derive anything on demand; the precomputation would just be a cache invalidation liability.

The same anti-pattern applies to *"Hand-tuned moment-scoring functions"* (L311). Heuristics for "is this take worth keeping" or "is this a peak moment" are unnecessary because the LLM picks better than any fixed function you could write. The 12 KB packed transcript leaves plenty of room in the context window for the agent to read all candidates in full and apply judgment.

And the third anti-pattern, on Whisper variants, is the most clarifying: *"Whisper SRT / phrase-level output. Loses sub-second gap data. Always word-level verbatim."* (L313.) Whisper in SRT mode returns phrase-level time ranges, which lose the granularity you need to snap cuts to word boundaries. Whisper also normalises fillers ("um", "uh", false starts), which kills the editorial signal. The transcript must preserve the exact text and the per-word timestamps — even when the words are noise. Scribe returns both. Whisper in its default config does not.

The packed transcript is the substrate. The next chapter is about the layer *above* it — the twelve production rules that distinguish a render-correct video from one that compiles but fails silently.

---

References: