The twelve rules that save your render
Every video editing skill in the world has taste rules — fonts, palettes, easing curves, runtimes.video-usehas twelve *correctness* rules where deviation produces silent failure, not bad aesthetics. These rules live inhelpers/render.pybecause the renderer is the only place correctness can be enforced.
I want to begin with the most counter-intuitive one. Rule 1 — subtitles go LAST in the filter chain — looks trivial until you see what happens when it doesn't hold. If you composite overlays onto a base video and *then* burn subtitles over the composited result, every overlay drawn over the lower 30% of the frame will sit on top of the caption. Silent failure: the user sees the caption for the first half of every clip, then loses it the moment an animated overlay slides in. The fix isn't a re-render with different assets; it's a re-render with the filter graph re-ordered. render.py orders the parts of the filter graph the only safe way: base → overlays → subtitles. Look at build_final_composite L538–543:
if has_subs:
subs_abs = str(subtitles_path.resolve()).replace(":", r"\:").replace("'", r"\'")
filter_parts.append(
f"{current}subtitles='{subs_abs}':force_style='{SUB_FORCE_STYLE}'[outv]"
)
The subtitles= filter is the *final* filter part, regardless of how many overlays preceded it. That's not taste; that's correctness.
flowchart TD
A[Raw MP4<br/>per source] -->|ffmpeg extract<br/>30ms audio fades<br/>grade baked in| B[clips_graded/seg_NN.mp4]
B -->|-c copy concat demuxer| C[base.mp4]
C -->|setpts=PTS-STARTPTS+T/TB<br/>per overlay| D[overlay shifted to window]
D -->|overlay=enable='between(t,...)'| E[composited video]
E -->|subtitles=...:force_style=...| F[final.mp4 — captions on top]
F -->|loudnorm pass 1 + pass 2<br/>-14 LUFS / -1 dBTP / LRA 11| G[final_loudnorm.mp4]
G -->|timeline_view at every cut| H{Self-eval pass?}
H -->|no, fix| B
H -->|yes| I[Deliver to user]
The order matters more than the parts
The pipeline is not nine steps, it is nine steps in a specific order. Each reorder breaks one property. Let me walk the four most consequential ones:
Rule 2 — Per-segment extract → lossless -c copy concat, not single-pass filtergraph (render.py L267–283). The naive approach is to build a single filter graph that takes every source, trims every range, and concatenates — all in one ffmpeg invocation. That's brittle. When the user adds overlays or subtitles, the same single-pass graph has to be re-invoked with the new filter chain, which double-encodes the segments a second time. The two-stage pipeline extracts per-segment first (with grade + 30 ms fades baked in), then concatenates with -c copy and no re-encode. Adding overlays is a third pass that touches only the compositing, not the segments. Cost: re-encoding happens once per segment, even when overlays change three times during iteration.
Rule 3 — 30 ms audio fades at every segment boundary (render.py L188–189). The fade filter chain is afade=t=in:st=0:d=0.03,afade=t=out:st={dur-0.03}:d=0.03. Without it, every transition between segments carries an audible *pop* — a discontinuity in the waveform's first derivative that humans hear instantly and find unbearable. ffmpeg's -c copy concat cannot smooth that itself; you have to bake the fades during the segment extract. The 30 ms value is not arbitrary: it is short enough to be inaudible (you don't *hear* the dip in volume), long enough to remove the click. Anything below 20 ms is unreliable across ffmpeg versions; anything above 50 ms becomes a perceptible volume dip at every cut.
Rule 4 — PTS-shift overlays to land their frame 0 at the window start (render.py L522–524). Animation overlays are short rendered clips with their own internal timeline. If you composite them at output time T using their raw PTS, the visible "first frame" lands at output T plus the animation's pre-roll, which is not T. The visual effect is that every overlay's content appears to start a fraction of a second *into* the window, then snap forward as the overlay catches up. The fix is setpts=PTS-STARTPTS+T/TB — a per-overlay expression that re-stamps the first frame to timestamp T. This is unique to overlays: base video and grade-extracted segments do not need it because they were already time-aligned.
Rule 5 — Master SRT uses output-timeline offsets (render.py L362–363). When you build the master subtitle file from per-source transcripts, you cannot just dump the original timestamps. You must remap each word's start into the *output* timeline:
out_start = max(0.0, local_start - seg_start) + seg_offset
out_end = max(0.0, local_end - seg_start) + seg_offset
seg_offset is the accumulated duration of every segment that came before this one. If you skip this remap, captions land at the source's local time — which means they appear *before* their visual segment in the output. The error is invisible in the JSON but catastrophic in the resulting MP4: viewers see captions about what they're about to hear, or about what just played five seconds ago. The fix lives in build_master_srt, which is why extract_segment writes per-segment output and the master SRT can only be assembled after all segment offsets are known.
The rules you don't see at runtime
Some rules are not filter-graph correctness; they're correctness in the agent's *reasoning*, but the helpers still enforce them as a sanity check or a workflow guarantee.
Rule 6 — Never cut inside a word. This is enforced indirectly. The EDL schema accepts any start and end in seconds; nothing in the schema rejects a word-internal cut. The agent must apply it from takes_packed.md. The closest *technical* enforcement is the cut-padding window from Rule 7.
Rule 7 — Pad every cut edge 30–200 ms. The padding absorbs Scribe's 50–100 ms timestamp drift. Tighter padding risks clipping the consonant the LLM thinks it's keeping; looser padding leaks silence the silence-gap detector already said was editable. SKILL.md L107–108 gives the window; the agent chooses the value based on material pacing.
Rule 8 — Word-level verbatim ASR only. Enforced by transcribe.py L66–69: timestamps_granularity=word and no filler normalisation. If you ever swap Scribe for Whisper's SRT mode, you lose sub-second gap data *and* lose editorial signal. The helpers refuse no input — but the entire downstream pipeline assumes word-level verbatim output.
Rule 9 — Cache transcripts per source. Enforced by transcribe.py L106–109:
if out_path.exists():
if verbose:
print(f"cached: {out_path.name}")
return out_path
Re-transcription silently invalidates the editorial signal: Scribe regenerates timestamps and tokens, and any cached EDL built on the prior transcript becomes ungrounded. The cache is therefore mandatory, and the helper enforces it.
Rule 10 — Parallel sub-agents for multiple animations. Agent-side; the structure is in SKILL.md L249–261. Each animation slot gets its own sub-agent, its own slot_<id>/render.mp4, its own palette and spec. Spawning sequentially negates the speed benefit; parallelising is a hard correctness requirement because the *cost* of a sequential animation pipeline is the difference between a one-minute iteration and a fifteen-minute iteration.
Rule 11 — Strategy confirmation before execution. Agent-side; codified in the eight-step process as "Propose → Wait" (SKILL.md L88). The agent describes the strategy in 4–8 sentences, waits for explicit confirmation, then writes the EDL. Skipping the confirmation step means redoing work after the user objects to a cut they didn't approve.
Rule 12 — All session outputs in <videos_dir>/edit/. Enforced by the directory layout in SKILL.md L42–55 and the EDL's path semantics. The skill directory never receives user footage or per-session artefacts; everything lives under <videos_dir>/edit/. This is what makes the skill directory symlinkable into multiple agent install paths (install.md L73–88) without crossing wires.
Two hidden rules the renderer enforces for you
The twelve listed rules are not all the correctness. There are two additional invariants the helpers handle without explicit numbering:
HDR → SDR before any other transformation (render.py L95–131, L180–182). iPhones default to HLG HDR in Rec.2020; many mirrorless cameras ship PQ. If the source is HDR and you only downconvert bit depth (yuv420p10le → yuv420p), the output is 8-bit but still carries HLG/PQ transfer metadata. Players that honour the metadata (screen recorders, social upload re-encodes) interpret 8-bit values in an HDR container and the result is oversaturated. is_hdr_source() checks the stream's color_transfer. If it's smpte2084 or arib-std-b67, the filter graph gets a tonemap chain prepended:
TONEMAP_CHAIN = (
"zscale=t=linear:npl=100,"
"format=gbrpf32le,"
"zscale=p=bt709,"
"tonemap=tonemap=hable:desat=0,"
"zscale=t=bt709:m=bt709:r=tv,"
"format=yuv420p"
)
That's the only place in the entire skill where the fix is a five-stage filter graph. The reason: it's the only failure mode that doesn't manifest as a *visual* error in QuickTime on macOS. Macs hide the HDR metadata locally. Screen recording and uploaded renders cannot.
Loudness normalisation at -14 LUFS, -1 dBTP, LRA 11 LU (render.py L388–490). YouTube, Instagram, TikTok, X, and LinkedIn all normalise to -14 LUFS integrated with -1 dB true peak and 11 LU loudness range. The renderer runs a two-pass loudnorm after compositing: pass 1 measures the input; pass 2 applies the correction with measured_* parameters and :linear=true to keep the gain across the whole file consistent. Without this pass, the social platforms' own normalisation re-encodes the audio *and* introduces audible pumping on quiet sections. The two-pass is the only way to hit -14 LUFS without measurement drift.
Auto-grade is a hard rule in disguise
The fourth hidden correctness layer is in grade.py. Auto-grade is bounded to ±8% on every axis and applies *no* colour shift:
contrast_adj = max(0.94, min(1.08, contrast_adj))
gamma_adj = max(0.94, min(1.10, gamma_adj))
sat_adj = max(0.94, min(1.06, sat_adj))
(grade.py L246–249.) The rule isn't documented as a hard rule in SKILL.md because it's enforced by the helper's math, not by the agent's behaviour. But it is one. If the helper produced unbounded adjustments, the renderer would silently destroy skin tones, crush highlights, and shift palette. The boundedness is what makes auto-mode safe to default. The user can opt into warm_cinematic (which is flagged as "too aggressive for standard launch content" in the same file, L51), but they cannot opt into a grade.py that ignores its own caps.
I bring this up because it's the only one of the eleven hard-coded rules the user might *not* realise is a rule. The other eleven are visible in the filter graph, in the EDL schema, or in the agent's process flow. Auto-grade's boundedness is hidden inside auto_grade_for_clip's clamping. Knowing it is there means knowing why the renderer is safe to leave on "auto" for the entire session.
Walk through render.py once with the twelve rules in hand and you'll see why the renderer is 658 lines: each rule has a load-bearing code site, and most rules interact — adding overlays (Rule 4) reorders the filter graph relative to subtitles (Rule 1) while still requiring per-segment 30 ms fades (Rule 3) and lossless concat (Rule 2). The next chapter is the layer above this: how the agent makes decisions about *which* segments go in the EDL in the first place.
---
References:
- SKILL.md — the twelve hard rules, in order (L18–33)
- render.py — HDR tonemap chain, 30 ms fades, lossless concat, subtitles LAST (L95–131, L149–212, L267–283, L522–569)
- render.py — loudnorm two-pass for social targets (L388–490)
- grade.py — auto-grade bounded to ±8% on every axis (L178–271)
- install.md — symlinked skill directory + per-session output separation (L73–88)