What the Factory Cannot Yet Test for You
Harness ships QA guidance distilled from seven real boundary-mismatch bugs in SatangSlide — but the most dangerous failures sit *between* skills, not *within* them, and the factory has not yet built a smoke test for that surface. The next chapter of Harness's evolution will be cross-component QA, and the source is already pointing at it.
In late 2024, a team at Satang built a slide-generation web app. They had a Next.js API layer returning slide-project data, React hooks typed against SlideProject[], a status-transition map defining generating_template → template_approved, and a dashboard with links to the creation page. Each component worked correctly in isolation. npm run build passed. TypeScript strict mode was on.
When the dashboard went live, every link 404'd. The user typed claude "build a harness for SatangSlide", and Harness shipped a fresh team. One of the first artifacts the QA agent's checklist flagged was the bug class that would have produced 404s if the same code pattern existed here: boundary mismatch.
The QA agent guide at skills/harness/references/qa-agent-guide.md is not abstract guidance. It is a distilled lesson from seven specific production incidents in SatangSlide. The guide's central claim — *cross-component verification, not existence verification* — is the most operationally valuable single idea in the entire Harness repository. It is also the most visible hole in the factory: the guide describes the failure class but does not yet ship the automated cross-component smoke test that would catch it before deploy.
This chapter is about that gap, and what it implies for Harness's evolution.
Key Takeaways
- The dominant runtime failure in multi-agent-built web apps is the boundary mismatch: two components each pass their own tests but fail at the seam.
- Harness's QA guide lists six categories of boundary mismatch, all sourced from real SatangSlide incidents.
- The QA agent must be
general-purpose, notExplore. The reasoning is concrete: Grep + script execution + cross-component comparison. - Incremental QA (after each module) is structurally better than end-of-build QA. The spec calls this out explicitly.
- The factory ships the QA *methodology* but not the QA *smoke test*. That is the load-bearing gap.
- Harness's own
_workspace/01_auditor_repo_audit.md(internal trending-readiness audit, 5.5/10) admits the community-infra gap. The QA smoke test is the parallel engineering gap.
The seven bugs
The QA guide ends with a table of seven bugs from SatangSlide. They are not theoretical. Each one corresponds to a category in the boundary-mismatch table earlier in the same guide.
| Bug | Boundary | Why static review missed it | |-----|----------|----------------------------| | projects?.filter is not a function | API → hook | API returns { projects: [] }, hook expects SlideProject[] | | Dashboard links all 404 | File path → href | /dashboard/ prefix missing from links | | Theme image invisible | API → component | thumbnailUrl (camelCase) vs thumbnail_url (snake_case) | | Theme selection does not save | API → hook | select-theme API exists, no hook calls it | | Creation page hangs forever | Status map → code | template_approved transition defined, never executed | | data.failedIndices crash | Immediate response → frontend | Background job result accessed on synchronous response | | "View slides after completion" 404 | File path → href | /projects/ vs /dashboard/projects/ |
The pattern in all seven is identical: each side of the boundary is correct in its own context, but the contract between them is wrong. npm run build passes because TypeScript's generic casting (fetchJson<SlideProject[]>(...)) trusts the type annotation rather than the runtime response. Explore-only QA misses them because reading one side does not reveal the seam.
The guide's response is two structural rules:
Rule 1. The QA agent must begeneral-purpose, notExplore. *Reason:* Grep, script execution, and cross-component comparison require Edit/Write/Bash. Explore is read-only.
Rule 2. QA runs after each module, not at end of build. *Reason:* end-of-build QA lets boundary errors propagate through the rest of the codebase, multiplying fix cost.
These are the right rules. They are also rules that operate *on the QA agent itself*, not on the boundary surface. The boundary surface still gets discovered by hand, by the QA agent reading both sides and writing a checklist.
The boundary surface, drawn
flowchart LR
subgraph LEFT["Left Side — Producer"]
APIR["API route<br/>NextResponse.json()"]
SM["Status Map<br/>STATE_TRANSITIONS"]
DBM["DB Schema<br/>snake_case fields"]
FP["File System<br/>src/app/* page paths"]
end
subgraph RIGHT["Right Side — Consumer"]
HOOK["React hook<br/>fetchJson<T>"]
UC["Update Code<br/>.update({ status })"]
TY["Type Definition<br/>camelCase fields"]
LINK["Code Links<br/>href, router.push"]
end
APIR -.JSON shape.-> HOOK
SM -.transition used?-> UC
DBM -.field rename.-> TY
FP -.path match?-> LINK
classDef producer fill:#fde68a,stroke:#b45309;
classDef consumer fill:#bae6fd,stroke:#0369a1;
classDef boundary fill:#f5f5f4,stroke:#000,stroke-dasharray: 5 5;
class APIR,SM,DBM,FP producer;
class HOOK,UC,TY,LINK consumer;
Each dotted edge is a contract that must hold. The QA agent's job is to verify each contract by reading both sides simultaneously. The factory ships a checklist for each contract:
- API ↔ Hook: response shape matches
T; wrapping is unwrapped correctly; snake_case ↔ camelCase applied consistently; sync vs async response handled. - Status map ↔ Update code: every defined transition is executed; every code transition is defined; intermediate → final transitions not skipped.
- DB schema ↔ API ↔ Type: field name mapping is consistent end-to-end; null/undefined handled on optional fields.
- File paths ↔ Links: every
hrefandrouter.pushmatches a real page path; route groups(group)correctly stripped from URLs.
The checklist is the factory's most detailed artifact. It is also the artifact most likely to be skipped by users — it requires the QA agent to actually run a Grep-extract-compare cycle, which is more work than reading a single file.
The missing smoke test
The factory has not yet shipped a *programmatic* boundary smoke test. The closest is skill-testing-guide.md §4.2, which recommends "프로그래밍 가능한 검증" (programmable verification) for assertions, with scripts that can be re-run per iteration. But the boundary checklist itself is still prose. A team that wants to enforce boundary consistency has to write their own Grep scripts against their specific framework.
For Next.js + Prisma + React, that script is straightforward: extract NextResponse.json() calls, extract fetchJson< calls, diff the shapes, flag mismatches. For Express + Mongoose + Vue, the script is different but analogous. For Django + SQLAlchemy + HTMX, different again. The factory does not ship the script because it cannot anticipate the framework choice — and yet the failure class is framework-independent.
The hole in the factory is structural, not editorial. It is not that the maintainers forgot to ship the smoke test. It is that the smoke test depends on the framework, and the factory is framework-agnostic. The general solution would be:
1. A boundary-check tool that takes the team's stack as input and emits the appropriate Grep + diff pipeline. 2. A canonical boundary-check tool for the most common stacks (Next.js + Prisma, Express + Mongoose, Django + DRF), so users do not start from zero. 3. Integration into Phase 6 so that claude "build a harness for X" includes a boundary-check step by default.
None of those exists yet in the source. The QA guide tells you to do it. The smoke test does not exist.
Your webhook works. Your API works. Your user sees a TypeError. Welcome to the boundary.
Incremental QA and the producer-reviewer loop
The QA guide's recommendation to run QA *after each module* is more important than the README suggests. End-of-build QA fails for a specific reason: by the time all modules exist, the boundary mismatches have already propagated. A component built against the wrong assumption has produced downstream artifacts that depend on the wrong shape. Fixing the boundary mismatch now means rewriting those downstream artifacts too.
Incremental QA interrupts the propagation. After the API is written, QA reads the API and reads the (still-empty) hooks directory. The shape mismatch is flagged before the hook is written against the wrong type. After the first hook is written, QA reads both. After the first integration, QA reads API + hook + status update + link.
The pattern is structurally Producer-Reviewer, applied to the QA function. The QA agent is the reviewer. The producer is whichever agent just wrote the module. The retry cap from Chapter 2 applies — a boundary mismatch that survives two QA cycles is force-passed with a warning.
This is a coherent model. It is also one that depends entirely on the QA agent actually running. The factory ships the instruction. The user has to enforce it.
Evolution as the factory's other load-bearing phase
Phase 7 is the second load-bearing lifecycle phase. Its job is to ensure that the team improves after each execution. The mechanism is straightforward:
1. After each run, the orchestrator asks the user for feedback. 2. Feedback is routed to the correct artifact (quality → skill, role → agent, workflow → orchestrator, trigger → skill description). 3. The change is recorded in the CLAUDE.md change-log table. 4. Future sessions read the change log and start from the new baseline.
The factory's evolution model is auditable. Every change has a date, an artifact, and a reason. The change log is the factory's memory across sessions, and it is the same Markdown table the user already edits.
The limitation is that the change log captures *what* changed, not *whether the change worked*. Phase 7 records changes; it does not re-test them. If a skill was modified based on user feedback, the next session picks up the modified skill, but the factory does not re-run Phase 6's A/B test against the new version. The user has to ask for it.
This is a small gap, but it is the gap that distinguishes "factory that records changes" from "factory that improves quality." Harness ships the first. The second is open work.
The gap the QA agent guide reveals
Synthesizing across the seven bugs, the QA guide, the Phase 7 evolution model, and the [Unreleased] section of CHANGELOG.md, the load-bearing gap in the current factory is:
The factory ships the methodology for cross-component verification but not the test harness.
Specifically:
- The QA agent *description* is shipped (in qa-agent-guide.md).
- The QA agent *methodology* is shipped (Grep + diff + cross-component compare).
- The QA agent *checklist* is shipped (six contract categories).
- The QA agent *smoke test framework* is not shipped (no programmatic boundary check; no stack-specific tools).
Until the smoke test framework exists, the QA methodology is advisory. Users who follow it get high-quality teams. Users who skip it get SatangSlide-style boundary bugs. The factory's evolution path is to convert the methodology into a default-on smoke test — the equivalent of Phase 6's with-skill vs without-skill A/B, but applied to the team's own boundaries.
I thought the QA guide was an add-on. It turned out to be the most operational evidence of what "factory" means — the factory ships the standard, not just the product, and the standard is what the next phase of evolution will close the loop on.
What this means for the series
The red thread of this series has been: *the hardest problem in multi-agent AI is making the team a team, and the factory is the right shape for that work.* Across four chapters, I have argued that the factory has a disciplined pipeline (Chapter 1), a defensible design vocabulary (Chapter 2), an enforceable execution rule (Chapter 3), and — in this chapter — an honest gap.
The gap is not a flaw. It is the boundary the factory has not yet crossed. Every factory has one — the surface where its current standards are advisory rather than enforced, where the methodology is shipped but the test harness is not. For Harness, that surface is cross-component QA. The seven SatangSlide bugs are the proof that the surface exists; the QA guide is the proof that the maintainers know it exists; the missing smoke test is the proof that they have not yet closed it.
If the next chapter of Harness's evolution closes this gap — if Phase 6 ships with a boundary-check tool that runs by default — the factory's claim of *design-time discipline covering runtime quality* will be complete. Until then, the claim is *design-time discipline covering most runtime quality*, and the boundary mismatch is the exception that proves the rule.
If the factory's QA agent catches seven boundary bugs in a SatangSlide clone, who catches the bugs at the boundary *between harnesses*? That is the question the next generation of factory tooling — whether from the same maintainers or from the L3 neighbors — will have to answer. Harness ships the discipline. The discipline does not yet extend to the seams between its own outputs.
---
References:
skills/harness/references/qa-agent-guide.md— Boundary mismatch taxonomy, six contract categories, seven SatangSlide bugs, incremental QA ruleskills/harness/SKILL.md— Phase 6 validation, Phase 7 evolution, change-log table schemaskills/harness/references/skill-testing-guide.md— Programmable verification pattern (closest existing analog to boundary smoke test)_workspace/01_auditor_repo_audit.md— Internal trending-readiness audit (5.5/10 baseline; community-infra gap parallel to engineering gap)- Hwang, M. (2026). *Harness: Structured Pre-Configuration for Enhancing LLM Code Agent Output Quality* — Author-measured A/B on 15 tasks, n=15, third-party replications pending