Programming /

E04_Hypothesis_Shaped_Research

# Hypothesis-Shaped Research: Why Every Backtest Is a Clinical Trial > A research workflow that produces a transaction is transactional. A research workflow that produces an artifact you can inspect, hash, and re-run is scientific. Vibe-Trading's research surface is the second kind — and the same five-layer architecture that defends a broker account defends a research conclusion. ## Key Takeaways - A research workflow shaped like a transaction (prompt → answer) is not auditable. A workflow shaped like a hypothesis (Goal → Claim → Signal Engine → Backtest → Attribution) is, because every step produces an artifact you can hash and re-run. - Layered attribution is the most under-discussed element. After every backtest, the system runs trade-level winners/losers, beta regression, market-regime analysis, and a Monte Carlo permutation test, gated by data availability. The output is not a PnL number; it is a five-axis diagnostic. - The Shadow Account loop is the workflow in miniature: read journal → profile behavior → extract rules → backtest → render. Same five steps, scaled to one trader's history. - The series closes where it opened. "Trust is the product" was the E00 claim. Every chapter since has shown what the claim looks like in code: chokepoints, capability layers, structural guards, artifact trails. The five layers are not a stack. They are a cycle. The cycle is the product. --- A researcher uploads a one-year broker CSV to Vibe-Trading's Shadow Account endpoint. The system reads the journal, profiles the trader's behavior — holding days, win rate, PnL ratio, drawdown, disposition effect, overtrading, momentum chasing, anchoring — and distills three to five if-then rules from the profitable roundtrips. It scans a deterministic OHLCV feature space (not the old calendar-phase stub), backtests the rules across A-share, HK, US, and crypto markets, and renders an eight-section HTML/PDF report. The researcher did not write a line of code. She did not specify a feature. She did not run a backtest. She uploaded a file, and the system produced a delta-PnL report across four markets with attribution analysis on the winners and losers. That is the workflow in miniature. It is also the workflow at full size. The same five-step shape — read, profile, extract, backtest, attribute — applies whether the input is a trader's CSV or a fund's research mandate. The shape is the principle. The scale is the variable. The deeper claim I want to make in this closing chapter is that **Vibe-Trading's research workflow is shaped like a clinical trial, not like a transaction.** Every step produces an artifact you can inspect later. The Goal produces a research-goal object with claims, acceptance criteria, evidence, budgets, and a completion policy. The Hypothesis produces a registry entry with a falsification rule. The Signal Engine produces a pre-flight-validated Python module with a hash. The Backtest produces a run card with `config_hash` + `strategy_hash`. The Attribution produces a five-axis diagnostic on the backtest output, gated by data availability. The system does not just produce answers. It produces *auditable answers* — answers you can re-trace, re-run, and re-attribute. This is closer to clinical-trial methodology than to retail trading. And it is the reason the series exists. ## The Goal → Hypothesis → Signal → Backtest → Attribution shape Let me walk the shape end to end, because each step has an artifact and the artifacts are the point. **Goal.** Pull request 2026-05-24 introduced the Research Goal runtime: a session-scoped object that carries the user's claim, the acceptance criteria for the claim, the evidence to be gathered, the budgets (time, compute, API calls) that constrain the search, and the completion policy that decides when the goal is satisfied. The Goal is the *mandate* of the research workflow, in the same sense the broker mandate is the mandate of the trading layer. It is what the user commits to before the agent investigates. Pull request 2026-05-26 made the Web UI goal creation send the kickoff turn immediately, so the Goal is not a dead letter — it is the first artifact in the run. **Hypothesis.** The hypothesis registry (PR 2026-05-16, CLI added 2026-05-20) is a backend with `create_hypothesis`, `update_hypothesis`, `link_backtest`, and `search_hypotheses`. Each hypothesis carries a falsification rule: the condition under which the claim is invalidated. The registry is what makes a claim *falsifiable* — the difference between a research conclusion and a market opinion. The hypothesis is also the auto-expiry mechanism: claims that no longer hold are invalidated, and invalidated claims are not silently deleted. They are marked, dated, and queryable. **Signal Engine.** LLM-generated signal engines pass interface validation before instantiation (PR #149, 2026-05-30). The pre-flight catches circular self-imports, missing `generate()`, non-defaulted `__init__`, wrong return types — each failure surfaces as an actionable JSON error. The Signal Engine is the *pre-trade gate* of the research workflow, in the same sense the broker pre-trade gate rejects orders that violate the mandate. An engine that fails the pre-flight never runs. An engine that passes the pre-flight runs against data that has been validated by the loader registry's OHLC sanity check. **Backtest.** Pull request 2026-06-25 (PR #306) made validation JSON strict-JSON-safe; pull request 2026-05-12 made run cards the durable artifact for every finished run. The run card carries `config_hash` + `strategy_hash`, plus validation JSON, Monte Carlo permutation, Bootstrap CI, and Walk-Forward results. The backtest is the *audit ledger* of the research workflow. It is what makes the research reproducible. A researcher who disagrees with a backtest can re-run it with the same hashes and verify the result, or with different hashes and explain the divergence. **Attribution.** Pull request 2026-06-21 (PR #280) added layered attribution: after every backtest, the system runs trade-level winners/losers, beta regression, market-regime analysis, and a Monte Carlo permutation test, gated by data availability. The output is not a single PnL number. It is a five-axis diagnostic that explains *why* the strategy won or lost, which trades contributed, how much of the return is beta, which regimes it survives, and whether the result is statistically distinguishable from random. The attribution is the *bounded-autonomy element* I missed in E03 — it is the layer that prevents the researcher from confusing a lucky backtest with a real edge. ```mermaid flowchart LR G["Goal claim · acceptance · evidence · budget · completion"] H["Hypothesis falsification rule · registry · auto-invalidate"] S["Signal Engine pre-flight · interface validation · hash"] B["Backtest run_card.json · config_hash · strategy_hash"] A["Attribution winners/losers · beta · regime · permutation · bootstrap"] G --> H --> S --> B --> A A -.invalidates.-> H A -.closes the loop.-> G ``` The arrows in that diagram are the same shape as the bounded-autonomy diagram in E03. The Goal flows into the Hypothesis. The Hypothesis flows into the Signal Engine. The Signal Engine flows into the Backtest. The Backtest flows into the Attribution. The Attribution loops back to invalidate Hypotheses and close Goals. The cycle is the product. ## The Shadow Account loop The Shadow Account module under `agent/src/shadow_account/` is the workflow in miniature. The five components — `extractor.py`, `scanner.py`, `backtester.py`, `codegen.py`, `reporter.py` — implement the same five-step shape, scaled down to a single trader's journal. The **extractor** distills if-then rules from profitable roundtrips. Pull request #302 (2026-06-24) made the extraction see PIT-safe entry conte

Chapter 5 of 5 14m Article Audio Learning path

Hypothesis-Shaped Research: Why Every Backtest Is a Clinical Trial

A research workflow that produces a transaction is transactional. A research workflow that produces an artifact you can inspect, hash, and re-run is scientific. Vibe-Trading's research surface is the second kind — and the same five-layer architecture that defends a broker account defends a research conclusion.

Key Takeaways

A research workflow shaped like a transaction (prompt → answer) is not auditable. A workflow shaped like a hypothesis (Goal → Claim → Signal Engine → Backtest → Attribution) is, because every step produces an artifact you can hash and re-run.
Layered attribution is the most under-discussed element. After every backtest, the system runs trade-level winners/losers, beta regression, market-regime analysis, and a Monte Carlo permutation test, gated by data availability. The output is not a PnL number; it is a five-axis diagnostic.
The Shadow Account loop is the workflow in miniature: read journal → profile behavior → extract rules → backtest → render. Same five steps, scaled to one trader's history.
The series closes where it opened. "Trust is the product" was the E00 claim. Every chapter since has shown what the claim looks like in code: chokepoints, capability layers, structural guards, artifact trails. The five layers are not a stack. They are a cycle. The cycle is the product.

---

A researcher uploads a one-year broker CSV to Vibe-Trading's Shadow Account endpoint. The system reads the journal, profiles the trader's behavior — holding days, win rate, PnL ratio, drawdown, disposition effect, overtrading, momentum chasing, anchoring — and distills three to five if-then rules from the profitable roundtrips. It scans a deterministic OHLCV feature space (not the old calendar-phase stub), backtests the rules across A-share, HK, US, and crypto markets, and renders an eight-section HTML/PDF report. The researcher did not write a line of code. She did not specify a feature. She did not run a backtest. She uploaded a file, and the system produced a delta-PnL report across four markets with attribution analysis on the winners and losers.

That is the workflow in miniature. It is also the workflow at full size. The same five-step shape — read, profile, extract, backtest, attribute — applies whether the input is a trader's CSV or a fund's research mandate. The shape is the principle. The scale is the variable.

The deeper claim I want to make in this closing chapter is that Vibe-Trading's research workflow is shaped like a clinical trial, not like a transaction. Every step produces an artifact you can inspect later. The Goal produces a research-goal object with claims, acceptance criteria, evidence, budgets, and a completion policy. The Hypothesis produces a registry entry with a falsification rule. The Signal Engine produces a pre-flight-validated Python module with a hash. The Backtest produces a run card with config_hash + strategy_hash. The Attribution produces a five-axis diagnostic on the backtest output, gated by data availability. The system does not just produce answers. It produces *auditable answers* — answers you can re-trace, re-run, and re-attribute.

This is closer to clinical-trial methodology than to retail trading. And it is the reason the series exists.

The Goal → Hypothesis → Signal → Backtest → Attribution shape

Let me walk the shape end to end, because each step has an artifact and the artifacts are the point.

Goal. Pull request 2026-05-24 introduced the Research Goal runtime: a session-scoped object that carries the user's claim, the acceptance criteria for the claim, the evidence to be gathered, the budgets (time, compute, API calls) that constrain the search, and the completion policy that decides when the goal is satisfied. The Goal is the *mandate* of the research workflow, in the same sense the broker mandate is the mandate of the trading layer. It is what the user commits to before the agent investigates. Pull request 2026-05-26 made the Web UI goal creation send the kickoff turn immediately, so the Goal is not a dead letter — it is the first artifact in the run.

Hypothesis. The hypothesis registry (PR 2026-05-16, CLI added 2026-05-20) is a backend with create_hypothesis, update_hypothesis, link_backtest, and search_hypotheses. Each hypothesis carries a falsification rule: the condition under which the claim is invalidated. The registry is what makes a claim *falsifiable* — the difference between a research conclusion and a market opinion. The hypothesis is also the auto-expiry mechanism: claims that no longer hold are invalidated, and invalidated claims are not silently deleted. They are marked, dated, and queryable.

Signal Engine. LLM-generated signal engines pass interface validation before instantiation (PR #149, 2026-05-30). The pre-flight catches circular self-imports, missing generate(), non-defaulted __init__, wrong return types — each failure surfaces as an actionable JSON error. The Signal Engine is the *pre-trade gate* of the research workflow, in the same sense the broker pre-trade gate rejects orders that violate the mandate. An engine that fails the pre-flight never runs. An engine that passes the pre-flight runs against data that has been validated by the loader registry's OHLC sanity check.

Backtest. Pull request 2026-06-25 (PR #306) made validation JSON strict-JSON-safe; pull request 2026-05-12 made run cards the durable artifact for every finished run. The run card carries config_hash + strategy_hash, plus validation JSON, Monte Carlo permutation, Bootstrap CI, and Walk-Forward results. The backtest is the *audit ledger* of the research workflow. It is what makes the research reproducible. A researcher who disagrees with a backtest can re-run it with the same hashes and verify the result, or with different hashes and explain the divergence.

Attribution. Pull request 2026-06-21 (PR #280) added layered attribution: after every backtest, the system runs trade-level winners/losers, beta regression, market-regime analysis, and a Monte Carlo permutation test, gated by data availability. The output is not a single PnL number. It is a five-axis diagnostic that explains *why* the strategy won or lost, which trades contributed, how much of the return is beta, which regimes it survives, and whether the result is statistically distinguishable from random. The attribution is the *bounded-autonomy element* I missed in E03 — it is the layer that prevents the researcher from confusing a lucky backtest with a real edge.

flowchart LR
    G["Goal<br/>claim · acceptance · evidence · budget · completion"]
    H["Hypothesis<br/>falsification rule · registry · auto-invalidate"]
    S["Signal Engine<br/>pre-flight · interface validation · hash"]
    B["Backtest<br/>run_card.json · config_hash · strategy_hash"]
    A["Attribution<br/>winners/losers · beta · regime · permutation · bootstrap"]

    G --> H --> S --> B --> A
    A -.invalidates.-> H
    A -.closes the loop.-> G

The arrows in that diagram are the same shape as the bounded-autonomy diagram in E03. The Goal flows into the Hypothesis. The Hypothesis flows into the Signal Engine. The Signal Engine flows into the Backtest. The Backtest flows into the Attribution. The Attribution loops back to invalidate Hypotheses and close Goals. The cycle is the product.

The Shadow Account loop

The Shadow Account module under agent/src/shadow_account/ is the workflow in miniature. The five components — extractor.py, scanner.py, backtester.py, codegen.py, reporter.py — implement the same five-step shape, scaled down to a single trader's journal.

The extractor distills if-then rules from profitable roundtrips. Pull request #302 (2026-06-24) made the extraction see PIT-safe entry context — entry_rsi14 and prior_5d_return fetched through the loader registry as of buy_dt. The extractor is the Goal step: it commits to a set of rules the trader's history suggests.

The scanner uses deterministic OHLCV feature evaluation, replacing the old calendar-phase stub (PR 2026-05-16). The scanner is the Hypothesis step: it tests the rules against a feature space with a falsification condition (the rule fires or it does not).

The backtester runs the rules across multiple markets. Pull request #316 (2026-06-27) added a PRICE_FEATURES contract that the extractor and codegen share, with four-decimal return bounds — preventing rule/codegen drift and precision loss on prior_5d_return. The backtester is the Signal Engine step: it produces a hash-verified artifact per market.

The codegen generates runnable strategy code from the extracted rules. The codegen is the Backtest step: it produces a strategy module that can be re-run, re-hashed, and re-attributed.

The reporter renders an eight-section HTML/PDF report. The reporter is the Attribution step: it produces the diagnostic the user sees, with winners/losers, beta, regime, and permutation analysis gated by data availability.

flowchart LR
    J["Broker CSV<br/>trader journal"]
    EX["Extractor<br/>3-5 if-then rules"]
    SC["Scanner<br/>deterministic OHLCV features"]
    BT["Backtester<br/>A-share · HK · US · crypto"]
    CG["Codegen<br/>runnable strategy module"]
    RP["Reporter<br/>8-section HTML/PDF"]

    J --> EX --> SC --> BT --> CG --> RP
    RP -.delta-PnL report.-> U["Researcher"]

The Shadow Account loop is what a retail researcher touches. The full Goal → Hypothesis → Signal → Backtest → Attribution loop is what a quant desk touches. The shape is the same. The artifacts are the same. The principle is the same.

What this looks like in code

The Goal runtime lives in agent/src/goal/. The Hypothesis registry lives in agent/src/hypotheses/. The Signal Engine pre-flight is in agent/src/backtest/validation.py. The Run Card writer is in agent/src/backtest/run_card.py. The Layered Attribution is in agent/src/backtest/attribution.py. Each module is small. Each module is independently testable. Each module produces an artifact that the next module consumes.

The artifacts are linked by hash. A run_card.json contains the config_hash of the configuration, the strategy_hash of the strategy, the data_hash of the input data, and the attribution_hash of the attribution result. A researcher who wants to verify a backtest can fetch the run card, verify the hashes, re-run the backtest, and compare. The system supports reproducibility as a first-class property, not as an afterthought.

The shadow_account pipeline links the same way. The PRICE_FEATURES contract (PR #316) is the shared type that the extractor and codegen both speak. A rule extracted from a trader's journal is serialized against that contract, hashed, passed to the scanner, hashed, passed to the backtester, hashed. The chain of hashes is the chain of provenance. A break in the chain is a break in the workflow; the system surfaces the break as an actionable error.

I came into this codebase assuming research workflow design was about UX — give the agent more tools, make the chat prettier, ship a dashboard. The artifact trail changed my mind. The five-step shape is not UX. It is reproducibility. Every step must produce a hash-verified, content-recoverable artifact, and the artifacts must link. That is the requirement. The chat is the surface. The hashes are the substrate.

The five-layer cycle, closed

In E00 I drew the five layers as a vertical stack — data, model, broker, research workflow, audit — with a loop arrow from audit back to data. I want to revisit that diagram now, because the loop is the point.

flowchart TB
    L1["Layer 1 — Data<br/>loader registry · sanity check · cache · fallback chains"]
    L2["Layer 2 — Model<br/>capability-per-provider · stream isolation · empty-response surfacing"]
    L3["Layer 3 — Broker<br/>mandate · kill switch · fail-closed gate · audit ledger · structural per-broker guard"]
    L4["Layer 4 — Research workflow<br/>Goal → Hypothesis → Signal → Backtest → Attribution"]
    L5["Layer 5 — Audit trail<br/>run cards · config_hash · strategy_hash · session fsync · call_id correlation"]

    L1 --> L2 --> L3 --> L4 --> L5
    L5 -.every artifact becomes next-run input.-> L1

The loop is not decorative. The audit trail from a finished backtest becomes the input to the next research goal: a validated hypothesis, a remembered PnL, an attribution result that the next agent will reference when it drafts the next claim. The swarm workers (E01) feed the data layer; the data layer feeds the model layer (E02); the model layer's outputs feed the broker layer (E03); the broker layer's decisions feed the research workflow (this chapter); the research workflow's artifacts feed the audit layer; and the audit layer's outputs feed the data layer, completing the cycle.

The five layers are not a stack. They are a cycle. The cycle is the product.

Three cognitive shifts, one decision table

This series opened with a question: *how do you earn the right to connect an LLM to your financial life?* I want to close with three shifts in how I think about the answer, and a decision table for the reader who has to make this choice in their own system.

Shift 1: The bug is not the model. I started this series assuming the hard part of an LLM-trading system was the model. The evidence pushed me toward "the bug is the data layer." Then "the bug is the provider shim." Then "the bug is the absence of bounded autonomy." Then "the bug is the absence of an artifact trail." The model is the surface. The architecture is the substrate. You fix substrate problems at the substrate layer.

Shift 2: The shim is the bug. A shim that handles multiple providers, multiple brokers, or multiple data sources with one piece of code is a single point of cross-contamination. The fix is never a smarter shim. The fix is to delete the shim and let each system's behavior be its own concern, with its own tests, its own overrides, its own failure modes. "Default" is a stance, not a fact — make the stance explicit or it will leak.

Shift 3: The trust boundary does not end at the broker. The same five-element pattern that defends a broker account defends a research conclusion, a backtest artifact, a hypothesis claim. The pattern is portable. The vocabulary is portable. The enforcement style is portable. If you swap "broker" for "production database," "compliance team," or "customer-facing action," the same five elements apply. The five elements were not added to make the system safe. They were specified to make the system possible.

| If your agent touches… | The five elements are… | |---|---| | A broker API | Mandate, kill switch, pre-trade gate, audit ledger, auto-expiry runner | | A research artifact | Goal, falsification rule, signal-engine pre-flight, run-card audit, attribution gate | | A production database | Mandate (read-only / row-level scopes), kill switch (query cancel), pre-flight (SQL plan validation), audit ledger (query log), runner (auto-expire token) | | A customer-facing action | Mandate (rate limit / persona scope), kill switch (feature flag), pre-flight (policy check), audit ledger (action log), runner (session timeout) |

The table is the through-line. The pattern is the product. The architecture is the policy.

The narrower claim

I am arguing that for any LLM agent that produces artifacts consumed by humans or downstream systems, the research workflow must be shaped like a hypothesis, not like a transaction. Every step must produce a hash-verified, content-recoverable artifact. The artifacts must link. The links must be queryable. The query results must invalidate the upstream claims when they no longer hold.

I am not arguing that every workflow needs all five elements. A workflow that does not produce artifacts (a chat reply, a one-shot classification) does not need a hypothesis registry. A workflow that does not have consequences (a research summary for personal use) does not need a bounded-autonomy pattern. The principle is "match the architecture to the blast radius," not "apply the pattern everywhere."

I am also not arguing that this pattern is unique to Vibe-Trading. The clinical-trial shape shows up in any rigorous scientific workflow. The five-element bounded-autonomy pattern shows up in any safety-critical control system. The contribution of this codebase is the *consistent application* across both layers, with the same vocabulary, with the same enforcement style, with the same level of detail. The pattern is a habit, not a feature.

Closing the loop

I opened this series with a researcher running a fifty-worker investment committee on NVDA, and the swarm workers hallucinating prices from fifty different ad-hoc data fetches. The root cause was not a bad model. It was not a slow broker. It was the absence of a chokepoint. The bug moved downward, and the fix was at the bottom.

I close the series with a researcher uploading a broker CSV to a Shadow Account endpoint, and the system producing a delta-PnL report across four markets with five-axis attribution. The workflow is shaped like a clinical trial. The artifacts are hash-linked. The results are reproducible. The conclusions are falsifiable.

The two scenes are not different systems. They are the same system, looked at from two ends. The first scene is the bottom of the cycle: data flowing in, being validated by the loader registry, being consumed by the swarm. The second scene is the top of the cycle: research flowing in, being shaped by the Goal, producing hash-linked artifacts, looping back to validate the next round of hypotheses. The cycle is the product. The product is the cycle.

Five layers, each assuming the layer outside it can fail. Data, model, broker, research workflow, audit. The pattern is portable. The pattern scales. The pattern is a habit. The pattern is what it looks like when an agent earns the broker.

---

References:

source/vibe-trading/agent/src/shadow_account/ — five-step Shadow Account pipeline
source/vibe-trading/agent/src/hypotheses/ — hypothesis registry with falsification rules
source/vibe-trading/agent/src/goal/ — Research Goal runtime (claims, acceptance, evidence, budget, completion)
source/vibe-trading/agent/src/backtest/ — backtest engines, validation, run cards, attribution
PR 2026-05-12 — run cards (run_card.json + run_card.md) for reproducible research artifacts
PR 2026-05-16 — hypothesis registry with invalidation policy
PR 2026-05-24 — Research Goal runtime
PR 2026-05-26 — Web UI goal creation sends kickoff turn immediately
PR 2026-05-30 — signal-engine pre-flight interface validation (PR #149)
PR 2026-06-20 — Research Autopilot Phase 1 (PR scaffolding); Phase 2 (2026-06-20) closed the loop with scaffold_signal_engine + link_autopilot_backtest (PR #267)
PR 2026-06-21 — layered attribution: winners/losers, beta, regime, Monte Carlo permutation (PR #280)
PR 2026-06-24 — PIT-safe Shadow Account entry context (PR #302)
PR 2026-06-25 — strict JSON validation normalization (PR #306)
PR 2026-06-27 — PRICE_FEATURES contract shared between extractor and codegen (PR #316)

---

A research workflow that produces a transaction is forgettable. A research workflow that produces an artifact you can inspect, hash, and re-run is *citable*. Vibe-Trading chose the second kind. The five layers are how it earns the right to keep choosing it, run after run, claim after claim, artifact after artifact. The trust is the product. The cycle is the trust.