HKUDS/Vibe-Trading / Chapter 2

Programming /

E01_The_Loader_Registry

# The Loader Registry: How 18 Data Sources Stopped Fighting Each Other > When one tool call can return eighteen different answers for the same symbol, the only safe move is a single chokepoint — and centralizing it eliminates an entire class of bugs. ## Key Takeaways - The bug moves downward. A swarm worker's hallucinated price was a symptom; the cause was fifty uncentralized data fetches; the fix was the loader registry at the bottom. - A loader registry is not a list of sources. It is a contract: OHLC sanity at the boundary, fallback chains by IP-ban risk, cache with staleness guard, PIT-safe fundamental enrichment. - Eighteen sources exist not because the author is indecisive but because each market has different failure modes — China A-share has IP-ban risk; US has data fragmentation; crypto has venue fragmentation. One chokepoint absorbs all of that. - Centralizing market data was the precondition for everything else in this series. Without it, the provider capability layer (next chapter) would have inherited the same fifty-shapes-of-truth problem. --- Here is a function call that returned the wrong answer without throwing an exception. In the Vibe-Trading source tree under `agent/src/market_data.py`, the Tushare loader exposes a method called `daily()` that, for an A-share ETF like `510300.SH`, returns an empty DataFrame. No error. No warning. Just an empty frame that the calling code interprets as "no trading happened on those days." The trading day did happen. The loader is wrong. This is the canonical shape of every data-layer bug I want to discuss in this chapter. The bug is invisible from the call site. It looks like a market that did nothing, when in fact the data source simply did not know how to ask for that instrument. The fix landed on 2026-06-26 in pull request #315, and it is worth reading for its parsimony: ETFs route to `fund_daily()`, indices to `index_daily()`, HK equities to `hk_daily()`. The bug existed for months because nobody had built the routing layer that the call site assumed was there. The lesson is not "fix the tushare loader." The lesson is: **a market-data call site should not be choosing between sources, choosing between endpoints, or choosing between fallback chains. It should be calling one chokepoint.** When that chokepoint does not exist, every consumer of market data reimplements it badly, in its own shape. The NVDA incident I opened the series with — fifty swarm workers writing fifty `yfinance` snippets — was the architectural symptom of the same disease. Imagine you are operating a quant research platform. Your agent loop calls `get_market_data("NVDA", days=30)`. Behind that single call, the loader registry runs an ordered decision tree: which market? which source is best for that market today? which fallback if that source fails? what is the cache? what does OHLC integrity look like at the boundary? Each of those questions is non-trivial. The right answer is to ask them once, in one place, and to make every consumer — backtests, swarm workers, the Web UI, the MCP server — go through the same gate. Vibe-Trading does exactly this, and the consequences ripple outward. I want to walk through the design in three steps: the contract the loader enforces, the fallback chains it picks from, and the integrity checks that sit at the boundary. Then I want to show why centralizing this was the precondition for the swarm-grounding fix in PR #199 — because without the registry, you cannot give fifty workers a "use this one tool" instruction that actually means anything. ## The contract Every loader in the registry conforms to a fixed shape: `(symbol, start, end, fields=None) -> DataFrame` with normalized OHLCV columns, point-in-time-safe enrichment, and strict JSON serialization. Two non-obvious rules govern that shape. First, **the OHLC sanity check sits at the boundary, not in the consumer.** Pull request #274, merged 2026-06-20, drops dirty bars (`high < low`, non-positive prices, bad bracketing) centrally at the loader exit. Before that PR, every consumer had its own definition of "dirty" — a backtest engine might tolerate `high == low`, a portfolio simulator might not — and the disagreement surfaced as bugs that looked like model failures. Centralizing the check made the contract enforceable: if a row reaches the agent, it has already been vetted. Second, **non-finite floats serialize as `null`, not as the literal string `NaN`.** This sounds pedantic until you watch a JSON parser crash on a downstream strict-mode validator. Pull request #238 made run-card payloads strict-JSON-clean; pull request #306 made validation JSON sanitize nested `NaN`/`Infinity`. The pattern is consistent: every exit point from the loader must produce JSON that survives a strict parser, because the loader's consumers include the Web UI, which speaks strict JSON to the browser, and the swarm workers, which pipe results through tool-call argument previews. The contract also covers **point-in-time safety**. When a backtest asks for "the income statement as of yesterday," the loader must not silently use a restated quarterly figure that was only published this morning. Pull request #302 (2026-06-24) made Shadow Account rule extraction see PIT-safe entry context — `entry_rsi14` and `prior_5d_return` fetched through the loader registry as of `buy_dt`. Pull request #76 (2026-05-08) added `fundamental_fields` with the same discipline. The contract is the same: the loader knows what "as of" means, and the consumers do not have to argue about it. ```mermaid flowchart LR C["Consumer<br/>(backtest · swarm worker · web UI · MCP tool)"] R["Loader registry<br/>(single chokepoint)"] OHLC["OHLC sanity gate<br/>PR #274"] PIT["PIT-safe enrichment<br/>PR #302, #76"] STRICT["Strict JSON serialization<br/>PR #238, #306"] CACHE["Opt-in local cache<br/>PR #177"] SRC["18 sources<br/>tencent · mootdx

Chapter 2 of 5 11m Article Audio Video Learning path

The Loader Registry: How 18 Data Sources Stopped Fighting Each Other

When one tool call can return eighteen different answers for the same symbol, the only safe move is a single chokepoint — and centralizing it eliminates an entire class of bugs.

Key Takeaways

  • The bug moves downward. A swarm worker's hallucinated price was a symptom; the cause was fifty uncentralized data fetches; the fix was the loader registry at the bottom.
  • A loader registry is not a list of sources. It is a contract: OHLC sanity at the boundary, fallback chains by IP-ban risk, cache with staleness guard, PIT-safe fundamental enrichment.
  • Eighteen sources exist not because the author is indecisive but because each market has different failure modes — China A-share has IP-ban risk; US has data fragmentation; crypto has venue fragmentation. One chokepoint absorbs all of that.
  • Centralizing market data was the precondition for everything else in this series. Without it, the provider capability layer (next chapter) would have inherited the same fifty-shapes-of-truth problem.

---

Here is a function call that returned the wrong answer without throwing an exception. In the Vibe-Trading source tree under agent/src/market_data.py, the Tushare loader exposes a method called daily() that, for an A-share ETF like 510300.SH, returns an empty DataFrame. No error. No warning. Just an empty frame that the calling code interprets as "no trading happened on those days." The trading day did happen. The loader is wrong.

This is the canonical shape of every data-layer bug I want to discuss in this chapter. The bug is invisible from the call site. It looks like a market that did nothing, when in fact the data source simply did not know how to ask for that instrument. The fix landed on 2026-06-26 in pull request #315, and it is worth reading for its parsimony: ETFs route to fund_daily(), indices to index_daily(), HK equities to hk_daily(). The bug existed for months because nobody had built the routing layer that the call site assumed was there.

The lesson is not "fix the tushare loader." The lesson is: a market-data call site should not be choosing between sources, choosing between endpoints, or choosing between fallback chains. It should be calling one chokepoint. When that chokepoint does not exist, every consumer of market data reimplements it badly, in its own shape. The NVDA incident I opened the series with — fifty swarm workers writing fifty yfinance snippets — was the architectural symptom of the same disease.

Imagine you are operating a quant research platform. Your agent loop calls get_market_data("NVDA", days=30). Behind that single call, the loader registry runs an ordered decision tree: which market? which source is best for that market today? which fallback if that source fails? what is the cache? what does OHLC integrity look like at the boundary? Each of those questions is non-trivial. The right answer is to ask them once, in one place, and to make every consumer — backtests, swarm workers, the Web UI, the MCP server — go through the same gate.

Vibe-Trading does exactly this, and the consequences ripple outward. I want to walk through the design in three steps: the contract the loader enforces, the fallback chains it picks from, and the integrity checks that sit at the boundary. Then I want to show why centralizing this was the precondition for the swarm-grounding fix in PR #199 — because without the registry, you cannot give fifty workers a "use this one tool" instruction that actually means anything.

The contract

Every loader in the registry conforms to a fixed shape: (symbol, start, end, fields=None) -> DataFrame with normalized OHLCV columns, point-in-time-safe enrichment, and strict JSON serialization. Two non-obvious rules govern that shape.

First, the OHLC sanity check sits at the boundary, not in the consumer. Pull request #274, merged 2026-06-20, drops dirty bars (high < low, non-positive prices, bad bracketing) centrally at the loader exit. Before that PR, every consumer had its own definition of "dirty" — a backtest engine might tolerate high == low, a portfolio simulator might not — and the disagreement surfaced as bugs that looked like model failures. Centralizing the check made the contract enforceable: if a row reaches the agent, it has already been vetted.

Second, non-finite floats serialize as null, not as the literal string NaN. This sounds pedantic until you watch a JSON parser crash on a downstream strict-mode validator. Pull request #238 made run-card payloads strict-JSON-clean; pull request #306 made validation JSON sanitize nested NaN/Infinity. The pattern is consistent: every exit point from the loader must produce JSON that survives a strict parser, because the loader's consumers include the Web UI, which speaks strict JSON to the browser, and the swarm workers, which pipe results through tool-call argument previews.

The contract also covers point-in-time safety. When a backtest asks for "the income statement as of yesterday," the loader must not silently use a restated quarterly figure that was only published this morning. Pull request #302 (2026-06-24) made Shadow Account rule extraction see PIT-safe entry context — entry_rsi14 and prior_5d_return fetched through the loader registry as of buy_dt. Pull request #76 (2026-05-08) added fundamental_fields with the same discipline. The contract is the same: the loader knows what "as of" means, and the consumers do not have to argue about it.

flowchart LR
    C["Consumer<br/>(backtest · swarm worker · web UI · MCP tool)"]
    R["Loader registry<br/>(single chokepoint)"]
    OHLC["OHLC sanity gate<br/>PR #274"]
    PIT["PIT-safe enrichment<br/>PR #302, #76"]
    STRICT["Strict JSON serialization<br/>PR #238, #306"]
    CACHE["Opt-in local cache<br/>PR #177"]
    SRC["18 sources<br/>tencent · mootdx · eastmoney · baostock · akshare · tushare · yahoo · sina · stooq · yfinance · finnhub · alphavantage · tiingo · fmp · okx · ccxt · futu · local"]

    C -->|get_market_data| R
    R --> OHLC --> PIT --> STRICT --> CACHE
    CACHE -.miss.-> SRC
    SRC -.normalized bars.-> CACHE

The fallback chains

Eighteen sources is not a brag. It is a fact about markets. China A-shares have IP-ban risk on the rich providers. US equities have data fragmentation across brokerages, regulators, and aggregators. HK equities live on a different trading calendar and use a different currency. Crypto lives on hundreds of venues with different bars and conventions. A loader that knows only one source for each market is a loader that will be down every time that source has a bad day.

The chains are ordered by IP-ban risk, not by data quality. This is a counterintuitive choice that pays off in production. The A-share chain reads tencent · mootdx · eastmoney · baostock · akshare · tushare · local. The first two never IP-ban because they speak the 通达信 TCP protocol directly; the next two throttle; the last is token-gated. The US chain is yahoo · stooq · sina · eastmoney · yfinance · tiingo · fmp · finnhub · alphavantage · akshare · local. Crypto is okx · ccxt · yfinance · local. The HK chain is eastmoney · yahoo · futu · yfinance · akshare · local. Each chain is a story about a specific market's failure modes.

Three subtle rules govern the walk. An explicit local: symbol never silently falls back to a network source. This is the kind of clause that reads paranoid in a spec doc and prevents incidents in production. A staleness guard never caches a range ending today, because today's last bar is still forming. Cached frames round-trip byte-identical to freshly fetched ones — pull request #177 verifies this property in CI, which means a cache hit is observably indistinguishable from a network fetch, which means every consumer gets the same downstream behavior whether the data came from disk or wire.

There are also two endpoint-correctness fixes that look tiny in isolation and matter enormously in production. Yfinance's exclusive end boundary used to drop the final requested trading day — pull request #226 made the download pass end + 1 day while keeping cache keys on the original range. CCXT_TIMEOUT_MS / OKX_TIMEOUT_S env-var parsing used to raise at import when malformed — pull request #227 made it warn-and-fallback. These are not data-layer features; they are correctness fixes for the call-and-response surface. The pattern is the same as the OHLC sanity check: move the validation to the loader boundary so the consumer does not have to.

The integrity check, one more time

I want to circle back to the OHLC sanity check because it is the most important of these fixes. The check is dead simple: a row is dropped if high < low, if any price is non-positive, or if the bar is bracketed incorrectly (the open sits outside the high-low range, or the close sits outside it). Before pull request #274, this check did not exist centrally. Backtests had their own filtering, swarm workers had their own filtering, the Web UI had its own filtering — each defined "dirty" slightly differently.

Why does centralization matter? Two reasons. First, the cost of disagreement is invisible drift: a backtest that accepts high == low will run slightly differently from a swarm worker that does not, and the divergence compounds across a long-horizon backtest. Second, the cost of omission is invisible silence: a consumer that skips the check silently produces wrong numbers, and the wrong numbers become a downstream hallucination. The NVDA incident is the second cost at scale: the malformed bar was accepted by some workers, not by others, and the disagreement propagated up the swarm as inconsistent risk assessments.

Pull request #274 is two paragraphs of code. The architectural claim it makes is much larger: *integrity is a registry-level concern, not a consumer-level concern.* Once you accept that claim, every other consumer gets the check for free, forever, and the next malformed bar that arrives from any source will be dropped before it can mislead any agent.

What this enabled

Here is the part that surprised me. The loader registry was not built for the swarm. It was built, originally, as a clean abstraction over eighteen heterogeneous sources. But once it existed, the swarm-grounding fix in pull request #199 became a one-paragraph change instead of an architectural rewrite: every market-data preset in the swarm got the same get_market_data tool backed by the loader registry, with strict JSON, with non-finite floats serialized as null, and the prompt policy steered OHLCV work tool-first.

Before that PR, the prompt policy was a polite suggestion and the workers wrote their own scripts. After that PR, the workers had one tool, and using it was cheaper than writing their own, and the architectural claim of the registry — *call the chokepoint* — became an enforceable convention. The fix moved downward. The bug had been "swarm workers hallucinate prices." The cause had been "no shared data layer." The fix was at the bottom, where the loader registry enforces its contract.

flowchart TB
    subgraph Before["Before PR #199"]
        W1[Worker A<br/>ad-hoc yfinance]
        W2[Worker B<br/>ad-hoc yfinance]
        W3[Worker C<br/>ad-hoc yfinance]
        W1 -.different shape.-> H[Hallucinated consensus]
        W2 -.different shape.-> H
        W3 -.different shape.-> H
    end

    subgraph After["After PR #199"]
        WR1[Worker A]
        WR2[Worker B]
        WR3[Worker C]
        WR1 --> R["Loader registry<br/>single chokepoint"]
        WR2 --> R
        WR3 --> R
        R --> H2["Consistent inputs"]
    end

I came into this codebase assuming the loader registry was an implementation detail. It is not. It is the foundation of every claim about correctness that the rest of the system makes. When the data layer centralizes, the model layer can stop worrying about input hygiene. When the model layer can trust inputs, the broker layer can stop worrying about hallucinated prices triggering real trades. The trust chain is a vertical stack, and it starts at the bottom, with the loader.

The narrower claim

Let me be specific about what I am and am not arguing.

I am arguing that for any LLM agent that consumes external data, a single loader chokepoint with a strict contract is load-bearing. The contract should cover: source selection (per-market fallback chains ordered by a property like IP-ban risk, not by data quality), boundary validation (OHLC sanity, strict-JSON serialization, PIT safety), and observability (cache hit/miss, per-symbol warnings, partial-fetch detection). Every consumer goes through it. Every consumer gets the same answer for the same symbol on the same day.

I am not arguing that eighteen sources are always the right number. Vibe-Trading's count is specific to its markets — A-share, US, HK, crypto, futures, forex — and a system that only does US equities needs fewer. The principle is "one chokepoint with N sources behind it," not "eighteen."

I am also not arguing that centralization is free. The registry is the file the entire team touches when a new source is added, when an existing source changes its API, when a market's failure modes shift. Centralization is a coordination cost. But coordination cost is cheaper than fifty workers writing their own data layer.

The hardest part of the design, looking at it now, is the discipline to keep the chokepoint honest. Pull request #199 — the swarm-grounding fix — exists because somebody noticed that workers were bypassing the registry by writing ad-hoc scripts. The PR did not add new infrastructure; it removed a bypass. That is the kind of PR you only write once you have internalized the architectural claim.

The bug moves downward. You fix it at the lowest layer. And the lowest layer, in a system that consumes external data, is the loader.

---

References:

  • source/vibe-trading/agent/src/market_data.py — single normalized loader entry
  • PR #274 (2026-06-20) — central OHLC sanity check at the loader boundary
  • PR #177 (2026-06-04) — opt-in local data cache with staleness guard
  • PR #315 (2026-06-26) — tushare loader routes ETF/LOF/indices/HK to the correct endpoints
  • PR #199 (2026-06-11) — swarm workers ground through the loader registry
  • PR #302 (2026-06-24) — PIT-safe Shadow Account entry context
  • PR #226 (2026-06-14) — yfinance end-boundary fix
  • PR #227 (2026-06-14) — CCXT/OKX timeout env-var fallback
  • PR #306 (2026-06-25) — strict JSON validation normalization

---

The loader registry assumes the model handles a different kind of quirk — not dirty bars, but provider-specific protocol deviations. That is the next layer.