The OpenAI-Compatible Lie
"OpenAI-compatible" is not a contract. It is a sketch. Treating the sketch as a contract produced four completely different production failures from one shared shim — and the fix was to delete the shim.
Key Takeaways
- "OpenAI-compatible" describes wire format, not behavior. Wire-format compatibility is the easy part. Behavioral compatibility is the part that breaks production.
- A single shim that handles per-provider quirks globally cross-contaminates them. The DeepSeek hang fix broke Kimi. The Kimi workaround broke Gemini. The Gemini fix broke DeepSeek.
- The replacement is an explicit capability layer: each provider's quirk lives in its own file, gated to its own provider, never applied to any other.
- The same lesson applies to brokers, to data sources, to anything with heterogeneous behavior behind a homogeneous API. "Default" is a stance, not a fact — make the stance explicit or it will leak.
---
On the morning of 2026-06-12, four independent production issues landed on the Vibe-Trading issue tracker within ninety minutes of each other. DeepSeek runs were stuck on "Agent is working…" indefinitely (#208). Kimi rejected the client outright (#204). The UI never recovered after a stall (#195). And reached max iterations was masking empty model responses (#203). Four bugs. Four surfaces. Four different reproduction paths. One root cause, diagnosed in the same day, fixed in the same pull request: every OpenAI-compatible provider ran through a single shim that applied DeepSeek/Kimi/Gemini quirks globally and silently swallowed stream failures.
I want to walk through that day in detail because it is the cleanest case study I know for "wire-format compatibility is not behavioral compatibility." The lesson generalizes: anywhere you have a heterogeneous set of systems behind a homogeneous API, a single shim is a single point of cross-contamination. The fix is always to delete the shim and let each system's behavior be its own.
The single shim, in Vibe-Trading, was the OpenAI client wrapper used by LangChain. Every provider — DeepSeek, Kimi, Moonshot, Gemini via OpenAI-compatible, OpenRouter, GLM/Zhipu, Qwen, plus the obvious OpenAI and Anthropic paths — went through it. The shim did four things that turned out to be incompatible. It applied DeepSeek-specific reasoning-content replay to all providers (which broke Kimi). It set a single User-Agent header (which Kimi rejected). It captured Gemini thought signatures only on the in-memory path (which left dict-replayed history without the signature, breaking multi-turn tool calls). And it swallowed stream failures silently, falling back to a slow non-streaming call (which masked reached max iterations errors when the model returned nothing).
Each individual behavior was correct for the provider it was designed for. Each was wrong for every other provider. The bugs were not bugs in any single behavior — they were bugs in the architecture of "one shim handles all."
The four quirks, named
Let me name the four quirks explicitly, because they are general patterns that show up in any system with a homogeneous API and heterogeneous behavior behind it.
Quirk 1: The reasoning-content format is per-provider. DeepSeek, Kimi, and Qwen all support reasoning-mode outputs, but they emit the reasoning payload in different fields, with different replay rules, and with different policies on whether assistant-prefill handoff messages are accepted. A shim that captures reasoning as one canonical format will get two of three providers wrong. The fix, in Vibe-Trading, is reasoning capture and replay gated per-provider — Kimi's path is verified end-to-end against the live API on kimi-k2.6, with tool calls and strict multi-turn reasoning replay, but the same code path does not run for DeepSeek.
Quirk 2: The User-Agent header is a handshake. Some providers fingerprint the client by User-Agent and reject unknown strings. Kimi does this; DeepSeek does not. A shim that sets one header for all providers will be accepted by one and rejected by the other. The fix is a per-provider User-Agent override — MOONSHOT_USER_AGENT is exposed as an env var specifically because the Kimi path needs a different default than the OpenAI path.
Quirk 3: Per-call signatures are per-provider. Gemini 2.5 and 3.x attach a thoughtSignature to each tool call, and the signature must round-trip on the next request or multi-turn tool calling fails with INVALID_ARGUMENT. The signature is preserved on the in-memory path but dropped when the agent loop replays history as OpenAI-format dicts through LangChain. A shim that handles the in-memory case will miss the dict-replay case. The fix lives in a single _convert_input chokepoint where both invoke and stream pass through; the signature is re-attached at that chokepoint, including for parallel calls where only the first of N is signed (pull requests #176 and #184, 2026-06-05 and 2026-06-08).
Quirk 4: Stream-failure handling is per-provider. Some providers return transient connection resets that should be retried; others return deterministic 4xx errors that should fail fast. A shim that retries all stream failures will hammer an already-rejected request. A shim that fails fast on all stream failures will give up on a recoverable one. The fix is a contextual provider_stream_error exception with one automatic retry for transient resets, deterministic 4xx fail-fast on the rest, surfaced to the user as a meaningful message instead of "max iterations."
flowchart TB
subgraph Before["Before 2026-06-12 — single shim"]
REQ["Stream / invoke request"]
SHIM["OpenAI-compatible shim<br/>applies ALL quirks globally"]
D["DeepSeek<br/>❌ contaminated by Kimi/Gemini behavior"]
K["Kimi<br/>❌ rejected by wrong User-Agent"]
G["Gemini<br/>❌ thought_signature lost in dict replay"]
end
subgraph After["After 2026-06-12 — capability layer"]
REQ2["Stream / invoke request"]
CAP["_convert_input chokepoint<br/>provider-routed capabilities"]
D2["DeepSeek path<br/>native adapter opt-in"]
K2["Kimi path<br/>MOONSHOT_USER_AGENT override<br/>kimi-k2.x temp=1"]
G2["Gemini path<br/>thought_signature round-trip"]
O["OpenAI / Anthropic / OpenRouter / GLM<br/>each on its own capability"]
end
REQ --> SHIM --> D
REQ --> SHIM --> K
REQ --> SHIM --> G
REQ2 --> CAP
CAP --> D2
CAP --> K2
CAP --> G2
CAP --> O
"Default" is a stance
Here is the part that, once you see it, you cannot unsee. Every shim encodes a *default*. The default is some provider's behavior, applied to all providers as the baseline. DeepSeek was the default in the Vibe-Trading shim because it was the most heavily used. So Kimi got DeepSeek's reasoning replay (wrong), Gemini got DeepSeek's User-Agent (rejected), and OpenAI got DeepSeek's stream-failure retry policy (sometimes too aggressive). The bugs were not bugs in DeepSeek's behavior. They were bugs in DeepSeek's behavior *being the default for everyone else*.
The capability layer makes the stance explicit. There is no default. Each provider has its own adapter file, gated to its own provider, with no global flag that affects anything else. The cost is duplication — the same invoke method exists in seven slightly different shapes — but the benefit is that a fix to one provider cannot break another. The reproduction cases for the four production bugs above now live in separate test files, gated to their own provider, so a future change to the DeepSeek path cannot regress Kimi because the Kimi tests do not run the DeepSeek code.
I came into this codebase assuming "OpenAI-compatible" meant drop-in. The cluster of June 12 incidents changed my mind. Wire-format compatibility is the easy part; any provider can match the JSON schema. Behavioral compatibility is the part that breaks production, and it is invisible until something goes wrong. The fix is not a smarter shim. The fix is to make each provider's behavior its own concern, with its own tests, its own overrides, and its own failure modes — and to surface that taxonomy to the user as a provider doctor CLI that prints a redacted provider/model/package/proxy snapshot for triage.
The reasoning indicator and the empty-response semantic
Two design decisions made by the capability-layer refactor deserve separate mention because they are not provider-quirk fixes; they are *semantic* fixes. The reasoning-only stream used to look like dead air — the model was thinking, but the UI showed nothing for thirty seconds. Now a live "Reasoning…" indicator renders while the model is reasoning-only, replacing dead air with a signal that the system is working.
The empty-response semantic is more important. Before the refactor, a model that returned an empty response was indistinguishable, from the agent loop's perspective, from a model that needed more iterations to reach a conclusion. Both surfaced as reached max iterations. The user could not tell whether the model was thinking hard and failing to converge, or whether the model had simply returned nothing on iteration one. The capability layer distinguishes: empty model response is empty_model_response, an actionable diagnostic; out-of-budget is max_iterations, a separate one. The agent loop mirrors the swarm worker's wrap-up nudge at 80% of the iteration budget so it never burns the whole 50-iteration budget into a failed status with no output (pull request #148, 2026-05-30) — and that nudge is gated to fire only mid-run so it never displaces research-goal context.
These are not "fix the bug" changes. They are "fix the semantic" changes. The behavior of the system is more legible to the user because the failure modes are named.
The narrower claim
I am arguing that for any LLM agent that talks to multiple providers behind a homogeneous API, the single-shim approach is a single point of cross-contamination. The replacement is an explicit per-provider capability layer: each provider has its own adapter file, its own quirks file, its own test file. There is no global flag that affects multiple providers. The cost is duplication. The benefit is that fixes are local.
I am not arguing that all providers deserve equal treatment. DeepSeek, OpenAI, and Anthropic are first-class; smaller providers have thinner coverage; experimental providers ship behind feature flags. The principle is "explicit per-provider gating," not "equal per-provider effort."
I am also not arguing that wire-format compatibility is bad. It is the only reason an ecosystem like this can exist. The point is that wire-format compatibility is the floor, not the ceiling. Once you reach the floor, every other behavioral assumption — User-Agent, retry policy, reasoning format, signature round-trip, stream-failure handling, empty-response semantic — is per-provider. Pretending otherwise is the bug.
What this looks like in practice
The practical surface of the capability layer is small. There is a _convert_input chokepoint where every provider-quirk-relevant transformation happens — reasoning capture, signature round-trip, parallel-call handling. There is a provider_stream_error exception that carries enough context for the agent loop to decide retry-or-fail-fast. There is an empty_model_response exception that distinguishes a model that returned nothing from a model that ran out of iterations. There is vibe-trading provider doctor that prints the redacted snapshot. There are per-provider test files that gate regressions to the right provider.
The architectural surface is much larger. The capability layer is the answer to "what does it mean for a provider to be supported?" in this codebase. A provider is supported if its quirks are explicitly modeled and explicitly tested. A provider is *not* supported if it works because the global shim happens to handle its case. The latter is a coincidence, not a contract.
The same lesson applies to brokers, and we will get there in the next chapter. It applies to data sources, which is why the loader registry exists. It applies to swarm worker MCP servers, which is why pull request #142 pinned the trust boundary. Anywhere a system has heterogeneous behavior behind a homogeneous interface, "default" is a stance, and the stance must be explicit or it will leak.
The bug moves downward. The provider shim was a layer; below it, the bugs were per-provider. The fix is per-provider.
---
References:
source/vibe-trading/agent/src/providers/— capability layer files- PR 2026-06-12 — single overhaul: provider capability layer replacing the OpenAI-compatible shim
- Issue #208 — DeepSeek runs stuck on "Agent is working…"
- Issue #204 — Kimi rejecting client
- Issue #203 —
reached max iterationsmasking empty model responses - Issue #195 — UI not recovering after stall
- PR #176 (2026-06-05) and PR #184 (2026-06-08) — Gemini
thoughtSignatureround-trip - PR #248 (2026-06-17) — Kimi/Moonshot User-Agent override; Opus 4.8+ prefill handoff fix
- PR #148 (2026-05-30) — wrap-up nudge at 80% iteration budget
vibe-trading provider doctorCLI — redacted provider/model/package/proxy snapshot
---
But the model can be perfectly behaved — quirks handled, streams recovered, empty responses surfaced, signatures round-tripped — and still blow up the moment you connect it to a broker. The next layer is the broker layer, and the shape of the answer there is not a shim at all.