topoteretes/cognee / Chapter 5

Programming /

What_Beats_RAG_Actually_Proves

# What Beating RAG Actually Proves > BEAM is a memory benchmark, not a QA benchmark. The shift is in the test, not the score. Beating the state of the art on a benchmark is not the headline. The headline is what *kind* of benchmark it is. Cognee reports two numbers from the BEAM benchmark: 0.79 at the 100K-token setting and 0.67 at 10M tokens. These are 2x the Obsidian/RAG baseline of ~0.33, and 0.06 to 0.08 above the prior state of the art (0.735 and 0.641). The numbers are real. The README is explicit that they are "directional, not definitive" — but directional is the right kind of evidence here, because the benchmark itself is the point. I started skeptical of vendor benchmark numbers. The BEAM story is shaped differently from typical QA benchmarks, and the shape of the test is the artifact. BEAM is short for *Benchmark for Epistemic Agent Memory*. The README describes it as testing "whether a system can keep track of a long conversation as it changes — a more useful test for agent memory than typical needle-in-a-haystack benchmarks." This is exactly the failure mode the customer-support agent exhibited on day fourteen, in chapter zero. The 0.79 score is the artifact of cognee's architecture; the test that produced the score is the proof that the architecture is solving the right problem. ## Key Takeaways - BEAM = Benchmark for Epistemic Agent Memory — long conversations that change, not static QA - The 0.79 / 0.67 numbers are 2x the Obsidian/RAG baseline, 0.06–0.08 above the prior SOTA — meaningful but directional - The 0.79 was achieved with default settings, no custom models, no BEAM-specific pipelines (per `README.md:400-405`) - Session memory is the cache layer; permanent memory is the graph. `cognee.improve()` (or `self_improvement=True` on `remember()`) is the explicit bridge - "Memory is not context" — the architectural shift is the thesis. Cognee is the engineering embodiment of the shift - For the engineering reader: the action is to (1) try `cognee-cli -ui` for a local playground, (2) run the BEAM-style eval yourself, (3) put `ENABLE_BACKEND_ACCESS_CONTROL=True` before any production deploy ## The shape of BEAM Standard retrieval benchmarks — MMLU, HotpotQA, Natural Questions — measure "find the right fact in a static corpus." The benchmark is a single query against a fixed document set. The metric is accuracy, recall, or F1. These benchmarks are well-suited for measuring a retriever in isolation. BEAM is not that. A BEAM question is delivered in the context of a *running conversation* in which prior turns have established, modified, or contradicted facts. The test is whether a system can keep its answers consistent with the *current* state of the conversation, not the initial state. A BEAM question about a billing case might assume a refund was processed, then retracted, then re-issued — and the test system has to track which version of the fact is current. | Benchmark | Test shape | What it measures | |-----------|-----------|------------------| | HotpotQA | Static multi-hop QA over Wikipedia | Retriever recall across documents | | Natural Questions | Static QA over Wikipedia | Retriever + reader accuracy | | BEAM | Conversation that evolves over turns | Memory: tracking *current* state of dynamic facts | The 0.79 vs 0.33 RAG baseline tells you the gap. Standard RAG retrieves over a static document collection; when the conversation evolves, the retrieved chunks become stale. The RAG system answers the question it would have answered on turn one, not the question as it stands on turn one hundred. The graph-based memory in cognee, in contrast, encodes the *evolution* — every `remember()` call adds new facts, every `improve()` call reconciles them, and the graph stores the current state. The retriever queries the current state, not the historical record. ## The architecture that won Cognee's 0.79 was achieved with default settings, no custom models, no BEAM-specific pipelines. The architecture that won is the architecture you can read in this series. Recap of the load-bearing pieces, in the order they contribute to a BEAM-style question: 1. **The ECL pipeline (chapter 1)** extracts entities and relationships from each chunk via a per-chunk LLM call. The result is a graph where nodes are entities and edges are typed relationships with a one-sentence description in `Edge.description`. The graph captures the *current* state because the most recent `remember()` call has overwritten the relevant node's `updated_at` field. 2. **The 17 search types (chapter 2)** are the retrieval layer. For a BEAM-style question — "what does the user currently believe about the billing case?" — `GRAPH_COMPLETION` with `triplet_distance_penalty=6.5` and `feedback_influence` weighting surfaces the triplets whose endpoints have the most recent updates. The brute-force triplet search is not just retrieval; it is retrieval *re-ranked by temporal recency*, which is the dimension BEAM tests. 3. **The session→graph sync** is the bridge that keeps the graph current. Each `remember()` call with `session_id` writes to the session cache (fast, ephemeral, 7-day TTL). When `self_improvement=True` (the default for permanent memory), `cognee.imp

Chapter 5 of 6 9m Article Learning path

What Beating RAG Actually Proves

BEAM is a memory benchmark, not a QA benchmark. The shift is in the test, not the score.

Beating the state of the art on a benchmark is not the headline. The headline is what *kind* of benchmark it is.

Cognee reports two numbers from the BEAM benchmark: 0.79 at the 100K-token setting and 0.67 at 10M tokens. These are 2x the Obsidian/RAG baseline of ~0.33, and 0.06 to 0.08 above the prior state of the art (0.735 and 0.641). The numbers are real. The README is explicit that they are "directional, not definitive" — but directional is the right kind of evidence here, because the benchmark itself is the point.

I started skeptical of vendor benchmark numbers. The BEAM story is shaped differently from typical QA benchmarks, and the shape of the test is the artifact. BEAM is short for *Benchmark for Epistemic Agent Memory*. The README describes it as testing "whether a system can keep track of a long conversation as it changes — a more useful test for agent memory than typical needle-in-a-haystack benchmarks." This is exactly the failure mode the customer-support agent exhibited on day fourteen, in chapter zero. The 0.79 score is the artifact of cognee's architecture; the test that produced the score is the proof that the architecture is solving the right problem.

Key Takeaways

  • BEAM = Benchmark for Epistemic Agent Memory — long conversations that change, not static QA
  • The 0.79 / 0.67 numbers are 2x the Obsidian/RAG baseline, 0.06–0.08 above the prior SOTA — meaningful but directional
  • The 0.79 was achieved with default settings, no custom models, no BEAM-specific pipelines (per README.md:400-405)
  • Session memory is the cache layer; permanent memory is the graph. cognee.improve() (or self_improvement=True on remember()) is the explicit bridge
  • "Memory is not context" — the architectural shift is the thesis. Cognee is the engineering embodiment of the shift
  • For the engineering reader: the action is to (1) try cognee-cli -ui for a local playground, (2) run the BEAM-style eval yourself, (3) put ENABLE_BACKEND_ACCESS_CONTROL=True before any production deploy

The shape of BEAM

Standard retrieval benchmarks — MMLU, HotpotQA, Natural Questions — measure "find the right fact in a static corpus." The benchmark is a single query against a fixed document set. The metric is accuracy, recall, or F1. These benchmarks are well-suited for measuring a retriever in isolation.

BEAM is not that. A BEAM question is delivered in the context of a *running conversation* in which prior turns have established, modified, or contradicted facts. The test is whether a system can keep its answers consistent with the *current* state of the conversation, not the initial state. A BEAM question about a billing case might assume a refund was processed, then retracted, then re-issued — and the test system has to track which version of the fact is current.

| Benchmark | Test shape | What it measures | |-----------|-----------|------------------| | HotpotQA | Static multi-hop QA over Wikipedia | Retriever recall across documents | | Natural Questions | Static QA over Wikipedia | Retriever + reader accuracy | | BEAM | Conversation that evolves over turns | Memory: tracking *current* state of dynamic facts |

The 0.79 vs 0.33 RAG baseline tells you the gap. Standard RAG retrieves over a static document collection; when the conversation evolves, the retrieved chunks become stale. The RAG system answers the question it would have answered on turn one, not the question as it stands on turn one hundred. The graph-based memory in cognee, in contrast, encodes the *evolution* — every remember() call adds new facts, every improve() call reconciles them, and the graph stores the current state. The retriever queries the current state, not the historical record.

The architecture that won

Cognee's 0.79 was achieved with default settings, no custom models, no BEAM-specific pipelines. The architecture that won is the architecture you can read in this series. Recap of the load-bearing pieces, in the order they contribute to a BEAM-style question:

1. The ECL pipeline (chapter 1) extracts entities and relationships from each chunk via a per-chunk LLM call. The result is a graph where nodes are entities and edges are typed relationships with a one-sentence description in Edge.description. The graph captures the *current* state because the most recent remember() call has overwritten the relevant node's updated_at field.

2. The 17 search types (chapter 2) are the retrieval layer. For a BEAM-style question — "what does the user currently believe about the billing case?" — GRAPH_COMPLETION with triplet_distance_penalty=6.5 and feedback_influence weighting surfaces the triplets whose endpoints have the most recent updates. The brute-force triplet search is not just retrieval; it is retrieval *re-ranked by temporal recency*, which is the dimension BEAM tests.

3. The session→graph sync is the bridge that keeps the graph current. Each remember() call with session_id writes to the session cache (fast, ephemeral, 7-day TTL). When self_improvement=True (the default for permanent memory), cognee.improve() runs asynchronously, reconciling the session turn into the permanent graph. Without this sync, the graph would lag the session. The BEAM score would be lower.

4. The multi-tenant substrate (chapter 3) keeps the right answers accessible to the right user. The _GraphEngineHandle proxy re-resolves through the LRU on every access; the per-tenant ContextVar isolation ensures the retriever is querying the right dataset. A BEAM score over a multi-tenant deployment requires the per-dataset isolation to be correct; otherwise the answers would mix across tenants.

The 0.79 is the artifact of all four pieces working together. None of them is the headline by itself. The headline is that the system *has* all four pieces and that they compose.

Where memory actually lives

The key architectural insight, and the one the README is implicitly arguing, is that *memory is not context*. An LLM context window is a transient buffer; a knowledge graph is a substrate. The two have different lifecycles, different update semantics, and different query patterns.

flowchart LR
    A["Agent context window<br/>(8K-1M tokens)<br/>reset every session"] -->|"queries"| B["Session cache<br/>(Redis/FS/SQLite/Postgres)<br/>7-day TTL"]
    B -->|"cognee.improve()"| C["Permanent graph<br/>(Ladybug/Neo4j/Postgres/Neptune)<br/>+ vector store<br/>(LanceDB/PGVector/...)"]
    C -->|"GRAPH_COMPLETION<br/>brute-force triplet search"| D["Answer"]
    B -->|"recall(session_id)"| D
    A -->|"prompt stuffing"| E["Static RAG baseline<br/>(~0.33 on BEAM)"]
    E -->|"answer based on stale chunks"| D

A standard RAG system lives in the top-right of this diagram: it embeds documents, retrieves top-k chunks at query time, and stuffs them into the prompt. The chunks are static; the retrieval is one-shot; the answer is based on whatever the chunks say, regardless of whether they are current. This is the architecture the BEAM baseline of 0.33 measures.

Cognee lives in the top-left. The session cache is the fast write path — remember("the user prefers X") writes to the cache in milliseconds, no LLM call. The permanent graph is the slow write path — remember() followed by cognify() extracts entities, runs the LLM, writes to the graph and the vector store. The session cache and the permanent graph are bridged by cognee.improve(), which runs asynchronously and reconciles the session turn into the graph. A query to cognee.recall() first checks the session cache (recency), then the graph (completeness), then re-ranks and returns. The graph's temporal recency is the dimension BEAM tests; the cache's recency is the dimension standard RAG ignores.

The action frame

Imagine you are an engineering lead deciding whether to put cognee in front of your agent stack. The decision is not "is cognee the right product?" — it is "is the architecture the right substrate?" The answer to that question is empirical, and the empirical test is the action frame.

Step 1: local playground. cognee-cli -ui launches a local stack — the API on port 8000, the MCP server on port 8001, the Next.js UI on port 3000 — in Docker. The README flags that the MCP server requires Docker Desktop, Colima, or any OCI-compatible runtime. The local playground is the fastest way to feel the architecture. Add a few documents, cognify them, recall, and watch the search types. The FEELING_LUCKY search type is the most informative — it shows you what the LLM thinks your question is asking for.

Step 2: run the BEAM-style eval yourself. The eval framework is at evals/. The competitor eval scripts (evals/src/qa/qa_benchmark_{cognee,graphiti,mem0,lightrag}.py) are the same scripts the BEAM comparison was run on. Running them on your own data — a real customer-support transcript, a real corpus of internal documentation — is the empirical test. The directional numbers from the README are a starting point; your numbers will be different. The shape of the gap between cognee and a RAG baseline is the artifact you want to see.

Step 3: production posture. ENABLE_BACKEND_ACCESS_CONTROL=True is the line in .env that flips the multi-tenant switch. The CLAUDE.md is explicit: "When true, API auth is required and per-user/dataset DB isolation is enabled. When false, single-user mode: shared DBs and auth off unless overridden." The lazy-handle pattern, the per-dataset ContextVar isolation, the _LeasedValueProxy — none of this is active without the flag. For a single-user local deployment, leaving it off is fine. For any production deployment, leaving it off is a security hole. The flag is the difference between a research prototype and a production system.

Step 4: the session→graph sync discipline. The default is self_improvement=True on remember(), which schedules an asynchronous improve() call after the session write. This is the discipline: every session turn is bridged to the permanent graph in the background. Without it, the graph lags the session; the BEAM score drops. The cost is an LLM call per session turn; the benefit is the graph stays current. For high-throughput agents, the cost can be batched — cognee.improve() accepts a list of session IDs and processes them in a single transaction.

Step 5: pick the LLM carefully. The default is openai/gpt-5-mini (per CLAUDE.md). The structured output framework is instructor via litellm, with json_schema_mode as the default Instructor mode. For high-stakes production, the CLAUDE.md flags LLM_INSTRUCTOR_MODE as the knob — tool_call is faster but less reliable, md_json is the fallback. The BAML alternative (STRUCTURED_OUTPUT_FRAMEWORK=baml) is worth evaluating for projects that need strict type guarantees.

The closing argument

The customer-support agent on day fourteen, the one that contradicted itself about the billing case — that agent is the failure mode the architecture prevents. The fix is not "give the agent a bigger context window." The fix is "give the agent a substrate that remembers." The substrate is a knowledge graph; the LLM queries the graph on demand; the answers are grounded in the current state, not the historical record.

Cognee is the engineering embodiment of that fix. The memory verbs (remember/recall/forget/improve) are the user-facing API. The ECL pipeline is the workhorse. The 17 search types are the retrieval toolkit. The lazy-handle multi-tenant substrate is the production engineering. The session→graph sync is the discipline. The BEAM benchmark is the proof.

| What it is | Where it lives | Why it matters | |------------|----------------|----------------| | ECL pipeline | cognee/api/v1/cognify/cognify.py | The workhorse. Five tasks, strict sequence, per-chunk LLM. | | 17 search types | cognee/modules/search/types/SearchType.py | The retrieval toolkit. One contract, many paradigms. | | Lazy handle | cognee/infrastructure/databases/graph/get_graph_engine.py:59-107 | The production engineering. Cache eviction safety. | | Session→graph sync | cognee/api/v1/improve/ | The discipline. The graph stays current. | | BEAM 0.79 | README.md:400-405 | The proof. The right test, the right number. |

The thesis of this series is that cognee is the engineering answer to "LLM context windows are too short for agents that need to remember." The five chapters have walked from the public API to the runtime substrate to the production engineering to the benchmark that proves the architecture works. The action frame is concrete: cognee-cli -ui, the eval framework, ENABLE_BACKEND_ACCESS_CONTROL=True, the session sync, the LLM choice. The reader who finishes this series and puts cognee in front of their agent stack is the reader who has internalized the thesis.

Memory is not context. The graph is the substrate. The rest is engineering.