The_Architectural_Shape_of_Agent_Memory | topoteretes/cognee

The Architectural Shape of Agent Memory

A series on how cognee replaced context windows with knowledge graphs — and what that replacement actually costs.

Imagine you're watching a customer-support agent on day fourteen. On day one it was correct. Today it is contradicting itself about a billing case it closed last week. The conversation log is in the database. The agent is not. The same week, a junior analyst asks an SQL copilot for a customer-retention query, and the copilot gives a textbook answer for a different schema. The expert query it should have echoed is three months of context-window rotation away — and that context is gone.

These are not LLM failures. They are memory failures. And they are the failure mode that cognee, the open-source AI memory platform from topoteretes/cognee, was built to address.

This series argues that cognee is the most thoroughly engineered open-source answer to "give an agent long-term memory," and that the memory metaphor is a thin skin over a much older idea: knowledge graphs. Across five chapters we'll walk from the public API to the runtime substrate. The work happens in a five-task ECL pipeline, behind seventeen search strategies, inside a multi-tenant substrate that uses a lazy-handle pattern to keep connections live across LRU evictions. The agent memory is the side effect; the engineering is the substance. If you finish this series and decide cognee is a vector-store wrapper, I will have failed to write it.

I started this analysis assuming cognee is a vector-store wrapper. Reading the README, I expected the storage story to be the point. The code tells a different story.

What "memory for an agent" means in 2026

An LLM has a context window. A context window is a temporary buffer — somewhere between eight thousand and a million tokens, depending on the model, and it is reset every time the conversation restarts. If you want an agent to remember something across sessions, you cannot rely on the context window. You have to externalize the memory: store it somewhere the LLM can query on demand.

The naive version of this is retrieval-augmented generation (RAG): chunk the documents, embed them, store the vectors, retrieve the top-k by cosine similarity at query time, and stuff the chunks into the prompt. This works. It is also brittle. It loses relationships between entities. It cannot answer "what changed between the time the user said X and the time they said Y?" because the chunks are flat. It forgets across sessions, because the retrieval is over static documents, not over what the agent has learned.

The deeper version is a knowledge graph: extract entities and relationships, store them as nodes and edges, and let the LLM query the graph structure on demand. This is what cognee does. The "memory" verb in the public API is a thin wrapper over a graph write; the "recall" verb is a thin wrapper over a graph read. The interesting question is what happens in between — what kind of graph, what kind of read, and how the substrate is engineered to be safe for production.

The five-thousand-foot view

Cognee's repository, as cloned at source/cognee/, is a Python package of roughly 2,400 tracked files organized in a layered architecture that the project's CLAUDE.md documents explicitly:

API Layer (cognee/api/v1/)               — remember/recall/forget/improve
Main Functions                           — add/cognify/search/memify
Pipeline Orchestrator (modules/pipelines/)  — Task, run_pipeline, per-dataset lock
Task Execution Layer (cognee/tasks/)        — classify_documents, extract_graph_from_data, add_data_points
Domain Modules (modules/)                  — graph, retrieval, ingestion, cognify
Infrastructure Adapters (infrastructure/)  — LLM gateway, graph/vector/cache engines
External Services                         — OpenAI, Ladybug, LanceDB, etc.

There are also four siblings: cognee-mcp/ (the Model Context Protocol server, a separate Docker image), cognee-frontend/ (the local Next.js UI), cognee_db_workers/ (subprocess helpers for the default graph database), and a starter kit and an eval framework.

The most important observation is that there are *two* public APIs. The v1 API is the pipeline-shaped one: add(), cognify(), search(), memify(). The v2 API is the memory-shaped one: remember(), recall(), forget(), improve(). The v2 functions are wrappers over the v1 functions, with session-memory handling added. When you call cognee.remember("the user prefers detailed explanations"), the runtime is executing add() followed by cognify() (for permanent memory) or a session-cache write followed by an asynchronous graph sync (for ephemeral session memory). The two APIs exist because the use case is different: v1 is for the engineer who is building a pipeline; v2 is for the agent that is acting.

flowchart TB
    subgraph API["Public API"]
        V2["remember / recall / forget / improve"]
        V1["add / cognify / search / memify"]
    end
    subgraph Pipeline["Pipeline Orchestrator"]
        T1["Task(classify_documents)"]
        T2["Task(extract_chunks_from_documents)"]
        T3["Task(extract_graph_and_summarize)"]
        T4["Task(add_data_points)"]
        T5["Task(extract_dlt_fk_edges)"]
    end
    subgraph Domain["Domain Modules"]
        G["modules/graph"]
        R["modules/retrieval"]
        S["modules/session_distillation"]
    end
    subgraph Infra["Infrastructure Adapters"]
        LLM["LLM Gateway (9 providers)"]
        GDB["Graph engine (6 backends)"]
        VDB["Vector engine (3 backends)"]
        Cache["Cache engine (5 backends)"]
    end
    V2 --> V1
    V1 --> Pipeline
    Pipeline --> Domain
    Domain --> Infra
    LLM --> External["OpenAI / Anthropic / Gemini / ..."]
    GDB --> External
    VDB --> External
    Cache --> External

This is the picture before any of the chapters open a file. From here, the rest of the series zooms in.

Why cognee and not its competitors

The repository ships with an evaluation suite in evals/. The README reports a BEAM benchmark result that, if you read it carefully, is more interesting than the number itself.

We ran cognee against BEAM, a long-context benchmark that tests whether a system can keep track of a long conversation as it changes — a more useful test for agent memory than typical needle-in-a-haystack benchmarks. Using only cognee's default settings and standard open-source features (no custom models, no BEAM-specific pipelines), we beat the previous state of the art at the 100K-token setting and matched it at 10M tokens.

| Benchmark | Setting | cognee | Previous SOTA | Obsidian / RAG baseline | |-----------|---------|--------|---------------|--------------------------| | BEAM | 100K tokens | 0.79 (>0.8 with per-question routing) | 0.735 | ~0.33 | | BEAM | 10M tokens | 0.67 | 0.641 | ~0.33 |

I will come back to these numbers in chapter four. For now, the relevant fact is that BEAM is not a typical QA benchmark. It is short for *Benchmark for Epistemic Agent Memory* — a test of whether a system can keep track of a long conversation *as it changes*. This is exactly the failure mode the customer-support agent exhibited on day fourteen. And the 0.79 score is the artifact; the underlying shift is from "context equals memory" to "graph equals memory."

Cognee is the engineering embodiment of that shift. The rest of the series is about how the embodiment is built.

A short tour of the next four chapters

**Chapter 1, *Extract, Cognify, Load*,** opens the engine room. The cognify() function in cognee/api/v1/cognify/cognify.py is a strictly sequential pipeline of five Task objects. The chapter walks each one, surfaces the two-schema problem (LLM-facing KnowledgeGraph with string IDs vs runtime DataPoint with UUIDs), and explains the bridge function integrate_chunk_graphs that translates one into the other. I expected the cognify function to be "run the LLM once and store the result." It is not.

**Chapter 2, *Seventeen Ways to Ask a Graph*,** takes the SearchType enum apart. Seventeen values, all behind a uniform three-method contract on BaseRetriever: get_retrieved_objects, get_context_from_objects, get_completion_from_context. The chapter follows a single query through the dispatcher, the retriever factory, and the default GRAPH_COMPLETION retriever's *brute-force triplet search* algorithm — the algorithm that won the BEAM score. I expected seventeen search types to be redundancy. They are not.

**Chapter 3, *The Handle That Survived Eviction*,** opens the cognee/infrastructure/ directory. The interesting design is a 3-layer caching architecture: an LRU, a proxy whose every attribute access re-resolves through the cache, and a per-tenant DB isolation layer driven by Python ContextVars. The chapter explains why the pattern exists, what the codebase explicitly rejected (per-tenant connection pools, per-tenant database-per-process), and why it works for six graph backends, three vector backends, and five cache backends without changing the engine interface.

**Chapter 4, *What Beating RAG Actually Proves*,** is the closing case study. BEAM is not a needle-in-a-haystack test. The 0.79 score is the artifact; the shift is in the test, not the score. The chapter presents the action frame: try cognee-cli -ui for a local playground, run the BEAM-style eval yourself, and put ENABLE_BACKEND_ACCESS_CONTROL=True before any production deployment.

A metaphor for the road

Think of agent memory as a geological formation. Sediment accumulates. The lower layers are the older facts, compressed and slightly transformed by the pressure of newer layers above. The layers interact: a fact in layer three gets reinterpreted when a related fact lands in layer seven. Memory is not a stack of independent chunks. It is a stratified record where every new fact can disturb the meaning of the old ones.

The ECL pipeline in cognee is, in this image, the geological process. The cognify step is the sedimentation — each chunk of input is compressed into a node, embedded, and linked. The memify step is the metamorphism — when the graph is enriched, the meaning of the older nodes shifts. The recall step is the archaeologist: the LLM reads the strata and answers a question. The session cache is the surface weather — quick, ephemeral, easily erased. The permanent graph is the bedrock.

This metaphor is not decoration. It is the architecture. And it is the lens through which the next four chapters read the code.