Start here
The Architectural Shape of Agent Memory
A series on how cognee replaced context windows with knowledge graphs — and what that replacement actually costs.
Imagine you're watching a customer-support agent on day fourteen. On day one it was correct. Today it is contradicting itself about a billing case it closed last week. The conversation log is in the database. The agent is not. The same week, a junior analyst asks an SQL copilot for a customer-retention query, and the copilot gives a textbook answer for a different schema. The expert query it should have echoed is three months of context-window rotation away — and that context is gone.
These are not LLM failures. They are memory failures. And they are the failure mode that cognee, the open-source AI memory platform from topoteretes/cognee, was built to address.
This series argues that cognee is the most thoroughly engineered open-source answer to "give an agent long-term memory," and that the memory metaphor is a thin skin over a much older idea: knowledge graphs. Across five chapters we'll walk from the public API to the runtime substrate. The work happens in a five-task ECL pipeline, behind seventeen search strategies, inside a multi-tenant substrate that uses a lazy-handle pattern to keep connections live across LRU evictions. The agent memory is the side effect; the engineering is the substance. If you finish this series and decide cognee is a vector-store wrapper, I will have failed to write it.
I started this analysis assuming cognee is a vector-store wrapper. Reading the README, I expected the storage story to be the point. The code tells a different story.
What "memory for an agent" means in 2026
An LLM has a context window. A context window is a temporary buffer — somewhere between eight thousand and a million tokens, depending on the model, and it is reset every time the conversation restarts. If you want an agent to remember something across sessions, you cannot rely on the context window. You have to externalize the memory: store it somewhere the LLM can query on demand.
The naive version of this is retrieval-augmented generation (RAG): chunk the documents, embed them, store the vectors, retrieve the top-k by cosine similarity at query time, and stuff the chunks into the prompt. This works. It is also brittle. It loses relationships between entities. It cannot answer "what changed between the time the user said X and the time they said Y?" because the chunks are flat. It forgets across sessions, because the retrieval is over static documents, not over what the agent has learned.
The deeper version is a knowledge graph: extract entities and relationships, store them as nodes and edges, and let the LLM query the graph structure on demand. This is what cognee does. The "memory" verb in the public API is a thin wrapper over a graph write; the "recall" verb is a thin wrapper over a graph read. The interesting question is what happens in between — what kind of graph, what kind of read, and how the substrate is engineered to be safe for production.
The five-thousand-foot view
Cognee's repository, as cloned at source/cognee/, is a Python package of roughly 2,400 tracked files organized in a layered architecture that the project's CLAUDE.md documents explicitly:
API Layer (cognee/api/v1/) — remember/recall/forget/improve
Main Functions — add/cognify/search/memify
Pipeline Orchestrator (modules/pipelines/) — Task, run_pipeline, per-dataset lock
Task Execution Layer (cognee/tasks/) — classify_documents, extract_graph_from_data, add_data_points
Domain Modules (modules/) — graph, retrieval, ingestion, cognify
Infrastructure Adapters (infrastructure/) — LLM gateway, graph/vector/cache engines
External Services — OpenAI, Ladybug, LanceDB, etc.
There are also four siblings: cognee-mcp/ (the Model Context Protocol server, a separate Docker image), cognee-frontend/ (the local Next.js UI), cognee_db_workers/ (subprocess helpers for the default graph database), and a starter kit and an eval framework.
The most important observation is that there are *two* public APIs. The v1 API is the pipe
7m / Article + audio
Extract, Cognify, Load
The five-task pipeline that turns text into a knowledge graph — and the two-schema problem that has to be solved in between.
# From cognee/api/v1/cognify/cognify.py, lines 290-353
default_tasks = [
Task(classify_documents),
Task(extract_chunks_from_documents,
max_chunk_size=chunk_size or get_max_chunk_tokens(),
chunker=chunker),
Task(extract_graph_and_summarize,
graph_model, config, custom_prompt,
task_config={"batch_size": chunks_per_batch}),
Task(add_data_points,
embed_triplets=embed_triplets,
task_config={"batch_size": chunks_per_batch}),
Task(extract_dlt_fk_edges),
]
The simplest API in cognee hides the deepest pipeline. cognee.remember("the user prefers detailed explanations") reads like a single operation. Behind it is a five-task pipeline that classifies documents, chunks them, runs an LLM extraction per chunk, writes the result to a graph and a vector store, and extracts foreign-key edges from any structured data sources — all strictly sequential, all stamped with provenance, and all protected by a per-dataset lock to prevent concurrent runs against the same data.
I initially read cognify() as "run the LLM once and store the result." The five tasks and the integrate_chunk_graphs function show this is wrong. The LLM is called *per chunk*, and there is a translation step between the LLM's output and the runtime storage that is one of the more interesting pieces of engineering in the codebase.
Key Takeaways
cognify() is a strictly sequential pipeline of five Task objects, not a "run the LLM and store" one-shot- The
KnowledgeGraph ↔ DataPoint translation in integrate_chunk_graphs is the architecture's hinge - DLT (data-load tool) chunks bypass the LLM entirely — they get deterministic FK-based graph construction
- Per-chunk parallel LLM calls (
asyncio.gather) inside extract_graph_and_summarize are bounded by an asyncio.Semaphore(data_per_batch=20) — this is how cognee extracts 100+ entities per second per chunk - Every task stamps provenance onto the
DataPoints it creates — source_pipeline, source_task, source_user, source_content_hash, topological_rank
The pipeline at a glance
The five tasks, in order, are:
1. classify_documents — turn raw Data items into typed Document objects (PDF, audio, image, text, etc.) 2. extract_chunks_from_documents — split the documents into semantic text chunks 3. extract_graph_and_summarize — the LLM call. For each chunk, extract entities and relationships AND generate a per-chunk summary, batched under an asyncio.Semaphore(data_per_batch=20) 4. add_data_points — persist the resulting nodes, edges, and embeddings to the graph database and the vector database, dual-write 5. extract_dlt_fk_edges — for structured data loaded via DLT, extract foreign-key edges deterministically (no LLM call)
Imagine you are running cognee.add(). The string "Cognee turns documents into AI memory" is on stdin. By the time the call returns, the five tasks have run, the LLM has been called against the chunk containing that string, and the resulting entities ("Cognee," "documents," "AI memory") and the relationship between them have been written to a graph that the next cognee.recall() can query.
flowchart LR
A["add()<br/>raw data"] --> B["classify_documents<br/>raw Data → typed Document"]
B --> C["extract_chunks_from_documents<br/>Document → DocumentChunk[]"]
C --> D["extract_graph_and_summarize<br/>LLM per chunk<br/>→ KnowledgeGraph"]
D --> E["add_data_points<br/>graph + vector write<br/>(dual or atomic)"]
E --> F["extract_dlt_fk_edges<br/>DLT chunks only<br/>deterministic FKs"]
F --> G["searchable graph<br/>+ vector store"]
The pipeline is *strictly sequential* at the task level. The orchestrator in cognee/modules/pipelines/operations/run_tasks_base.py:262-283 takes tasks[0] as the running task and recursively chains leftover_tasks[1:] — there is no automatic parallelization between tasks. Data items within a single task, however, are processed in parallel via asyncio.gather bounded by an asyncio.Semaphore(data_per_batch). The default is 20. This is the right level of concurrency: enough to saturate an LLM rate limiter, not enough to overwhelm a graph database.
The two-schema problem
The interesting part is what happens inside extract_graph_and_summarize. The function in cognee/tasks/graph/extract_graph_from_data.py calls extract_content_graph (in cognee/infrastructure/llm/extraction/knowledge_graph/) for each chunk. The LLM call is wrapped by Instructor, which forces the response into a Pydantic model — by default, KnowledgeGraph from cognee/shared/data_models.py.
# From cognee/shared/data_models.py
class Node(BaseModel):
id: str
name: str = "" # defaults to id in __init__
type: str
description: str
class Edge(BaseModel):
source_node_id: str
target_node_id: str
relationship_name: str
description: str | None
class KnowledgeGraph(BaseModel):
nodes: list[Node] = Field(default_factory=list)
edges: list[Edge] = Fi
8m / Article + audio
Seventeen Ways to Ask a Graph
A single dispatcher, a uniform three-method contract, and the algorithm that won the BEAM benchmark.
What does "graph completion" even mean when a graph is the wrong shape for the question?
Cognee ships with seventeen SearchType values. Seventeen. That sounds like redundancy. It is not. The seventeen values are five paradigms — graph-traversal, vector-only, hybrid, code/time-specialized, and agentic — unified behind a single three-method contract on BaseRetriever. The default, GRAPH_COMPLETION, is not a graph traversal at all. It is a *brute-force triplet search*: parallel vector search across five collections, an in-memory graph projection, distance-mapped re-ranking, and a top-k cutoff. That algorithm is what beat the BEAM benchmark. The other sixteen search types are variations on the theme of "use the right retrieval for the right question."
I expected seventeen search types to be redundancy. After mapping them to retrieval paradigms, the redundancy is the design.
Key Takeaways
- All seventeen
SearchType values conform to a 3-method contract: get_retrieved_objects → get_context_from_objects → get_completion_from_context - The default
GRAPH_COMPLETION is a brute-force triplet search, not a graph traversal — it embeds the query, runs parallel vector search across 5 collections, projects a graph fragment, then re-ranks by triplet_distance_penalty=6.5 FEELING_LUCKY lets the LLM pick a search type at runtime — this is only possible because of the uniform contractHYBRID_COMPLETION adds a "truth subspace" axis beyond vector similarity — a separate scoring signal that biases toward context-aligned answersAGENTIC_COMPLETION is a ReAct-style agent loop with per-tool ACL, not just a retriever- Permission denials return empty lists, not errors — a deliberate information-leak prevention
The seventeen, grouped by paradigm
# From cognee/modules/search/types/SearchType.py
class SearchType(str, Enum):
SUMMARIES = "SUMMARIES"
CHUNKS = "CHUNKS"
CHUNKS_LEXICAL = "CHUNKS_LEXICAL"
RAG_COMPLETION = "RAG_COMPLETION"
HYBRID_COMPLETION = "HYBRID_COMPLETION"
TRIPLET_COMPLETION = "TRIPLET_COMPLETION"
GRAPH_COMPLETION = "GRAPH_COMPLETION"
GRAPH_COMPLETION_DECOMPOSITION = "GRAPH_COMPLETION_DECOMPOSITION"
GRAPH_SUMMARY_COMPLETION = "GRAPH_SUMMARY_COMPLETION"
GRAPH_COMPLETION_COT = "GRAPH_COMPLETION_COT"
GRAPH_COMPLETION_CONTEXT_EXTENSION = "GRAPH_COMPLETION_CONTEXT_EXTENSION"
CYPHER = "CYPHER"
NATURAL_LANGUAGE = "NATURAL_LANGUAGE"
TEMPORAL = "TEMPORAL"
FEELING_LUCKY = "FEELING_LUCKY"
CODING_RULES = "CODING_RULES"
AGENTIC_COMPLETION = "AGENTIC_COMPLETION"
| Paradigm | Search types | What it is | |----------|--------------|------------| | Vector-only | CHUNKS, CHUNKS_LEXICAL, SUMMARIES, RAG_COMPLETION | Traditional RAG: embed the query, retrieve top-k, optionally LLM-complete. No graph. | | Graph-traversal | TRIPLET_COMPLETION, GRAPH_COMPLETION, GRAPH_SUMMARY_COMPLETION, GRAPH_COMPLETION_COT, GRAPH_COMPLETION_DECOMPOSITION, GRAPH_COMPLETION_CONTEXT_EXTENSION | The default paradigm. Use the graph; pick a flavor. | | Hybrid | HYBRID_COMPLETION | Multi-channel (chunks + entities + facts + global context) with a "truth subspace" scoring axis. | | Specialized | CYPHER, NATURAL_LANGUAGE, TEMPORAL, CODING_RULES | Domain-specific: Cypher queries, NL→Cypher translation, time-aware, code-rules. | | Agentic | AGENTIC_COMPLETION | ReAct-style agent loop with skills + tools, gated by per-tool ACL. | | Router | FEELING_LUCKY | Lets the LLM pick a search type at runtime. |
Imagine you call cognee.recall("What did the user say about billing?"). The router picks GRAPH_COMPLETION (or, if you set auto_route=True and skip query_type, the LLM picks via FEELING_LUCKY). What happens next is the same regardless of which paradigm you chose.
The three-method contract
Every retriever in cognee subclasses BaseRetriever (cognee/modules/retrieval/base_retriever.py:5-118) and implements three methods:
class BaseRetriever(ABC):
async def get_retrieved_objects(self, query) -> list: ...
async def get_context_from_objects(self, query, retrieved_objects) -> str | list: ...
async def get_completion_from_context(self, query, retrieved_objects, context) -> str: ...
This contract is the single most important abstraction in the retrieval layer. The dispatcher (cognee/modules/search/methods/get_search_type_retriever_instance.py:38-389) holds a registry — search_core_registry — that maps each SearchType to a (RetrieverClass, init_kwargs_dict) pair. The retriever factory instantiates the right class and the dispatcher calls the three methods in order.
The contract's power is that it makes FEELING_LUCKY possible. When the dispatcher receives a FEELING_LUCKY query, it calls select_search_type(query_text) (in cognee/modules/search/operations/select_search_type.py:9-42) which asks the LLM to pick a SearchType from the enum. Then it falls through to the same dispatcher logic as if the user had named the type explicitly. The dynamic router is only possible because the contract is uniform — otherwise every retriever would need its own dispatch logic and the LLM's pick would be an un-routable string.
The default algorithm: brute-force triplet search
GRAPH_COMPLETION is the default and the algorithm is worth a
8m / Article + audio
Premium chapters
4. The_Handle_That_Survived_EvictionAvailable after upgrade / 10m
5. What_Beats_RAG_Actually_ProvesAvailable after upgrade / 9m
6. READMEAvailable after upgrade / 2m