topoteretes/cognee / Chapter 4

Programming /

The_Handle_That_Survived_Eviction

# The Handle That Survived Eviction > The 3-layer caching architecture that keeps a multi-tenant graph database alive across LRU evictions. ```python # From cognee/infrastructure/databases/graph/get_graph_engine.py class _GraphEngineHandle: def __getattr__(self, name): # Re-resolve through the cache on every attribute access return getattr(self._resolve(), name) ``` The line is unremarkable in isolation. The architecture it supports is not. Cognee claims to be production-grade, multi-tenant, backend-agnostic across six graph databases (Ladybug, Neo4j, Postgres, Neptune, plus the Neptune Analytics hybrid and a community registry) and three vector databases (LanceDB, PGVector, Neptune Analytics). The claim holds because of a 3-layer caching architecture: an LRU at the bottom, a proxy whose every attribute access re-resolves through the cache in the middle, and a per-tenant DB isolation layer driven by Python ContextVars at the top. The "lazy handle" pattern at the middle is the answer to a question most engineers never think to ask: *what happens when an LRU evicts an adapter that someone is still holding a reference to?* I assumed a production-grade multi-tenant system needs heavyweight database-per-tenant with connection pools and explicit lifecycle management. The lazy-handle pattern shows a different economy. The cost is one indirection per attribute access. The benefit is that the cache can be small, the eviction policy can be aggressive, and the application code never sees a stale adapter. ## Key Takeaways - The "lazy handle" pattern (`_GraphEngineHandle`, `_VectorEngineHandle`, `_LeasedValueProxy`) is the answer to "what happens when an LRU evicts an adapter that someone is still holding?" - `ENABLE_BACKEND_ACCESS_CONTROL=True` triggers per-dataset DB isolation via `set_database_global_context_variables` — the database engines read ContextVars and switch to per-tenant instances - The 5 cache backends (Redis, FS, Tapes, SQLite, Postgres) are not interchangeable — Redis is for cross-process locks (the `shared_ladybug_lock` feature), FS is for single-process dev, Tapes is for cloud-style external cache - The `_lance_datapoint_class_cache` (256-entry LRU on the vector side) prevents a Pydantic memory leak that the README-quoted tracemalloc measurement pinned at +50 MB per large-text cognify cycle - The LLM gateway's "facade pattern" — `LLMGateway` is a static-method namespace, not a class — means swapping to BAML (`STRUCTURED_OUTPUT_FRAMEWORK=baml`) is one config flag and zero code change ## The constraint that shaped the design The codebase documents the design constraint in `cognee/infrastructure/databases/graph/get_graph_engine.py:60-81`. Direct quote from the handle's docstring: > A proxy object that survives cache invalidation. Every `__getattr__` re-calls `create_graph_engine(**config)` so the caller always reaches a live adapter. Without this, code that stored the return of `get_graph_engine()` would get a closed adapter after a prune/delete cycle. This is a real engineering constraint, not a theoretical one. The graph engine factory is `closing_lru_cache`'d (`cognee/infrastructure/databases/utils/closing_lru_cache.py:200-289`), which means an adapter can be evicted and closed at any time. If a caller had stashed a reference to the adapter, that reference is now a closed object. The handle pattern is the indirection that makes this safe: the caller holds the handle, the handle holds a reference to the config, and every attribute access re-resolves through the LRU. Imagine you have 200 concurrent tenants, each in their own dataset. Each one is calling `get_graph_engine()` from a different request handler. The LRU is, by default, 32 entries. Without the handle pattern, every cache eviction would be a potential silent failure: a handler holds a reference, the LRU evicts, the handler tries to use the reference, the adapter is closed, and the request fails with a connection error. With the handle pattern, the handler's reference is to the handle, and the handle re-resolves on every access. The 200-tenant case becomes safe. ## The three layers, top to bottom ```mermaid flowchart TB subgraph "Layer 3: Application code" A["tenant A: get_graph_engine() → handle"] B["tenant B: get_graph_engine() → handle"] C["tenant C: get_graph_engine() → handle"] end subgraph "Layer 2: Proxy (lazy handle)" H1["_GraphEngineHandle (A)"] H2["_GraphEngineHandle (B)"] H3["_GraphEngineHandle (C)"] end subgraph "Layer 1: closing_lru_cache" L["LRU(maxsize=32)<br/>key: hashable(config)<br/>value: _LeasedValueProxy"] end subgraph "Layer 0: Adapter" AD["LadybugAdapter / Neo4jAdapter / PostgresAdapter / ..."] end A --> H1 B --> H2 C --> H3 H1 --> L H2 --> L H3 --> L L --> AD ``` The first layer is the LRU. The key is a `to_hashable_dict()` of the adapter config (graph provider, URL, credentials), the value is a `_LeasedValueProxy` wrapping the actual adapter. On eviction, the LRU calls `close()` on the proxy — but the proxy's `close()` is deferred until the *caller's* reference is dropped, not when the cache evicts (`detach_from_cache`, line 131). This decoupling is the subtle part: cache eviction and caller-side cleanup are independent events. The second layer is the handle. Every call to `get_graph_engine()` returns a `_GraphEngineHandle`, not the adapter. The handle's `__getattr__` calls `self._resolve()`, which re-runs `create_graph_engine(**self._config)` — re-using the LRU, which may return a live adapter (cache hit) or create a new one (cache miss after eviction). The caller never sees a closed adapter. The third layer is the application code. It calls `get_graph_engine()` once, holds the handle, and uses it across many operations. The handle is the abstraction boundary. ## The alternatives the codebase rejected The lazy-handle pattern is the third option. The first two were rejected. The rejected options are worth naming because they are the obvious approaches and the codebase explicitly chose not to use them. **Per-tenant connection pools.** The textbook multi-tenant solution. Each tenant gets a dedicated connection pool, the pool is created at tenant creation, the pool is closed at tenant deletion. The cost

Chapter 4 of 6 10m Article Learning path

The Handle That Survived Eviction

The 3-layer caching architecture that keeps a multi-tenant graph database alive across LRU evictions.
# From cognee/infrastructure/databases/graph/get_graph_engine.py
class _GraphEngineHandle:
    def __getattr__(self, name):
        # Re-resolve through the cache on every attribute access
        return getattr(self._resolve(), name)

The line is unremarkable in isolation. The architecture it supports is not. Cognee claims to be production-grade, multi-tenant, backend-agnostic across six graph databases (Ladybug, Neo4j, Postgres, Neptune, plus the Neptune Analytics hybrid and a community registry) and three vector databases (LanceDB, PGVector, Neptune Analytics). The claim holds because of a 3-layer caching architecture: an LRU at the bottom, a proxy whose every attribute access re-resolves through the cache in the middle, and a per-tenant DB isolation layer driven by Python ContextVars at the top. The "lazy handle" pattern at the middle is the answer to a question most engineers never think to ask: *what happens when an LRU evicts an adapter that someone is still holding a reference to?*

I assumed a production-grade multi-tenant system needs heavyweight database-per-tenant with connection pools and explicit lifecycle management. The lazy-handle pattern shows a different economy. The cost is one indirection per attribute access. The benefit is that the cache can be small, the eviction policy can be aggressive, and the application code never sees a stale adapter.

Key Takeaways

  • The "lazy handle" pattern (_GraphEngineHandle, _VectorEngineHandle, _LeasedValueProxy) is the answer to "what happens when an LRU evicts an adapter that someone is still holding?"
  • ENABLE_BACKEND_ACCESS_CONTROL=True triggers per-dataset DB isolation via set_database_global_context_variables — the database engines read ContextVars and switch to per-tenant instances
  • The 5 cache backends (Redis, FS, Tapes, SQLite, Postgres) are not interchangeable — Redis is for cross-process locks (the shared_ladybug_lock feature), FS is for single-process dev, Tapes is for cloud-style external cache
  • The _lance_datapoint_class_cache (256-entry LRU on the vector side) prevents a Pydantic memory leak that the README-quoted tracemalloc measurement pinned at +50 MB per large-text cognify cycle
  • The LLM gateway's "facade pattern" — LLMGateway is a static-method namespace, not a class — means swapping to BAML (STRUCTURED_OUTPUT_FRAMEWORK=baml) is one config flag and zero code change

The constraint that shaped the design

The codebase documents the design constraint in cognee/infrastructure/databases/graph/get_graph_engine.py:60-81. Direct quote from the handle's docstring:

A proxy object that survives cache invalidation. Every __getattr__ re-calls create_graph_engine(**config) so the caller always reaches a live adapter. Without this, code that stored the return of get_graph_engine() would get a closed adapter after a prune/delete cycle.

This is a real engineering constraint, not a theoretical one. The graph engine factory is closing_lru_cache'd (cognee/infrastructure/databases/utils/closing_lru_cache.py:200-289), which means an adapter can be evicted and closed at any time. If a caller had stashed a reference to the adapter, that reference is now a closed object. The handle pattern is the indirection that makes this safe: the caller holds the handle, the handle holds a reference to the config, and every attribute access re-resolves through the LRU.

Imagine you have 200 concurrent tenants, each in their own dataset. Each one is calling get_graph_engine() from a different request handler. The LRU is, by default, 32 entries. Without the handle pattern, every cache eviction would be a potential silent failure: a handler holds a reference, the LRU evicts, the handler tries to use the reference, the adapter is closed, and the request fails with a connection error. With the handle pattern, the handler's reference is to the handle, and the handle re-resolves on every access. The 200-tenant case becomes safe.

The three layers, top to bottom

flowchart TB
    subgraph "Layer 3: Application code"
        A["tenant A: get_graph_engine() → handle"]
        B["tenant B: get_graph_engine() → handle"]
        C["tenant C: get_graph_engine() → handle"]
    end
    subgraph "Layer 2: Proxy (lazy handle)"
        H1["_GraphEngineHandle (A)"]
        H2["_GraphEngineHandle (B)"]
        H3["_GraphEngineHandle (C)"]
    end
    subgraph "Layer 1: closing_lru_cache"
        L["LRU(maxsize=32)<br/>key: hashable(config)<br/>value: _LeasedValueProxy"]
    end
    subgraph "Layer 0: Adapter"
        AD["LadybugAdapter / Neo4jAdapter / PostgresAdapter / ..."]
    end
    A --> H1
    B --> H2
    C --> H3
    H1 --> L
    H2 --> L
    H3 --> L
    L --> AD

The first layer is the LRU. The key is a to_hashable_dict() of the adapter config (graph provider, URL, credentials), the value is a _LeasedValueProxy wrapping the actual adapter. On eviction, the LRU calls close() on the proxy — but the proxy's close() is deferred until the *caller's* reference is dropped, not when the cache evicts (detach_from_cache, line 131). This decoupling is the subtle part: cache eviction and caller-side cleanup are independent events.

The second layer is the handle. Every call to get_graph_engine() returns a _GraphEngineHandle, not the adapter. The handle's __getattr__ calls self._resolve(), which re-runs create_graph_engine(**self._config) — re-using the LRU, which may return a live adapter (cache hit) or create a new one (cache miss after eviction). The caller never sees a closed adapter.

The third layer is the application code. It calls get_graph_engine() once, holds the handle, and uses it across many operations. The handle is the abstraction boundary.

The alternatives the codebase rejected

The lazy-handle pattern is the third option. The first two were rejected. The rejected options are worth naming because they are the obvious approaches and the codebase explicitly chose not to use them.

Per-tenant connection pools. The textbook multi-tenant solution. Each tenant gets a dedicated connection pool, the pool is created at tenant creation, the pool is closed at tenant deletion. The cost is memory: 200 tenants × 10 connections × 100 KB per connection = 200 MB of idle connections. The benefit is no cache invalidation. Cognee rejected this because the memory cost does not scale to the deployment scenarios the README targets (Modal, Fly.io, Render — serverless and edge platforms where memory is constrained).

Per-tenant database-per-process. Stronger isolation. Each tenant gets its own Python process, its own connection state, its own cache. The cost is process management overhead. The benefit is the strictest possible isolation. Cognee rejected this because it does not fit the deployment model — Modal and Fly.io are designed around single-process applications, and forcing a process-per-tenant model would be incompatible with the deployment surface.

The lazy-handle pattern is the third option. The cost is one indirection per attribute access. The benefit is that the cache can be small, the eviction policy can be aggressive, and the application code never sees a stale adapter. The cost is small enough that the LRU is essentially free.

The per-tenant DB isolation layer

The handle pattern is the bottom of the multi-tenant story. The top of the story is set_database_global_context_variables (cognee/context_global_variables.py:281). The function sets ContextVars — Python's task-local storage — that the database engines read when they need a config. When a request comes in for a specific (user_id, dataset_id), the dispatcher wraps the request in:

async with set_database_global_context_variables(dataset, user_id):
    # graph_engine and vector_engine resolve to the dataset's isolated instances
    await cognee.search(...)

The DatasetDatabase SQL table (cognee/modules/users/models/DatasetDatabase.py:7) is the per-(owner, dataset) "connection info" row. Columns include vector_database_name, graph_database_name, *_database_provider, *_database_url, *_database_key, *_connection_info JSON blob (so new providers don't need schema changes), and migration_revision / migration_last_error. When get_or_create_dataset_database is called (databases/utils/get_or_create_dataset_database.py:67), a missing row is created by calling the *dataset handler* for each side — _get_vector_db_info calls the handler's create_dataset(dataset_id, user), and the handler decides what "create" means for its backend (filesystem-backed handlers create a sub-directory; Postgres handlers issue CREATE DATABASE).

This is the second pattern: *the configuration is data, not code*. New backends don't require schema changes — they need a handler that knows how to provision a per-dataset instance and a way to store its connection info in the JSON blob.

The backend_access_control_enabled() function (context_global_variables.py:91) is the central switch. It reads ENABLE_BACKEND_ACCESS_CONTROL first, then checks multi_user_support_possible() — which requires that both graph and vector *_dataset_database_handler are registered and that the configured provider is in the supported list (VECTOR_DBS_WITH_MULTI_USER_SUPPORT = ["lancedb", "pgvector", "falkor"], line 103; GRAPH_DBS_WITH_MULTI_USER_SUPPORT = ["ladybug", "kuzu", "falkor", "postgres"], line 104). The architecture is "opt in, but the opt-in is conditional on the backends you have configured."

The 5 cache backends and why they are not interchangeable

The cache engine — CACHE_BACKEND — is the third infrastructure layer. The five backends, defined in cognee/infrastructure/databases/cache/config.py:36 as Literal["redis", "fs", "tapes", "sqlite", "postgres"], are not interchangeable. They serve different deployment scenarios:

| Backend | Use case | Why | |---------|----------|-----| | redis | Production multi-process | Cross-process locks via shared_ladybug_lock. Required for the Ladybug backend to coordinate file locks across workers. | | fs | Single-process dev | Filesystem-backed. No cross-process coordination. The README flags this as "single-process only." | | tapes | Cloud-style external cache | HTTP to TAPES_INGEST_URL. For deployments where the cache lives outside the cognee process. | | sqlite (default) | Single-machine production | SQLite-backed. The default. Fast for single-machine, no cross-process coordination. | | postgres | Production with Postgres already deployed | Reuse the existing Postgres for the cache. The postgres extra (pip install cognee[postgres]) enables this. |

The default is sqlite. The 7-day TTL (session_ttl_seconds=604800) is the default session lifetime. The usage_logging flag is a separate channel: when true, the cache engine returns None from get_cache_engine, and the cache writes go to a separate logging channel rather than the main cache.

I find the shared_ladybug_lock feature particularly interesting. Ladybug (the default graph database, formerly Kuzu) uses file-based storage. Multiple Python processes opening the same database file need a cross-process lock to avoid corruption. The cache engine's acquire_lock / release_lock abstraction is the substrate, and redis is the production-quality implementation. Without Redis, the codebase cannot run Ladybug safely across multiple processes. The README documents this as a deployment constraint, not a bug.

The Pydantic memory leak the codebase caught

The most surprising detail in the infrastructure layer is the Pydantic memory leak. The function get_graph_from_model in cognee/modules/graph/utils/get_graph_from_model.py recursively walks a DataPoint tree to extract (nodes, edges) tuples. For each call, it generates a fresh BaseModel subclass via copy_model and attaches fresh FieldInfo / SchemaValidator / SchemaSerializer to Pydantic's global caches. The comment in the file (lines 12-22) reads:

Tracemalloc attributed +~50 MB per large-text cognify cycle. The cache bounds this.

The fix is an LRU cache of size 256, keyed by (DataPoint subclass, tuple(sorted(excluded_fields))). On a hit, the cache returns the same BaseModel subclass instead of minting a new one. The same pattern is used on the vector side: _lance_datapoint_class_cache in LanceDBAdapter.py:113, capped at 256.

This is the kind of engineering that does not show up in benchmarks. It shows up in long-running production deployments that OOM after a few hours. The codebase caught it with tracemalloc, fixed it with an LRU, and documented the fix in a code comment that names the leak. This is the discipline of a production engineering team.

The LLM gateway's facade

The LLM gateway is the simplest piece of the infrastructure layer. LLMGateway (cognee/infrastructure/llm/LLMGateway.py:52-92) is a static-method namespace — not a class with state. The class is just a docstring holder. Every method delegates to one of: BAML extraction module, litellm_instructor.llm.get_llm_client().acreate_structured_output, etc.

The provider list is at cognee/infrastructure/llm/structured_output_framework/litellm_instructor/llm/get_llm_client.py:70-92: OPENAI, OLLAMA, ANTHROPIC, CUSTOM, GEMINI, MISTRAL, AZURE, BEDROCK, LLAMA_CPP. Each maps to an adapter file. The BAML alternative is a drop-in: STRUCTURED_OUTPUT_FRAMEWORK=baml swaps the instructor.from_litellm wrapping for the BAML extraction module, with no code change.

The rate limiter is a singleton (llm_rate_limiter, line 109 of rate_limiter.py) using limits.storage.MemoryStorage + MovingWindowRateLimiter. Configurable per LLM_RATE_LIMIT_{ENABLED,REQUESTS,INTERVAL}. The error pattern list (lines 59-74) catches rate limit, too many requests, quota, throttled, tps limit exceeded, etc. — and the retry loop is tenacity with exponential jitter.

The gateway is the entry point, but the actual call goes through acreate_structured_output on the litellm-instructor client, which wraps litellm.acompletion with instructor.from_litellm(..., mode=instructor.Mode(self.instructor_mode)). The mode defaults per provider (json_schema_mode for OpenAI) and is overridable via LLM_INSTRUCTOR_MODE. This is the structured-output backbone of the entire system.

The pattern, in summary

Cognee's production claim rests on a 3-layer caching architecture that is *boring in implementation, surprising in its consequences*. An LRU. A proxy. A per-tenant isolation layer. The LRU can be small and aggressive because the proxy makes eviction safe. The proxy can be trivial because the application code never holds direct references. The isolation layer can be a context manager because the proxy decouples caller lifetime from cache lifetime.

The hotel key card is the metaphor. A hotel key card re-validates against the front desk on every door. The cache can be aggressive because every attribute access re-validates. The cost is one round-trip per access; the benefit is that a 32-entry cache can safely serve 200 tenants. The lazy handle is the key card. The LRU is the front desk. The tenant isolation is the room number.

The question this design provokes is whether abstractions are the right currency in multi-tenant systems. A heavyweight per-tenant pool is more explicit. A process-per-tenant is more isolated. The lazy-handle pattern is neither — it is a thin indirection that turns out to be enough. The codebase's bet is that one indirection per attribute access is a price worth paying for the deployment scenarios the README targets. The architecture is the bet; the next chapter shows the bet paying off.