← Back to Writing

Why Your AI Agent Keeps Forgetting: The Three Memory Layers That Fix It

Vector databases are powerful — but they're not enough. Here's why agents built on a single memory layer keep failing, and the three-layer architecture that finally gets it right.


The Agent That Couldn't Connect the Dots

You've built an AI agent to help your engineering team stay on top of project work. You feed it everything: Jira tickets, Confluence docs, Slack threads, sprint planning notes, architecture decisions. You wire up a vector database, embed it all, and ship it. The team starts using it.

At first, it seems great. Ask it "What's the current status of the auth service refactor?" — it nails it. Ask it "Who's been working on the payment gateway?" — it finds the right Slack messages. But then someone asks a slightly harder question:

💬 The question that breaks it

"Which engineers on the payments team have unresolved blockers that are holding up Q2 deliverables?"

The agent returns a generic list of open Jira tickets. It doesn't connect that Sarah owns the payments gateway, that she raised a blocker in Monday's standup Slack thread, that the blocker involves a dependency owned by the infra team, and that the infra team's sprint is already over capacity per last week's planning doc.

Four facts. Four different sources. Each one retrievable in isolation. None of them connected.

This isn't a data problem. All of that information exists in the knowledge base. The agent can recall any individual fact with high confidence. What it cannot do is reason across the chain of relationships that links them together.

This is the fundamental limitation of a vector-only memory architecture — and it's the most common mistake teams make when building production agents.

Bigger Context Windows Don't Fix It

The instinct is to throw more context at the problem. Dump everything into a 128K or 1M token window and let the model figure it out. This feels like it should work. It mostly doesn't.

📉
Research from Stanford and Berkeley found that LLM recall drops by 30% or more when relevant information is positioned in the middle of a long context, rather than near the start or end. This is the "Lost in the Middle" effect (Liu et al., 2023). A bigger context window isn't a memory solution — it's a larger haystack with a duller needle.

The model doesn't have a filing system. It has attention — and attention has limits. Stuffing more tokens into the context is like solving a messy desk by buying a bigger desk. Eventually, everything gets lost faster.

The real fix requires rethinking memory as a layered architecture, not a single store.

The Three Memory Layers

World-class agent memory systems combine three complementary layers, each solving a different part of the problem. Used together, they give agents the ability to recall facts precisely, understand semantic meaning, and reason across complex relationships.

Layer 01
Relational Layer — Provenance & Structure

A traditional relational database (Postgres, SQLite, etc.) that stores structured facts with context: where did this information come from? When was it recorded? Who created it, who has access to it, and when does it expire? This layer is the source of truth for provenance. It answers the question "where did this come from, and can I trust it?" — something a vector store fundamentally cannot answer.

Layer 02
Vector Layer — Semantic Similarity

A vector index (pgvector, Pinecone, Weaviate, etc.) that stores embedded representations of meaning. This layer answers "what is this similar to?" It enables fuzzy, intent-driven retrieval: finding docs about "payment system failures" when the user asked about "transaction errors." It's the semantic bridge between natural language and stored knowledge. Powerful for initial retrieval — but blind to relationships.

Layer 03
Graph Layer — Relationships & Multi-Hop Reasoning

A property graph database (Neo4j, Apache AGE, etc.) that stores explicit connections between entities. This layer answers "how do these facts connect?" It enables the kind of multi-hop reasoning our project manager example requires: Sarah → owns → payments gateway → blocked by → infra dependency → owned by → Team X → sprint status: over-capacity. Without a graph layer, that chain is invisible. With it, it's a three-hop traversal.

How the Three Layers Work Together

These layers aren't alternatives — they're complements. A query flows through all three, each contributing what it does best.

%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#1a1a2e", "primaryTextColor": "#e2e8f0", "primaryBorderColor": "#00d4ff", "lineColor": "#7c3aed", "secondaryColor": "#0d0d1f", "tertiaryColor": "#0a0a1a", "edgeLabelBackground": "#0a0a1a", "clusterBkg": "#0d0d1f", "clusterBorder": "#7c3aed", "titleColor": "#00d4ff", "nodeTextColor": "#fff"}}}%%
flowchart LR
    Q["User Query"]:::query --> V

    subgraph MEM ["Memory Architecture"]
        V["Vector Layer
semantic search
pgvector"]:::vector R["Relational Layer
provenance + access
Postgres"]:::relational G["Graph Layer
relationships
Neo4j / AGE"]:::graph V -->|candidate entities| G G -->|hop traversal| R V -->|access filter| R end MEM --> CTX["Assembled Context
ranked and filtered"]:::ctx CTX --> LLM["LLM Response
grounded and precise"]:::llm classDef query fill:#7c3aed,stroke:#a78bfa,color:#fff classDef vector fill:#1a1040,stroke:#7c3aed,color:#c4b5fd classDef relational fill:#001a22,stroke:#00d4ff,color:#7dd3fc classDef graph fill:#220011,stroke:#ec4899,color:#f9a8d4 classDef ctx fill:#0d0d1f,stroke:#00d4ff,color:#e2e8f0 classDef llm fill:#0a0a1a,stroke:#00d4ff,color:#00d4ff

Fig 1. Query flows through all three layers; each contributes what it does best.

The flow in practice:

  1. The user's query is embedded and hits the vector layer first, retrieving semantically relevant candidate documents and entities.
  2. Those entities are handed to the graph layer, which traverses relationships — connecting engineers to tickets, tickets to blockers, blockers to dependent teams.
  3. Results are filtered through the relational layer, which checks provenance: is this information current? Does this user have access? What's the source confidence?
  4. The assembled, relationship-aware context is passed to the LLM — which now has everything it needs to answer the question correctly.

Practical Architecture Options

The good news: you don't need three separate managed services to get started. Here are the most practical paths depending on your stack:

Option Stack Notes
pgvector + Apache AGE Postgres extensions Both layers in one Postgres instance. Great for self-hosted setups. ⚠️ Apache AGE is not available on Neon's managed Postgres — requires your own Postgres host or Docker.
pgvector + Neo4j AuraDB Postgres + managed graph Best of both worlds: Neon-compatible for vectors + Neo4j AuraDB free tier for the graph layer. Production-ready with zero graph infra to manage.
Cognee Open-source orchestration Cognee automatically orchestrates all three layers. You give it raw documents; it handles embedding, relationship extraction, and graph construction. Lowest barrier to get started.
Docker (self-hosted) Apache AGE in Docker The fastest way to get Apache AGE running locally or on a VPS with full extension support.

For the Docker option, getting Apache AGE running is a single command:

docker run -p 5432:5432 \
  -e POSTGRES_PASSWORD=secret \
  apache/age:latest

Then enable both extensions in your database:

-- In your connected DB:
CREATE EXTENSION IF NOT EXISTS age;
CREATE EXTENSION IF NOT EXISTS vector;

LOAD 'age';
SET search_path = ag_catalog, "$user", public;

From there, you can store vector embeddings in a standard vector(1536) column, and model relationships using AGE's Cypher query interface — all within the same Postgres connection.

Where to Start

If you're building a new agent from scratch, start with Cognee — it removes all the architectural decisions and gets you a working three-layer memory in under an hour. If you're adding memory to an existing Postgres-backed system, reach for pgvector + Neo4j AuraDB. If you're self-hosted and want full control, Apache AGE + pgvector via Docker is the cleanest single-service path.

The key insight is this: vector similarity is a starting point, not a destination. The moment your agent needs to reason about how things relate — ownership, dependency, causality, chronology — you need a graph layer. And the moment your agent needs to reason about whether to trust a fact — who added it, when, from what source — you need a relational layer.

Memory in production agents is an architectural decision, not a configuration option. Build it with all three layers, and your agent won't just remember things. It'll understand them.


References & Inspirations

KN
Koushik Nagarajan
Engineering leader and AI architect. Writing about the patterns, pitfalls, and principles behind building intelligent systems that actually work in production.

← Back to Writing