AI Agent Memory Patterns: How to Give Agents Long-Term Memory (2026)
Every AI agent starts each conversation with amnesia. It doesn't remember what you told it last week, what worked last time it ran this task, or anything about you beyond what fits in the current context window. This guide covers the four memory types that fix this — and how to implement each one in production.
- Why agents forget (and why that's a problem)
- The 4 memory types
- In-context memory patterns
- External memory with vector stores
- Episodic memory: storing past agent runs
- Semantic memory: user facts and learned knowledge
- Implementation patterns (LangGraph, LlamaIndex, Supabase)
- Memory hygiene: what to store, what to forget, privacy
- FAQ
Why agents forget (and why that's a problem)
LLMs are stateless by design. Every API call is independent — the model receives a context window, generates tokens, and returns a response. It has no persistent state, no memory of prior calls, and no awareness that this user has been talking to it for six months. The session ends; everything vanishes.
For a simple one-shot Q&A chatbot, this is fine. For anything more ambitious, it's a critical limitation. Consider what breaks without memory:
- A coding assistant that forgets your tech stack, naming conventions, and past architectural decisions every session
- A customer support agent that asks the same onboarding questions to a user it's spoken to 40 times
- A research agent that re-runs the same web searches it ran last week, unaware the results were cached and reviewed
- A personal assistant that can't build on prior conversations to develop a real picture of your preferences and working style
Memory is the difference between an agent that feels like a capable colleague and one that feels like a stateless API. Getting it right is one of the highest-leverage improvements you can make to an agent's perceived quality.
The challenge is that memory is hard to implement correctly. Context windows are finite. Retrieval is imperfect. Storing everything creates noise. Forgetting the wrong thing destroys trust. The four-type framework below gives you the vocabulary and patterns to make deliberate, appropriate choices.
The 4 memory types
| Type | What it stores | Persistence | Retrieval | Latency |
|---|---|---|---|---|
| In-context | Current conversation, injected facts | Session only | Always present | Zero |
| External / retrieval | Documents, knowledge base, past messages | Indefinite | Similarity search on each turn | 50–200ms |
| Episodic | Past agent runs, task outcomes, tool call history | Indefinite | Query by recency, task type, or similarity | 50–200ms |
| Semantic | User facts, preferences, learned knowledge about the world | Indefinite | Structured lookup or vector search | 10–100ms |
Most production agents need at least two of these four types working together. A personal assistant might use all four: in-context for the current conversation, external retrieval for your documents, episodic for "what did I help you with last Tuesday," and semantic for durable facts like your name, role, and communication preferences.
In-context memory patterns
In-context memory is the simplest form: everything in the prompt is the memory. The model "remembers" whatever is in its context window right now. The challenge is managing that window intelligently as conversations grow.
Naive conversation history
The default approach: keep all messages and append each new turn. Works until the context limit is hit, then fails hard.
from anthropic import Anthropic
client = Anthropic()
messages = [] # grows unboundedly — dangerous at scale
def chat(user_input: str) -> str:
messages.append({"role": "user", "content": user_input})
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system="You are a helpful assistant.",
messages=messages,
)
reply = response.content[0].text
messages.append({"role": "assistant", "content": reply})
return reply
This works for short conversations. For anything longer than ~20 turns with substantive content, you need a windowing or summarization strategy.
Sliding window
Keep only the N most recent messages. Simple but loses context that may still be relevant:
WINDOW_SIZE = 20 # keep last 20 messages
def get_windowed_messages(all_messages: list, window: int = WINDOW_SIZE) -> list:
return all_messages[-window:]
# Use in chat:
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=get_windowed_messages(messages),
)
Summarization compression
A better approach: when the conversation exceeds a threshold, summarize older messages and keep recent ones verbatim. The summary preserves key facts while dramatically reducing token count:
def compress_history(messages: list, keep_recent: int = 8) -> list:
if len(messages) <= keep_recent + 4:
return messages
to_summarize = messages[:-keep_recent]
recent = messages[-keep_recent:]
summary_response = client.messages.create(
model="claude-haiku-3-5", # cheap model for compression
max_tokens=512,
system="Summarize the following conversation into a compact paragraph preserving all key facts, decisions, and user preferences. Be concise.",
messages=to_summarize,
)
summary_text = summary_response.content[0].text
summary_message = {
"role": "user",
"content": f"[Conversation summary: {summary_text}]"
}
# Inject summary as a synthetic user message at position 0
return [summary_message] + recent
# Compress every 30 turns
if len(messages) % 30 == 0:
messages = compress_history(messages)
Use a cheap, fast model (Claude Haiku, GPT-4o mini) for the compression pass — the quality requirement is lower and the cost savings are significant at scale.
Injecting external memories into context
Rather than storing everything in the conversation, inject retrieved memories at the start of each turn as a synthetic system context block:
def build_context_with_memories(user_input: str, user_id: str) -> list:
# Retrieve relevant memories (covered in next sections)
memories = retrieve_memories(user_input, user_id, top_k=5)
memory_block = "\n".join(f"- {m}" for m in memories)
system = f"""You are a helpful assistant.
Relevant memories about this user:
{memory_block}
Use these memories to personalize your response. Do not mention that you retrieved them.
"""
return system, [{"role": "user", "content": user_input}]
External memory with vector stores
External memory stores information outside the model — in a vector database — and retrieves relevant pieces on demand. This is the same mechanism as RAG, applied to conversational history and user-specific knowledge rather than a document corpus.
The pattern: embed every turn or fact as a vector, store it with metadata (user ID, timestamp, session ID), and at the start of each new turn, retrieve the top-k most semantically similar past memories and inject them into context.
Storing conversation turns as memory
import os
from openai import OpenAI
from supabase import create_client
openai_client = OpenAI()
supabase = create_client(os.environ["SUPABASE_URL"], os.environ["SUPABASE_KEY"])
def embed(text: str) -> list[float]:
response = openai_client.embeddings.create(
model="text-embedding-3-small",
input=text,
)
return response.data[0].embedding
def store_memory(user_id: str, content: str, memory_type: str = "conversation"):
vector = embed(content)
supabase.table("agent_memories").insert({
"user_id": user_id,
"content": content,
"embedding": vector,
"memory_type": memory_type,
"created_at": "now()",
}).execute()
def retrieve_memories(user_id: str, query: str, top_k: int = 5) -> list[str]:
query_vector = embed(query)
result = supabase.rpc("match_memories", {
"query_embedding": query_vector,
"match_user_id": user_id,
"match_count": top_k,
}).execute()
return [row["content"] for row in result.data]
Supabase pgvector setup
-- Enable pgvector extension
create extension if not exists vector;
-- Memory table
create table agent_memories (
id uuid primary key default gen_random_uuid(),
user_id text not null,
content text not null,
embedding vector(1536), -- text-embedding-3-small dimension
memory_type text default 'conversation',
created_at timestamptz default now()
);
-- Vector similarity search function
create or replace function match_memories(
query_embedding vector(1536),
match_user_id text,
match_count int default 5
)
returns table (id uuid, content text, similarity float)
language sql stable
as $$
select id, content, 1 - (embedding <=> query_embedding) as similarity
from agent_memories
where user_id = match_user_id
order by embedding <=> query_embedding
limit match_count;
$$;
-- Index for fast search
create index on agent_memories
using ivfflat (embedding vector_cosine_ops)
with (lists = 100);
This pattern gives you a working long-term memory store with a single Postgres/Supabase instance. No separate vector database service required. Scales comfortably to millions of memories per user before you need to think about sharding or dedicated vector infrastructure.
When to use Pinecone or Qdrant instead
Migrate from pgvector to a dedicated vector database when: you exceed ~50M vectors, you need sub-10ms retrieval at p99, you need advanced filtering across many metadata fields, or you need multi-region replication. For most agent memory use cases, pgvector is sufficient and simpler to operate.
Episodic memory: storing and replaying past agent runs
Episodic memory records what the agent did, not just what was said. It captures the full trace of an agent run: the goal, the tools called, the outputs produced, the errors encountered, and the final outcome. An agent with episodic memory can recall "last time I tried to scrape that site, it required authentication" or "the previous report generation run took 4 minutes and timed out at step 3."
This is most valuable for:
- Recurring tasks (weekly reports, daily data syncs) — the agent learns from each run
- Debugging — humans and agents can both inspect why a previous run failed
- Few-shot planning — the agent can use past successful runs as examples for similar new tasks
- Audit trails — compliance, accountability, cost tracking
Storing a run episode with LangGraph
import json
from datetime import datetime
def save_episode(
run_id: str,
user_id: str,
goal: str,
steps: list[dict], # list of {node, input, output, timestamp}
outcome: str, # "success" | "failure" | "partial"
error: str | None = None,
):
episode = {
"run_id": run_id,
"user_id": user_id,
"goal": goal,
"steps": steps,
"outcome": outcome,
"error": error,
"duration_seconds": (datetime.now() - run_start).total_seconds(),
"created_at": datetime.now().isoformat(),
}
# Store full episode in Postgres
supabase.table("agent_episodes").insert(episode).execute()
# Store a summary embedding for semantic search
summary = f"Goal: {goal}. Outcome: {outcome}. Steps: {len(steps)}. Error: {error or 'none'}"
store_memory(user_id, summary, memory_type="episodic")
Retrieving relevant episodes for planning
def get_relevant_episodes(user_id: str, current_goal: str, top_k: int = 3) -> list[dict]:
# Semantic search over episode summaries
similar_summaries = retrieve_memories(
user_id,
current_goal,
top_k=top_k,
)
# Could also filter by memory_type="episodic" in the SQL query
# Inject as few-shot context for the agent
return similar_summaries
def build_agent_prompt_with_episodes(goal: str, user_id: str) -> str:
episodes = get_relevant_episodes(user_id, goal)
if not episodes:
return f"Goal: {goal}"
episode_context = "\n".join(f"- Past run: {ep}" for ep in episodes)
return f"""Goal: {goal}
Relevant past experience:
{episode_context}
Use past experience to avoid known failure modes and build on what worked.
"""
LangGraph trace capture
LangGraph's checkpointer already stores the full step-by-step state history. You can replay it as an episode log:
config = {"configurable": {"thread_id": run_id}}
# After run completes:
history = list(app.get_state_history(config))
steps = [
{
"node": h.next,
"step": h.metadata.get("step"),
"messages": len(h.values.get("messages", [])),
}
for h in history
]
save_episode(run_id, user_id, goal, steps, outcome)
Semantic memory: user facts, preferences, learned knowledge
Semantic memory stores durable facts — things that are true about the user or the world and don't change turn-to-turn. Examples: user's name, job title, tech stack, preferred communication style, time zone, project names, key relationships. Unlike episodic memory, semantic memory is not tied to a specific event — it's background knowledge that should inform every interaction.
The key design decision: when do you write to semantic memory? Three approaches:
Explicit extraction after each session
At the end of a session (or every N turns), run an extraction pass over the conversation to identify new facts worth storing:
EXTRACTION_PROMPT = """
Review this conversation and extract any durable facts about the user.
Focus on: preferences, professional context, explicit statements of fact, stated goals.
Ignore: transient requests, single-session context, emotional reactions.
For each fact, output:
{"fact": "...", "confidence": "high|medium", "category": "preference|professional|goal|relationship"}
Output a JSON array. If no facts worth storing, output [].
Conversation:
{conversation}
"""
def extract_semantic_memories(conversation: list[dict], user_id: str):
conv_text = "\n".join(
f"{m['role'].upper()}: {m['content']}" for m in conversation
)
response = client.messages.create(
model="claude-haiku-3-5",
max_tokens=1024,
system=EXTRACTION_PROMPT.format(conversation=conv_text),
messages=[{"role": "user", "content": "Extract facts."}],
)
try:
facts = json.loads(response.content[0].text)
for fact in facts:
if fact["confidence"] == "high":
store_memory(user_id, fact["fact"], memory_type=f"semantic_{fact['category']}")
except json.JSONDecodeError:
pass # extraction failed — log and continue
Structured key-value store for well-known fields
For facts you know you'll always want (name, role, timezone, language preference), use a structured table rather than embeddings. Embeddings are for open-ended retrieval; key-value is for known fields you'll look up directly:
-- Structured user profile
create table user_profiles (
user_id text primary key,
display_name text,
role text,
timezone text,
language text default 'en',
tech_stack jsonb, -- {"languages": ["python", "typescript"], "infra": "aws"}
preferences jsonb, -- {"response_length": "concise", "code_style": "typed"}
updated_at timestamptz default now()
);
-- Inject into system prompt at session start:
def get_user_profile_context(user_id: str) -> str:
result = supabase.table("user_profiles").select("*").eq("user_id", user_id).single().execute()
if not result.data:
return ""
p = result.data
return f"User: {p['display_name']} | Role: {p['role']} | TZ: {p['timezone']} | Stack: {json.dumps(p.get('tech_stack', {}))}"
LlamaIndex memory modules
LlamaIndex provides a ChatMemoryBuffer and higher-level memory abstractions out of the box:
from llama_index.core.memory import ChatMemoryBuffer, VectorMemory
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.supabase import SupabaseVectorStore
# Vector-backed long-term memory
vector_store = SupabaseVectorStore(
postgres_connection_string=os.environ["DATABASE_URL"],
collection_name="agent_memory",
)
long_term_memory = VectorMemory.from_defaults(
vector_store=vector_store,
embed_model=embed_model,
retriever_kwargs={"similarity_top_k": 5},
)
# Short-term in-context buffer
short_term_memory = ChatMemoryBuffer.from_defaults(token_limit=4096)
# Compose both into an agent
from llama_index.core.agent import ReActAgent
agent = ReActAgent.from_tools(
tools=tools,
llm=llm,
memory=short_term_memory, # LlamaIndex uses this for context injection
verbose=True,
)
# Separately retrieve from long_term_memory and inject into the system prompt
Implementation patterns: LangGraph checkpointers, LlamaIndex memory modules, raw Supabase
Pattern 1: Full LangGraph stack
Use LangGraph's PostgresSaver for episodic/session memory (automatic) and a custom memory node that reads/writes semantic facts to a separate table:
from langgraph.graph import StateGraph, END
from langgraph.checkpoint.postgres import PostgresSaver
from typing import TypedDict, Annotated
import operator
class AgentState(TypedDict):
messages: Annotated[list, operator.add]
user_id: str
injected_memories: list[str]
def memory_retrieval_node(state: AgentState) -> dict:
"""Run at the start of each turn to inject relevant memories."""
latest_input = state["messages"][-1].content
memories = retrieve_memories(state["user_id"], latest_input, top_k=5)
return {"injected_memories": memories}
def memory_write_node(state: AgentState) -> dict:
"""Run at the end of each turn to extract and store new facts."""
# Only run every 5 turns to save cost
if len(state["messages"]) % 10 == 0:
extract_semantic_memories(state["messages"][-10:], state["user_id"])
return {}
graph = StateGraph(AgentState)
graph.add_node("memory_in", memory_retrieval_node)
graph.add_node("agent", agent_node)
graph.add_node("tools", tool_node)
graph.add_node("memory_out", memory_write_node)
graph.set_entry_point("memory_in")
graph.add_edge("memory_in", "agent")
graph.add_conditional_edges("agent", should_continue, {"tools": "tools", END: "memory_out"})
graph.add_edge("tools", "agent")
graph.add_edge("memory_out", END)
app = graph.compile(checkpointer=PostgresSaver(conn))
Pattern 2: Minimal raw API + Supabase
No frameworks — just the Anthropic SDK, pgvector, and a clean memory abstraction layer. Best when you don't want LangGraph overhead for simpler agents:
class AgentMemory:
def __init__(self, user_id: str):
self.user_id = user_id
def remember(self, content: str, memory_type: str = "conversation"):
store_memory(self.user_id, content, memory_type)
def recall(self, query: str, top_k: int = 5) -> list[str]:
return retrieve_memories(self.user_id, query, top_k)
def build_system_prompt(self, base_prompt: str, current_query: str) -> str:
memories = self.recall(current_query)
if not memories:
return base_prompt
memory_block = "\n".join(f"- {m}" for m in memories)
return f"{base_prompt}\n\nRelevant memory:\n{memory_block}"
# Usage
memory = AgentMemory(user_id="user-456")
system = memory.build_system_prompt(
"You are a helpful coding assistant.",
"Help me write a database migration script."
)
response = client.messages.create(
model="claude-sonnet-4-5",
system=system,
messages=[{"role": "user", "content": "Help me write a database migration script."}],
max_tokens=2048,
)
# Store the exchange
memory.remember(f"User asked about database migrations. Provided Alembic migration script.")
Memory hygiene: what to store, what to forget, privacy
The instinct is to store everything and retrieve selectively. This is wrong. Storing everything creates retrieval noise, inflates storage costs, and creates privacy and compliance liabilities. Good memory hygiene is as important as good memory architecture.
What to store
- Durable user preferences — communication style, level of detail preferred, domains of expertise
- Explicit user statements of fact — "I work at Acme Corp," "My deadline is June 30th," "I prefer TypeScript over JavaScript"
- Task outcomes that inform future behavior — "The weekly report generation succeeded using template v3," "User rejected approach A, preferred approach B"
- Corrections the user makes — if a user says "that's wrong, actually X," store X with high priority
What not to store
- Transient conversational filler — "Thanks!", "Got it.", pleasantries with no informational content
- Sensitive PII beyond what's needed — credit card numbers, health details, passwords, SSNs
- Emotional state — "user seemed frustrated today" — context-specific and likely to produce stereotyping
- Low-confidence inferences — if you're guessing at a user preference, don't store it as fact
- Information the user didn't intend to share — incidentally mentioned details that weren't offered as facts about themselves
TTL and expiry
Implement time-to-live on memory records. Not all memories should live forever:
alter table agent_memories add column expires_at timestamptz;
-- Short-lived: task context, project-specific details (90 days)
-- Medium-lived: professional context, current goals (1 year)
-- Long-lived: durable preferences, communication style (no expiry)
-- Cleanup job (run daily)
delete from agent_memories where expires_at < now();
User-facing memory controls
For any production application storing user-specific memories, you must provide:
- Memory inspection — users should be able to see what the agent remembers about them
- Memory deletion — users must be able to delete specific memories or all memories (GDPR/CCPA right to erasure)
- Memory correction — users should be able to update incorrect facts
- Opt-out — users should be able to disable memory entirely
# Memory management API endpoints (FastAPI example)
@app.get("/api/memories/{user_id}")
async def list_memories(user_id: str):
result = supabase.table("agent_memories").select("id, content, memory_type, created_at").eq("user_id", user_id).execute()
return result.data
@app.delete("/api/memories/{memory_id}")
async def delete_memory(memory_id: str, user_id: str):
supabase.table("agent_memories").delete().eq("id", memory_id).eq("user_id", user_id).execute()
return {"status": "deleted"}
@app.delete("/api/memories/user/{user_id}/all")
async def delete_all_memories(user_id: str):
supabase.table("agent_memories").delete().eq("user_id", user_id).execute()
supabase.table("user_profiles").delete().eq("user_id", user_id).execute()
return {"status": "all memories deleted"}
Avoid storing PII in embeddings
Embeddings are not reversible, but the raw content is stored alongside them. If you store "User's SSN is 123-45-6789" as a memory, that plaintext exists in your database. Apply the same data handling standards to memory content as you do to any other PII. When in doubt, store a reference ("user has provided identity verification — see user_profiles.verified") rather than the sensitive value itself.
FAQ
What is the difference between in-context memory and external memory for AI agents?
In-context memory is everything stored directly in the prompt — conversation history, injected facts, tool results. It's fast and zero-latency but strictly limited by the context window (typically 128K–200K tokens) and wiped at the end of each session. External memory lives outside the model in a database or vector store, persists indefinitely, and is retrieved on demand, but introduces retrieval latency (50–200ms) and the possibility of retrieval errors — relevant memories might not rank highly enough to be retrieved.
How do I give a chatbot long-term memory across sessions?
The standard production pattern: at the end of each session, run an LLM extraction pass over the conversation to identify durable facts and preferences, store them in a vector database keyed by user ID, and at the start of each new session retrieve the top-k most relevant memories based on the current query and inject them into the system prompt. This gives you effective long-term memory without blowing up your context window on every turn.
What vector database should I use for agent memory?
For most teams: pgvector via Supabase is the pragmatic default — one Postgres stack, minimal ops, scales to tens of millions of memories. Pinecone if you need managed scale beyond ~10M vectors with zero infra management. Qdrant for self-hosted with strong metadata filtering. Weaviate if you need hybrid BM25 + vector search out of the box. All four are production-proven; choose based on your existing infrastructure and ops capacity.
What is episodic memory in AI agents?
Episodic memory is a record of past agent runs — what the agent did, what tools it called, what the outcomes were, and what errors it encountered. It's distinct from semantic memory (durable facts about the user or world) because it captures procedural, temporal experience tied to specific events. Agents with episodic memory can recall "last time I ran this weekly report, the data API returned a 429 rate limit error at step 3" and proactively add retry logic to the current run.
How do I handle memory privacy and data retention for AI agents?
Treat agent memory like any other user personal data: collect only what you need, store with explicit user consent, implement TTL-based expiry for time-sensitive facts, provide user-facing memory inspection and deletion (required under GDPR/CCPA right to erasure), avoid storing raw PII in vector embeddings alongside their content, and separate memory tiers by sensitivity level. Preferences can persist longer; transactional details should expire within 90 days.