TechStackopoly
Agent Memory · Patterns · 2026

AI Agent Memory Patterns: How to Give Agents Long-Term Memory (2026)

Every AI agent starts each conversation with amnesia. It doesn't remember what you told it last week, what worked last time it ran this task, or anything about you beyond what fits in the current context window. This guide covers the four memory types that fix this — and how to implement each one in production.

On this page

Why agents forget (and why that's a problem)

LLMs are stateless by design. Every API call is independent — the model receives a context window, generates tokens, and returns a response. It has no persistent state, no memory of prior calls, and no awareness that this user has been talking to it for six months. The session ends; everything vanishes.

For a simple one-shot Q&A chatbot, this is fine. For anything more ambitious, it's a critical limitation. Consider what breaks without memory:

Memory is the difference between an agent that feels like a capable colleague and one that feels like a stateless API. Getting it right is one of the highest-leverage improvements you can make to an agent's perceived quality.

The challenge is that memory is hard to implement correctly. Context windows are finite. Retrieval is imperfect. Storing everything creates noise. Forgetting the wrong thing destroys trust. The four-type framework below gives you the vocabulary and patterns to make deliberate, appropriate choices.

The 4 memory types

TypeWhat it storesPersistenceRetrievalLatency
In-contextCurrent conversation, injected factsSession onlyAlways presentZero
External / retrievalDocuments, knowledge base, past messagesIndefiniteSimilarity search on each turn50–200ms
EpisodicPast agent runs, task outcomes, tool call historyIndefiniteQuery by recency, task type, or similarity50–200ms
SemanticUser facts, preferences, learned knowledge about the worldIndefiniteStructured lookup or vector search10–100ms

Most production agents need at least two of these four types working together. A personal assistant might use all four: in-context for the current conversation, external retrieval for your documents, episodic for "what did I help you with last Tuesday," and semantic for durable facts like your name, role, and communication preferences.

In-context memory patterns

In-context memory is the simplest form: everything in the prompt is the memory. The model "remembers" whatever is in its context window right now. The challenge is managing that window intelligently as conversations grow.

Naive conversation history

The default approach: keep all messages and append each new turn. Works until the context limit is hit, then fails hard.

from anthropic import Anthropic

client = Anthropic()
messages = []   # grows unboundedly — dangerous at scale

def chat(user_input: str) -> str:
    messages.append({"role": "user", "content": user_input})
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        system="You are a helpful assistant.",
        messages=messages,
    )
    reply = response.content[0].text
    messages.append({"role": "assistant", "content": reply})
    return reply

This works for short conversations. For anything longer than ~20 turns with substantive content, you need a windowing or summarization strategy.

Sliding window

Keep only the N most recent messages. Simple but loses context that may still be relevant:

WINDOW_SIZE = 20   # keep last 20 messages

def get_windowed_messages(all_messages: list, window: int = WINDOW_SIZE) -> list:
    return all_messages[-window:]

# Use in chat:
response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=get_windowed_messages(messages),
)

Summarization compression

A better approach: when the conversation exceeds a threshold, summarize older messages and keep recent ones verbatim. The summary preserves key facts while dramatically reducing token count:

def compress_history(messages: list, keep_recent: int = 8) -> list:
    if len(messages) <= keep_recent + 4:
        return messages

    to_summarize = messages[:-keep_recent]
    recent = messages[-keep_recent:]

    summary_response = client.messages.create(
        model="claude-haiku-3-5",   # cheap model for compression
        max_tokens=512,
        system="Summarize the following conversation into a compact paragraph preserving all key facts, decisions, and user preferences. Be concise.",
        messages=to_summarize,
    )
    summary_text = summary_response.content[0].text

    summary_message = {
        "role": "user",
        "content": f"[Conversation summary: {summary_text}]"
    }
    # Inject summary as a synthetic user message at position 0
    return [summary_message] + recent

# Compress every 30 turns
if len(messages) % 30 == 0:
    messages = compress_history(messages)

Use a cheap, fast model (Claude Haiku, GPT-4o mini) for the compression pass — the quality requirement is lower and the cost savings are significant at scale.

Injecting external memories into context

Rather than storing everything in the conversation, inject retrieved memories at the start of each turn as a synthetic system context block:

def build_context_with_memories(user_input: str, user_id: str) -> list:
    # Retrieve relevant memories (covered in next sections)
    memories = retrieve_memories(user_input, user_id, top_k=5)
    memory_block = "\n".join(f"- {m}" for m in memories)

    system = f"""You are a helpful assistant.

Relevant memories about this user:
{memory_block}

Use these memories to personalize your response. Do not mention that you retrieved them.
"""
    return system, [{"role": "user", "content": user_input}]

External memory with vector stores

External memory stores information outside the model — in a vector database — and retrieves relevant pieces on demand. This is the same mechanism as RAG, applied to conversational history and user-specific knowledge rather than a document corpus.

The pattern: embed every turn or fact as a vector, store it with metadata (user ID, timestamp, session ID), and at the start of each new turn, retrieve the top-k most semantically similar past memories and inject them into context.

Storing conversation turns as memory

import os
from openai import OpenAI
from supabase import create_client

openai_client = OpenAI()
supabase = create_client(os.environ["SUPABASE_URL"], os.environ["SUPABASE_KEY"])

def embed(text: str) -> list[float]:
    response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
    )
    return response.data[0].embedding

def store_memory(user_id: str, content: str, memory_type: str = "conversation"):
    vector = embed(content)
    supabase.table("agent_memories").insert({
        "user_id": user_id,
        "content": content,
        "embedding": vector,
        "memory_type": memory_type,
        "created_at": "now()",
    }).execute()

def retrieve_memories(user_id: str, query: str, top_k: int = 5) -> list[str]:
    query_vector = embed(query)
    result = supabase.rpc("match_memories", {
        "query_embedding": query_vector,
        "match_user_id": user_id,
        "match_count": top_k,
    }).execute()
    return [row["content"] for row in result.data]

Supabase pgvector setup

-- Enable pgvector extension
create extension if not exists vector;

-- Memory table
create table agent_memories (
  id          uuid primary key default gen_random_uuid(),
  user_id     text not null,
  content     text not null,
  embedding   vector(1536),   -- text-embedding-3-small dimension
  memory_type text default 'conversation',
  created_at  timestamptz default now()
);

-- Vector similarity search function
create or replace function match_memories(
  query_embedding vector(1536),
  match_user_id   text,
  match_count     int default 5
)
returns table (id uuid, content text, similarity float)
language sql stable
as $$
  select id, content, 1 - (embedding <=> query_embedding) as similarity
  from agent_memories
  where user_id = match_user_id
  order by embedding <=> query_embedding
  limit match_count;
$$;

-- Index for fast search
create index on agent_memories
  using ivfflat (embedding vector_cosine_ops)
  with (lists = 100);

This pattern gives you a working long-term memory store with a single Postgres/Supabase instance. No separate vector database service required. Scales comfortably to millions of memories per user before you need to think about sharding or dedicated vector infrastructure.

When to use Pinecone or Qdrant instead

Migrate from pgvector to a dedicated vector database when: you exceed ~50M vectors, you need sub-10ms retrieval at p99, you need advanced filtering across many metadata fields, or you need multi-region replication. For most agent memory use cases, pgvector is sufficient and simpler to operate.

Episodic memory: storing and replaying past agent runs

Episodic memory records what the agent did, not just what was said. It captures the full trace of an agent run: the goal, the tools called, the outputs produced, the errors encountered, and the final outcome. An agent with episodic memory can recall "last time I tried to scrape that site, it required authentication" or "the previous report generation run took 4 minutes and timed out at step 3."

This is most valuable for:

Storing a run episode with LangGraph

import json
from datetime import datetime

def save_episode(
    run_id: str,
    user_id: str,
    goal: str,
    steps: list[dict],   # list of {node, input, output, timestamp}
    outcome: str,        # "success" | "failure" | "partial"
    error: str | None = None,
):
    episode = {
        "run_id": run_id,
        "user_id": user_id,
        "goal": goal,
        "steps": steps,
        "outcome": outcome,
        "error": error,
        "duration_seconds": (datetime.now() - run_start).total_seconds(),
        "created_at": datetime.now().isoformat(),
    }
    # Store full episode in Postgres
    supabase.table("agent_episodes").insert(episode).execute()

    # Store a summary embedding for semantic search
    summary = f"Goal: {goal}. Outcome: {outcome}. Steps: {len(steps)}. Error: {error or 'none'}"
    store_memory(user_id, summary, memory_type="episodic")

Retrieving relevant episodes for planning

def get_relevant_episodes(user_id: str, current_goal: str, top_k: int = 3) -> list[dict]:
    # Semantic search over episode summaries
    similar_summaries = retrieve_memories(
        user_id,
        current_goal,
        top_k=top_k,
    )
    # Could also filter by memory_type="episodic" in the SQL query

    # Inject as few-shot context for the agent
    return similar_summaries

def build_agent_prompt_with_episodes(goal: str, user_id: str) -> str:
    episodes = get_relevant_episodes(user_id, goal)
    if not episodes:
        return f"Goal: {goal}"

    episode_context = "\n".join(f"- Past run: {ep}" for ep in episodes)
    return f"""Goal: {goal}

Relevant past experience:
{episode_context}

Use past experience to avoid known failure modes and build on what worked.
"""

LangGraph trace capture

LangGraph's checkpointer already stores the full step-by-step state history. You can replay it as an episode log:

config = {"configurable": {"thread_id": run_id}}
# After run completes:
history = list(app.get_state_history(config))
steps = [
    {
        "node": h.next,
        "step": h.metadata.get("step"),
        "messages": len(h.values.get("messages", [])),
    }
    for h in history
]
save_episode(run_id, user_id, goal, steps, outcome)

Semantic memory: user facts, preferences, learned knowledge

Semantic memory stores durable facts — things that are true about the user or the world and don't change turn-to-turn. Examples: user's name, job title, tech stack, preferred communication style, time zone, project names, key relationships. Unlike episodic memory, semantic memory is not tied to a specific event — it's background knowledge that should inform every interaction.

The key design decision: when do you write to semantic memory? Three approaches:

Explicit extraction after each session

At the end of a session (or every N turns), run an extraction pass over the conversation to identify new facts worth storing:

EXTRACTION_PROMPT = """
Review this conversation and extract any durable facts about the user.
Focus on: preferences, professional context, explicit statements of fact, stated goals.
Ignore: transient requests, single-session context, emotional reactions.

For each fact, output:
{"fact": "...", "confidence": "high|medium", "category": "preference|professional|goal|relationship"}

Output a JSON array. If no facts worth storing, output [].

Conversation:
{conversation}
"""

def extract_semantic_memories(conversation: list[dict], user_id: str):
    conv_text = "\n".join(
        f"{m['role'].upper()}: {m['content']}" for m in conversation
    )
    response = client.messages.create(
        model="claude-haiku-3-5",
        max_tokens=1024,
        system=EXTRACTION_PROMPT.format(conversation=conv_text),
        messages=[{"role": "user", "content": "Extract facts."}],
    )
    try:
        facts = json.loads(response.content[0].text)
        for fact in facts:
            if fact["confidence"] == "high":
                store_memory(user_id, fact["fact"], memory_type=f"semantic_{fact['category']}")
    except json.JSONDecodeError:
        pass   # extraction failed — log and continue

Structured key-value store for well-known fields

For facts you know you'll always want (name, role, timezone, language preference), use a structured table rather than embeddings. Embeddings are for open-ended retrieval; key-value is for known fields you'll look up directly:

-- Structured user profile
create table user_profiles (
  user_id       text primary key,
  display_name  text,
  role          text,
  timezone      text,
  language      text default 'en',
  tech_stack    jsonb,   -- {"languages": ["python", "typescript"], "infra": "aws"}
  preferences   jsonb,   -- {"response_length": "concise", "code_style": "typed"}
  updated_at    timestamptz default now()
);

-- Inject into system prompt at session start:
def get_user_profile_context(user_id: str) -> str:
    result = supabase.table("user_profiles").select("*").eq("user_id", user_id).single().execute()
    if not result.data:
        return ""
    p = result.data
    return f"User: {p['display_name']} | Role: {p['role']} | TZ: {p['timezone']} | Stack: {json.dumps(p.get('tech_stack', {}))}"

LlamaIndex memory modules

LlamaIndex provides a ChatMemoryBuffer and higher-level memory abstractions out of the box:

from llama_index.core.memory import ChatMemoryBuffer, VectorMemory
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.supabase import SupabaseVectorStore

# Vector-backed long-term memory
vector_store = SupabaseVectorStore(
    postgres_connection_string=os.environ["DATABASE_URL"],
    collection_name="agent_memory",
)
long_term_memory = VectorMemory.from_defaults(
    vector_store=vector_store,
    embed_model=embed_model,
    retriever_kwargs={"similarity_top_k": 5},
)

# Short-term in-context buffer
short_term_memory = ChatMemoryBuffer.from_defaults(token_limit=4096)

# Compose both into an agent
from llama_index.core.agent import ReActAgent
agent = ReActAgent.from_tools(
    tools=tools,
    llm=llm,
    memory=short_term_memory,   # LlamaIndex uses this for context injection
    verbose=True,
)
# Separately retrieve from long_term_memory and inject into the system prompt

Implementation patterns: LangGraph checkpointers, LlamaIndex memory modules, raw Supabase

Pattern 1: Full LangGraph stack

Use LangGraph's PostgresSaver for episodic/session memory (automatic) and a custom memory node that reads/writes semantic facts to a separate table:

from langgraph.graph import StateGraph, END
from langgraph.checkpoint.postgres import PostgresSaver
from typing import TypedDict, Annotated
import operator

class AgentState(TypedDict):
    messages: Annotated[list, operator.add]
    user_id: str
    injected_memories: list[str]

def memory_retrieval_node(state: AgentState) -> dict:
    """Run at the start of each turn to inject relevant memories."""
    latest_input = state["messages"][-1].content
    memories = retrieve_memories(state["user_id"], latest_input, top_k=5)
    return {"injected_memories": memories}

def memory_write_node(state: AgentState) -> dict:
    """Run at the end of each turn to extract and store new facts."""
    # Only run every 5 turns to save cost
    if len(state["messages"]) % 10 == 0:
        extract_semantic_memories(state["messages"][-10:], state["user_id"])
    return {}

graph = StateGraph(AgentState)
graph.add_node("memory_in", memory_retrieval_node)
graph.add_node("agent", agent_node)
graph.add_node("tools", tool_node)
graph.add_node("memory_out", memory_write_node)

graph.set_entry_point("memory_in")
graph.add_edge("memory_in", "agent")
graph.add_conditional_edges("agent", should_continue, {"tools": "tools", END: "memory_out"})
graph.add_edge("tools", "agent")
graph.add_edge("memory_out", END)

app = graph.compile(checkpointer=PostgresSaver(conn))

Pattern 2: Minimal raw API + Supabase

No frameworks — just the Anthropic SDK, pgvector, and a clean memory abstraction layer. Best when you don't want LangGraph overhead for simpler agents:

class AgentMemory:
    def __init__(self, user_id: str):
        self.user_id = user_id

    def remember(self, content: str, memory_type: str = "conversation"):
        store_memory(self.user_id, content, memory_type)

    def recall(self, query: str, top_k: int = 5) -> list[str]:
        return retrieve_memories(self.user_id, query, top_k)

    def build_system_prompt(self, base_prompt: str, current_query: str) -> str:
        memories = self.recall(current_query)
        if not memories:
            return base_prompt
        memory_block = "\n".join(f"- {m}" for m in memories)
        return f"{base_prompt}\n\nRelevant memory:\n{memory_block}"

# Usage
memory = AgentMemory(user_id="user-456")
system = memory.build_system_prompt(
    "You are a helpful coding assistant.",
    "Help me write a database migration script."
)
response = client.messages.create(
    model="claude-sonnet-4-5",
    system=system,
    messages=[{"role": "user", "content": "Help me write a database migration script."}],
    max_tokens=2048,
)
# Store the exchange
memory.remember(f"User asked about database migrations. Provided Alembic migration script.")

Memory hygiene: what to store, what to forget, privacy

The instinct is to store everything and retrieve selectively. This is wrong. Storing everything creates retrieval noise, inflates storage costs, and creates privacy and compliance liabilities. Good memory hygiene is as important as good memory architecture.

What to store

What not to store

TTL and expiry

Implement time-to-live on memory records. Not all memories should live forever:

alter table agent_memories add column expires_at timestamptz;

-- Short-lived: task context, project-specific details (90 days)
-- Medium-lived: professional context, current goals (1 year)
-- Long-lived: durable preferences, communication style (no expiry)

-- Cleanup job (run daily)
delete from agent_memories where expires_at < now();

User-facing memory controls

For any production application storing user-specific memories, you must provide:

# Memory management API endpoints (FastAPI example)
@app.get("/api/memories/{user_id}")
async def list_memories(user_id: str):
    result = supabase.table("agent_memories").select("id, content, memory_type, created_at").eq("user_id", user_id).execute()
    return result.data

@app.delete("/api/memories/{memory_id}")
async def delete_memory(memory_id: str, user_id: str):
    supabase.table("agent_memories").delete().eq("id", memory_id).eq("user_id", user_id).execute()
    return {"status": "deleted"}

@app.delete("/api/memories/user/{user_id}/all")
async def delete_all_memories(user_id: str):
    supabase.table("agent_memories").delete().eq("user_id", user_id).execute()
    supabase.table("user_profiles").delete().eq("user_id", user_id).execute()
    return {"status": "all memories deleted"}

Avoid storing PII in embeddings

Embeddings are not reversible, but the raw content is stored alongside them. If you store "User's SSN is 123-45-6789" as a memory, that plaintext exists in your database. Apply the same data handling standards to memory content as you do to any other PII. When in doubt, store a reference ("user has provided identity verification — see user_profiles.verified") rather than the sensitive value itself.

✨ Design your agent memory architecture visually
Open the free TechStackopoly workflow planner, load the agents template, and map your memory tiers — in-context, vector store, episodic — with real-time cost estimation. No login required.
Open the planner →

FAQ

What is the difference between in-context memory and external memory for AI agents?

In-context memory is everything stored directly in the prompt — conversation history, injected facts, tool results. It's fast and zero-latency but strictly limited by the context window (typically 128K–200K tokens) and wiped at the end of each session. External memory lives outside the model in a database or vector store, persists indefinitely, and is retrieved on demand, but introduces retrieval latency (50–200ms) and the possibility of retrieval errors — relevant memories might not rank highly enough to be retrieved.

How do I give a chatbot long-term memory across sessions?

The standard production pattern: at the end of each session, run an LLM extraction pass over the conversation to identify durable facts and preferences, store them in a vector database keyed by user ID, and at the start of each new session retrieve the top-k most relevant memories based on the current query and inject them into the system prompt. This gives you effective long-term memory without blowing up your context window on every turn.

What vector database should I use for agent memory?

For most teams: pgvector via Supabase is the pragmatic default — one Postgres stack, minimal ops, scales to tens of millions of memories. Pinecone if you need managed scale beyond ~10M vectors with zero infra management. Qdrant for self-hosted with strong metadata filtering. Weaviate if you need hybrid BM25 + vector search out of the box. All four are production-proven; choose based on your existing infrastructure and ops capacity.

What is episodic memory in AI agents?

Episodic memory is a record of past agent runs — what the agent did, what tools it called, what the outcomes were, and what errors it encountered. It's distinct from semantic memory (durable facts about the user or world) because it captures procedural, temporal experience tied to specific events. Agents with episodic memory can recall "last time I ran this weekly report, the data API returned a 429 rate limit error at step 3" and proactively add retry logic to the current run.

How do I handle memory privacy and data retention for AI agents?

Treat agent memory like any other user personal data: collect only what you need, store with explicit user consent, implement TTL-based expiry for time-sensitive facts, provide user-facing memory inspection and deletion (required under GDPR/CCPA right to erasure), avoid storing raw PII in vector embeddings alongside their content, and separate memory tiers by sensitivity level. Preferences can persist longer; transactional details should expire within 90 days.