What Makes Memory Work? Evaluating Long-Term Memory for Large Language Models

June 17, 2025

Today’s large language models (LLMs) can dazzle us with their ability to write essays, explain complex topics, and expertly answer a lot of questions. Yet a critical limitation remains: they lack true memory. Ask an LLM about a document it processed yesterday, or expect it to learn from earlier mistakes in a conversation, and its brilliance often falters

A new wave of research is exploring how AI agents can remember, reflect, and improve over time. While previous studies have focused on the development of more complex memory systems, it remains unclear which memory architectures are most effective for long-context conversational tasks. In a new paper (currently under review), we compare these approaches to understand what makes LLM memory effective and where current methods fall short.

Why AI Needs Memory

If you’ve ever spoken to an LLM for more than a few messages, you may have noticed it starts to lose track of earlier parts of the conversation. That’s because most models only see a limited window of text at a time.

This raises a critical question: How can LLMs effectively support:

Long multi-turn conversations
Documents spanning hundreds of pages
Tasks extending across days or weeks?

The answer is: it needs true memory. Not just longer context windows, but structured ways of storing, retrieving, and learning from past interactions. Our research explores exactly that. We tested a variety of memory systems for LLMs, not by fine-tuning their weights, but by giving them tools to remember and reflect.

Different types of memory

The study explored several distinct types of working and long-term memory inspired by how humans store and use information. Each plays a different role in helping language models handle complex, long-term interactions.

Semantic Memory

Semantic memory refers to a model’s store of general knowledge, facts, concepts, and accumulated information from past interactions. In language models, this can be implemented in several ways: Full-Context Prompting includes the entire conversation history in the prompt, simple but inefficient for long inputs. Retrieval-Augmented Generation (RAG) improves efficiency by retrieving only the most relevant past information. These retrieved snippets are added to the prompt at runtime, helping the model stay grounded in context while keeping inputs compact. Agentic Memory (A-Mem) adds dynamic control, allowing the model to update and reorganise its memory over time, creating a more structured and evolving knowledge base.

Think of semantic memory like your mental encyclopedia or dictionary, your stored knowledge of facts and concepts. For example, you know that Paris is the capital of France or that a bicycle has two wheels. You don’t need to remember when or where you learned it; it’s just general knowledge you carry with you.

Episodic Memory

Episodic memory focuses on remembering specific past experiences. For language models, this means storing, recalling, and reflecting on previous interactions, decisions, or errors. When similar questions arise later, the model can draw on these examples to make more informed or consistent responses.

Episodic memory is like your personal diary of past experiences. For instance, you remember your last birthday party and details such as the people who attended, what you did, and how you felt. When someone asks about your birthday, you recall those specific events, not just the general idea of what a birthday is.

Procedural Memory

This type of memory is about learning how to perform tasks or follow procedures. In the study, models were able to refine their own prompts over time based on prior mistakes or successes. This allowed them to adjust their reasoning strategies and improve how they approach certain types of problems.

Procedural memory is your “how-to” memory, such as the skills and processes you’ve learned. For example, once you learn how to ride a bike, you don’t have to think through every step each time; your body just knows what to do. It’s also like typing on a keyboard or tying your shoelaces. These are automatic skills developed through practice.

So, Which Memory System Works Best?

We ran all these memory methods through a challenging benchmark: LoCoMo, a synthetic dataset simulating very long conversations with over 9,000 tokens per chat. We tested how well the models answered different types of questions about the conversations, from simple factual ones to tricky adversarial and temporal queries.

Here’s what we found:

For semantic memory, RAG performed best overall; it is efficient, accurate, and lightweight.
Agentic Memory (A-Mem) was powerful but expensive, requiring complex formatting and large model overhead. Its advantage can lie in the way similar memories are dynamically clustered together.
Episodic memory added a strong boost, especially when it came to learning from past failures.
Procedural memory had mixed results but might shine in tasks where complex rules and workflows matter.
Full-context prompting, while often accurate, was too slow, uninterpretable, and too resource-intensive to scale.

Why This Matters

This research extends beyond benchmarks, suggesting a fundamental shift: toward models that don’t merely generate text, but actively remember and evolve.

In practical terms, memory-augmented LLMs can:

Better adapt to users over time
Reduce redundancy in long conversations
Make fewer mistakes by learning from past ones
Support real-world agents in dynamic, long-term settings

What’s Next?

This research is part of a growing trend toward non-parametric learning, teaching AI to improve without retraining the model itself. Instead, we give it memory: summaries, reflections, and past examples.

Future work could explore:

Smarter memory selection (what to store, forget, compress)
Agentic memory systems that manage their own knowledge dynamically
An evaluation of textual memory as an alternative to classic reinforcement learning, especially in low-resource settings

AI systems need to remember not just what they read, but what they did, what they got wrong, and how they got better. This research shows that it’s not only possible, but increasingly practical. Memory might be key to making AI feel less like a chatbot and more like a real assistant.