🎁 Perplexity PRO offert
When AI Forgets Your Name Three Messages Later
Have you ever had that frustrating conversation with ChatGPT or Claude? You mention an important detail at the beginning, then after a few exchanges with tangential questions, the model seems to have completely forgotten what you said. Or worse: it changes its mind based on your tone, telling you what you want to hear rather than the truth.
This isn’t a bug. It’s a fundamental limitation of the current large language model architecture.
Today’s LLMs are statistical prodigies capable of generating impressive text, but they have three major flaws:
- No coherent working memory: They treat each response as a fresh start, without persistent internal state.
- No internal reflection mechanism: They generate responses in a single pass, without an “inner monologue” to verify coherence.
- Radical inefficiency for static knowledge: They “recalculate” facts every time that they should simply “remember”.
Imagine a developer who would forget your variable names every three lines of code, or who would have to reread all of React’s documentation every time they write a useState(). That’s exactly what current LLMs do.
But two revolutionary architectures are changing the game: MIRROR and Engram.
They don’t just improve performance. They redefine what it means to “think” and “remember” for an AI.
MIRROR: Giving AI an Inner Monologue
The Problem: An AI Without Mental State
Humans don’t think in a single pass. Before answering a complex question, you:
- Reflect (you mentally explore different paths)
- Synthesize (you consolidate your ideas into a coherent mental model)
- Respond (you formulate a clear answer)
Classic LLMs skip straight to step 3. They generate a response without this internal reflection process. Result:
- Sycophancy: They prioritize agreeing with you over truth or safety.
- Attention deficits: They forget critical information mentioned earlier in the conversation.
- Inconsistency: They struggle to prioritize contradictory constraints (e.g., your safety vs. your stated preferences).
This is exactly what the MIRROR (Modular Internal Reasoning, Reflection, Orchestration, and Response) architecture solves.
The Architecture: Separating Thought from Speech
MIRROR functions as a two-layer system:
1. The Thinker: The Internal Consciousness
The Thinker maintains a persistent internal narrative — a kind of “mental model” that evolves throughout the conversation. It consists of two modules:
a) The Inner Monologue Manager This module orchestrates three parallel reasoning threads:
- Goals: What is the user really looking for? What are their intentions?
- Reasoning: What logical implications? What thought patterns are emerging?
- Memory: What key facts have been mentioned? What preferences are stable?
b) The Cognitive Controller It synthesizes these three threads into a unified narrative that serves as working memory. This narrative is updated at each conversation turn and serves as the basis for response generation.
2. The Talker: The External Voice
The Talker uses the internal narrative to generate coherent and contextually appropriate responses. It reflects the system’s current “state of consciousness”.
Temporal decoupling: In production, the Thinker can continue reflecting asynchronously while the Talker responds immediately. This ensures low latency while allowing deep background reflection.
Performance: +156% on Critical Scenarios
MIRROR was evaluated on the CuRaTe benchmark, designed to test multi-turn dialogues with critical safety constraints and contradictory preferences.
| Metric | Baseline | With MIRROR | Improvement |
|---|---|---|---|
| Average success rate | 69% | 84% | +21% |
| Maximum performance (Llama 4 Scout) | - | 91% | - |
| Critical scenario (3 people) | - | - | +156% |
The benefits are model-agnostic: MIRROR improves GPT-4o, Claude 3.7 Sonnet, Gemini 1.5 Pro, Llama 4, and Mistral 3.
Why this spectacular improvement? Because MIRROR transforms a potentially infinite conversation history into actionable understanding via a three-step pipeline:
- Multi-dimensional exploration (thought threads)
- Condensation into coherent mental model (internal narrative)
- Contextual application (response)
This is exactly what a senior developer does when analyzing a complex bug: they don’t respond immediately, they think first.
Engram: When Memory Replaces Computation
The Problem: Recalculating What Should Be Remembered
Imagine a developer who would have to reread all of Python’s documentation every time they write print(). Absurd, right?
Yet that’s exactly what current Transformers do. To identify an entity like “Diana, Princess of Wales”, an LLM must:
- Pass tokens through several attention layers
- Progressively aggregate contextual features
- “Recalculate” every time what should be a simple knowledge lookup
It’s as if your brain had to recalculate that 2+2=4 every time rather than simply knowing it.
The Engram architecture solves this problem by introducing conditional memory — a constant-time O(1) lookup system for static knowledge.
The Architecture: O(1) Lookup via Hashed N-grams
Engram modernizes the classic N-gram embedding approach to create a scalable memory module.
1. Sparse Retrieval
a) Tokenizer Compression Raw token IDs are projected onto canonical IDs via textual normalization (NFKC, lowercase). This reduces the effective vocabulary size by approximately 23% for a 128k tokenizer, increasing semantic density.
b) Multi-Head Hashing For each N-gram (sequence of N tokens), the system uses K distinct hash functions. Each hash head maps the local context to an index in an embedding table. This mitigates collisions and allows retrieval of a memory vector.
The result? A system that can “look up” knowledge in constant time, instead of “recalculating” it through multiple Transformer layers.
2. Context-Aware Gating
The retrieved memory vector (e_t) is a static prior that may contain noise. To integrate it intelligently, Engram uses an attention-inspired gating mechanism:
- The current Transformer hidden state (h_t) acts as a Query
- The retrieved memory (e_t) serves as the source for Key and Value
- A gate scalar (α_t) is computed to modulate the memory’s contribution
If memory contradicts the dynamic context, the gate closes (α_t → 0), suppressing the noise.
The U-Shaped Scaling Law: The Compute-Memory Alliance
Engram isn’t just a module. It’s a new sparsity axis complementary to Mixture-of-Experts (MoE).
An analysis revealed a U-shaped relationship between allocating sparsity parameters to computation (MoE experts) and memory (Engram):
- Too much compute, not enough memory → Inefficiency (constant recalculation)
- Too much memory, not enough compute → Performance plateaus
- Optimal point (20-25% memory) → Strictly outperforms pure MoE models
This is a major finding: the future isn’t in bigger models, but in smarter hybrid models.
Performance: Better at Reasoning, Not Just Memorization
Engram-27B and Engram-40B models were evaluated by reallocating parameters from a baseline MoE model.
| Benchmark | Category | Gain (Engram vs MoE) |
|---|---|---|
| BBH | Complex Reasoning | +5.0 |
| CMMLU | Cultural Knowledge | +4.0 |
| ARC-Challenge | Scientific Reasoning | +3.7 |
| MMLU | General Knowledge | +3.4 |
| HumanEval | Code Generation | +3.0 |
| MATH | Mathematical Reasoning | +2.4 |
Surprising: the largest gains aren’t in pure memorization, but in complex reasoning, code, and math.
Why? Because Engram frees the model’s early layers from the task of reconstructing static patterns. This increases the network’s “effective depth” available for abstract reasoning.
It’s like delegating system memory management to an optimized OS, freeing your CPU for more complex computations.
System Efficiency: Memory Offloading from RAM or NVMe
Engram’s retrieval index is deterministic: it depends solely on the input token sequence, not on runtime hidden state (unlike MoE routing).
This property allows asynchronous prefetching of necessary embeddings from:
- CPU RAM
- NVMe disks via PCIe bus
This masks communication latency and allows extending the model’s memory to hundreds of billions of parameters with negligible performance overhead (< 3%), bypassing GPU VRAM limitations.
Imagine being able to extend your LLM’s memory like you add RAM to your PC, without having to buy additional GPUs. That’s exactly what Engram makes possible.
ENGRAM-R: Optimizing Reasoning with “Fact Cards”
Beyond architectural integration, modular memory principles are applied at the system level to manage long conversations and optimize large reasoning models (LRM).
The ENGRAM System: Typed Memory Inspired by Cognitive Science
Inspired by cognitive theories, this system organizes conversational memory into three distinct stores:
- Episodic Memory: Events and interactions with temporal context (e.g., “user moved to Seattle last year”)
- Semantic Memory: Facts, observations, and stable preferences (e.g., “user’s favorite color is green”)
- Procedural Memory: Instructions and processes (e.g., “tax filing deadline is April 15”)
At each conversation turn, the system routes information to the relevant store(s). During a query, dense similarity search is performed to retrieve the most relevant context.
ENGRAM-R: “Fact Cards” to Reduce Redundant Thinking
ENGRAM-R introduces two mechanisms to drastically reduce reasoning computational cost:
1. Fact Cards Rendering Rather than injecting verbose conversation excerpts into the context, retrieved records are transformed into compact, auditable cards:
[E1, A moved to Seattle, Turn 1]
[S2, Favorite color: green, Turn 5]
[P3, Tax deadline: April 15, Turn 12]
2. Direct Citation Mechanism The LRM is explicitly instructed to use these cards as the source of truth and cite them directly in its chain of thought:
“To answer Q1, E1 shows that A lives in Seattle. Answer: Seattle. Cite [E1].”
Efficiency Gains: -89% Tokens, +2.5% Accuracy
Evaluation on long conversation benchmarks (LoCoMo: 16k tokens, LongMemEval: 115k tokens):
| Metric | Full-Context | ENGRAM-R | Reduction |
|---|---|---|---|
| Input Tokens (LoCoMo) | 28,371,703 | 3,293,478 | ≈ 89% |
| Reasoning Tokens | 1,335,988 | 378,424 | ≈ 72% |
| Accuracy (Multi-hop) | 72.0% | 74.5% | +2.5% |
| Accuracy (Temporal) | 67.3% | 69.2% | +1.9% |
Transforming history into a compact, citable evidence base allows:
- Massively reducing computational costs
- Maintaining or even improving accuracy
- Making reasoning traceable and auditable
It’s the AI equivalent of what a senior developer does: they don’t reread all the code every time, they maintain a compact mental model of critical parts.
The Future: AI That Thinks and Remembers Like Us
The Mutation of Architectures
MIRROR and Engram aren’t incremental optimizations. They signal a paradigm shift:
From: Monolithic models that recalculate everything each pass To: Hybrid compute-memory systems that think, remember, and reason
This mutation is directly inspired by cognitive science:
- Working memory (MIRROR’s Cognitive Controller)
- Typed long-term memory (episodic, semantic, procedural)
- Information compression (Fact Cards)
- Inner monologue (parallel reasoning threads)
Architectures like XMem and Memoria already reproduce human psychological effects: primacy, recency, and temporal contiguity effects.
The RAG vs Full-Context Debate
The Convomem benchmark revealed an important nuance: for the first 150 conversations, a full-context approach (providing all history) outperforms sophisticated RAG systems (70-82% accuracy vs 30-45%).
This suggests that conversational memory benefits from a “small corpus advantage” where exhaustive search is possible and preferable. Direct application of generalist RAG solutions isn’t always optimal.
The future will likely be hybrid:
- Full context for short conversations
- Typed memory + Fact Cards for long conversations
- O(1) retrieval for static knowledge
Impact on Developers and Creators
For us developers and creators, these architectures redefine what we can expect from an LLM:
Today: “ChatGPT is an assistant that sometimes forgets and contradicts itself” Tomorrow: “My AI maintains a coherent mental model of my project over weeks”
Imagine:
- A development agent that remembers your code conventions and architectural preferences over months
- An e-commerce assistant that maintains a nuanced understanding of your business constraints and customers
- A support system that never asks you the same information twice
These architectures aren’t just performance gains. They make AI truly usable for complex long-term tasks.
Conclusion: The Dawn of Truly Cognitive AI
For years, we’ve improved LLMs by making them bigger: more parameters, more data, more compute.
MIRROR and Engram show us another path: making AI smarter, not just bigger.
By giving them an inner monologue, working memory, and efficient lookup capability, we’re not just improving performance. We’re creating systems that can truly think and remember.
The question is no longer “What model size is necessary?” but “What cognitive architecture is optimal?”.
And you? How do you envision exploiting these architectures in your projects? An assistant that maintains coherent memory of your codebase? A support system that truly understands your users long-term? An agent that reasons before acting?
The future of AI is no longer measured in billions of parameters, but in depth of reflection.
Articles Liés
La Grande Guerre des IA de 2025 : Chronique d'une Révolution qui Redéfinit le E-commerce et le Travail
L'année 2025 restera dans l'histoire comme le tournant décisif de l'intelligence artificielle générative. GPT-5, Gemi...
L'IA, le Pig Butchering et la Nouvelle Frontière des Arnaques : Pourquoi les Arnaqueurs Deviennent des Développeurs
Les technologies d'IA transforment l'une des plus anciennes formes de fraude : le pig butchering. Des LLM qui gèrent ...
Oubliez Python : Pourquoi PHP est le véritable avenir de l'IA pour le Web
Vous pensez que pour faire de l'IA, vous devez apprendre Python et les mathématiques complexes ? Faux. Python est le ...
L'avenir des développeurs avec l'IA
Le code n'est plus le seul langage du pouvoir dans le monde digital. Une nouvelle compétence, plus humaine et plus ac...
L'IA est paresseuse : sa force cachée
Et si la révolution de l'IA reposait sur un immense malentendu ? Les modèles comme GPT ne "pensent" pas, ils évitent ...
MCP Server PrestaShop : Piloter votre boutique avec MCP Tools Plus
Découvrez le PS MCP Server et MCP Tools Plus pour PrestaShop : transformez votre boutique avec un assistant IA capabl...
Découvrez mes autres articles
Guides e-commerce, tutoriels PrestaShop et bonnes pratiques pour développeurs
Voir tous les articlesPlanification LinkedIn
Date de publication : 5 février 2026
Temps restant :