🎁 Perplexity PRO offert

30 jours gratuits

Activer l'offre →

When AI Forgets Your Name Three Messages Later

Have you ever had that frustrating conversation with ChatGPT or Claude? You mention an important detail at the beginning, then after a few exchanges with tangential questions, the model seems to have completely forgotten what you said. Or worse: it changes its mind based on your tone, telling you what you want to hear rather than the truth.

This isn’t a bug. It’s a fundamental limitation of the current large language model architecture.

Today’s LLMs are statistical prodigies capable of generating impressive text, but they have three major flaws:

  1. No coherent working memory: They treat each response as a fresh start, without persistent internal state.
  2. No internal reflection mechanism: They generate responses in a single pass, without an “inner monologue” to verify coherence.
  3. Radical inefficiency for static knowledge: They “recalculate” facts every time that they should simply “remember”.

Imagine a developer who would forget your variable names every three lines of code, or who would have to reread all of React’s documentation every time they write a useState(). That’s exactly what current LLMs do.

But two revolutionary architectures are changing the game: MIRROR and Engram.

They don’t just improve performance. They redefine what it means to “think” and “remember” for an AI.


MIRROR: Giving AI an Inner Monologue

The Problem: An AI Without Mental State

Humans don’t think in a single pass. Before answering a complex question, you:

  • Reflect (you mentally explore different paths)
  • Synthesize (you consolidate your ideas into a coherent mental model)
  • Respond (you formulate a clear answer)

Classic LLMs skip straight to step 3. They generate a response without this internal reflection process. Result:

  • Sycophancy: They prioritize agreeing with you over truth or safety.
  • Attention deficits: They forget critical information mentioned earlier in the conversation.
  • Inconsistency: They struggle to prioritize contradictory constraints (e.g., your safety vs. your stated preferences).

This is exactly what the MIRROR (Modular Internal Reasoning, Reflection, Orchestration, and Response) architecture solves.

The Architecture: Separating Thought from Speech

MIRROR functions as a two-layer system:

1. The Thinker: The Internal Consciousness

The Thinker maintains a persistent internal narrative — a kind of “mental model” that evolves throughout the conversation. It consists of two modules:

a) The Inner Monologue Manager This module orchestrates three parallel reasoning threads:

  • Goals: What is the user really looking for? What are their intentions?
  • Reasoning: What logical implications? What thought patterns are emerging?
  • Memory: What key facts have been mentioned? What preferences are stable?

b) The Cognitive Controller It synthesizes these three threads into a unified narrative that serves as working memory. This narrative is updated at each conversation turn and serves as the basis for response generation.

2. The Talker: The External Voice

The Talker uses the internal narrative to generate coherent and contextually appropriate responses. It reflects the system’s current “state of consciousness”.

Temporal decoupling: In production, the Thinker can continue reflecting asynchronously while the Talker responds immediately. This ensures low latency while allowing deep background reflection.

Performance: +156% on Critical Scenarios

MIRROR was evaluated on the CuRaTe benchmark, designed to test multi-turn dialogues with critical safety constraints and contradictory preferences.

Metric Baseline With MIRROR Improvement
Average success rate 69% 84% +21%
Maximum performance (Llama 4 Scout) - 91% -
Critical scenario (3 people) - - +156%

The benefits are model-agnostic: MIRROR improves GPT-4o, Claude 3.7 Sonnet, Gemini 1.5 Pro, Llama 4, and Mistral 3.

Why this spectacular improvement? Because MIRROR transforms a potentially infinite conversation history into actionable understanding via a three-step pipeline:

  1. Multi-dimensional exploration (thought threads)
  2. Condensation into coherent mental model (internal narrative)
  3. Contextual application (response)

This is exactly what a senior developer does when analyzing a complex bug: they don’t respond immediately, they think first.


Engram: When Memory Replaces Computation

The Problem: Recalculating What Should Be Remembered

Imagine a developer who would have to reread all of Python’s documentation every time they write print(). Absurd, right?

Yet that’s exactly what current Transformers do. To identify an entity like “Diana, Princess of Wales”, an LLM must:

  1. Pass tokens through several attention layers
  2. Progressively aggregate contextual features
  3. “Recalculate” every time what should be a simple knowledge lookup

It’s as if your brain had to recalculate that 2+2=4 every time rather than simply knowing it.

The Engram architecture solves this problem by introducing conditional memory — a constant-time O(1) lookup system for static knowledge.

The Architecture: O(1) Lookup via Hashed N-grams

Engram modernizes the classic N-gram embedding approach to create a scalable memory module.

1. Sparse Retrieval

a) Tokenizer Compression Raw token IDs are projected onto canonical IDs via textual normalization (NFKC, lowercase). This reduces the effective vocabulary size by approximately 23% for a 128k tokenizer, increasing semantic density.

b) Multi-Head Hashing For each N-gram (sequence of N tokens), the system uses K distinct hash functions. Each hash head maps the local context to an index in an embedding table. This mitigates collisions and allows retrieval of a memory vector.

The result? A system that can “look up” knowledge in constant time, instead of “recalculating” it through multiple Transformer layers.

2. Context-Aware Gating

The retrieved memory vector (e_t) is a static prior that may contain noise. To integrate it intelligently, Engram uses an attention-inspired gating mechanism:

  • The current Transformer hidden state (h_t) acts as a Query
  • The retrieved memory (e_t) serves as the source for Key and Value
  • A gate scalar (α_t) is computed to modulate the memory’s contribution

If memory contradicts the dynamic context, the gate closes (α_t → 0), suppressing the noise.

The U-Shaped Scaling Law: The Compute-Memory Alliance

Engram isn’t just a module. It’s a new sparsity axis complementary to Mixture-of-Experts (MoE).

An analysis revealed a U-shaped relationship between allocating sparsity parameters to computation (MoE experts) and memory (Engram):

  • Too much compute, not enough memory → Inefficiency (constant recalculation)
  • Too much memory, not enough compute → Performance plateaus
  • Optimal point (20-25% memory) → Strictly outperforms pure MoE models

This is a major finding: the future isn’t in bigger models, but in smarter hybrid models.

Performance: Better at Reasoning, Not Just Memorization

Engram-27B and Engram-40B models were evaluated by reallocating parameters from a baseline MoE model.

Benchmark Category Gain (Engram vs MoE)
BBH Complex Reasoning +5.0
CMMLU Cultural Knowledge +4.0
ARC-Challenge Scientific Reasoning +3.7
MMLU General Knowledge +3.4
HumanEval Code Generation +3.0
MATH Mathematical Reasoning +2.4

Surprising: the largest gains aren’t in pure memorization, but in complex reasoning, code, and math.

Why? Because Engram frees the model’s early layers from the task of reconstructing static patterns. This increases the network’s “effective depth” available for abstract reasoning.

It’s like delegating system memory management to an optimized OS, freeing your CPU for more complex computations.

System Efficiency: Memory Offloading from RAM or NVMe

Engram’s retrieval index is deterministic: it depends solely on the input token sequence, not on runtime hidden state (unlike MoE routing).

This property allows asynchronous prefetching of necessary embeddings from:

  • CPU RAM
  • NVMe disks via PCIe bus

This masks communication latency and allows extending the model’s memory to hundreds of billions of parameters with negligible performance overhead (< 3%), bypassing GPU VRAM limitations.

Imagine being able to extend your LLM’s memory like you add RAM to your PC, without having to buy additional GPUs. That’s exactly what Engram makes possible.


ENGRAM-R: Optimizing Reasoning with “Fact Cards”

Beyond architectural integration, modular memory principles are applied at the system level to manage long conversations and optimize large reasoning models (LRM).

The ENGRAM System: Typed Memory Inspired by Cognitive Science

Inspired by cognitive theories, this system organizes conversational memory into three distinct stores:

  • Episodic Memory: Events and interactions with temporal context (e.g., “user moved to Seattle last year”)
  • Semantic Memory: Facts, observations, and stable preferences (e.g., “user’s favorite color is green”)
  • Procedural Memory: Instructions and processes (e.g., “tax filing deadline is April 15”)

At each conversation turn, the system routes information to the relevant store(s). During a query, dense similarity search is performed to retrieve the most relevant context.

ENGRAM-R: “Fact Cards” to Reduce Redundant Thinking

ENGRAM-R introduces two mechanisms to drastically reduce reasoning computational cost:

1. Fact Cards Rendering Rather than injecting verbose conversation excerpts into the context, retrieved records are transformed into compact, auditable cards:

[E1, A moved to Seattle, Turn 1]
[S2, Favorite color: green, Turn 5]
[P3, Tax deadline: April 15, Turn 12]

2. Direct Citation Mechanism The LRM is explicitly instructed to use these cards as the source of truth and cite them directly in its chain of thought:

“To answer Q1, E1 shows that A lives in Seattle. Answer: Seattle. Cite [E1].”

Efficiency Gains: -89% Tokens, +2.5% Accuracy

Evaluation on long conversation benchmarks (LoCoMo: 16k tokens, LongMemEval: 115k tokens):

Metric Full-Context ENGRAM-R Reduction
Input Tokens (LoCoMo) 28,371,703 3,293,478 ≈ 89%
Reasoning Tokens 1,335,988 378,424 ≈ 72%
Accuracy (Multi-hop) 72.0% 74.5% +2.5%
Accuracy (Temporal) 67.3% 69.2% +1.9%

Transforming history into a compact, citable evidence base allows:

  • Massively reducing computational costs
  • Maintaining or even improving accuracy
  • Making reasoning traceable and auditable

It’s the AI equivalent of what a senior developer does: they don’t reread all the code every time, they maintain a compact mental model of critical parts.


The Future: AI That Thinks and Remembers Like Us

The Mutation of Architectures

MIRROR and Engram aren’t incremental optimizations. They signal a paradigm shift:

From: Monolithic models that recalculate everything each pass To: Hybrid compute-memory systems that think, remember, and reason

This mutation is directly inspired by cognitive science:

  • Working memory (MIRROR’s Cognitive Controller)
  • Typed long-term memory (episodic, semantic, procedural)
  • Information compression (Fact Cards)
  • Inner monologue (parallel reasoning threads)

Architectures like XMem and Memoria already reproduce human psychological effects: primacy, recency, and temporal contiguity effects.

The RAG vs Full-Context Debate

The Convomem benchmark revealed an important nuance: for the first 150 conversations, a full-context approach (providing all history) outperforms sophisticated RAG systems (70-82% accuracy vs 30-45%).

This suggests that conversational memory benefits from a “small corpus advantage” where exhaustive search is possible and preferable. Direct application of generalist RAG solutions isn’t always optimal.

The future will likely be hybrid:

  • Full context for short conversations
  • Typed memory + Fact Cards for long conversations
  • O(1) retrieval for static knowledge

Impact on Developers and Creators

For us developers and creators, these architectures redefine what we can expect from an LLM:

Today: “ChatGPT is an assistant that sometimes forgets and contradicts itself” Tomorrow: “My AI maintains a coherent mental model of my project over weeks”

Imagine:

  • A development agent that remembers your code conventions and architectural preferences over months
  • An e-commerce assistant that maintains a nuanced understanding of your business constraints and customers
  • A support system that never asks you the same information twice

These architectures aren’t just performance gains. They make AI truly usable for complex long-term tasks.


Conclusion: The Dawn of Truly Cognitive AI

For years, we’ve improved LLMs by making them bigger: more parameters, more data, more compute.

MIRROR and Engram show us another path: making AI smarter, not just bigger.

By giving them an inner monologue, working memory, and efficient lookup capability, we’re not just improving performance. We’re creating systems that can truly think and remember.

The question is no longer “What model size is necessary?” but “What cognitive architecture is optimal?”.

And you? How do you envision exploiting these architectures in your projects? An assistant that maintains coherent memory of your codebase? A support system that truly understands your users long-term? An agent that reasons before acting?

The future of AI is no longer measured in billions of parameters, but in depth of reflection.