DeepSeek’s conditional memory fixes silent LLM waste: GPU cycles lost to static lookups

13.01.2026 19:00

VentureBeat.com

When an enterprise LLM retrieves a product name, technical specification, or standard contract clause, it's using expensive GPU computation designed for complex reasoning — just to access static information. This happens millions of times per day. Each lookup wastes cycles and inflates infrastructure costs.

DeepSeek's newly released research on "conditional memory" addresses this architectural limitation directly. The work introduces Engram, a module that separates static pattern retrieval from dynamic reasoning. It delivers results that challenge assumptions about what memory is actually for in neural networks. The paper was co-authored by DeepSeek founder Liang Wenfeng.

Through systematic experiments DeepSeek found the optimal balance between computation and memory with 75% of sparse model capacity allocated to dynamic reasoning and 25% to static lookups. This memory system improved reasoning more than knowledge retrieval.

Complex reasoning benchmarks jumped from 70% to 74% accuracy, while knowledge-focused tests improved from 57% to 61%. These improvements came from tests including Big-Bench Hard, ARC-Challenge, and MMLU.

The research arrives as enterprises face mounting pressure to deploy more capable AI systems while navigating GPU memory constraints and infrastructure costs. DeepSeek's approach offers a potential path forward by fundamentally rethinking how models should be structured.

How conditional memory solves a different issue than agentic memory and RAG

Agentic memory systems, sometimes referred to as contextual memory — like Hindsight, MemOS, or Memp — focus on episodic memory. They store records of past conversations, user preferences, and interaction history. These systems help agents maintain context across sessions and learn from experience. But they're external to the model's forward pass and don't optimize how the model internally processes static linguistic patterns.

For Chris Latimer, founder and CEO of Vectorize, which developed Hindsight, the conditional memory approach used in Engram solves a different problem than agentic AI memory.

"It's not solving the problem of connecting agents to external memory like conversation histories and knowledge stores," Latimer told VentureBeat. "It's more geared towards squeezing performance out of smaller models and getting more mileage out of scarce GPU resources."

Conditional memory tackles a fundamental issue: Transformers lack a native knowledge lookup primitive. When processing text, they must simulate retrieval of static patterns through expensive neural computation across multiple layers. These patterns include named entities, technical terminology, and common phrases.

The DeepSeek paper illustrates this with a concrete example. Recognizing "Diana, Princess of Wales" requires consuming multiple layers of attention and feed-forward networks to progressively compose features. The model essentially uses deep, dynamic logic circuits to perform what should be a simple hash table lookup. It's like using a calculator to remember your phone number rather than just looking it up.

"The problem is that Transformer lacks a 'native knowledge lookup' ability," the researchers write. "Many tasks that should be solved in O(1) time like retrieval have to be 'simulated for retrieval' through a large amount of computation, which is very inefficient."

How conditional memory works

Engram introduces "conditional memory" to work alongside MoE's conditional computation.

The mechanism is straightforward. The module takes sequences of two to three tokens and uses hash functions to look them up in a massive embedding table. Retrieval happens in constant time, regardless of table size.

But retrieved patterns need filtering. A hash lookup for "Apple" might collide with unrelated content, or the word might mean the fruit rather than the company. Engram solves this with a gating mechanism. The model's current understanding of context (accumulated through earlier attention layers) acts as a filter. If retrieved memory contradicts the current context, the gate suppresses it. If it fits, the gate lets it through.

The module isn't applied at every layer. Strategic placement balances performance gains against system latency.

This dual-system design raises a critical question: How much capacity should each get? DeepSeek's key finding: the optimal split is 75-80% for computation and 20-25% for memory. Testing found pure MoE (100% computation) proved suboptimal. Too much computation wastes depth reconstructing static patterns; too much memory loses reasoning capacity.

Infrastructure efficiency: the GPU memory bypass

Perhaps Engram's most pragmatic contribution is its infrastructure-aware design. Unlike MoE's dynamic routing, which depends on runtime hidden states, Engram's retrieval indices depend solely on input token sequences. This deterministic nature enables a prefetch-and-overlap strategy.

"The challenge is that GPU memory is limited and expensive, so using bigger models gets costly and harder to deploy," Latimer said. "The clever idea behind Engram is to keep the main model on the GPU, but offload a big chunk of the model's stored information into a separate memory on regular RAM, which the model can use on a just-in-time basis."

During inference, the system can asynchronously retrieve embeddings from host CPU memory via PCIe. This happens while GPU computes preceding transformer blocks. Strategic layer placement leverages computation of early layers as a buffer to mask communication latency.

The researchers demonstrated this with a 100B-parameter embedding table entirely offloaded to host DRAM. They achieved throughput penalties below 3%. This decoupling of storage from compute addresses a critical enterprise constraint as GPU high-bandwidth memory remains expensive and scarce.

What this means for enterprise AI deployment

For enterprises evaluating AI infrastructure strategies, DeepSeek's findings suggest several actionable insights:

1. Hybrid architectures outperform pure approaches. The 75/25 allocation law indicates that optimal models should split sparse capacity between computation and memory.

2. Infrastructure costs may shift from GPU to memory. If Engram-style architectures prove viable in production, infrastructure investment patterns could change. The ability to store 100B+ parameters in CPU memory with minimal overhead suggests that memory-rich, compute-moderate configurations may offer better performance-per-dollar than pure GPU scaling.

3. Reasoning improvements exceed knowledge gains. The surprising finding that reasoning benefits more than knowledge retrieval suggests that memory's value extends beyond obvious use cases.

For enterprises leading AI adoption, Engram demonstrates that the next frontier may not be simply bigger models. It's smarter architectural choices that respect the fundamental distinction between static knowledge and dynamic reasoning. The research suggests that optimal AI systems will increasingly resemble hybrid architectures.

Organizations waiting to adopt AI later in the cycle should monitor whether major model providers incorporate conditional memory principles into their architectures. If the 75/25 allocation law holds across scales and domains, the next generation of foundation models may deliver substantially better reasoning performance at lower infrastructure costs.

Читайте на сайте

Smi24.net — ежеминутные новости с ежедневным архивом. Только у нас — все главные новости дня без политической цензуры. Абсолютно все точки зрения, трезвая аналитика, цивилизованные споры и обсуждения без взаимных обвинений и оскорблений. Помните, что не у всех точка зрения совпадает с Вашей. Уважайте мнение других, даже если Вы отстаиваете свой взгляд и свою позицию. Мы не навязываем Вам своё видение, мы даём Вам срез событий дня без цензуры и без купюр. Новости, какие они есть —онлайн с поминутным архивом по всем городам и регионам России, Украины, Белоруссии и Абхазии. Smi24.net — живые новости в живом эфире! Быстрый поиск от Smi24.net — это не только возможность первым узнать, но и преимущество сообщить срочные новости мгновенно на любом языке мира и быть услышанным тут же. В любую минуту Вы можете добавить свою новость - здесь.

Новости от наших партнёров в Вашем городе

Ria.city

Музыкальные новости

Новости России

Экология в России и мире

Спорт в России и мире

Moscow.media

103news.com — быстрее, чем Я..., самые свежие и актуальные новости Вашего города — каждый день, каждый час с ежеминутным обновлением! Мгновенная публикация на языке оригинала, без модерации и без купюр в разделе Пользователи сайта 103news.com.

Как добавить свои новости в наши трансляции? Очень просто. Достаточно отправить заявку на наш электронный адрес mail@29ru.net с указанием адреса Вашей ленты новостей в формате RSS или подать заявку на включение Вашего сайта в наш каталог через форму. После модерации заявки в течении 24 часов Ваша лента новостей начнёт транслироваться в разделе Вашего города. Все новости в нашей ленте новостей отсортированы поминутно по времени публикации, которое указано напротив каждой новости справа также как и прямая ссылка на источник информации. Если у Вас есть интересные фото Вашего города или других населённых пунктов Вашего региона мы также готовы опубликовать их в разделе Вашего города в нашем каталоге региональных сайтов, который на сегодняшний день является самым большим региональным ресурсом, охватывающим все города не только России и Украины, но ещё и Белоруссии и Абхазии. Прислать фото можно здесь. Оперативно разместить свою новость в Вашем городе можно самостоятельно через форму.

Другие популярные новости дня сегодня

Новости 24/7 Все города России

Топ 10 новостей последнего часа

Moscow.media

103news.com — международная интерактивная информационная сеть (ежеминутные новости с ежедневным интелектуальным архивом). Только у нас — все главные новости дня без политической цензуры. "103 Новости" — абсолютно все точки зрения, трезвая аналитика, цивилизованные споры и обсуждения без взаимных обвинений и оскорблений. Помните, что не у всех точка зрения совпадает с Вашей. Уважайте мнение других, даже если Вы отстаиваете свой взгляд и свою позицию.

Мы не навязываем Вам своё видение, мы даём Вам объективный срез событий дня без цензуры и без купюр. Новости, какие они есть — онлайн (с поминутным архивом по всем городам и регионам России, Украины, Белоруссии и Абхазии).

103news.com — живые новости в прямом эфире!

В любую минуту Вы можете добавить свою новость мгновенно — здесь.

Музыкальные новости

Спорт в России и мире