Why memory is AI's most underestimated bottleneck

Listen to the MindMakers podcast:

Storage used to be something you bought once and forgot about. That era ended the moment trillion‑parameter models and AI agent swarms arrived. Memory — how models access, cache, and work with context — has become the defining challenge in AI infrastructure, the layer that determines whether your GPU investments actually deliver or stall out.

In this episode of MindMakers, WEKA Chief AI Officer Val Bercovici and Sendbird CEO John Kim discuss:

why memory architecture matters more than most people realize
how the industry is shifting from prompt engineering to context engineering
what happens when you remove the memory bottleneck entirely

Val brings two decades of infrastructure experience across NetApp, SolidFire, and early Kubernetes development. But it's his current role that frames the conversation. As WEKA's Chief AI Officer, he's building a market category around the very problem this episode unpacks. The conversation starts there.

The Chief AI Officer role is temporary (and that's the point)

The title "Chief AI Officer" is everywhere right now. Companies are hiring them. VCs are asking for them. But Val sees the role as fundamentally transitional — more like "internet person" from the dot-com era than a permanent C-suite fixture.

"I do believe it's transitional. Like most cliches, you're really successful in your role if you work yourself out of a job."
— Val Bercovici, Chief AI Officer, WEKA

The real work, Val explains, is making sure organizations take full advantage of what AI has to offer. Right now, that means myth-busting, internal education, and helping teams sort through what to act on, what to ignore, and where the real risks are. Strategy is fundamental to the role, along with technical leadership both inside and outside the company.

Because the role follows the rapid evolution of the science itself, the day-to-day activities of the role are changing fast, Val notes. In two years, this role will barely resemble what it is today. But the category Val is building at WEKA — context memory storage — points to the problem he believes will outlast the title.

The memory wall is the real constraint on AI scaling

Memory is the technical bottleneck for the industry right now, and most infrastructure conversations aren't treating it that way.

Val breaks it down: the working memory that transformers use — technically called the key-value cache — is the real memory hog, not the model weights themselves.

A 100,000-token context translates to about a megabyte of raw data, but once you embed and vectorize it across 10,000 to 20,000 dimensions, that balloons to roughly 50 gigabytes. And that memory has to be expensive, scarce, high-bandwidth memory co-packaged with the GPUs themselves.

"This solves a first-principles GPU scientific computing problem, because you're always trading GPU floating-point operations for memory. AI is the most successful scientific application on GPUs ever."
— Val Bercovici, Chief AI Officer, WEKA

You're out of that high-bandwidth memory before you even begin, Val says. Then the spillover starts: from HBM to CPU DRAM shared across the motherboard, and from there to storage tiers — with what Val calls a "Grand Canyon" of performance between each level.

WEKA's approach, as Val describes it, repackages those storage tiers as memory — maintaining throughput and latency across a thousand times more capacity.

"The net effect is you go from a scarcity mindset of infrastructure resources — memory in particular — to an abundance mindset. Because now you have, at least economically, a thousand times the capacity of memory without any sacrifice in latency or throughput."
— Val Bercovici, Chief AI Officer, WEKA

Once you solve the memory constraint, everything downstream opens up. Context gets deeper. Concurrency scales. The applications you can build shift entirely.

Context engineering is the new frontier

Val points to a major shift that happened over the past year: the move from prompt engineering to context engineering. It's directly tied to memory scarcity — the fact there will always be some level of constraint in what the GPU can process in real time, despite other tiers being available.

The science and engineering are both evolving rapidly, Val explains. Models still don't understand memory tiers or memory hierarchies — they understand very scarce, limited memory. So engineers optimize around that constraint: KV cache compression, context compaction, summarization after a certain number of turns. These techniques help, but they're workarounds. They manage scarcity instead of solving it.

Val's alternative is what WEKA calls the "token warehouse." The idea: instead of compacting and discarding context to stay within limits, you prepare it once — embedded, vectorized, ready — and keep it permanently available for low-cost cache reads. No redundant reprocessing every session. No eviction. Nvidia's engineers describe a similar approach as local pre-fill global decode.

"We've given this a lot of thought, and we actually coined a term around it called a token warehouse. One side of the memory discussion with AI engineers focuses on compaction — preserving short-term memory in markdown files, SQLite databases, Postgres, context graphs. But they're basically pre-embedded, pre-vectorized structures."
— Val Bercovici, Chief AI Officer, WEKA

The token warehouse concept is still early, but the research momentum is real. Val points to TensorMesh's Cache Blend paper — work on partial prefix matching that moves beyond brute-force full-prefix approaches — as a sign the field is shifting from proof-of-concept to engineering discipline.

Agent swarms need concurrent memory, not just capacity

The conversation around agents today focuses on what a single agent can do. Val's conviction is sharper: no one will ever run just one agent. It'll always be swarms working in parallel.

Instead of summarizing and compacting context to fit within limits, Val describes a future where you give full context to a new sub-agent for each sub-task, letting it fill up its own context window. That parallelization — concurrent agents, agent swarms — changes the memory math entirely.

"The biggest value of memory today is without any additional expense, without more CAPEX or energy OPEX, you can have a high level of concurrency. In a best-case scenario, we estimate about 10x. Even at modest scale, we're seeing 6.5x — another way of saying it is 550% more concurrent tokens without latency sacrifice, without any more GPU spend or any more energy spend."
— Val Bercovici, Chief AI Officer, WEKA

Those numbers reframe the economics of agent deployment. If you can run 6x more concurrent agents on the same hardware, you're saving money and making architectures viable that weren't before. A customer service system that dispatches 10 specialized sub-agents per ticket, a code review pipeline that checks security, performance, and style in parallel — these patterns only work if the memory layer can keep up.

Val calls the discipline behind this "context platform engineering" — architecting memory so swarms of agents can operate at scale without hitting the constraints that shut down single-agent systems.

"I've always believed that as agents really take off and become valuable, no one will be running just an agent. It'll always be a swarm of agents working in parallel concurrently. So this concurrent agent swarm engineering is something we've termed context platform engineering."
— Val Bercovici, Chief AI Officer, WEKA

Organizations already thinking in terms of agent swarms will find that solving memory concurrency first gives them a structural advantage — and we're only at the beginning of understanding how large that gap can be.

Memory awareness is moving to the model level

So far, every solution Val has described — bridging the memory wall, engineering context persistence, scaling agent concurrency — treats memory as an infrastructure problem solved below the model. The model consumes context; engineers figure out how to feed it efficiently. But Val points to a shift that changes that dynamic entirely: models themselves are starting to understand memory.

He cites DeepSeek 4's preview paper introducing a concept called Engram — the first public indication that foundation models are becoming architecturally aware of memory hierarchies, not just consuming whatever context gets handed to them.

"It is the first public indication — and we're sure the closed commercial frontier labs are doing this — of awareness of memory tiers at the model level, not just at the inference server level."
— Val Bercovici, Chief AI Officer, WEKA

This matters because it closes the loop. Models need a clear sense of what memory they have access to. When models know how to use both short‑term and long‑term memory, they can decide what to keep close and what to fetch later — much like a developer choosing between cache and disk. New attention mechanisms such as radix and ring attention are pushing this even further, expanding how well models can reason over long contexts.

For Val, this reinforces the central point: the teams that truly solve memory will be the ones capable of building AI systems that operate on deep context, high concurrency, and real‑time memory management across the entire stack.

And because each layer strengthens the next, the advantage compounds. Better memory infrastructure unlocks more concurrency; models that understand memory tiers make smarter use of it. And as a result, companies that move early achieve faster infrastructure and access to application architectures their competitors can’t run at all.

Listen to the full conversation with Val Bercovici, Chief AI Officer of WEKA, on MindMakers to hear more about how memory changes the AI competitive landscape, why the SaaS business model faces new pressure from AI agents, and what infrastructure leaders need to think about right now.

Ready to explore delight.ai's agent-ready infrastructure? We break down the silos between your systems, giving our AI agents the context they need to reliably support your customers. Contact sales to learn more.

Why memory is AI's most critical bottleneck

The Chief AI Officer role is temporary (and that's the point)

The memory wall is the real constraint on AI scaling

Context engineering is the new frontier

Agent swarms need concurrent memory, not just capacity

Memory awareness is moving to the model level

Keep reading