From Bits to Atoms: The AI Infrastructure Shift From LLMs to VLAs

Back Research Notes From Bits to Atoms: The AI Infrastructure Shift From LLMs to VLAs Published on December 2, 2025 By Jordi Visser I listened to a long-form interview this weekend with Elon Musk. I’ve watched most of his interviews over the years, but this one stood out. It wasn’t about quarterly results, Tesla margins or rocket design. It was Musk speaking broadly about consciousness, physics, society, money and the future of intelligence. But in the middle of that conversation he said something that immediately caught my attention because it hit a key point from my weekly video. “AI is going to move from thinking to acting. Real-world AI robots, autonomous vehicles, machines moving through space, this is the true frontier.” He said it casually, the way someone might comment on the weather. But it’s one of the most important statements for investors thinking about AI alpha over the next three years. Because what he’s really saying is: the next phase of AI is not cognitive. It’s kinetic. The entire artificial intelligence investment narrative to date in the three years since the launch of ChatGPT has been about manipulating bits, processing text, generating language, and reasoning through problems that exist purely in digital space. GPUs multiply matrices. Memory stores model weights. Networks shuffle data between processors. This architecture of cognitive AI has created one of the greatest capital expenditure cycles in history, concentrating wealth in a handful of semiconductor and cloud infrastructure companies. To borrow a line from Good Will Hunting, this phase was AI “reading about the Sistine Chapel,” not standing beneath it. In the famous park scene, Robin Williams reminds Matt Damon that knowing something intellectually is not the same as experiencing it: “You don’t know what it smells like in the Sistine Chapel.” That is exactly where we are with AI today. The complexity of this interaction of experiencing the physical world instead of reading about it is why I say PMIs are going to move higher. The buildout is immense to make it happen and you need to think three years from now as to where we will be rather than where we are now just like with ChatGPT in November 2022. All of this will ultimately lead to factor shifts as well within the equity market as the winners and losers change in this next phase. This shift is happening now despite all the AI bubble talk and worries about investments and ROIC. The next phase of AI is not about thinking faster anymore, it’s about moving atoms. LLMs taught AI to read. VLMs taught AI to see. VLAs will teach AI to act. Vision-Language Models (VLMs) closed the gap between text and perception, but Vision-Language-Action (VLA) models extend this into physical agency. This is not an incremental upgrade; it is an architectural fracture. Where large language models required us to build bigger datacenters, VLAs require us to rebuild the interface between computation and physical reality. The companies that move data, perceive the environment, and execute precision motion in three-dimensional space will capture value that today’s market has almost entirely overlooked. The trillion-dollar compute infrastructure was the table stakes. The multi-trillion-dollar kinetic infrastructure is the game we’re about to play. Why VLAs Break Everything The Large Language Model era optimized for a specific problem: train transformers on massive amounts of text, then deploy them to answer questions or generate content. The workload was cloud-based and latency-tolerant, a user could wait half a second for a response. The constraint was pure compute power. VLAs invert nearly every assumption. Instead of discrete text measured in kilobytes, VLAs ingest continuous video streams measured in terabytes. A single minute of high-resolution, multi-camera footage contains more information than millions of text tokens. Instead of generating buffered text responses, VLAs output real-time motor commands to physical actuators, robot joints, vehicle steering systems, drone rotors, where a 10-millisecond delay can mean the difference between success and catastrophic failure. This creates what we call the physics constraint. In the LLM world, if your model hallucinates, you get a factual error, embarrassing but not catastrophic. In the VLA world, if your model hallucinates while controlling a robot, you get a “kinetic hallucination”: the AI commands a physically impossible or dangerous action. The stakes are categorically different. VLAs introduce three new bottlenecks that GPUs alone cannot solve: the Bandwidth Wall (video ingestion overwhelms traditional interconnects), the Memory Capacity Gap (context windows explode to include environmental maps and sensor histories), and the Actuation Precision Requirement (commands must execute with sub-millisecond latency). These are not software problems. They are materials science, photonics, and mechanical engineering problems. And they create entirely new categories of infrastructure winners. The Data Center Transformation: From Compute to Bandwidth Training LLMs on text required massive parallelism connecting thousands of GPUs to process enormous datasets. But the data itself was relatively compact. The bottleneck was compute: how many trillions of operations per second could you sustain? Training VLAs on video flips this dynamic. The bottleneck becomes data movement. A robotics company training a manipulation model needs to ingest continuous streams from hundreds of robots, each equipped with multiple cameras and sensors. This creates server-to-server communication within the datacenter that scales exponentially. In LLM clusters, this traffic was manageable. In VLA clusters, it becomes the primary constraint. This is why optical interconnects are becoming the circulatory system of AI infrastructure. At the speeds required for next-generation systems, electrical signals degrade within inches due to basic physics. This forces a wholesale migration to photonics: lasers, optical fibers, and eventually co-packaged optics where the light source sits directly on the chip package. The manufacturing complexity creates natural monopolies. Unlike electronic chips where billions of transistors are created in parallel, optical components require precise physical alignment of powered lasers to optical fibers, a serial, time-intensive process that can’t be easily automated. Companies that have mastered this operate with multi-year visibility and pricing power, because the switching cost for customers is catastrophic. But the shift goes beyond raw bandwidth. VLAs also change the memory problem. LLMs were primarily bandwidth-constrained loading model weights as fast as possible. VLAs face a capacity problem. A robot fleet must maintain massive shared context: spatial maps, object databases, sensor histories. This context must be accessible to multiple agents simultaneously. When ten warehouse robots encounter the same obstacle, they need to query and update a shared environmental model in real-time. This is where technologies like Compute Express Link become transformative, enabling memory pooling at rack scale. For VLA workloads with highly variable memory demand, this eliminates the need to overprovision expensive memory on every server. The companies building these memory controllers are effectively collecting a tax on the entire AI infrastructure build-out, yet most trade at modest multiples because the market still thinks of them as niche components rather than foundational infrastructure. The Edge Revolution: Inference Moves to the Robot LLMs could run in centralized cloud datacenters because latency was tolerant, 200 milliseconds didn’t matter. VLAs cannot afford this luxury. A humanoid robot walking across uneven terrain, a drone navigating obstacles, an autonomous vehicle reacting to a pedestrian these require sub-10-millisecond decision loops. Network latency to a distant datacenter already exceeds 50-100