An engineer who worked at Twitter during the dramatic transition from Agrawal to Musk has publicly reminisced about discovering a cluster of 700 Nvidia V100 GPUs, and Tim Zaman, who now works as a software engineer at Google DeepMind, found this ton of GPU power running but sitting idle in the data center of X’s spunky ancestor.
A few weeks after the Twitter acquisition in 2022, we discovered 700 V100 GPUs (PCIe, lol) in a data center. Powered on and idle, and had been for a long time. Forgotten remnants of a good-faith attempt to create a cluster within Twitter 1.0. Times have changed. 100,000 GPUs… https://t.co/zSChG0BvVZJuly 22, 2024
The lump of Nvidia silicon and PCBs warmly humming in a Twitter data center was poetically described by Zaman in a Twitter/X post on Monday as “the forgotten remnants of an honest attempt to create a cluster inside Twitter 1.0.” The engineer was prompted to write about his unexpected discovery of this silicon treasure trove after reading that xAI’s Memphis Supercluster has begun training Grok 3, powered by 100,000 liquid-cooled Nvidia H100 accelerators on a single RDMA fabric.
Zaman highlighted what many of you are probably thinking: on Twitter, 700 of the world’s most powerful GPUs have been sitting aimlessly for years. “Times have changed!” he exclaimed. Indeed, the first Nvidia Volta architecture V100 GPUs for data centers started appearing on the market during the first major GPU shortage in 2017, but in mid-2022, Zaman found a cluster with 700x V100 cards sitting aimlessly. That’s a lot of waste of computing time and resources.
Another joy for Zaman was discovering that the 700 Nvidia V100s were PCIe GPUs, and not the SXM2 form factor variety with its much higher-bandwidth NVLink interface. Of course, we don’t know, and probably never will know, why Twitter back in 2017 bought PCIe rather than SXM2-bus V100 GPUs for this massive installation.
Zaman’s tweet also contained an interesting observation about Musk’s new “gigafactory of computing.” “Running 100,000 GPUs on the same fabric must be an epic challenge,” the engineer commented. “At that scale, the only guarantee is failure, and it’s all about proper failure management.” With this in mind, Zaman pondered distributing resources into separate domains to ensure that failures don’t take down the entire system.
Engineers were also intrigued by the maximum number of GPUs that can reside on a single fabric: As tech giants race to scale up their AI training clusters, this is sure to uncover both predictable and unexpected limits on the maximum number of GPUs on the same fiber.