AMD's AI Future is Rack Scale 'Helios'

Only have a minute? Here are our key takeaways.

🚀 New MI355X GPU: 2x AI FLOPs, more HBM, 40% better tokens/$ than NVIDIA.
🧠 Software Wins: ROCm 7 with big performance boosts and day-0 support.
🖧 Rack-Scale Wins: New turnkey solutions using AMD CPU + GPU + Network.
📈 Roadmap Wins: Next-Gen in 2026 with 4x performance, HBM4 and scale.
🌱 Efficiency Wins: Roadmap to 20× rack-scale energy efficiency by 2030.

Thanks for reading More Than Moore! This post is public so feel free to share it.

With the demand for hardware for artificial intelligence continuing to go through the roof, the various chip designers and other hardware providers within the industry are still in full-on growth mode. The name of the game is not just developing new products to woo deep-pocketed customers looking to build large-scale training and inference clusters, but to keep up with the rapid growth of the market and not fall behind in market share simply by not growing as fast as the overall market. The gold rush is still in full effect, and everyone wants their piece.

As the second-largest of the industry’s major GPU providers, AMD has been able to capture a slice of the AI market relatively quickly in the last couple of years, building on the back of its Instinct MI300X series of GPU-based accelerators. While far from AMD’s first foray into server GPUs, as MI300X uses the third iteration of their CDNA server GPU architecture, but MI300X was AMD’s first architecture to truly go all-in on the kind of features required for large-scale AI systems, all of which has translated into significant and high-margin sales for the company.

And now with that momentum behind them, AMD is aiming to continue to ramp up their presence in the AI market with their next generation of AI server hardware.

During the company’s Advancing AI 2025 presentation earlier this morning, AMD launched the next generation of what has become a full-on AI ecosystem for the company. Built around the company’s fourth-generation CDNA architecture, CDNA4, and the new Instinct MI350 series of accelerators based on it, AMD is looking to dive even deeper into the AI market with a faster accelerator even more optimized for AI workloads and the unique math behind them. And running on that new hardware is the latest iteration of AMD’s ROCm platform, now up to version 7, which sees AMD touting numerous performance improvements coupled with a fresh focus on day-0 software support.

Meanwhile, with the wider industry’s focus shifting from buying individual server systems to buying whole racks of systems, AMD is also assembling its own ecosystem for rack-scale AI compute, which it has brought to the forefront today, enabled through its acquisitions of companies like ZT Systems.

By combining Instinct MI350 accelerators with AMD’s recently-launched Pollara 400 AI NIC and Turin EPYC CPUs, AMD is now offering customers all the major components they need to build whole racks based on AMD hardware. And like AMD’s plans for CPUs and GPUs, AMD is planning for multiple generations of rack-scale hardware, with this year’s equipment being the tip of the iceberg to a much bigger push in 2026 with the Helios AI rack, which will be packed full of next-generation AMD technologies.

But before we get too far ahead of ourselves, let’s take a look at AMD’s major announcements today.

Doubling Down on AI, Damn Near Literally

First and foremost, we have the star of today’s announcements; AMD’s Instinct MI350 family of GPU-based accelerators. Based on AMD’s new CDNA4 architecture, these accelerators push even harder into the AI field, with AMD effectively doubling up on their matrix (tensor) operation performance per clock versus the MI300X accelerator. And with AMD supporting lower precision FP6 and FP4 formats, the peak throughput of these new accelerators is even higher. Under ideal accelerators, the MI350 series stands to be up to four times as fast as MI300X – a significant growth spurt that comes barely 18 months after AMD first launched the MI300X.

Under the hood, the Instinct MI350 series is based on AMD’s CDNA4 architecture. A further evolution of the CDNA3 architecture that was used in the MI300 series, CDNA4 is an incremental design that keeps around most of CDNA3’s underpinnings, while making some important upgrades elsewhere with a specific focus on improving matrix performance for AI workloads.

The biggest change here is that AMD has doubled the throughput of the matrix engines that are responsible for providing matrix operation support. So clock-for-clock, for FP16 and below data types, a CDNA4 compute unit (CU) can process twice as many matrix operations as a CDNA3 compute unit. Meanwhile, there has been no equivalent scale-up of traditional vector performance, underscoring how MI350’s transistor budget is far more shifted to AI hardware than MI300 before it.

Meanwhile, CDNA4 also brings native support for FP6 and FP4 data types to AMD’s accelerators for the first time. One of the marquee features of rival NVIDIA’s Blackwell architecture, FP6 and FP4 have become a new target for AI inference, as developers look to wring every TOP/FLOP of performance from these expensive and power-hungry GPUs. And, aiming to one-up NVIDIA at their own game here, AMD has even beefed up FP6 performance on their architecture so that it processes at twice the rate of FP8, unlike NVIDIA’s architecture where it processes at the same rate as FP8. AMD in essence built a better FP4 unit to support FP6, rather than reusing an FP8 unit to support FP6. This carries a die area penalty, but the upshot is double the performance.

Feeding all of this, the MI350 series is being paired with 288GB of HBM3E memory, eight stacks in all. This is the same number of stacks as on the MI300, with AMD benefitting from both the greater memory capacity of HBM3E and its higher bandwidth as well. All told, the MI350 accelerators offer a cumulative 8 TB/sec of memory bandwidth, up from the 192GB of memory and 5.3 TB/sec of bandwidth available on the MI300X.

And while we’ll get into the nitty-gritty of the architecture in our separate CDNA4 architecture piece, the MI350 series also sees a meaningful change in how the chips are constructed. AMD is still using a 3D hybrid bonded die stacking technique, stacking the compute ‘XCD’ dies on top of I/O and memory IOD dies. The MI350 series continues to use 8 XCDs – this time with 32 CUs each – but these are now being stacked on top of just two large IODs, versus the four smaller IODs used on MI300X. In terms of process technologies, the CDNA4 XCDs are using the latest and greatest node from TSMC, N3P. This is a notable advantage over NVIDIA’s N4-built hardware. AMD tells us that there is a total of 185B transistors in a completed chip, up from 153B for MI300.

This time around, the Instinct MI350 series is being split up into a pair of accelerator SKUs, based on power consumption and clockspeeds.

The lead SKU is the MI355X, which is specifically intended for liquid cooled systems. This is the highest performing part, with a peak FP16 performance of 5 PFLOPS, and a clockspeed of around 2.4 GHz.

The air-cooled counterpart will be the MI350X, which is hardware-identical but with a lower peak clockspeed of around 2.2 GHz (and a lower power consumption to match). In terms of AI performance, this would give the MI350X around 4.6 PFLOPS of peak FP16 performance.

Otherwise, as hinted at by AMD’s decision to split SKUs based on power consumption, the power consumption of this generation of accelerators is going to be very high. Coming from the 750 Watt MI300X, the similarly air-cooled MI350X is designed for 1000 Watt systems, and the liquid-cooled MI355X will be able to chug 1400 Watts. AMD is pushing these parts hard to maximize performance, but according to the company, it’s what their customers want – extreme scale-up by pushing each individual processor as hard as possible. Compared to the cost of buying the hardware and maintaining these systems, the extra performance is worth the additional power consumption.

Also worth noting here is the uptick in ‘rack’ level power consumption. With up to 128 MI355X accelerators fitting into a single rack thanks to the density afforded by liquid cooling, this means a single rack can potentially draw upwards of 180 kW just from the GPUs alone.

Ultimately, just about everything about the MI350 is rooted in economics in some form or another. As AMD is looking to grow their AI market share – and particularly at the expense of NVIDIA – the company is looking to not just meet or beat its rival on performance, but to undercut them on pricing as well. Specific pricing is not being published, but AMD reckons they can offer upwards of 40% more tokens per dollar – or around 30% lower cost per token – than NIVIDA’s GB200 platform. Though as with all performance and cost claims, this remains to be seen, and is going to be highly workload dependent.

According to AMD, Instinct MI350 accelerators are already shipping to their partners. The company expects those partners to start offering MI350 solutions – cloud and hardware – in Q3 of this year. The company has not outlined the ramp schedule for the product, however, so it’s unclear just how quickly AMD will be able to get all the systems their customers want into their waiting hands.

The software counterpart to AMD’s hardware ambitions is the ROCm software stack. At its inception nearly a decade ago, like its competitor, was focused mostly as a high performance computing stack for scientific research. But the recent pivot to AI means that AMD has been gearing towards wide support, and through that, building an open software ecosystem. That continues to be their direction to this day. With a relatively late start on assembling a software stack, ROCm has traditionally been the more fraught piece of the puzzle of AMD, but the software side of the business has been making some good progress as of late, and is reaching some important milestones in terms of software support and reaction times.

The big announcement on the software front today is the upcoming release of ROCm 7, the latest iteration of the software stack, up from 6.4 earlier this year. The marquee addition is introducing support for the CDNA4 architecture and MI350 series accelerators. But AMD’s software developers have also been focusing on everything from improving performance and better supporting cluster-scale operations to laying down the framework for enterprise-grade management and AI lifecycle features. Most noticeably, for those following the key people at AMD on social media, has been out-of-the-box support for popular models and integration methods.

At a high level, AMD is indicating that the ROCm 7 release marks the pivot point for the company, moving from focusing on catching up to NVIDIA to focusing to what comes next – being there to support new frameworks and services as early as possible. In other words, focusing on bringing day-0 support where possible, especially for the ever-popular Pytorch.

This also includes AMD’s previously announced efforts to bring ROCm to first-class citizen status under Windows – therefore helping them bootstrap the next generation of ROCm developers. While bits and pieces have been available under Windows up until recently, such as HIP and various tools under WSL, in Q3 of this year AMD will finally begin previewing native Pytorch support under Windows, and ONNX runtime support as well. And perhaps most importantly of all, this next Windows release will bring support for ROCm software development on AMD’s latest generation of discrete (RDNA 4) and integrated (RDNA 3) GPU hardware.

Elsewhere, there is an ever-continuing push for improved performance from software, which over the years is where most of the total performance gains from new hardware generations has come from. Case in point, AMD is touting that both inference and training performance on the existing MI300X is upwards of 3.8x faster under ROCm 7 than ROCm 6.0, all thanks to software improvements.

Notably, these are the cumulative improvements of all the software improvements AMD has made since the start of ROCm 6, so these improvements are not solely due to ROCm 7; but rather it’s meant to illustrate how AMD has improved software performance since the MI300X launch in late 2023. AMD’s software team has stated that they aren’t holding back software performance improvements – they will deploy them as soon as possible – so ROCm 7 is more of a rolling release than a major milestone on the performance front.

What is new are a suite of features across the stack. Everything from distributed inference to GEMM tuning to GPU direct access has been touched in ROCm 7, all adding new tools and features for developers to further improve their ROCm programs.

And, as briefly mentioned earlier, AMD is finally paying more focus to enterprise management needs as well with ROCm Enterprise AI. This operational stack is intended to provide the tools that large enterprises need to actually manage and orchestrate AMD-based AI clusters, encompassing everything from provisioning to model fine-tuning.

ROCm 7 is being released in a preview format today. The final release will come at some point in the second-half of the year.

While not strictly a new product being announced or launched today, AMD also used part of today’s keynote to talk about their networking technologies, as well as talking up the recently-released Pollara 400 AI NIC. A product of AMD’s Pensando division, a previous acquisition, the Pollara 400 AI NIC is AMD’s first post-acqusition NIC, and an important cornerstone in developing a complete AMD hardware ecosystem. By offering AMD CPUs, AMD accelerators, and now suitable high-performance NICs, AMD’s partners can build whole racks of AMD-based systems.

As alluded to by the name, the Pollara 400 AI NIC is a 400G Ethernet card. With the central controller built on TSMC’s N4 process, the Pollara 400 is meant to be used as a highly programmable P4 NIC for scaling out AMD compute clusters, taking AMD from an 8-way system (their current scale-up limit) to as large as customers need.

Notably, this is also the first Ultra Ethernet Consortium-ready AI NIC, with AMD being one of the steering members of what is the group pushing forward the next generation of Ethernet. We’re still very early in the days of UEC, and AMD will be but one of hopefully many players, but the company is securely hitching its horse on this next generation of Ethernet to provide the scale-out features and performance that current Ethernet technologies cannot provide.

Last, but certainly not least, is the matter of AMD’s broader pivot to rack-scale solutions. With large customers increasingly buying systems by the whole rack – networking hardware and all – instead of by the individual server, AMD is looking to meet customers (and rival NVIDIA) in offering rack-scale systems.

Today that is in the form of MI350 + Turin 5th Gen EPYC + Pollara 400. This is the first generation of such systems where AMD has an Ethernet offering to pair with their CPUs and GPUs. However the big prize is on racks-scale systems that can truly scale-up (rather than scale-out) to the whole rack, and this is where AMD is laying out a series of fresh roadmaps for both their GPUs and their rack-scale systems. As we’ve seen with other AI vendors, the long hardware cycle time and significant investments required means that hardware vendors are starting to offer an open look at their plans for the next few years, and AMD is among this crowd.

First and foremost, let’s talk about the GPU side of matters. AMD has previously published GPU roadmaps leading up to the MI400 in 2026, calling it a true next-generation GPU architecture with little in the way of hard details. Now that the MI400 is only around a year off, AMD is outlining for the first time some of the features and the performance targets for their future accelerator.

With a performance target of 20 PFLOPS of FP8, MI400 is slated to double MI355X’s AI performance at low precision. Feeding the beast will be 432GB of next-generation HBM4 memory, with AMD touting a true generational jump in memory bandwidth of 19.6 TB/second – more than double MI355X’s. Based on what we know about HBM4 thus far, these figures allude to 12 stacks of memory on a single accelerator, but this remains to be confirmed. AMD has not previously named the underlying architecture for the chip, but in previous roadmaps it has been described as “CDNA Next.”

Compute performance aside, the other notable revelation today is that MI400 will support Ultra Accelerator Link, the open industry standard for developing scale-up fabrics. If all goes according to plan, UAL will give AMD the scale-up bandwidth and flexibility that they lack today with the MI350 series, allowing AMD to go from 8 GPUs to a whopping 1024 GPUs in a scale-up configuration.

MI400, in turn, will be the heart of AMD’s next-generation rack-scale system, Helios. Combining MI400, 6th gen EPYC “Venice,” and the “Vulcano” NIC, Helios is meant to be AMD’s proper answer to today’s (and tomorrow’s) rack-scale systems from NVIDIA and others. With NVIDIA having already shown a bit of their hand with their own roadmaps, AMD thinks that a 72 GPU Helios rack will be able to match NVIDIA’s next-gen Vera Rubin racks on AI performance while beating it in memory capacity, memory bandwidth, and even scale-out network bandwidth. In short, AMD wants to capture performance leadership at the GPU level and at the rack level.

Venice is the name for Zen 6 based EPYC CPUs built on TSMC N2. AMD showed off a wafer of this only a few weeks ago with TSMC to showcase the first N2 silicon being developed. We expect Venice to be launched in 1H 2026.

Vulcano is an 800G networking solution that incorporates PCIe 6.0 support as well as Ultra Accelerator Link support, built on TSMC N3.

Edit from Ian: This rack looks to be double width. I’ve been saying for a while that a single rack width isn’t sufficient in today’s scale-up, and I’m fully expecting team red or green to redefine the base rack specification in the next year or two to compensate. We already see dual-width rack designs from some EDA companies and their emulation tools (eg Siemens), but if this image is a true-ish render, it could be coming to AI next year for sure.

And AMD’s rack plans don’t stop there. The company is publishing a whole rack roadmap, that takes the company’s AI offerings through 2027. Two years for now, the hereto-unnamed next-gen AI rack will combine an even newer MI500 accelerator with EPYC “Verano” and the “Vulcano” NIC.

AMD’s goal is to iterate on their core architectures on a yearly basis, and so long as they’re able to hold to their roadmap published here, they’ll be able to do just that, with multiple new CPU and GPU families to build ever-faster racks.

Ultimately, AMD is setting a new 5 year goal to reach a 20-fold increase in rack-scale energy efficiency versus MI300X by 2030. And coupled with software optimizations to help drive down the amount of computational work required to actually train a model, AMD is ambitiously eyeing a possible 100x improvement in overall energy efficiency by that time.

Suffice it to say, if all goes according to plan, then AMD has some very big plans for their AI offerings over the next several years. But for the immediate future, AMD’s AI offerings are going to be rooted in the new instinct MI350 series, the CDNA 4 architecture, and all the performance that kilowatts of power and low precision number can unlock.

We’re planning deeper dives into today’s announcements, such as the architecture and the benchmarks, so subscribe to ensure you do not miss out!

Want to read more right now? How about this article from earlier in 2025 where AMD CEO Dr. Lisa Su laid out her plans for AMD this year.

Seven Highlights from AMD’s CEO Dr Lisa Su

More Than Moore, as with other research and analyst firms, provides or has provided paid research, analysis, advising, or consulting to many high-tech companies in the industry, which may include advertising on the More Than Moore newsletter or TechTechPotato YouTube channel and related social media. The companies that fall under this banner include AMD, Applied Materials, Armari, ASM, Ayar Labs, Baidu, Dialectica, Facebook, GLG, Guidepoint, IBM, Impala, Infineon, Intel, Kuehne+Nagel, Lattice Semi, Linode, MediaTek, NextSilicon, NordPass, NVIDIA, ProteanTecs, Qualcomm, SiFive, SIG, SiTime, Supermicro, Synopsys, Tenstorrent, Third Bridge, TSMC, Untether AI, Ventana Micro.

Source link

Basic Bliss

Recommended

Premium Perks

Ultimate Insider

Basic Bliss

Recommended

Premium Perks

Ultimate Insider

Become a member

17 Best Reformation Pieces For Every Wardrobe

20 Pictures of Cats High Fiving You for Surviving the Week