AMD Seeks to Outpace NVIDIA's CUDA with Performance-Enhancing ROCm 7 Software
AMD Unveils ROCm 7.0: A Leap Forward in AI Performance
In a significant move, AMD has officially released its ROCm 7.0 software platform. Announced at the "Advancing AI 2025" event in mid-June, the final release was documented around mid-September 2025. This update promises major improvements in inference and training performance for AMD's MI300-series parts and the MI355X.
The MI355X, launched this spring, is AMD's latest GPU offering, designed to close the performance gap with Nvidia's Blackwell accelerators. In inference workloads, the MI355X achieves a 1.3x edge over Nvidia's B200 when running DeepSeek R1 in SGLang.
One of the key features of ROCm 7.0 is the introduction of AMD's AI Tensor Engine (AITER), which is aimed at maximising GenAI performance. AITER, when applied to models like DeepSeek R1, can boost throughput by more than 2x. It also offers a significant boost in MLA decode operations (by 17x) and MHA prefill ops (by 14x) for inference.
ROCm 7.0 offers a roughly 3.5x uplift in inference performance and a 3x boost in effective floating point performance in model training compared to ROCm 6. The support for OCP's microscaling datatypes in ROCm 7.0 cuts memory requirements by a factor of 2 to 4x.
The MI355X boasts 108 GB more HBM3e compared to Nvidia's B200. This increased memory capacity is a significant advantage for handling larger and more complex AI models.
To make these improvements accessible, AMD is rolling out a pair of new dashboards. The Resource Manager is designed for managing large clusters of GPUs, while the AI Workbench streamlines training or fine-tuning popular foundation models.
ROCm 7.0 adds native support for PyTorch 2.7 and 2.9, TensorFlow 2.19.1, and JAX 0.6. Enabling the feature in these engines is as simple as installing dependencies and setting environment variables. AITER and the MXFP4 datatype have been merged into popular inference serving engines like vLLM and SGLang.
Moreover, ROCm 7.0 extends broader support for these low precision datatypes, with AMD's Quark quantization framework now being production ready. The platform is available for download from AMD's support site and in pre-baked container images on Docker Hub.
Lastly, it's worth noting that the MI350 series is AMD's first generation of GPUs to offer hardware acceleration for OCP's microscaling datatypes. The MI355X's main competitor is Nvidia's B300, which packs 288 GB of HBM3e.
In conclusion, AMD's ROCm 7.0 software platform offers significant improvements in AI performance, making it an attractive option for developers and researchers working on AI projects.
Read also:
- List of 2025's Billionaire Video Game Moguls Ranked by Fortune
- Transformation of Decarbonization Objectives in the Iron Ore Pellets Sector
- Condolences offered by Cuba for earthquake tragedy in Turkey
- Affordable, Multifunctional Storage Solution for Small-Scale Power Plants: Marstek Jupiter C Plus, Offering Energy Storage below 220 € per Kilowatt-hour, Now Available with a 100 € Discount for Each Set.