Amazon developed a substantial AI supercomputer named Project Rainier for Anthropic - this is what's been revealed thus far.
Amazon Web Services (AWS) is set to launch a groundbreaking AI supercomputing cluster, known as Project Rainier, later this year. This colossal system, also referred to as the Ultracluster, is designed to support advanced AI training and inference for companies like Anthropic and other customers [1][2][5].
At the heart of Project Rainier lies Amazon's custom AI chips, the Trainium2 accelerators, which distinguish it from traditional GPU-based clusters. The Ultracluster is poised to be one of the largest AI supercomputer clusters globally, featuring hundreds of thousands of these accelerators interconnected [1][5].
The emphasis on efficiency over raw speed caters to customer demand for cost-effective AI model training and inference solutions. With an expected operational status in late 2025, Project Rainier aligns with AWS’s ongoing investments in global data center infrastructure to bolster AI capabilities, including a significant investment of AU$20 billion in Australia for data center expansions supporting AI [3][5].
Compared to other large-scale AI systems, Project Rainier stands out due to its use of proprietary chips, AWS cloud integration, and Anthropic partnership. While systems like OpenAI's Stargate and xAI's Collusus primarily rely on GPU-based hardware, Project Rainier's use of Trainium2 chips is a strategic bet on custom silicon to improve efficiency and reduce costs at scale [2].
Each Trn2 cluster consists of chips connected in a 4x4 2D torus using AWS high-speed NeuronLink v3 interconnect. Each Trn2 UltraServer will boast 12.8Tbps of connectivity, courtesy of Annapurna's custom Nitro data processing units. Interestingly, Amazon's third-gen accelerator, Trainium3, is expected to deliver about 4x the performance of its Trn2-based systems [4].
Project Rainier will represent the largest deployment of Amazon's Annapurna AI silicon ever. The inter-instance bandwidth provided by NeuronLink between the Trn2 instances is 256GB/s of bandwidth per chip. The Trn2 UltraServer meshes together four Trn2 systems into a single 64-chip compute domain [4].
Details remain rather light on Amazon's Trainium3 accelerator, but it is speculated that Project Rainier may eventually utilize this next-generation technology. For training workloads, Amazon's Trn2 instances do have a bit of an advantage as they offer higher sparse floating performance at FP8 [4].
Amazon is also developing a custom fabric that delivers tens of petabits of bandwidth with under 10 microseconds of latency across the network. Unlike Nvidia's B200, Amazon's Trainium2 accelerators are spread across eight compute blades (2x Trainium2s each) which are managed by a pair of x86 CPUs from Intel [4].
In summary, Amazon’s Project Rainier represents a transformative AI supercomputing architecture focused on proprietary chip technology and cloud-scale efficiency, distinguishing it from other leading AI supercomputers that leverage GPU-centric architectures. It aims to be operational later in 2025, marking a major milestone in the evolution of AI infrastructure [1][2][5].
- The heart of Amazon's Project Rainier, the Ultracluster, consists of Amazon's custom AI chips, the Trainium2 accelerators, setting it apart from traditional GPU-based clusters.
- Project Rainier's emphasis on efficiency over raw speed aligns with AWS’s ongoing investments in technology, including a significant investment of AU$20 billion in Australia for data center expansions supporting AI.
- Compared to other large-scale AI systems like OpenAI's Stargate and xAI's Collusus, Project Rainier stands out due to its use of proprietary chips, AWS cloud integration, and its focus on cloud-scale efficiency.