Exclusive: AWS is developing a new high-power AI chip

  • AWS VP of Infrastructure revealed that the hyperscaler wants to push its next-generation Trainium chip above 1,000 watts
  • Trainium3 silicon will be a 1,000+ watt chip like Nvidia’s Blackwell GPU
  • AWS is also making other moves to prepare its data centers for a wave of GenAI demand

When it comes to high-power artificial intelligence (AI) computing, Nvidia is the one to beat. But Amazon Web Services (AWS) is jockeying for a spot on the AI chip leaderboard with its forthcoming Trainium3 chip, revealed AWS VP of Infrastructure Services Prasad Kalyanaraman in a conversation with Fierce Network this week.

AWS has been building its own chips for a while, but Trainium3 will cross a key power threshold.

Kalyanaraman didn’t specify wattages for Trainium3 or its predecessor Trainium2, which was unveiled in November 2023 and is set to become available later this year. But he did say that liquid cooling is required for chips that use 1,000+ watts.

While Trainium2 doesn’t require liquid cooling, Kalyanaraman noted Trainium3 will.

"The current generation of chips don't require liquid cooling, but the next generation will require liquid cooling. When a chip goes above 1,000 watts, that's when they require liquid cooling," he stated, adding that the company’s other AI chip, Inferentia, requires much lower power.

Trainium3 power

“The Trainium3 chip has the makings to be a very powerful chip, but it’s all about timing,” given Nvidia is already planning its next-generation Rubin chip and Intel is rumored to be working on a 1,500-watt chip as well, Dell’Oro Group Research Director Lucas Beran told Fierce when asked for his take on what AWS revealed.

“To me, this is a clear signal, they’re saying they can’t compete with the likes of chips from Nvidia without pushing power density to levels that require liquid cooling,” he added.

Kalyanaraman didn’t say when Trainium3 will be available or give any other indication of when liquid cooling will roll out in its data centers.

However, Beran said it makes sense that AWS would want to prepare for future chips well in advance given the lead times for coolant distribution units (which are the beating heart of liquid cooling systems) can be upwards of a year.

AWS buys and offers Nvidia chips and there’s no indication that the rollout of Trainium3 would change that, Beran said.

Keeping it cool

When it comes to data center infrastructure, Beran said that like Nvidia’s announcement earlier this year that Blackwell would be liquid cooled, AWS’ expected adoption of liquid cooling is a “big step” for the industry.

While Nvidia’s move is expected to help liquid cooling proliferate with a much wider set of customers, AWS’ play will still meaningfully move the market in terms of sheer revenue, he explained.

Of course, deploying such a high-power chip has big implications for AWS’ data centers.

Today, Kalyanaraman said pretty much all of AWS’ data centers use traditional air cooling. That’s fine for the current generation of chips, but the company is preparing for a future in which liquid cooling is required.

So, which of the many forms of liquid cooling does it plan to use? Kalyanaraman said immersion cooling is currently off the table and that instead, AWS is planning to adopt single-phase cold plate technology. He added research into microfluidics, which would allow AWS to pipe liquid directly to high-heat areas of its chips, remains ongoing.

Data center makeover

In addition to designing its data centers to support liquid cooling, Kalyanaraman said AWS is making several other optimizations as well with strategic rack positioning and networking setups.

On the network front, Kalyanaraman said AWS has long built its own commodity switches and in 2019 rolled out its own Elastic Fabric Adapter network interface which uses the Scalable Reliable Datagram low-latency transport protocol. The key there, he said, is that this means AWS isn’t limited by proprietary protocols (ahem, InfiniBand).

The vast majority of its switches today support 12.8 Tbps. Kalyanaraman said its next-gen switches will ramp that up to 51.2 Tbps.

Beyond the switches, it has also built commodity optics to avoid using OEM optics and has worked with EML providers as well as laser and transponder providers to ensure it can mix and match optical components and “not be beholden to a single provider.” And in case you were wondering, it runs 400G optics that provide 100G of bandwidth per lane.

In terms of rack positioning, Kalyanaraman said AWS is carefully planning layouts to avoid stranding a precious resource: power.

What does this mean? Well, a data center doesn’t just have racks and servers for AI. It also has racks and servers for memory, storage and general-purpose compute, each of which draws a different amount of power. When you have an expected amount of power going to your aisles, you want to maximize how much power is drawn in each.

If you put all your AI servers and racks in one aisle, that would mean other aisles with storage servers, for instance, might not use all the power available to them. And boom, then you’re stuck with stranded power. So, it ends up being a bit like Tetris in that it’s a giant packing game.

“You have to think through what’s the forecast, how many racks do we expect to land over the future weeks and months, and then you have to precisely pack,” he said.

Though they may seem unrelated, all the aforementioned pieces – cooling, power utilization and networking – are all part of a bigger puzzle AWS is putting together.

According to Kalyanaraman, they’re all part of the company’s plan to improve efficiency and achieve carbon neutrality by 2040. Who’d have thought?