AI: Amazon AI chips traction with key customers & partners. AI-RTZ #1094

AI: Amazon AI chips traction with key customers & partners. AI-RTZ #1094

Nvidia continues to ride high this AI Tech Wave globally with its AI GPUs and related systems globally. Despite the ‘Frenemies’ dynamic of the industry, where its top customers are also developing their own chips for AI training and inference.

I’ve talked about Google’s progress to date with its TPUs (tensor processing units) with external customers like OpenAI and others.

Now Amazon also seems to be making headway with its Trainium and other AI chips, particularly with its equity partner Anthropic amongst others. And the chips are becoming a key part of its AWS cloud offerings.

The Information explains in “Amazon’s Nvidia Alternative Starts Winning Over AI Developers”:

  • “Scarcity of Nvidia GPUs has made Amazon’s pitch more attractive”

  • “Amazon worked closely with Anthropic on improving Trainium software”

  • “Developers say Trainium documentation and support have improved recently”

Amazon’s years long effort to build a serious alternative to Nvidia’s dominant AI chips is starting to gain traction.”

And it starts with both of the top LLM AI companies working closely with Amazon on their custom chips. In some ways think of this as an extension of each of these companies own efforts to build their own chips with Broadcom and others.

“Anthropic and OpenAI, which have struck multibillion-dollar investment and infrastructure deals with Amazon, have already committed to renting large amounts of current and future Trainium capacity. Now, recent software improvements are prompting smaller developers to consider moving more workloads to Trainium, half a dozen people who use or work with the chips said.”

“That includes Daniel Svonava, CEO of Superlinked, an infrastructure firm that helps companies run AI models on rented infrastructure. He said Amazon’s pitch on Trainium, including potential cost savings by switching to the chip, only recently started becoming more compelling.”

“Our response has always been the lack of software support being a barrier,” Svonava said. “That’s the thing that changed in the last couple months. That barrier has been removed.”

He’s referring to Nvidia’s sofware moat with its open source CUDA frameworks, needed for customers to scale their work with Nvidia chips.

“The scarcity of Nvidia chips has also made Amazon’s pitch more attractive, with sales reps telling the startup they have limited availability on the latest graphics processing units. At the same time, Amazon has indicated it has more Trainium capacity available and is willing to be flexible on price, he said. Amazon has given Superlinked $200,000 worth of AWS credits, which it is using to test Trainium.”

Amazon is of course offering Nvidia chips as part of its AWS services:

“The new interest comes as Amazon is betting that Trainium can improve the economics of its AI cloud business. In a January interview with The Information, CEO Andy Jassy said that while Amazon plans to continue buying Nvidia chips, “if you’re building a big inference business” that charges less and has sustainable margins, “you’re strategically disadvantaged if you don’t have your own custom silicon.”

But its own chips are building traction:

“Last month, Jassy said Amazon’s custom silicon business, including Trainium and Graviton, has reached a more than $20 billion annualized run rate, or roughly $50 billion if measured as a stand-alone chip seller. That $20 billion reflects revenue from customers using Trainium and Graviton directly through Amazon’s EC2 service, an Amazon spokesperson said. It excludes offerings such as Amazon’s Bedrock, which lets customers access AI models, and internal Amazon workloads.”

Here Anthropic is borrowing a page from Nvidia’s successful playbook to date:

“Getting to this point took years of software work and close collaboration with Anthropic. Nvidia’s advantage was as much in software as in hardware—developers had spent years building around Cuda, while Amazon had to make its Trainium software, called Neuron, easy enough to justify switching.”

It’s been a long time in the making, especially vs Google TPUs and others:

“Amazon announced Trainium in 2020 through its Annapurna Labs unit, initially pitching it as a cheaper way to train machine-learning models on AWS. When the first-generation chips launched, early internal users included Amazon’s search teams, which helped shape the chip’s development, according to someone with knowledge of the matter.”

“But when Amazon staff began ramping up generative AI products in late 2022, some teams did not use Trainium broadly, and Amazon’s Nova large language models were first trained on Nvidia GPUs, according to a former employee.”

“Amazon announced in 2023 that Anthropic would use Trainium and Inferentia to train and run future models, and by the following year had committed $8 billion to Anthropic. The two companies also teamed up to make Trainium faster and more efficient.”

“Anthropic and Amazon engineers worked closely to optimize Trainium for Anthropic’s models, talking frequently, according to Carlos Escapa, a former AWS executive who worked on selling Anthropic models. Anthropic and Amazon made software improvements that could also benefit other customers.”

“The collaboration between Anthropic and AWS on the NKI [Neuron Kernel Interface] has been very, very deep,” Escapa said, referring to Amazon software that lets developers fine-tune how models run on Trainium chips. “And some of these features that have been developed for Anthropic have also become very useful for other companies.”

“Some of the work involved software changes that helped Trainium perform more processes simultaneously, Escapa said. Anthropic co-founder Tom Brown has publicly described the broader effort as “a game of Tetris,” where a tight chip architecture makes models cheaper and faster.”

“By the end of 2024, Amazon had launched its second-generation Trainium chip broadly and announced Project Rainier, a large Trainium cluster for Anthropic. Inside Amazon, Trainium use began picking up in some areas, with Nova starting to use Trainium in 2024 and ramping up since then with pretraining in particular, a former employee said.”

And the company also ‘dogfooded’ its own chips along the way:

“Bedrock, which offers access to Anthropic and other models, initially relied on GPUs, according to two people with knowledge of the product. One of the people said some Bedrock workloads in 2024 required roughly twice as many Trainium chips as Nvidia chips to handle the same workload.”

Then it expanded to partners:

“Amazon said it prioritized limited Trainium capacity for external customers such as Anthropic as demand accelerated. The company also said Bedrock used Trainium for models and tasks where the chips offered better cost and performance, while relying on GPUs for other models to keep Bedrock’s selection broad.”

“As the software matured, Amazon said, more Bedrock workloads moved to Trainium, which now runs the majority of Bedrock inference across more than 125,000 customers. Amazon also said it is planning to train its largest internal models on Trainium going forward.”

Slower was the progress to support external AI models on the Amazon chips:

“Outside Amazon, developers had their own frustrations with Trainium. Julien Simon, an AI operating partner at a private equity firm, first ran into issues while working at Hugging Face, where he spent three years. Hugging Face had worked with Trainium chips, and Amazon was sometimes slow to support newer models on the startup’s open-source platform, Simon said.”

“In recent months, however, several customers said Amazon has made Trainium much easier to use by improving documentation and support and making the chips work better with popular open-source tools.”

The open source support, especially on models from China, is important as more customers are turning to those models as I’ve outlined.

“That included a native PyTorch integration Amazon unveiled in December, an important step because PyTorch is many developers’ default programming platform and has long worked best with Nvidia. Before the integration, developers often had to write code in PyTorch and adapt it to Amazon’s Neuron software.”

The efforts led to some high profile external wins:

“In February, Amazon said OpenAI would take around 2 gigawatts of Trainium capacity, including Trainium3 and the upcoming Trainium4, alongside an initial $15 billion Amazon investment.”

And run on Cerebras chips, recently in the news.

“Amazon has also paired Trainium with Cerebras, another OpenAI chip partner. OpenAI announced in January that it would use Cerebras systems for 750 megawatts of high-speed inference compute. Two months later, Amazon said it would deploy Cerebras systems in AWS data centers alongside Trainium to deliver faster inference through Bedrock.”

And the traction with Anthropic in particular accelerated.

“Amazon later expanded its Anthropic partnership, with Anthropic committing to spend more than $100 billion on AWS over the next decade, including on Trainium capacity. Amazon also said it would invest another $5 billion in Anthropic, with up to $20 billion more tied to future milestones.”

All leading to the current state of affairs on chip availability:

“Late last month, Amazon said Trainium2 is largely sold out, Trainium3 is nearly fully subscribed and much of Trainium4, which is about 18 months from broad availability, is reserved. The Amazon spokesperson said Trainium’s customer base “extends well beyond OpenAI and Anthropic,” citing examples including Uber and Decart.”

So solid progress starting with a core array of partners and customers.

“Still, Amazon has not detailed how much of that demand comes from a small number of large customers, and it’s unclear how much of a market exists beyond those AI giants and Amazon’s own services. Many AI-heavy companies buy model access through application programming interfaces or Amazon’s Bedrock instead of renting chips directly, Svonava said, and most companies don’t want to actively evaluate the underlying chips.”

But still a long way to go vs Nvidia and others.

“Trainium also hasn’t fully displaced Nvidia inside Amazon. Some of the models underpinning Amazon’s shopping AI still use Nvidia chips exclusively, someone with direct knowledge of the product said. Anthropic is still securing Nvidia capacity too, recently announcing a deal with SpaceX to access more than 220,000 Nvidia chips within the coming month.”

The whole piece is worth a full read for additional details.

But it’s useful to track Amazon’s progress with its AI chips to date, this AI Tech Wave. Another strong horse in the race besides Google’s TPUs. Stay tuned.

(NOTE: The discussions here are for information purposes only, and not meant as investment advice at any time. Thanks for joining us here)





Want the latest?

Sign up for Michael Parekh's Newsletter below:


Subscribe Here