Connect with us

Innovation and Technology

AI Inference Chip Showdown

Published

on

AI Inference Chip Showdown

Introduction to AI Inference Processing

Everyone is not just talking about AI inference processing; they are doing it. Analyst firm Gartner released a new report this week forecasting that global generative AI spending will hit $644 billion in 2025, growing 76.4% year-over-year. Meanwhile, MarketsandMarkets projects that the AI inference market is expected to grow from $106.15 billion in 2025 to $254.98 billion by 2030. However, buyers still need to know what AI processor to buy, especially as inference has gone from a simple one-shot run through a model to agentic and reasoning models that can increase computational requirements by some 100-fold.

Performance Continues to Skyrocket

For seven years, the not-for-profit group MLCommons has been helping AI buyers and vendors by publishing peer-reviewed quarterly AI benchmarks. It has just released its Inference 5.0 suite of results, with new chips, servers, and models. Let’s take a look.

The New Benchmarks

New benchmarks were added for the larger Llama 3.1 405B, Llama 2 70B with latency constraints for interactive work, and a new “R-GAT” benchmark for graph models. Only Nvidia ran benchmarks for all the models. A new benchmark was also added for edge inference, the Automotive PointPainting test for 3D object detection. There are now 11 AI benchmarks managed by MLCommons.

The New Chips

AI is built on silicon, and MLCommons received submissions for six new chips this round, including AMD Instinct MI325X (launched last Fall), Intel Xeon 6980P “Granite Rapids” CPU, Google TPU Trillium (TPU v6e) which has become generally available, Nvidia B200 (Blackwell), Nvidia Jetson AGX Thor 128 for AI at the Edge, and perhaps most importantly the Nvidia GB200, the beast that powers the NVL72 rack that has data centers scrambling to power and cool.

The New Results: Nvidia

As usual, Nvidia won all benchmarks; this time, they won by a lot. First, the B200 tripled the performance of the H200 platform, delivering over 59,000 tokens per second on the latency-bounded Llama 2 70B Interactive model. The new Llama 3.2 405B model is 3.4 times faster on Blackwell. Now for the real test: is the NVL72 as fast as Nvidia promised at launch? Yes, it is thirty times faster than the 8-GPU H200 running the new Llama 405B, but it has 9 times more GPUs.

Nvidia Performance

The new Llama 3.1 405B benchmark supports input and output lengths up to 128,000 tokens (compared to only 4,096 tokens for Llama 2 70B). The benchmark tests three distinct tasks: general question-answering, math, and code generation. But when you add Nvidia’s new open-source Dynamo “AI Factory OS” that optimizes AI at the data center level, AI factory throughput can double again running Llama and thirty times faster running DeeSeek.

And, Surprise, AMD Has Rejoined the MLPerf Party!

Welcome back, AMD! The new AMD MI325 did quite well at the select benchmarks AMD ran, competing admirably with the previous generation Hopper GPU. So, for AI practitioners who know what they are doing and don’t need the value of Nvidia software, AMD MI325 can save them a lot of money. AMD also did quite well at the Llama 3.1 405B Serving benchmark (distinct from the interactive 405B benchmark mentioned previously). AMD proudly said that Meta is now using the (older) MI300X as the exclusive inference server for the 405B model.

Conclusion

Nvidia retains the crown of AI King across all AI applications. Although competition is on the horizon, AMD delivers competitive performance only when measured against the previous Nvidia GPU generation. AMD expects that the MI350, due later this year, will close the gap. However, thanks to the GB300, Nvidia will retain the lead at the GPU performance level by then. But the real issue here is that while everyone else is trying to compete at the GPU level, Nvidia keeps raising the bar at the data center level with massive investments in software, solutions, and products to ease AI deployment and lower TCO.

FAQs

  • Q: What is the forecast for global generative AI spending in 2025?
    A: Global generative AI spending is expected to hit $644 billion in 2025, growing 76.4% year-over-year.
  • Q: What is the projected growth of the AI inference market from 2025 to 2030?
    A: The AI inference market is expected to grow from $106.15 billion in 2025 to $254.98 billion by 2030.
  • Q: What is the performance of Nvidia’s B200 chip compared to the H200 platform?
    A: The B200 chip tripled the performance of the H200 platform, delivering over 59,000 tokens per second on the latency-bounded Llama 2 70B Interactive model.
  • Q: How does AMD’s MI325 chip perform compared to Nvidia’s Hopper GPU?
    A: The AMD MI325 chip competes admirably with the previous generation Hopper GPU, and can save AI practitioners a lot of money if they don’t need Nvidia’s software.
  • Q: What is the significance of Nvidia’s Dynamo "AI Factory OS"?
    A: Nvidia’s Dynamo "AI Factory OS" optimizes AI at the data center level, allowing for doubled throughput and lower TCO.
Advertisement

Our Newsletter

Subscribe Us To Receive Our Latest News Directly In Your Inbox!

We don’t spam! Read our privacy policy for more info.

Trending