On June 30th, Ian Buck, Vice President and General Manager of NVIDIA's Large Scale and HPC Business, recently stated at the 2024 Global Technology Conference of Bank of America Securities that customers are investing billions of dollars to purchase new NVIDIA hardware to keep up with the demand for updated AI models, thereby increasing revenue and productivity.
Buck stated that companies competing to build large data centers will particularly benefit and receive substantial returns over the lifespan of the data centers, which is four to five years. "For every dollar spent by cloud providers on purchasing GPUs, they can recover $5 within four years (by providing computing power services, GAAS)."
"If used for reasoning, it would be more profitable, generating $7 in revenue for every $1 spent during the same time period, and this number is still growing," Buck said.
Nvidia founder, President and CEO Huang Renxun, and Executive Vice President and CFO Colette Kress have previously expressed similar views.
They previously stated that with the innovation of the CUDA algorithm, Nvidia has increased the LLM inference speed of H100 by three times, which can reduce the cost of models like Llama 3 by one-third. H200 has almost doubled its inference performance compared to H100, bringing enormous value to production deployment. For example, using LLama 3 with 700 billion parameters, a single HGX H200 server can output 24000 tokens per second and support over 2400 users. This means that based on existing pricing, API providers hosting Llama3 can earn $7 in revenue from Llama3 token billing for every $1 spent on Nvidia HGX H200 servers over the next four years.
AI inference models around Llama, Mistral, or Gemma are constantly evolving and being serviced by tokens. Nvidia is packaging open-source AI models into a container called Nvidia Inference Microservices (NIM).
Nvidia's latest Blackwell has been optimized for inference. This GPU supports FP4 and FP6 data types, which can further improve energy efficiency when running low-intensity AI workloads. According to official data, compared to Hopper, Blackwell's training speed is 4 times faster than H100, inference speed is 30 times faster, and it can run trillions of parameter large language model generative AI in real-time. It can further reduce costs and energy consumption to one 25th of the original. This seems to echo Huang Renxun's repeated slogan of "the more you buy, the more you save". However, it cannot be ignored that Nvidia GPU prices are also rapidly rising.
Prepare for Rubin GPU
Many cloud providers have already started planning new data centers two years in advance and hope to understand what the future AI GPU architecture will look like.
Nvidia announced at the Computex 2024 exhibition in early June that Blackwell chips have started production and will soon replace Hopper chips. Blackwell Ultra GPU chips will be launched in 2025. Nvidia has also announced the next generation of integrated HBM4 AI platform called Rubin, which will be released in 2026 to replace Blackwell and Blackwell Ultra GPUs.
"For us, achieving this is really important - data centers are not built out of thin air, they are large-scale construction projects. They need to understand what Blackwell data centers will look like and how they differ from Hopper data centers?" Buck said.
Blackwell provides an opportunity to shift towards more intensive computational forms and use technologies such as liquid cooling, as air cooling efficiency is not high.
Nvidia has announced the rhythm of launching a new GPU every year, which will help the company keep up with the pace of AI development and help customers plan products and AI strategies.
Buck said, "Nvidia has been discussing Rubin GPUs with its largest clients for some time - they know our goals and timeline."
The speed and capability of AI are directly related to hardware. The more funds invested in GPUs, the AI company can train larger models, thereby generating more revenue.
Microsoft and Google are pinning their future on artificial intelligence and competing to develop more powerful large-scale language models. Microsoft heavily relies on new GPUs to support its GPT-4 backend, while Google relies on its TPU to run its artificial intelligence infrastructure.
Blackwell is in short supply
Nvidia is currently producing Blackwell GPUs, and samples will be released soon. But customers can expect that the first batch of GPUs (to be shipped by the end of the year) will be in short supply.
"Every transformation of new technologies brings challenges in terms of supply and demand. We have experienced this situation on Hopper, and Blackwell's capacity improvement will also face similar supply and demand constraints... from the end of this year to next year," Buck said.
Buck also stated that data center companies are phasing out CPU infrastructure to make room for more GPUs. Hopper GPUs may be retained, while old GPUs based on the old Ampere and Volta architectures will be resold.
Nvidia will retain multiple levels of GPUs, and as Blackwell continues to evolve, Hopper will become its mainstream AI GPU. Nvidia has made multiple hardware and software improvements to enhance Hopper's performance.
In the future, all cloud providers will provide Blackwell GPUs and servers.
Expert model
Buck stated that the GPT-4 model has approximately 1.8 trillion parameters, and as AI expansion has not yet reached its limit, the number of parameters will continue to grow.
"The size of the human brain is roughly equivalent to 100 billion to 15 trillion parameters, depending on the individual and the neurons and connections in the brain. Currently, the parameter size of artificial intelligence is about 2 trillion... we have not yet conducted inference," Buck said.
In the future, there will be a large-scale model containing trillions of parameters, on which smaller and more professional models will be built. The more parameters there are, the more advantageous Nvidia is because it helps to sell more GPUs.
Nvidia is adjusting its GPU architecture from the original base model approach to a hybrid expert model. Expert mixing involves multiple neural networks verifying answers through mutual reference.
Buck said, "The 1.8 trillion parameter GPT model has 16 different neural networks that attempt to answer some of the questions in their respective layers, then discuss, meet, and decide what the correct answer is."
The upcoming GB200 NVL72 rack mounted server is equipped with 72 Blackwell GPUs and 36 Grace CPUs, designed specifically for hybrid expert models. Multiple GPUs and CPUs are interconnected to support hybrid expert models.
"These guys can all communicate with each other without being blocked on I/O. This evolution constantly occurs in the model architecture," Buck said.
Tips for locking in customers
Nvidia CEO Huang Renxun made some heated remarks at HPE's Discover conference this month, calling on people to purchase more of the company's hardware and software.
Nvidia and HPE have announced a series of new products with a simple and clear name, "Nvidia AI Computing by HPE".
"We have designed small, medium, large, and extra large sizes for you to choose from. And as you know, the more you buy, the more you save," Huang said on the Discovery stage.
Earlier this year, Huang Renxun made another controversial statement, stating that future programmers do not need to learn how to write code. But loading AI models on Nvidia GPUs requires knowledge of command lines and scripts to create and run AI environments.
Nvidia's proprietary remarks and complete dominance in the artificial intelligence market make it a target of antitrust investigations.
When Buck tries to downplay people's concerns about CUDA, he must be careful and cautious, stating that "moat is a complex word.".
Both Nvidia executives stated that CUDA is a must-have software for their GPUs - to maximize the performance of GPUs, CUDA is needed. Open source software can be used in conjunction with Nvidia GPUs, but it cannot provide the powerful features of CUDA libraries and runtime.
Downward compatibility and continuity are Nvidia's unique advantages, and Nvidia's support for AI models and software can be extended to the next generation of GPUs. But for Intel's Gaudi and other ASICs, it is not the case as they must be readjusted for each new model.
pg" alt="" width="600" height="289" />