Recommended essential components for AI artificial intelligence devices:
What are the competitors of Nvidia?
Of course, the first choice is AMD and Intel. The former already has the business of AI acceleration cards, integrating CPU and GPU design capabilities, while the latter, as the founder of x86 architecture, has also entered the field of AI acceleration cards. It can be seen that their products not only benchmark against Nvidia in terms of parameters, but also launch rounds of attacks in terms of positioning and pricing.
After NVIDIA's customers formed a united front with Broadcom and Marvell, they also became its competitors, constantly introducing new self-developed custom chips to replace traditional general-purpose AI accelerators, putting another pressure on NVIDIA.
In the field of networking, Nvidia has also faced its own competitors.
NVIDIA Exclusive AI Network
Since the beginning of the 21st century, with the increasing popularity of cloud computing and big data, data centers have experienced rapid development. InfiniBand has played a significant role in this, especially since 2023, when large AI models represented by ChatGPT rely on InfiniBand, further increasing the attention to this network technology.
As is well known, modern digital computers have always adopted the von Neumann architecture since its inception, which includes the CPU (arithmetic logic unit and control unit), memory (RAM, hard disk), and I/O (input/output) devices. In the early 1990s, in order to support an increasing number of external devices, Intel was the first to introduce Peripheral Component Interconnection (PCI) bus design into the standard PC architecture.
Subsequently, the Internet entered a stage of rapid development, and the continuous growth of online business and user scale posed a huge challenge to the capacity of IT systems. Supported by Moore's Law, components such as CPU, memory, and hard drives are rapidly advancing, while the update and replacement speed of PCI bus is relatively slow, greatly limiting I/O performance and becoming a bottleneck for the entire system.
To address this issue, Intel, Microsoft, and SUN led the development of the Next Generation I/O (NGIO) technology standard, while IBM, Compaq, and HP led the development of the Future I/O (FIO) and jointly developed the PCI-X standard in 1998.
In 1999, the FIO Developer Forum and NGIO Forum merged to form the InfiniBand Trade Association (IBTA). Quickly, in 2000, the InfiniBand architecture specification version 1.0 was officially released. The purpose of InfiniBand's birth is to replace the PCI bus. It introduces the RDMA protocol, providing lower latency, higher bandwidth, and higher reliability, thereby achieving stronger I/O performance.
In May 1999, several employees who had left Intel and Galileo Technologies founded a chip company called Mellanox in Israel. Mellanox joined NGIO and later merged with FIO. Mellanox also joined the InfiniBand camp and launched its first InfiniBand product in 2001.
Nvidia has combined its GPU computing power with Mellanox's network technology to create a powerful "computing engine". In terms of computing infrastructure, Nvidia undoubtedly holds a leading advantage,
Although InfiniBand was abandoned by both Intel and Microsoft, it has found growth points in new areas. Since 2012, with the continuous growth of demand for high-performance computing (HPC), InfiniBand technology has made significant progress and its market share has continued to increase. In 2015, InfiniBand technology surpassed 50% of the TOP500 list for the first time, reaching 51.4% (257 systems). This marks the first successful challenge of InfiniBand technology to Ethernet technology, becoming the preferred internal interconnect technology for supercomputers.
And Mellanox is also constantly growing: in 2010, Mellanox merged with Voltaire, and Mellanox and QLogic became the main suppliers of InfiniBand; In 2013, Mellanox further entered the network field by acquiring silicon optical technology company Kotura and parallel optical interconnect chip manufacturer IPtronics, further consolidating its industry position; By 2015, Mellanox had occupied 80% of the global InfiniBand market share. Our business scope has expanded from chips to network cards, switches/gateways, remote communication systems, cables, and modules, becoming a world-class network supplier.
With the continuous development of AI, the value of InfiniBand is also becoming increasingly apparent, and Mellanox has become a hot topic in the eyes of manufacturers due to its almost monopolistic position in this technology.
Why is InfiniBand so important for AI? For AI supercomputers, we can consider them as a cluster of many graphics processing units (GPUs) that perform a large amount of complex calculations. In addition, there are also some central processing units (CPUs) responsible for commanding computer operations, along with some DRAM chips and NAND chips. The cost allocation is approximately 50-60% for GPUs, 10-15% for CPUs and DRAM chips, and 5-10% for NAND chips.
But all the chips mentioned above need to be connected to each other, which can be achieved through InfiniBand or Ethernet cables, also known as "networks". They account for 10-15% of hardware costs, and the purpose is to provide the highest possible bandwidth for fast data transmission. If higher bandwidth cannot be achieved, no matter how much cost is spent on GPUs, it will eventually become meaningless.
As one of the earliest explorers in the field of AI, Nvidia was keenly aware of this and decided to shift its focus from gaming to AI. In 2019, Nvidia acquired Mellanox for $6.9 billion, surpassing competitors Intel and Microsoft's bids of $6 billion and $5.5 billion respectively. This massive acquisition paved the way for Nvidia to enter the network technology market.
At that time, NVIDIA's CEO, Huang Renxun, explained that the reason for acquiring Mellanox was: "This is the merger of two globally leading high-performance computing companies. We focus on accelerating computing, while Mellanox focuses on interconnection and storage."
The bundling of GPU and network technology sounds a bit like forced buying and forced selling, but to many people's surprise, the model created by Huang Renxun has achieved rapid success. As of January this year, Nvidia's annual revenue has more than doubled to $60.9 billion, and the sales of its computing and networking departments have increased by 215%, accounting for 78% of Nvidia's business. Although Nvidia's GPU has attracted a lot of attention, its network business is also key to success. At the company's final earnings conference call, Huang Renxun stated that InfiniBand's revenue has increased fivefold year-on-year, meaning its growth rate is about twice that of the entire computing and networking business.
As Intel turned to PCI Express (PCIe) and Microsoft withdrew from InfiniBand, the network technology began to shift towards the application field of computer cluster interconnection, and the newly established Mellanox began to take the stage, gradually becoming a backbone in the development process of InfiniBand.
The Great Threat of NVIDIA
In the past, the industry has been using Nvidia's InfiniBand network solution to deploy artificial intelligence and machine learning technology. The reason is simple: it is currently the most mature network technology that supports large-scale deployment, but InfiniBand is not perfect. On the one hand, due to the acquisition, it has become Nvidia's exclusive product, and on the other hand, its cost is expensive and not easily affordable for ordinary enterprises.
Nvidia CEO Huang Renxun once joked that InfiniBand only accounts for 20% of the cluster cost, and it can improve the performance of artificial intelligence training by 20%, which has partially recovered the cost, so InfiniBand is actually free. But such a statement is obviously biased. Customers must first allocate 20% of the cluster cost to truly extract the performance of the cluster, which means creating 120% performance with 120% cost.
By contrast, Ethernet based clusters typically only require an additional 10% or even lower cost, although the latter is often difficult to match InfiniBand in performance, it has also won over a portion of users with its low price. In fact, the competition for high-performance networks nowadays is a competition between InfiniBand and high-speed Ethernet. Manufacturers with sufficient resources may be more inclined to choose InfiniBand, while those who value cost-effectiveness may lean towards high-speed Ethernet.
But this situation is not static, even large enterprises with strong financial resources are looking for cheaper and more suitable network solutions, and Nvidia and InfiniBand are constantly being challenged.
In July 2023, the Linux Foundation announced that it would oversee the establishment of a Super Ethernet Alliance, whose founding members include AMD Arista、Broadcom、 Supported by Cisco, Eviden, HPE, Intel, Meta, and Microsoft, the Ultra Ethernet Alliance has stated its commitment to improving Ethernet to meet the low latency and scalability requirements of high-performance computing and artificial intelligence systems.
The primary task of the alliance's establishment is to define and develop what they call the Ultra Ethernet Transport (UET) protocol, a new Ethernet transport layer protocol that can better meet the needs of artificial intelligence and HPC workloads.
At a high level, the Super Ethernet Alliance hopes to improve Ethernet through surgical means, only making improvements and modifications to the necessary parts to achieve the goals. From the beginning, the alliance focused on improving the software and physical layers of Ethernet technology without changing its basic structure to ensure cost-effectiveness and interoperability.
The technical goals of the alliance include developing specifications, application program interfaces, and source code to define protocols, interfaces, and data structures for hyper Ethernet communication. In addition, the alliance is committed to updating existing links and transmission protocols, creating new telemetry, signaling, security, and congestion mechanisms to better meet the needs of large artificial intelligence and high-performance computing clusters. Meanwhile, due to the many differences between artificial intelligence and high-performance computing workloads, UET will provide separate configuration files for appropriate deployment.
Thanks to this super Ethernet alliance, several issues with running AI workloads on Ethernet in the past are being resolved, which has also driven the wider adoption of Ethernet in traditional HPC workloads, providing Ethernet networking companies with an opportunity to counterattack InfiniBand.
One of the members of the Super Ethernet Alliance, Arista Networks, the Ethereum networking company, explained the difference between InfiniBand and Ethernet during a earnings conference call in February this year: "As you know, historically, when considering InfiniBand and Ethernet separately, each has its own advantages. Traditionally, InfiniBand is considered lossless, while Ethernet is considered to have some loss characteristics.". However, when you actually consider a complete GPU cluster along with optical devices, and check the consistency of task completion time in all packet sizes, data - including third-party data from Broadcom - shows that compared to these technologies in the real environment, the completion time of Ethernet tasks is about 10% faster. So, you can view these technologies in isolation or in actual clusters. In actual clusters, we have seen improvements in Ethernet. Please remember, this is only Ethernet as we know it today. Once we have the Super Ethernet Alliance and some improvements, such as packet spraying, dynamic load balancing, and congestion control, I believe these numbers will become even better
"Arista claims that its Ethernet is about 10% faster than InfiniBand in task completion speed, which surprised us, especially considering InfiniBand's deep penetration in the current GPU cluster," Jefferies analyst George Notter said after the meeting.
He pointed out that Nvidia's simple bundling of GPUs with InfiniBand was a key reason for the success of the technology. In other words, part of the reason InfiniBand was popular was because it was sold together with Nvidia's GPUs, but now the backlog of GPU orders has decreased, and the number of mobile devices using InfiniBand may also decrease. This is good news for Arista and another Ethernet network company, Broadcom.
"We have made progress in four major AI Ethernet clusters, all of which are examples of our victory over InfiniBand. In all four cases, we are now transitioning from experimentation to piloting, connecting thousands of GPUs annually," explained Arista CEO Jayshree Ullal.
Arista's performance in the previous quarter was also very good. As of March, Arista's revenue increased by 16% year-on-year and earnings per share increased by 44%. Analysts predict that this growth will accelerate as AI infrastructure spending increases. About 40% of Arista's business comes from Microsoft and Meta, both of which have announced plans to increase capital expenditures again next year. Jefferies analyst George Notter recently raised Arista's rating from hold to buy, stating, "Now, the wave of deploying GPU based infrastructure (including Ethernet) will persist."
Arista is not the only internet company to benefit from the deployment boom. As of February 4th, Broadcom's revenue for the three months increased by 34% year-on-year, reaching $12 billion, of which network revenue increased by 46%, reaching $3.3 billion. "This is mainly due to the strong demand for AI accelerators from our two ultra large scale customers," explained Hock Tan, CEO of Broadcom, during the earnings conference call.
The demand for network hardware is faster than expected by Broadcom, driven by strong demand from ultra large customers and large enterprises deploying AI data centers. Therefore, Broadcom has raised its annual growth forecast for its network business from 30% to 35%. Overall, Broadcom's revenue is expected to be $50 billion this year, a 40% increase from last year.
The foreign media nextplatform has raised an interesting mathematical question: for every $750 million Arista Networks earns in AI cluster interconnect sales, Nvidia may lose $1.5 billion to $2.25 billion. In the past 12 months, it is roughly estimated that Nvidia's sales in the InfiniBand network were $6.47 billion, while GPU computing sales in data centers were $39.78 billion. With a four to one dividend rate and stable market conditions, Nvidia can retain about $1.3 billion, while the Super Ethernet Alliance can retain $1.7 billion to $2.6 billion. If everything remains unchanged, InfiniBand's sales target will reach $12 billion.
The media pointed out that members of the Super Ethernet Alliance can seize a large market share, but they will do so by removing revenue from the system, just like Linux did with Unix, rather than converting revenue from one technology to another, and the saved funds will be reinvested in GPUs.
Challenge NVIDIA
Nvidia is not only facing challenges in the field of networking, as we mentioned earlier, its biggest reliance - GPU - is being besieged by companies such as AMD, Intel, and Broadcom. Despite its market value of $3 trillion, it will still feel immense pressure.
In the online market, Arista is undoubtedly still a very weak company. Compared to Nvidia InfiniBand's revenue of billions of dollars, it is still difficult to challenge in the short term. However, the dissatisfaction of giants with AI cluster network monopolies has provided Arista with valuable opportunities for rapid development. Over time, it is likely to become Nvidia's new major concern.