As the scale of large supercomputers continues to grow, Cerebras, headquartered in Sunnyville, California, has taken a different approach. The company did not connect more and more GPUs together, but instead squeezed as many processors as possible onto a giant wafer. The main advantage lies in interconnection - by connecting processors together on the chip, wafer level chips can avoid many computational speed losses caused by communication between many GPUs and losses caused by loading data from memory.
Now, Cerebras has demonstrated the advantages of its wafer level chips in two independent but related achievements. Firstly, the company showcased its second-generation wafer level engine WSE-2, which is significantly faster than the world's fastest supercomputer Frontier in molecular dynamics calculations (based on protein folding, nuclear reactor radiation damage modeling, and other issues in materials science). Secondly, in collaboration with machine learning model optimization company Neural Magic, Cerebras has demonstrated that sparse large-scale language models can infer at one-third the energy consumption of a complete model without losing any accuracy.
Although the results are in vastly different fields, they are both possible due to the interconnectivity and fast memory access supported by Cerebras hardware.
Rapidly traversing the molecular world
"Imagine a tailor who can make a suit within a week," said Andrew Feldman, CEO and co-founder of Cerebras. "He bought the tailor next door, and she could also make a suit within a week, but they couldn't work together. Now, they can make two suits a week. But they can't make a suit within three and a half days."
Feldman believes that GPUs are like tailors that cannot work together, at least in some aspects of molecular dynamics. As more and more GPUs are connected, they can simulate more atoms simultaneously, but cannot simulate the same number of atoms faster.
However, Cerebras' wafer level engine expands in a completely different way. Due to the fact that chips are not limited by interconnect bandwidth, they can communicate quickly, just like two tailors working together perfectly to create a suit in three and a half days.
To demonstrate this advantage, the team simulated 800000 atomic interactions with a time interval of femtoseconds between each calculation. On their hardware, each step can be calculated in just a few microseconds. Although this is still 9 orders of magnitude slower than actual interactions, its speed is also 179 times that of Frontier supercomputers. This achievement effectively shortened the calculation time of one year to two days.
This work was completed in collaboration with Sandia National Laboratory, Lawrence Livermore National Laboratory, and Los Alamos National Laboratory. Thomas Oppelstrup, a researcher at Lawrence Livermore National Laboratory, said that this progress makes it possible to simulate molecular interactions that were previously impossible to achieve.
Oppelstrup states that this is particularly useful for understanding the long-term stability of materials under extreme conditions. "When building advanced machines that operate at high temperatures, such as jet engines, nuclear reactors, or fusion reactors for power generation," he said, "you need materials that can withstand these high temperatures and harsh environments. It is very difficult to manufacture materials with appropriate characteristics, long service life, high strength, and no cracking." Oppelstrup said that being able to simulate the behavior of candidate materials for a longer period of time is crucial for material design and development processes.
Cerebras Chief Engineer Ilya Sharapov stated that the company looks forward to expanding its application of wafer level engines to a wider range of problems, including molecular dynamics simulations of biological processes and simulations of airflow around cars or airplanes.
Shrinking Large Language Models
As large language models (LLMs) become increasingly popular, the energy cost of using them is starting to exceed the training cost - estimated to be as high as ten times. "Inference is the main workload of artificial intelligence today, as everyone is using ChatGPT," said James Wang, Director of Product Marketing at Cerebras. "And the operating costs are very high, especially in large-scale situations."
One way to reduce inference energy consumption (and speed) is through sparsity - essentially utilizing the power of zero. LLM consists of a large number of parameters. For example, Cerebras uses an open-source Llama model with 7 billion parameters. During the inference process, each parameter is used to process input data and output output. However, if a significant portion of these parameters are zero, they can be skipped during the calculation process, saving time and energy.
The problem is that it is difficult to skip specific parameters on the GPU. Reading from GPU memory is relatively slow because they are designed to read memory in blocks, which means reading one set of parameters at a time. This does not allow GPU to skip randomly scattered zeros in the parameter set. Cerebras CEO Feldman proposed another analogy: "This is like a shipper who only wants to move things on the pallet because they don't want to check every box. Memory bandwidth is the ability to check each box to ensure it's not empty. If it's empty, put it aside and don't move it."
Some GPUs are equipped with a special sparsity called 2:4, where exactly two out of every four continuously stored parameters are zero. The most advanced GPU has a memory bandwidth of TB per second. The memory bandwidth of Cerebras WSE-2 is over a thousand times that of it, reaching 20 PB per second. This allows for the utilization of unstructured sparsity, which means that researchers can reset parameters to zero as needed, regardless of their position in the model, and dynamically check each parameter during the calculation process. "Our hardware has been supporting unstructured sparsity since the first day," Wang said.
Even with appropriate hardware, resetting many parameters of the model to zero can lead to a worse model. But the joint team of Neural Magic and Cerebras has found a way to restore the full accuracy of the original model. After reducing 70% of the parameters to zero, the team conducted two additional stages of training, giving non-zero parameters the opportunity to compensate for new zero values.
This additional training uses approximately 7% of the original training energy, and these companies have found that through this training, they can restore complete model accuracy. The smaller model spends one-third of the time and energy in the inference process compared to the original complete model. Sharapov said, "These novel applications are implemented on our hardware because we have one million very compact kernels, which means the latency between kernels is very low and the bandwidth is high."