In recent GTC, NVIDIA released the latest ampere architecture and A100 GPU based on ampere architecture. A100 GPU is realized by TSMC 7Nm process, including 54.2 billion transistors. According to official news, it can achieve 7 times higher performance than the previous generation V100. In addition to the improvement of computing power, NVIDIA also adds the multi instance 13gpu (MIG) feature of GPU, which can make a GPU virtualization called seven independent GPUs.
At the same time, NVIDIA DGX A100 supercomputer, which includes eight A100 GPUs and has a peak computing power of up to 10 petaops, was announced.
At the conference, NVIDIA did a lot of rendering for computing power. However, in our opinion, NVIDIA's feature expansion beyond computing power will become a more important threshold, and the Chinese semiconductor industry also needs to consider the important features beyond computing power if it wants to develop its own GPU.
Computing architecture: improved and updated, the pace of progress is in line with expectations
Compared with the previous generation V100 GPU, NVIDIA A100 GPU's computing power improvement mainly comes from the following aspects:
Add sparse operation support。 This is probably the biggest innovation in A100 GPU computing architecture. Specifically, A100 supports 2:4 structured sparsity, that is, when using sparse computing, there must be two or more zeros for every four elements in the matrix. Through sparse operation, the performance can be doubled.
In fact, it has been almost five years since the concept of sparse operation was put forward in deep learning. Today NVIDIA has finally put this concept into the product, and it uses 2:4 structured sparse. Its twice acceleration can be said to be relatively conservative (in contrast, in 2018, the AI accelerator IP of Cambrian supported quadruple sparse acceleration).
Introduction of tf32 number systemThis is mainly for training calculation. A review of the course of artificial intelligence training computing, the earliest common use is the 32-bit floating-point system (FP32). To speed up the training calculation, the 16-bit FP16 system has been supported since NVIDIA several years ago. The advantage of this system is that it is faster, but there are some problems in some applications in dynamic range.
In A100, NVIDIA introduced tf32 number system to solve the problem of fp16. In fact, tf32 is not a 32-digit system, but a 19 digit system. Its dynamic range (exponent) is the same as that of fp32, but its precision (mantissa) is the same as that of fp16, which is 10 bits, which is equivalent to the fusion of fp32 and fp16. Compared with fp32, tf32 can achieve 8 times of throughput improvement.
Stronger and more stream processors (SM)The tensor matrix computing power of each stream processor is twice that of V100 in the A100, while the number of stream processors in the GPU increases by 30%.
Larger on-chip storage and faster memory interfaces. In the design of A100, the L1 cache capacity of each stream processor is increased from 128KB of V100 to 192kb, while the L2 cache capacity is increased to 40MB, compared with the previous generation Six point seven Times. In terms of memory interface, the total loan of hbm2 of A100 is 1555gb / s, which is higher than that of the previous generation 1.7X 。
overall, in terms of computational architecture, in addition to supporting sparse computing and introducing TF32, other elevations are predictable conventional elevations, and sparse computing and TF32 are not new concepts in ai computing. And we believe that this generation of NVIDIA A100's computational power performance improvement is incremental improvement, not revolutionary.
GPU virtual instance and interconnection: further increasing competition barriers
We believe that in addition to computing power, the more important competitive barrier of A100 comes from the GPU virtual instance support and interconnection scheme for data center.
In ampere architecture, an important new feature is GPU virtual instance MIG. With the increase of the proportion of GPU deployment in the cloud data center, how to realize GPU virtualization is an important task, which will reduce the overall GPU utilization if not well solved.
At present, in cloud services, the CPU and memory instances that users apply for are mostly virtualized. When you apply for n CPU cores, it's not that you package this CPU chip, but that it's likely that different cores on the same CPU chip will be allocated to different users, and users don't have to worry about which chip their CPU cores are located on. It's OK to use them 。
Roughly speaking, this is CPU virtualization. Before GPU, there was also virtualization, that is, the same GPU can be used by different programs at the same time, but its memory access model is not as perfect as CPU virtualization. Therefore, in the case of multiple users, the method of multiple users sharing a GPU at the same time is usually not used, but to allocate a GPU to a user.
This brings about efficiency problems. For example, user a only needs to use half of the computing resources in a GPU, while user B needs to use the computing resources One point five Block GPU, then using the traditional coarse-grained solution will cause user a and user B to occupy a block of GPU, so user a actually wastes GPU resources, while user B's computing resource requirements are not well met.
With the application of GPU in more and more scenes, different scene algorithms have different utilization and requirements for GPU. In this way, using the previous coarse-grained scheme will cause the problem of GPU utilization in the overall data center.
In order to solve this problem, MIG came into being. The MIG in A100 supports to divide the same GPU into seven independent instances, and the memory space access between each instance does not interfere with each other, so that the fine-grained GPU computing resource allocation can be realized, thus increasing the resource utilization efficiency in the cloud computing scenario with very heterogeneous computing requirements.
To be sure, the partition of 7 GPU virtual instances supported by MIG may not be very fine granularity, but it can be regarded as an important milestone towards virtualization. In addition to the MIG, the A100 has also improved on the multi chip interconnection.
First of all, the A100 includes the third generation nvlink, which is mainly used for the communication between GPUs on the same host. Compared with the V100, the communication bandwidth has doubled to 600gb / s. In GPU and CPU communication, A100 supports PCIe gen4, which also doubles the bandwidth compared with the previous generation of PCIe gen3. In addition, the interconnection of A100 is deeply integrated with mellanox's solution, which can well support RDMA based on Ethernet and Infiniband.
The entry threshold of cloud AI chip is greatly increased
We believe that the release of NVIDIA A100 has once again opened the gap with other chip competitors in the field of artificial intelligence cloud. In terms of computing power, NVIDIA A100's performance on the Bert benchmark is 11 times that of T4, while Habana, the most successful start-up company (now purchased by Intel at a high price), launched a new Goya chip last year, whose performance on the same benchmark is only about twice that of T4, so A100 occupies the computing power highland again. We believe that the main advantage of NVIDIA in computing power improvement lies in its strong system engineering ability.
As we have analyzed before, the computing unit architecture innovation used by NVIDIA in A100 is actually not new. It has existed for many years in the field of artificial intelligence hardware, and many start-ups have tried similar implementations before. However, when the scale of the chip rises, its design process is not only about logic design, but also about yield, heat dissipation and other factors. These seemingly low-level factors actually need to be considered in the top-level architecture design process - in other words, although other people can think of using these architecture innovations, they just haven't done all kinds of problems In fact, it's also a barrier accumulated by NVIDIA for many years.
In fact, we believe that computing power is only a small part of NVIDIA A100 hardware competition barriers, and its more important barriers also come from interconnection, virtualization and other features. Interconnection and virtualization features are important requirements for cloud data center scenarios, and the implementation of these requirements needs to be solid, step-by-step design and accumulation.
If NVIDIA hasn't introduced virtualization before, and cloud AI acceleration chips are still a competition of computing power, so start-ups still have a chance to overtake in a corner, then after A100, we think that other cloud AI acceleration chip start-ups with the same market as NVIDIA have lost this opportunity, and they must take virtualization, RDMA and other distributed computing features step by step Sex is realized on one's own chip, so one is qualified to fight NVIDIA head-on.
For the cloud computing market, another possible strategy of other chip manufacturers is to focus on areas that NVIDIA can't take into account and GPU's SIMT architecture can't cover well, such as fintech's computing and so on. We expect more of these startups in the next few years.
The inspiration for localization of GPU: computing power is not everything, and it is also important to support distributed computing and virtualization
The A100 GPU released by NVIDIA also has important implications for the localization of GPU for cloud data center, that is, computing power is not everything, and the support for distributed computing and multi-user virtualization may be more important.
In the current high-performance cloud computing, most of the tasks will use distributed computing. In the distributed computing, the computing power of single card GPU is only the foundation, and the IO in addition to the computing power will become an important factor to determine the performance. The IO includes the communication between single machine and multiple cards, between GPU and CPU, and between multiple hosts.
In NVIDIA's technology stack, single machine multi card communication has nvlink, and multi machine communication has RDMA and smart NIC technology from mellanox recently acquired. It can be said that NVIDIA is also the world's leading in the IO field, so as to ensure that the cloud GPU scheme is the best in the world. Virtualization support is closely related to distributed computing. As mentioned before, GPU virtualization will greatly improve the utilization of GPU resources in cloud computing.
However, in addition to the improvement of utilization rate, the virtualization access model also provides a clean interface for the software stack of distributed computing, so that the engineers of distributed system can build a flexible multi-user use model and interface by virtue of the concept of virtualization without caring about the implementation details of GPU bottom layer, so as to provide a high-efficiency distributed system with The support and empowerment of force.
We believe that GPU virtualization is still at an early stage, and we will see the investment of NVIDIA and other European and American manufacturers in this direction in the future. For domestic GPUs, we have always stressed that we need to build a good ecosystem to make them truly competitive. Such an ecosystem first includes a good scalability Architecture - which points to the support of data communication interconnection such as IO; in addition, a friendly and easy to use development environment is also needed, which allows developers to develop various cloud applications supporting multi-user on the basis of hardware. Virtualization is the core component of multi-user support.
We believe that a GPU with strong computing power but limited support for distributed computing and virtualization is not as good as a GPU with weak computing power (for example, only half or even a third of NVIDIA), but reasonable and complete support in distributed and multi-user scenarios. But these two just need step by step solid accumulation, can't expect to overtake at the corner.