LCD Display Inverter

Display Inverter / VGA Board / LCD Controller

In-depth analysis of Tesla, Qualcomm, and Huawei AI processors

Many people will ask, why not Nvidia? At present, all mainstream deep learning operation mainstream framework backends are NVIDIA’s CUDA, including TensorFlow, Caffe, Caffe2, PyTorch, mxnet, PaddlePaddle, CUDA includes micro-architecture and instruction set and parallel computing engine. CUDA has a monopoly on deep learning or artificial intelligence, which is similar to ARM’s microarchitecture and instruction set. CUDA’s strong ecosystem has created NVIDIA’s unbreakable dominance. The theoretical basis of deep learning has been ready in the 1950s. The key to its inability to apply is the lack of intensive and simple computing devices like GPUs. It is NVIDIA’s GPUs that created the era of deep learning for humans, or the era of artificial intelligence. CUDA strengthens Nvidia’s status. You can do without the Nvidia GPU, but you have to convert the format to accommodate CUDA.

CUDA has opened the era of parallel computing or multi-core computing. All accelerators used in artificial intelligence today are multi-core or many-core processors, and almost all of them are inseparable from CUDA. The CUDA program architecture is divided into two parts: Host and Device. Generally speaking, Host refers to CPU, and Device refers to GPU or AI accelerator. In the CUDA program architecture, the main program is still executed by the CPU, and when it encounters the part of data parallel processing, CUDA will compile the program into a program that the GPU can execute, and send it to the GPU. And this program is called a kernel in CUDA. CUDA extends the C language by allowing programmers to define C language functions called cores, when such a function is called, it will be executed N times in parallel by N different CUDA threads, unlike ordinary C language functions that are executed only once. different ways. Each thread executing a core is assigned a unique thread ID, which can be accessed in the core through the built-in threadIdx variable. In a CUDA program, before calling any GPU kernel, the main program must configure the kernel for execution, that is, determine the number of thread blocks and the number of threads in each thread block, and the shared memory size. You can use NVIDIA’s GPU, but in the end you can’t do without CUDA, that is, you need to convert to CUDA format, which means a drop in efficiency. So Nvidia is a reference-level existence.

From the characteristics of CUDA, it is not difficult to see that a separate AI accelerator cannot be used. Today we analyze three AI accelerators that can be used in the field of intelligent driving, namely Qualcomm’s AI100, Huawei’s Ascend, and Tesla’s FSD. Among them, Qualcomm AI100 is relatively rare.

The Qualcomm AI100 first appeared at the Qualcomm AI Open Day in Shenzhen in April 2019, and will be mass-produced in September 2020. AI100 is Qualcomm’s only AI inference computing accelerator at present, targeting four applications: first, edge computing in data centers, second, 5G mobile edge computing, third, intelligent driving and intelligent transportation, and fourth, 5G infrastructure.

AI100 has two focuses: First, 5G games. On the day of AI100’s release, VIVO mobile phones and Tencent’s King of Glory development team were invited to use AI100 to start a video game competition, that is, to put some operations on 5G edge servers to reduce the load on the mobile phone. . The second is intelligent transportation and intelligent driving. The AI ​​accelerator of Qualcomm’s autonomous driving Ride platform is likely to be a replica of the AI100.

Qualcomm especially demonstrated the application of AI100 in the field of intelligent transportation/intelligent driving.

At the same time, it supports 24-channel 2-megapixel image recognition with a frame rate of 25Hz. Tesla’s FSD is only 8-channel 1.3-megapixel image recognition with a frame rate of 30Hz at the same time, and its performance is at least 3 times that of Tesla’s FSD. The AI100 can be applied like a blade server, with up to 16 cascaded PCIe switches.

The maximum computing power per watt is 12.37TOPs, the power consumption of Tesla FSD is 36 watts, the AI ​​part is estimated to be about 24 watts, and the computing power per watt is only about 3TOPs per watt. Nvidia’s Orin is roughly 5.2TOPs per watt. computing power.

The picture above shows the internal frame diagram of Qualcomm AI100. The design is very simple, with 16 AI cores, the fourth-generation PCIe connection between the cores, the bandwidth of 186GB/s, the 8-channel PCIe network, and then with various on-chip networks (NoC), including storage NoC, computing NoC And the configuration NoC is connected through the PCIe bus. The on-chip memory capacity is up to 144MB and the bandwidth is 136GB/s. The peripheral memory is LPDDR4 of 256Gb. Supports the ISO26262 safety standard for the automotive industry, or ASIL, up to level B.

NoC is one of the core technologies of multi-core AI processors. Tesla FSD has only two NPUs. It is very likely that NoC will be used and a relatively backward bus technology will be used, but both Qualcomm and Huawei have used it.

The detailed theory of NoC will not be discussed. It can be understood as a communication network running between PE and storage. There are many similarities between NoC technology and OSI (Open System Interconnection) technology in network communication. The proposal of NoC technology is also because of the layered idea of ​​parallel computer interconnection network and Ethernet network. The similarities between the two are: support package Switching, routing protocols, task scheduling, extensibility, etc. NoC pays more attention to the area occupied by switching circuits and buffers, and these are the main considerations in the design. The basic components of NoC are: IP core, router, network adapter and network link. The IP core and router are located at the system layer, and the network adapter is located at the network adaptation layer. For these four basic components of NoC, many research directions and optimization approaches have also been derived.

The NoCs of common AI accelerators are shown in the table above. It should be pointed out that both Qualcomm and Huawei use Arteris. This company is actually a subsidiary of Qualcomm. Qualcomm acquired this small French company with only 43 people in November 2013. Today, almost all large and medium-sized chip companies in China are Its customers, including Rockchip, National Technology, Huawei, Allwinner, Actions, Spreadtrum, etc., can be said to be working for Qualcomm. Intel acquired Netspeed in 2019, and Facebook acquired Sonics in 2019. The two NoCs are far less used than Qualcomm’s Arteris.

The internal framework of each AI core is as above, mainly divided into 4 parts, namely scalar processing, vector processing, storage processing and tensor processing. Four kinds of quantities often appear in deep learning, scalar, vector, matrix and tensor. The most basic data structure of neural network is vector and matrix. The input of neural network is vector, and then the vector is linearly transformed through each matrix, and then through nonlinear transformation of activation function, through layer-by-layer calculation, the loss function is finally minimized , to complete the training of the model.

Scalar: A scalar is a single number (integer or real), unlike most other objects studied in linear algebra (usually arrays of numbers). A scalar is usually represented by italic lowercase letters, and a scalar is equivalent to the Python definition of x=1.

Vector (Vector), a vector represents a set of numbers in an ordered arrangement. We can find each individual number by indexing in the order. Vectors are usually represented by bold lowercase letters. Each element in the vector is a scalar, A vector is equivalent to a one-dimensional array in Python.

Matrix (matrix), a matrix is ​​a two-dimensional array, each element of which is determined by two indices, a matrix is ​​usually represented by bold italic capital letters, we can think of a matrix as a two-dimensional data table, a matrix Each row represents an object, and each column represents a feature.

Tensor, an array of more than two dimensions. Generally speaking, the elements in an array are distributed in a regular grid of several-dimensional coordinates, which is called a tensor. If a tensor is a three-dimensional array, then we need three indices to determine the position of the element, and the tensor is usually represented by bold uppercase letters.

To put it less rigorously, a scalar is a point in 0-dimensional space, a vector is a line in 1-dimensional space, a matrix is ​​a face in two-dimensional space, and a three-dimensional tensor is a body in three-dimensional space. That is, vectors are composed of scalars, matrices are composed of vectors, and tensors are composed of matrices.

The scalar operation part can be regarded as a small CPU that controls the operation of the entire AI Core. The scalar computing unit can control the loop in the program, and can realize branch judgment. As a result, the execution flow of other functional units in the AI ​​Core can be controlled by inserting synchronization symbols in the event synchronization module. It also provides the calculation of data addresses and related parameters for the matrix calculation unit or the vector calculation unit, and can implement basic arithmetic operations. Scalar operations with high complexity, such as data flow control, are performed by specialized AI CPUs through operators. AI processors cannot work alone, and must be coordinated by an external CPU.

Huawei Ascend series kernel architecture.Image source: Internet

Huawei’s Shengteng 910 is Davinci Max, and like Qualcomm AI100, it is also 8192 Int8 and 4096 FP16. However, the Shengteng 910 is used for training, and the Qualcomm AI100 is used for inference, but the 910 uses HBM2 generation storage regardless of cost, and its performance far exceeds the AI100.

The above picture shows the internal flow of Tesla’s FSD signal. Coherent traffic, that is, the data traffic of deep learning, requires CPU control, and of course it is not only for deep learning. The most computationally intensive convolution part in image recognition deep learning is actually the multiplication and accumulation of matrices. It can be decomposed into a 1-dimensional scalar or operator (ie weight) and a 2-dimensional vector, that is, the multiplication and accumulation of the input image.

The picture above shows the Tesla FSD neural network architecture. Tesla simply wrote the multiplication and accumulation of the matrix as MulAccArray. Tesla has just started making chips. Except for the NPU, which is made by itself, the rest of the FSD is IP purchased from outside. In terms of NPU, it is mainly to stack MAC multiply and accumulate units. In the field of scalar computing with a little technical content, Tesla has not announced which instruction set to use, which should be nothing special. Both Huawei and Qualcomm have adopted VLIW.

Qualcomm’s vector processor can be simply regarded as a DSP. As we all know, Qualcomm’s AI technology comes from its DSP technology. Qualcomm is very fond of DSP, and the VLIW super-long instruction set that has lost its vitality is very suitable for deep learning. Real-time control in general scenarios is required. And its program running has strict time requirements, the uncontrollable time structure of cache is not suitable, usually a fixed-cycle TCM is used as the cache, so that the memory access time is fixed. With the above-mentioned features, the difficulties faced by static compilation in general situations do not exist, and the more efficient parallel computing capability and simplified hardware structure of DSP are fully exerted.

In order to consider a variety of applications, AI100 has two precision arrays, FP16 and Int8. Int8, or 8-bit integer precision, is the most common in the field of intelligent driving, and FP16 is commonly used in games and AR/VR fields. Int8 has 8192, FP16 has 4096, and Tesla has 9216 Int8 arrays. If AI100 only considers intelligent driving, the computing power can be improved a lot while the total area (which can be almost equal to the cost) remains unchanged.

The above picture shows part of the Tesla NPU process and the distribution of bare chips. The theoretical peak computing power is simply calculated based on the number of MACs. The actual memory is the bottleneck, and the memory can greatly reduce the computing power. This is why AI chips are used for training at all costs. The reason for using HBM memory. Most of Tesla’s chips are given to SRAM, which is also to solve the memory bottleneck problem. There are two common units here, GiB and GB, GB is decimal, GiB is binary, 1GiB=(1024*1024*1024)B=1073741824B, 1GB=(1000*1000*1000)B=1000000000B, 1GiB/1GB=1073741824 /1000000000=1.073741824. If the accuracy is not high, it can be replaced directly. Qualcomm AI100 has 144MB of on-chip storage, Tesla only has 32MiB, Qualcomm can obviously crush Tesla, and in the peripheral LPDDR4 storage, Qualcomm also crushes Tesla, especially The Sla bandwidth is only 63.58 GiB/s, and the Qualcomm AI100 is 136GB/s.

Finally, on computing power, the comparison of AI processors seems to be inseparable from the comparison of computing power. In fact, it is meaningless to talk about computing power data alone. The above picture shows the performance of Qualcomm AI100 on five data sets. We can see that there is a huge difference in performance and efficiency. , The stronger the AI ​​computing power, the narrower its applicability, and the higher the degree of binding with the deep learning model. In other words, the AI ​​chip can only exert its maximum performance on the matching deep learning model. It may only be able to exert 10% of the performance of the chip. The current computing power data of all AI chips are theoretical peak data, and the theoretical peak cannot be reached in practical applications. In some cases, it may only be 10% or even 2% of the peak computing power. The computing power of 100TOPS may shrink to 2TOPS.

In the field of chips, Tesla can only be regarded as a primary school student. As long as there is a will, Qualcomm, Intel, Nvidia, Huawei, AMD, MediaTek, and Samsung can crush Tesla.