Amazon Cloud Technology (AWS) has been making cores for eight years.
Since the launch of the first Nitro1 chip in 2013, this cloud manufacturer first set foot in self-developed chips has three product lines: network chips, server chips, and artificial intelligence machine learning self-developed chips.
The fourth-generation network chip Nitro, the third-generation server chip Graviton, the first-generation AI inference chip, and the first-generation AI training chip are the self-developed chip layout of the cloud computing base announced by AWS so far.
It is worth noting that AWS is by no means “PPT-made core”, but quickly guides the value of chips to customers through cloud instances. This year, AWS released a series of new instances such as C7g and Trn1, and introduced Nitro’s SSD hard drives.
“Self-developed chips require accumulation of experience. It is not something you can buy with money, and it is not something you can quickly achieve in a hurry.” At the media communication meeting of the 2021 re:Invent Global Conference, Amazon Cloud Technology Greater China Gu Fan, general manager of the district product department, talked about the biggest advantage of AWS in refactoring self-developed chips-deep understanding of the workload of all customers on the cloud, reverse work, and design chips.
Zhou Ge, Director of Computing and Storage of the Greater China Product Department of Amazon Cloud Technology, gave a detailed interpretation of the AWS core process and the logic behind it.
Self-developed CPU: starting from customer needs,
Two major technical ideas to improve performance
AWS started with self-developed chips to make cloud computing innovations. Starting from the release of the first Amazon EC2 Instance in 2006, this is an Amazon EC2 journey started by a team of ten people.
As customers place more requirements on Amazon EC2, AWS continues to add many models, including instances with a maximum memory of more than 24TB for SAP, VT1 for transcoding services, and Apple’s latest M1 chip based on Arm. New instance, etc.
The continuous deepening of the diversified innovation of these examples made the AWS team realize that it must focus on the chip development itself, starting with chip innovation. So since 2013, AWS has continuously launched new instances based on self-developed chips to provide higher cost performance.
In 2019, AWS released the second-generation server chip Graviton. After its instances were listed, customers deployed more different application methods to Graviton, and the scope of the span was further expanded, from the original Cache to the Web all the way to data analysis, and even Machine learning and high-performance computing workloads.
In order to help customers make better use of new instances, AWS will deeply integrate more management services with Graviton, so that customers can use them quickly. As a result, many customers can complete the migration from x86 to Arm in just one or two days.
At this year’s re:Invent conference, AWS released 4 new instances based on Graviton2, including X2gd with 1TB of memory, G5g supported by Graviton2 and NVIDIA GPUs. Examples of the fourth generation of Is include Is4gen and Im4gn, which also use the Graviton2 processor.
Graviton2 has 30 billion transistors, and the new generation of Graviton3 has increased by 20 billion, with a total of 50 billion transistors.
C7g is the first instance for Graviton3. Its breakthrough has several key points, including a performance increase of over 25% compared to the previous generation, and a floating point calculation capability that has increased by more than twice. In terms of memory, C7g is also the first new computing instance on the cloud that supports DDR5.
Zhou Ge, Director of Computing and Storage of Amazon Cloud Technology Greater China Product Department, also shared Graviton’s technological innovation. The same Arm micro-architecture does not mean that the same CPU can be made. The chip design must have its own thinking. The principle of AWS is Look at the workload and find the starting point of the design from the actual actual use of the customer.
In the past 20 years, the two easiest directions for improving CPU performance are increasing the frequency and increasing the number of cores.
Most of the time, increasing the frequency can easily improve the performance, but the problem is that with the power and capacity of the existing semiconductors, increasing the frequency means that it will increase a lot of frequency and heat, and requires a large power supply and heat dissipation configuration.
In ultra-large-scale data centers like the cloud, this will bring a lot of power consumption and energy consumption, reduce the efficiency of the cloud, and put forward higher requirements for heat dissipation efficiency, which will eventually lead to an increase in customer use costs. “So this is one of the main reasons why we are very cautious when selecting frequency increases.” Zhou Ge said.
If the frequency is not increased, what other options are there?
AWS’s answer is that it hopes to increase the width of the core, that is, to use instruction parallelism, and hope that its core can execute more instructions and complete more tasks in the same clock cycle.
For example, from executing 5 instructions in one time cycle to executing 8 instructions, more things can be done at the same time, which can also bring great changes to the final performance.
When using instruction parallelism, if the application compilation can take advantage of instruction parallelism, the effect will be very obvious. Nginx and Groovy have increased by almost 60%, and even Redshift can increase by more than 25%.
In the same cycle, AWS gives it the opportunity to execute more instructions, and also allows it to execute more data when the same instruction is executed. It will complete video, image processing, transcoding, and some machine learning faster And the process of high-performance computing.
The results show that the encoding performance of x264 and x265 video has increased by about 50%, and the encryption performance of AES-256 has increased by 61%. This is achieved through instruction parallelism and the processing capability of loading more data in the same instruction, instead of increasing the frequency.
Increasing the number of cores is another quick and effective way to improve performance.
From the first generation of Graviton to the second generation, AWS has increased a lot of cores, and the effect is not bad. In the third generation, the AWS team studied customer workloads that were actually running on Graviton2 and found that a large number of workloads were based on big data, microservice architecture, etc., and some high-performance computing services, which are more important to memory. Bandwidth and delay sensitivity are very high.
Therefore, the AWS team made a judgment: working hard on memory will have a faster effect than increasing the number of cores.
New multiple-choice questions appear: use the remaining transistors to add more cores, or to increase the CPU’s memory bandwidth and reduce latency?
In the end, the AWS team chose memory. Compared with Graviton2 and Intel’s Tool platform, its memory bandwidth has increased by 50%, which has a very direct effect on many applications.
Some early customers shared the help that Graviton3 brought to them. For example, Twitter stated that its performance is increased by 20% to 80%, and it can even reduce the tail delay by at least 35%, which is very sensitive to memory latency; the F1 equation uses Graviton2 for fluid simulation, which is already much faster than Intel’s platform , This time it can be increased by 40% with C7g; Epic’s “Fortress Night” game performance experience has also been greatly improved.
Another important indicator is to reduce power consumption by 60%. This will maintain a higher energy efficiency ratio, and customers will not pay a high cost.
The heavy use of Graviton has been derived into a lot of workloads. For example, SAP HANA Cloud uses Graviton to help enterprise-level customers improve operational efficiency and improve performance.
At the same time, AWS also announced on re:Invent this year that it won the HPC Best Cloud Platform Award for the fourth consecutive year. Graviton also won the Product Innovation Award from the High Performance Computing Committee of the China Computer Society at the end of November.
The C7g instance supports BFloat16, making Graviton3’s machine learning inference performance almost 4 times faster than the previous generation.
Self-developed AI chip: optimize memory and network,
Accelerate adaptation to machine learning needs
While improving the AI computing power of server chips, AWS is continuing to optimize the performance of its AI chips and related instances.
Facing AI training and reasoning, AWS has different examples. In addition to its P4d and inf1 instances used for inference, general-purpose CPUs are increasingly used for inference, including C6i and C6g; in terms of training, AWS has launched P4d with EC2 super cluster capabilities and DL1 driven by Intel AI chip Habana.
In 2019, AWS released an Amazon EC2 Inf1 instance based on its self-developed cloud AI reasoning chip Inferentia. Now this instance has been used by many Chinese customers for reasoning.
Trn1, which was just launched this year, is an example based on the AWS self-developed cloud training chip Trainium. In recent years, machine learning has developed rapidly, and its supporting professional acceleration chips have developed rapidly. Every year, the speed of the evolution of special-purpose chips for machine learning is over doubled, much faster than general-purpose CPUs.
However, the challenge is that the complexity of the machine learning model is growing by 10 times, and the acceleration of the GPU and the acceleration chip itself is not enough to keep up with this speed.
Therefore, in order to greatly improve the machine learning training ability, the key is parallel training. This means not only to improve the performance of the dedicated chip itself, but also to solve the two major difficulties of memory and network, and build a network environment that is more suitable for these chips to play and all its supporting services.
In terms of memory, AWS memory has been increasing since P3dn. Trn1 memory is already 512G this year.
In terms of network, AWS launched a 100G network a few years ago. By the time of p4d, it made a 400G network last year. The newly released Trn1 increased the bandwidth to 800G for the first time, and Trn1n can even reach 1600G. Based on its EFA, AWS can make machines The learned device group is in an architecture to improve the throughput and scalability of distributed high-performance computing and machine learning workloads.
From the effect point of view, a typical large model GPT-3 requires a two-week training period. P3dn requires 600 instances, P4d has dropped to 128 instances, and Trn1n continues to drop to 96 instances. After the number of instances is greatly reduced, the cost will be significantly reduced.
It is worth noting that when P3dn runs training, 49% of the overhead is spent on communication between instances. After P4d improves network performance, 14% of the overhead remains. To Trn1n, only 7% of the overhead will be used for network communication.
The advantage of this is that AWS can use a larger cluster and more cards to train at the same time, which really significantly reduces the training time. P4d can train up to 4,000 cards at the same time, and Trn1n has raised this index to 10,000 cards, which is also a huge improvement.
Today, more than 60 million new instances are actually running on Amazon EC2 every day, which is twice the number in 2019, and all of this comes from the innovation of the network itself.
Self-developed SSD: unified monitoring and operation indicators,
Reduce the risk of bugs
Nitro chip is the starting point of AWS network innovation. The 100G, 400G, 800G and 1600G mentioned in the previous article are inseparable from Nitro.
This four-generation network chip has helped AWS solve many problems, including providing a unified security platform. No matter what kind of CPU computing platform is used, you can obtain consistent security, consistent VPC access capabilities, and consistent API uniformity and so on.
Nitro can also help improve storage performance.
The first hard disk was quite huge in 1956, and then gradually developed into a disk the size of a phonograph. A lot of big data in the data center is stored in such a disk. With the development of applications, data centers have higher and higher requirements for IO throughput, and SSDs that are good at this are increasingly being used in storage devices.
SSD uses semiconductors to store data, and its Flash Translation Layer (FTL) controls the conversion of data from logical addresses to physical addresses. Two things are particularly prone to happen here: one is garbage collection, which releases the corresponding space when erasing and writing; the other is wear balance, which controls the use frequency of each data block to be similar, and can be worn out in a more balanced manner.
In the past, a large number of SSD drives used by AWS came from many suppliers, and each FTL control mechanism was different, and even the FTL of each type of disk might be different. If there is a bug in FTL that needs to be fixed, you can only wait for the original factory to do it, and the time will be uncontrollable, and it will become more difficult to provide service guarantees to customers with AWS.
In addition, the implementation logic of some functions of different FTLs is inconsistent, and functions such as garbage collection, fragment collection, and wear leveling may be activated at different times. Once this function is suddenly activated, when the customer is using this disk, the requested work will be affected. Discontinued, this will cause great interference to the customer’s use.
The performance of the same application is different when different disks are used. Therefore, AWS uses Nitro SSD to solve these problems through a unified FTL, solve bugs by itself, and avoid customer interference.
AWS’s first-generation Nitro SSD was made in 2017. Currently, its entire cloud system has deployed more than 500,000 Nitro SSDs. The second-generation Nitro SSD was just released at this year’s re:Invent conference. Both I4i based on Intel platform and Im4gn/Is4gen based on Graviton2 use the new Nitro SSD.
Nitro SSD brings intuitive performance improvement. Compared with the previous generation I3 instance, the new instance I/O latency is reduced by 60%, and the latency variability is reduced by 75%.
EBS storage service also uses Nitro SSD to improve performance. For example, io2 Block Express, which was officially launched on the market this year, used Nitro SSD, and finally achieved 256k IOPS and a very stable millisecond delay.
As can be seen from the above figure, its performance can be increased by more than 2.4 times when running PostgreSQL, and even better on SQL. According to its customers, if you use R5B’s Instance plus io2, it is the most suitable choice for running SQL Server on the cloud, and the performance is increased by 5 times.
In 2006, AWS released the first cloud service Amazon S3 object storage, which was also the starting point for AWS storage.
Nowadays, more and more workloads are moving to the cloud, requiring Amazon S3 support. This year AWS has introduced two new storage tiers: one is to make Glacier for archiving data a layer that can be indexed in time, which can not only be used for fast indexing like hot data, but also be able to maintain long-term storage like archive storage. low cost.
AWS S3 intelligent tiering also covers new tiers. Currently, it provides 8 storage tiers to meet various storage needs. There are 10^14 objects in S3, which is equivalent to 13,000 objects for everyone in the world. There are 2 trillion stars in the universe, and each star can share 5 objects.
In such a large scale, how can customers maintain such high availability and consistent performance? Distributing so many objects on millions of hard drives is exactly the mechanism of Amazon S3 that enables it to provide very high stability, reliability, and performance from day one.
Conclusion: Everything starts with chip innovation
From CPU to acceleration chip to storage, the evolution of these AWS basic underlying technologies all started with chip innovation. Based on such an innovation base, AWS has developed to its current scale and can still achieve efficient, safe, and continuous innovation.
While these computing, storage, and data-related services are breaking the boundaries of various innovations, new constraints are in urgent need of breakthrough. These constraints lie in many physical regions and regulations. For example, there are local regulatory requirements for data, as well as delays caused by various distances, network bandwidth, connection stability, and so on.
AWS is also solving these problems through a series of product layouts. For example, with AWS WaveLength, which can serve ultra-low latency 5G edge applications, Outposts extends AWS’s capabilities to customers’ own data centers; Cloud WAN makes it easier for customers to connect complex terrestrial networks to the cloud through a central control panel; Snowball devices can Realize the migration of several terabytes of data in a week; Ground Station satellite communication operation service is out of the earth, covering satellite communication and data processing…
Today, AWS has 25 regions, covering six continents around the world, and provides 81 availability zones. Next, it will provide 9 new service zones and 30 Local Zones. Based on the cornerstone of chip innovation, AWS is helping cloud users to explore further boundaries through continuous innovation on cloud infrastructure.