CPU and GPU profiles and differences

January 31, 2024

NVIDIA recently announced that the NVIDIA® Tesla® AI supercomputer platform will power the top 13 systems in the Green Energy ranking of the world's most energy efficient high performance computing (HPC) system. All 13 computers use the NVIDIA Tesla P100 Data Center GPU Accelerator, which includes four systems based on the NVIDIA DGX-1TM AI supercomputer.

NVIDIA also announced some performance data, which shows that NVIDIA Tesla GPUs have improved the performance of HPC applications by more than 3 times compared to the Kepler architecture released two years ago. This greatly exceeds Moore's Law's prediction of performance improvement, even before it began to decelerate in recent years.

Many people have such doubts. Nowadays, in areas such as mining, cracking passwords, and even supercomputing, more and more applications require GPUs instead of CPUs. Is it time for the CPU to be replaced by a GPU?

CPU and GPU Introduction

What is CPU

The Central Processing Unit (CPU) is the computing core and control core of a computer. The CPU, internal memory, and input/output devices are the three core components of the computer. Its function is mainly to interpret computer instructions and process data in computer software. The CPU consists of the operators, controllers, registers, and buses that implement the data, control, and status of the connections between them. Almost all CPUs operate in four phases: Fetch, Decode, Execute, and Writeback. The CPU fetches instructions from the memory or cache memory, places them in instruction registers, decodes the instructions, and executes the instructions. The so-called computer programmability mainly refers to the programming of the CPU.

CPU and GPU profiles and differences

CPU function

The computer solves the problem by executing the program. A program is a sequence of instructions. Execution of a program is a one-by-one execution of instructions in a sequence of instructions. Once the program is loaded into main memory (referred to as main memory), the tasks of the main access instruction and the execution instruction can be automatically completed by the CPU.

The CPU has the following four basic functions:

1, command sequence control

This refers to the sequence of execution of the instructions in the control program. The instructions in the program are in strict order and must be strictly followed

The sequence of procedures specified by the program can ensure the correctness of computer work.

2, operation control

The function of an instruction is often implemented by a sequence of operations performed by components in the computer. According to the function of the instruction, the CPU generates a corresponding operation control signal and sends it to the corresponding component so as to control the operation of these components according to the requirements of the instruction.

3, time control

Time control is the timing of the implementation of various operations. In the execution of an instruction, what operations should be performed at any time should be strictly controlled. Only in this way can the computer work orderly and automatically.

4, data processing

That is, the data is subjected to arithmetic and logic operations, or other information processing. The CPU fetches instructions from memory or cache memory, places them in instruction registers, and decodes the instructions. It decomposes the instruction into a series of micro-operations, then issues various control commands and executes the micro-operation sequence to complete the execution of one instruction. An instruction is a basic command that the computer specifies the type of operation to perform and the operands. An instruction consists of one or more bytes, including the opcode field, one or more fields for the operand address, and some status words and signatures that characterize the state of the machine. Some instructions directly include the operand itself.

What is a GPU

(Graphic Processing Unit) graphics processing chip. It is the "heart" of the display card, which is equivalent to the role of the CPU in the computer. It determines the grade and most of the performance of the card. It is also the difference between the 2D display card and the 3D display card. The 2D display chip mainly depends on the processing capability of the CPU when processing 3D images and special effects, and is called "soft acceleration". The 3D display chip concentrates the three-dimensional image and special effects processing functions on the display chip, which is also called "hardware acceleration" function. The display chip is usually the largest chip on the display card (which is also the most pin-sized). Most graphics cards on the market today use graphics processing chips from companies such as nVIDIA and ATI.

CPU and GPU profiles and differences

Today, GPUs are no longer limited to 3D graphics processing. The development of GPU general-purpose computing technology has attracted a lot of attention in the industry. Facts also prove that GPUs can provide dozens of times or even more in terms of floating point operations, parallel computing, and other calculations. One hundred times the performance of the CPU, so powerful "Rising Star" will inevitably make the CPU vendor boss Intel to be nervous for the future, NVIDIA and Intel are often for the CPU and GPU who are more important to start a war of words.

Standards for GPU general-purpose computing currently include OPEN CL, CUDA, and ATI STREAM. Among them, OpenCL (full name Open Computing Language, Open Computing Language) is the first open, free standard for general-purpose parallel programming of heterogeneous systems, and a unified programming environment for software developers to use as high-performance computing servers and desktops. Computational systems and handheld devices write efficient and lightweight code, and are widely used in games, such as multi-core processors (CPUs), graphics processors (GPUs), Cell type architectures, and digital signal processors (DSPs). Entertainment, research, medical and other fields have broad prospects for development. AMD-ATI and NVIDIA's current products all support OPEN CL.

On August 20th, 1985, ATi was founded. In October of the same year, ATi developed the first graphics chip and graphics card using ASIC technology. In April 1992, ATi released a Mach32 graphics card with integrated graphics acceleration function. In April 1998, ATi IDC was selected as the market leader in the graphics chip industry, but at that time the chip did not have the title of GPU. For a long time, ATI called the graphics processor VPU until AMD acquired ATI. Use the GPU name. NVIDIA first introduced the GPU concept when it released the GeForce 256 graphics processor in 1999. Since then, the core of the NV graphics card will be called with this new name GPU. The GPU reduces the graphics card's reliance on the CPU and performs some of the original CPU work, especially in 3D graphics processing. The core technologies used by the GPU include hardware T&L, cube environment texture mapping and vertex blending, texture compression and bump map mapping, and dual-textured four-pixel 256-bit rendering engine. The hardware T&L technology can be said to be a GPU symbol.

The difference between CPU and GPU

CPUs and GPUs are very different, because of their different design goals, they are aimed at two different application scenarios. The CPU needs to be very versatile to deal with a variety of different data types. At the same time, the logic judgment will introduce a large number of branch jumps and interrupt handling. All of these make the internal structure of the CPU extremely complicated. The GPU is faced with highly-typed, non-dependent, large-scale data and a clean computing environment that does not need to be interrupted.

So CPU and GPU show a very different architecture (schematic diagram):

CPU and GPU profiles and differences

The picture is from the nVidia CUDA document. The green one is the calculation unit, the orange-red one is the storage unit, and the orange one is the control unit.

The GPU employs a large number of computing units and a very long pipeline, but only a very simple control logic and saves the Cache. CPU is not only occupied by Cache, but also has complicated control logic and many optimization circuits. In comparison, the computing power is only a small part of the CPU.

CPU and GPU profiles and differences

From the above figure we can see: Registers: GPU > CPU Multiple registers can support a lot of Thread, thread need to use register, a large number of threads, register must also follow a very big job.

SIMD Unit (single-instruction, multi-stream, synchronous execution of the same instruction at the same time): GPU > CPU.

The CPU is based on a low-latency design:

CPU and GPU profiles and differences

The CPU has a powerful ALU (arithmetic arithmetic unit) that can perform arithmetic calculations in a few clock cycles.

Today's CPU can reach 64bit double precision. Adding and multiplying to perform double-precision floating-point arithmetic requires only 1 to 3 clock cycles.

The frequency of the CPU's clock cycle is very high, reaching 1.532 to 3 gigahertz (Gigahertz, 10 to the 9th power). Large caches can also reduce latency. Save a lot of data in the cache, when you need to access the data, as long as the previous visit, and now directly in the cache to take it.

Complex logic control unit. When the program contains multiple branches, it reduces latency by providing branch prediction capabilities.

Data forwarding. When some instructions rely on the result of the previous instruction, the logical control unit of the data forwarding determines the position of these instructions in the pipeline and forwards the result of one instruction to subsequent instructions as soon as possible. These actions require many comparison circuit units and forwarding circuit units.

The GPU is based on a large throughput design:

CPU and GPU profiles and differences

The GPU features a lot of ALUs and very few caches. The goal of caching is not to save the data that needs to be accessed later. This is different from the CPU, but it improves the service for threads. If there are many threads that need to access the same data, the cache will merge these accesses and then access the dram (because the data that needs to be accessed is stored in dram instead of the cache). After the data is obtained, the cache will forward the data to the corresponding database. Thread, this time is the role of data forwarding. However, due to the need to access dram, naturally it will bring about the problem of delay.

The control unit of the GPU (yellow area block on the left) can merge multiple accesses into fewer accesses.

Although GPU has dram delay, there are a lot of ALUs and a lot of threads. In order to balance the memory delay problem, we can make full use of the characteristics of many ALUs to achieve a very large throughput. Allocate as many Threads as possible. Usually the GPU ALU will have a very heavy pipeline because of this.

So with the CPU is good at logic control, serial computing. Unlike general-purpose type data calculations, GPUs are good at large-scale concurrent computations, which is exactly what password cracking and other needs require. Therefore, in addition to image processing, the GPU is also increasingly involved in calculations.

Most of the GPU's work is like this. It is computationally intensive, but it has no technical content and it is repeated many times. Just as if you have a job that needs to count hundreds of millions of times less than one hundred, plus, minus, multiply and divide, the best way is to hire dozens of elementary school students to count together, and one person to count them. In any case, these calculations have no technical content and are purely physical. The CPU is like an old professor. The integral differential will be calculated. It means that the salary is high. An old professor is a top 20 primary school students. If you were Foxconn, which one would you hire? The GPU is like this, using many simple computing units to complete a large number of computing tasks, pure human-sea tactics. This strategy is based on the premise that the work of Pupil A and Pupil B is not dependent on each other and is independent of each other. Many problems involving large numbers of calculations have such characteristics, such as cracking passwords, mining, and many graphics calculations. These calculations can be broken down into multiple identical simple tasks, each of which can be assigned to a primary school student. But there are still some tasks that involve "flow" issues. For example, when you go to a blind date, both parties look pleasing to continue to develop. There is no way you haven't met yet. There's someone who's got the cards. This more complicated issue is done by the CPU.

All in all, the CPU and GPU are different because of the tasks they were originally used to handle, so there is a big difference in design. Some tasks are similar to the ones that the GPU was originally used to solve, so use the GPU to calculate it. The speed of the GPU's computing depends on how many elementary school students are employed. The CPU's speed of operation depends on how powerful the professor has been. Professor's ability to handle complex tasks is to squash primary school students, but for less complex tasks, it still can not withstand more people. Of course, the current GPU can also do some more complicated work, which is equivalent to upgrading to junior high school students. But also need the CPU to feed the data to the mouth in order to start working, it still depends on the CPU to control.

With current computer architecture, GPUs can only be called niche

GPU as a latecomer, it is too late, the computer architecture has been finalized, it is unlikely to shake Intel's dominance, and Intel will use its inherent advantages to suppress other competitors.

Why GPU can only be counted as a niche. Programs run on a computer can be roughly divided into three categories from the perspective of performance: 1. I/O intensive; 2. Memory intensive and 3. Compute-intensive.

1. The performance bottleneck of an I/O intensive program is I/O, which means that most of the time the program runs is spent on hard disk read/write/network communication, and I/O is at the bottom of the pyramid of computer architecture. Very slow. The recent big data that is being fired is talking about this type of application. Where hundreds of terabytes or even PB-level data goes, it can only be placed on a hard disk. A machine capacity is too small how to do the CPU is too small, engage in hundreds of units or even thousands of machines connected with network cable distribution processing. So this piece is all I/O, and now the big Internet companies are not able to engage in a few thousand-node clusters.

2, Memory intensive program performance bottleneck in memory access, the program has a large number of random access memory operations, but basically no I / O, this type of program has been an order of magnitude faster than the first type of program, but and the speed of the register Still can't compare. Most of the current applications fall into this category. The various software installed on a personal computer is basically such. If there is a little I/O, it will be very decent.

The above-mentioned two types of programs are the most widely used and cover most of the useful computer software, but unfortunately the GPU is useless in these two pieces, and GPUs only have a role in computationally intensive programs. I/O is a bottleneck program, and the time spent on calculations is negligible. It doesn't matter how you use GPU acceleration. Programs that contain a large amount of memory for random access are also not suitable for execution on the GPU. A large number of random accesses can even make the GPU's behavior change from parallel to serial.

What type of program is suitable for running on the GPU

1, computationally intensive programs

The so-called Compute-intensive program is that most of its running time is spent on register operations. The speed of registers is comparable to that of processors. There is almost no delay in reading and writing data from registers. For comparison, the latency for reading memory is about a few hundred clock cycles; the speed of reading a hard disk is not stated, and even an SSD is really too slow.

2, easy to parallel program

GPU is actually a SIMD (Single Instruction Multiple Data) architecture. He has hundreds of cores, and each core can do the same thing at the same time.

No one can replace GPU and CPU

If the image is understood, the GPU is like a group of ants. These ants all do the same thing. The CPU is like a monkey. The monkey is doing different things.

CPUs and GPUs have different purposes and have different focuses. They also have different performance characteristics. In some jobs, the CPU executes faster. In another job, the GPU may be better.

When you need to do the same thing with a lot of data, the GPU is more appropriate, and when you need to do a lot of things with the same data, the CPU is just right.

It can be predicted that in the future, as the CPU further strengthens the ability to process data blocks, we will see a convergence between CPU and GPU architectures, and with the advancement of manufacturing technology and the shrinking of chips, GPUs can also undertake more complex instructions. Although the division of labor between the CPU and the GPU is still quite different, the intersection between the two will undoubtedly be more.

Author:

Mr. lilei

E-mail:

1234@qqq.com

Phone/WhatsApp:

+86 13323231231