The impact of software and accelerator architecture on performance

16:13
10.12.2024
NelSon29
171

A few years ago, at my previous job, I had an interesting discussion with a colleague from the microelectronics department. His idea was that the performance in neural network inference on NVIDIA's GPGPU surpasses our solution due to the use of a more advanced process technology, higher clock frequencies, and a larger die area. As a programmer, I couldn't agree with this, but at that time no one had the time or desire to test this hypothesis. Recently, in a conversation with my current colleagues, I remembered this discussion and decided to see it through. In this article, we will compare the performance of the NM Card module from NTC Module and the GT730 graphics card from NVIDIA.

Disclaimer

All information is obtained from open sources, open documentation, and/or conference reports.

Comparison of chip characteristics

The NM Card module is based on the K1879VM8YA processor, also known as NM6408, manufactured at the TSMC factory using 28 nm process technology in 2017. The die area is 83 mm², contains 1.05 billion transistors, and the power consumption does not exceed 35 W [1]. The chip includes a PCIe 2.0 x4 controller and a DDR3 memory interface. The peak (theoretical) performance is 512 GFLOPS in fp32 data format. The NMC4 processor cores operate at a clock frequency of 1 GHz.

The closest analogs of this chip appear to be two graphics processors from NVIDIA: GK208B of the updated Kepler generation for discrete graphics cards [2] and GM108S of the Maxwell generation for the laptop segment [3]. Both chips are almost identical, released in March 2014. Since the GM108S is only available as part of laptops, and their price on the secondary market is too high for simple curiosity, the first option was chosen. However, it is worth noting that Maxwell is a newer architecture, supported by a more up-to-date software stack and potentially capable of achieving higher performance.

To avoid duplicating the text description, the comparison of chip characteristics is summarized in the table:

	K1879VM8YA	GK208B
Factory and process technology	TSMC, 28nm	TSMC, 28nm
Die area, mm²	83	87
Number of transistors, billion	1.05	1.02
Maximum power, W	35	23
PCIe controller	Gen 2, x4	Gen 2, x8
Memory interface	DDR3	DDR3*
Peak performance, GFLOPS	512	692.7
Frequency, MHz	1000	902
* — GDDR5 memory is also supported

It can be noted that the chips are almost identical in characteristics. However, the GK208B has a certain flexibility, so it is important to choose the right graphics card for a correct comparison.

Comparison of modules

As a module based on NM6408, as already mentioned, the NM Card was chosen [4]. In terms of performance, all modules with a single NM6408 show identical performance, which is provided in the documentation for the inference framework from Module called NMDL [5], which I will refer to later.

After a couple of days of searching through local flea markets, I managed to find an MSI GT 730 (N730K-2GD3H/LPV1) graphics card for 14 euros [6], in Russia at the time of writing this article it can be found on Avito for 500-1$0.00. The graphics card was manufactured in 2016 and is quite close to the NM Card module. Both cards have passive cooling, however, the GT730 has only 2 GB of memory compared to 5 GB of the NM Card. This affected the research by limiting the batch size, which potentially limited the peak performance for several networks. The second difference is in the standard frequency: 902 MHz versus 1000 MHz for the NM6408, but it can be increased in Afterburner with one action. After that, the peak performance will be 768 GFLOPS, which corresponds to a 1.5 times difference relative to the NM6408.

Test bench and software

As a test bench, I got a computer with very modest specifications: i3-7100 and 8 GB of DDR4 memory. However, since the performance of the graphics card is being measured, this should not have a significant impact. Due to the rather old architecture, the latest version of CUDA that works with this card is 11.6. The modern onnxruntime comes with a dependency on CUDA 12.x, for CUDA 11.x you need to build onnxruntime yourself.

Read also:

ChatGPT and the younger generation

As a result, to get a working version of onnxruntime with cuDNN and TensorRT, the following versions of the components were found and selected:

Component	Version
onnxruntime	1.14
CUDA	11.6.2
cuDNN	8.7.0
TensorRT	8.5.3.1

A simple program using onnxruntime was written as a benchmark, which loaded the model, ran inference with different batch sizes, and measured execution time over 10 seconds, after which the average execution time was taken. For the sake of experiment purity, an attempt was made to use reference models as much as possible, which are distributed as additional data to the NMDL framework.

Model Name	Reference Model Used	Comment
alexnet	+
inception_229	+
resnet-18	-	Incorrect model parameter
resnet-50	+
squeezenet	-	Incorrect model parameter
yolov2-tiny	-	Layer size mismatch within the model
yolov3-tiny	+
yolov5s	+
yolo3	+
inception_512	+
unet	+
yolo5l	-	Model not provided

Data Processing Modes

NM6408 can operate in two modes: multi-unit and single-unit. To understand these two modes, it is necessary to briefly describe the chip architecture. NM6408 contains four independent computing clusters, each of which contains four NMC4 cores and one arm control core. Each cluster has its own DDR3 controller, to which 1 GB of memory is connected. In addition to these four clusters, there is an additional (control) arm core with its own DDR3 controller and an additional gigabyte of memory. All clusters have direct access to each other's memory, but this access is limited to the first 512 MB of memory, which is caused by the use of a 32-bit address space.

The single-unit mode represents the independent operation of each cluster. In the case of the same model, this can be represented as working with a batch size of 4 and in data parallelism mode.

In multi-unit mode, the common input tensor is processed simultaneously on all four clusters. This can also be represented as working with a batch size of 1 and in spatial parallelism mode.

The main difference between the NM6408 and traditional GPGPU operation is that in the former case, the entire model execution graph is on the device and runs with minimal host device involvement. To some extent, this is similar to how modern NPUs work. At the same time, GPGPUs work with so-called kernels, i.e., individual layers. This approach allows for better distribution of work across multiple parallel computing devices, but more on that another time.

In this study, the use of traditional GPGPU allows for greater flexibility in choosing batch size, thereby increasing device utilization.

Comparison results

An important difference in the measured quantities is that NMDL returns the execution time on the time scale of the arm control core without taking into account the transfer of input and output tensor data. Onnxruntime does not provide such a mechanism, so in this experiment, timestamps are taken on the CPU scale before and after running inference through onnxruntime. Thus, all other things being equal, the performance values obtained from onnxruntime will be lower.

For batch size 1, the following results are observed:

Model name	GT730 result, fps	NM Card result, fps	Acceleration, %
alexnet	42.05	12.6	233.7
inception_229	13.75	12.8	7.4
resnet-18	63.74	25	154.9
resnet-50	19.55	12.2	60.2
squeezenet	149.8	74.4	101.3
yolov2-tiny	25.96	21	23.6
yolov3-tiny	29.14	27.3	6.7
yolov5s	9.1	4.7	93.6
yolo3	2.89	3.7	-21.9
inception_512	4.69	3.93	19.3
unet	1.68	2	-16
yolo5l	1.66	1.39	19.4

Two out of 12 models were slower on the GT730, the performance of two models was on par, the rest received a significant performance boost. The average improvement relative to the NM Card was 56.8%

The next experiment consisted of finding the "optimal" batch size in powers of two, i.e., batch size of 1, 2, 4, ... 64.

Model Name	GT730 Result, fps	NM Card Result, fps	Acceleration, %	Optimal Batch Size
alexnet	259.06	13	1892.7	64
inception_229	17.42	20.3	-14.1	32
resnet-18	98.66	47	109.9	64
resnet-50	27.68	20.6	34.3	64
squeezenet	170.6	100	70.6	64
yolov2-tiny	35.18	30.4	15.7	16
yolov3-tiny	36.53	35.3	3.4	16
yolov5s	9.39	5.7	64.7	16
yolo3	2.89	4.5	-35.7	1
inception_512	5.24	5.44	-3.6	8
unet	1.7	2	-14.5	2
yolo5l	1.9	1.43	32.8	8

The impact of software and accelerator architecture on performance

Write comment

Relevant news on the topic "AI"

Cognitive Traps of Humans and AI

Everything is generated by GPT! A guide on how to recognize AI text and how to make it indistinguishable from human writing.

The Godfather" AI accuses new models of lying to users: how to avoid problems with LLM

What do engineers at OpenAI, Microsoft, and AWS think about the future of AI: honest answers from the AI Engineer World's Fair 2025

ChatGPT is still out of reach: what’s happening on the AI market by mid-2025?

Life After Achieving AGI: Total Happiness or the Decline of Civilization?

We have implemented a Telegram bot with AI in a federal company

Also read

Getting the Most Out of ChatGPT: Tips for Choosing the Right Model

Full-cycle engineering: TAPP Group experience

Also read