Graphics Memory Optimization Features

Graphics Memory Optimization Features#

Refer to Tensor Parallel in the multi-card parallelism guide.

Background

In synchronous offload mode, the accelerator stops after finishing one layer and waits while the next layer’s weights are moved from CPU memory to the accelerator. That leaves the device idle for a large portion of time and lowers utilization.
Optimization method
- Asynchronous offload is a classic technique that trades time for space, or more precisely, inference speed for graphics memory capacity.
- Its core idea is to overlap compute and weight transfer with an asynchronous pipeline. While the device computes one layer, the next layer’s weights are already being loaded. By the time the current layer finishes, the next layer’s weights are ready, so transfer time is hidden behind compute time and end-to-end latency is reduced.
The figures below show the standard offload flow and the asynchronous offload flow:

Refer to Linear Quantization in the lightweight algorithm guide.