May 26, 2024

Asymmetric Parallel Support for Heterogeneous GPUs

Authors

Ajay Potnis, Chief Research Officer - Beyond Labs

Discover how Beyond Network addresses challenges with distributed heterogeneous GPUs.

Asymmetric Parallel Support for Heterogeneous GPUs in Beyond Network

The rapid growth of artificial intelligence (AI) applications has led to an increasing demand for efficient and scalable inference systems. However, running large language models (LLMs) on a diverse set of gaming GPUs in a decentralized environment presents performance and resource utilization challenges. Beyond Network, a decentralized AI inference network addresses these challenges by introducing various solutions. In this series, let's discuss asymmetric parallel support for heterogeneous GPUs.

Challenges in running LLMs on gaming GPUs

One of the primary challenges in running LLMs on gaming GPUs is their varying computational capacities. Gaming GPUs have various configurations with different memory limits, bandwidths, and computational power. This heterogeneity makes it difficult to efficiently distribute and execute inference tasks across multiple GPUs, leading to suboptimal performance and resource utilization.

Beyond Network's asymmetric parallel support

Beyond Network introduces asymmetric parallel support to tackle the challenges posed by heterogeneous GPUs. This approach extends the concept of pipeline parallelism [1] by allowing each pipeline parallel stage to be assigned a different number of layers and tensor model parallel degrees. By adapting to the specific capabilities of each GPU, asymmetric parallel support enables the system to optimize resource utilization and improve overall performance.

Extending Pipeline Parallelism

Beyond Network's asymmetric parallel support builds upon the foundations of pipeline parallelism and tensor model parallelism. Pipeline parallelism [1] is a technique that partitions a model across multiple devices, with each device processing a subset of the model's layers. On the other hand, tensor model parallelism [3] involves distributing the computation of individual layers across multiple devices.

Beyond Network extends these concepts by allowing for a flexible assignment of layers and tensor model parallel degrees to each pipeline stage. This flexibility enables the system to adapt to the specific capabilities of each GPU, ensuring optimal resource utilization and improved performance.

Formal Definition and Notations

To formally define Beyond Network's asymmetric parallel support, let us consider a set of GPU devices 𝐺={𝑔1,𝑔2,…,𝑔𝑛}, where each device 𝑔𝑖 is characterized by its memory limit 𝑀𝑖, memory bandwidth 𝐵𝑖, and computational power 𝐶𝑖. The communication between devices is represented by matrices 𝐿 and 𝐵, where 𝐿𝑖𝑗 and 𝐵𝑖𝑗 denote the communication latency and bandwidth between devices 𝑔𝑖 and 𝑔𝑗, respectively.

Let 𝐺𝑖𝑗 be a subset of GPU devices serving the 𝑖i-th stage in the 𝑗-th pipeline, and 𝐿𝑖𝑗 be the number of transformer layers assigned to this stage.

Computation and Communication Cost Estimation

The computation cost 𝑇𝑖𝑗𝑐 and communication cost 𝑇𝑖𝑗𝑚 for the 𝑖i-th stage in the 𝑗j-th pipeline can be estimated using the formulas introduced in HexGen [4].

Formulas

where:

𝑑 is the hidden dimension size,
𝑏 is the batch size,
𝑛 is the number of GPUs in 𝐺𝑖𝑗.

Scaling Inference on Distributed Heterogeneous Cloud

Beyond Network scales inference by leveraging asymmetric parallel support and tensor model parallelism. This allows for the flexible assignment of layers and tensor model parallel degrees to each pipeline stage based on the specific capabilities of each GPU. Here’s how the approach improves efficiency and reduces latency:

Resource Optimization:
- By assigning different layers and tensor model parallel degrees to each pipeline stage, the network can optimize the use of each GPU's memory, computational power, and bandwidth.
- Efficient utilization of resources leads to lower computation costs (𝑇𝑖𝑗𝑐) and reduced communication overhead (𝑇𝑖𝑗𝑚).
Adaptability to GPU Capabilities:
- GPUs with higher computational power 𝐶𝑖 and memory bandwidth 𝐵𝑖 can handle more layers, reducing the overall computation time.
- GPUs with lower latency 𝐿𝑖𝑗 facilitate faster communication between devices, minimizing delays in data transfer.
Conditions for Efficiency and Low Latency:
- High Computational Power: More powerful GPUs (𝐶𝑖) reduce the computation cost 𝑇𝑖𝑗𝑐 by processing layers faster.
- High Memory Bandwidth: GPUs with higher memory bandwidth (𝐵𝑖) enhance data transfer speeds, lowering communication costs 𝑇𝑖𝑗𝑚.
- Low Communication Latency: Low latency (𝐿𝑖𝑗) between GPUs enables quicker data exchange, which is crucial for synchronized parallel processing.

By distributing tasks based on the unique capabilities of each GPU, Beyond Network's approach ensures optimal resource utilization, leading to significant improvements in efficiency and latency. This addresses the inherent challenges in running large language models on a heterogeneous set of GPUs, paving the way for scalable and efficient AI inference on a distributed compute cloud.

Conclusion

Beyond Network's asymmetric parallel support is a powerful technique for addressing the challenges of running large language models on heterogeneous gaming GPUs in a decentralized environment. Beyond Network enables optimal resource utilization and improved performance by extending pipeline parallelism and allowing for a flexible assignment of layers and tensor model parallel degrees to each pipeline stage.

As the demand for efficient and scalable AI inference grows, Beyond Network's approach to handling heterogeneous GPUs sets the stage for the future of decentralized AI applications. By harnessing the power of gaming GPUs worldwide, Beyond Network democratizes end-user accessibility to AI and empowers open-source innovation.

References:

[1] Y. Huang et al., "GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism," arXiv:1811.06965, 2019.

[2] J. Park et al., "HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism," USENIX ATC, 2020.

[3] M. Shoeybi et al., "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism," arXiv:1909.08053, 2019.

[4] S. Rajbhandari et al., "HexGen: A System for Large-Scale Model Inference on Heterogeneous Hardware," arXiv:2303.17138, 2023.