Authors
Ajay Potnis, Chief Research Officer - Beyond Labs
Discover how Beyond Network addresses challenges with distributed heterogeneous GPUs.
Asymmetric Parallel Support for Heterogeneous GPUs in Beyond Network
The rapid growth of artificial intelligence (AI) applications has led to an increasing demand for efficient and scalable inference systems. However, running large language models (LLMs) on a diverse set of gaming GPUs in a decentralized environment presents performance and resource utilization challenges. Beyond Network, a decentralized AI inference network addresses these challenges by introducing various solutions. In this series, let's discuss asymmetric parallel support for heterogeneous GPUs.
Challenges in running LLMs on gaming GPUs
One of the primary challenges in running LLMs on gaming GPUs is their varying computational capacities. Gaming GPUs have various configurations with different memory limits, bandwidths, and computational power. This heterogeneity makes it difficult to efficiently distribute and execute inference tasks across multiple GPUs, leading to suboptimal performance and resource utilization.
Beyond Network's asymmetric parallel support
Beyond Network introduces asymmetric parallel support to tackle the challenges posed by heterogeneous GPUs. This approach extends the concept of pipeline parallelism [1] by allowing each pipeline parallel stage to be assigned a different number of layers and tensor model parallel degrees. By adapting to the specific capabilities of each GPU, asymmetric parallel support enables the system to optimize resource utilization and improve overall performance.
Extending Pipeline Parallelism
Beyond Network's asymmetric parallel support builds upon the foundations of pipeline parallelism and tensor model parallelism. Pipeline parallelism [1] is a technique that partitions a model across multiple devices, with each device processing a subset of the model's layers. On the other hand, tensor model parallelism [3] involves distributing the computation of individual layers across multiple devices.
Beyond Network extends these concepts by allowing for a flexible assignment of layers and tensor model parallel degrees to each pipeline stage. This flexibility enables the system to adapt to the specific capabilities of each GPU, ensuring optimal resource utilization and improved performance.
Formal Definition and Notations
To formally define Beyond Network's asymmetric parallel support, let us consider a set of GPU devices πΊ={π1,π2,β¦,ππ}, where each device ππβ is characterized by its memory limit ππ, memory bandwidth π΅π, and computational power πΆπ. The communication between devices is represented by matrices πΏ and π΅, where πΏππβ and π΅ππβ denote the communication latency and bandwidth between devices ππ and ππ, respectively.
Let πΊππ be a subset of GPU devices serving the πi-th stage in the π-th pipeline, and πΏππβ be the number of transformer layers assigned to this stage.
Computation and Communication Cost Estimation
The computation cost ππππβ and communication cost ππππ for the πi-th stage in the πj-th pipeline can be estimated using the formulas introduced in HexGen [4].
Formulas
where:
π is the hidden dimension size,
π is the batch size,
π is the number of GPUs in πΊππ.
Scaling Inference on Distributed Heterogeneous Cloud
Beyond Network scales inference by leveraging asymmetric parallel support and tensor model parallelism. This allows for the flexible assignment of layers and tensor model parallel degrees to each pipeline stage based on the specific capabilities of each GPU. Hereβs how the approach improves efficiency and reduces latency:
Resource Optimization:
By assigning different layers and tensor model parallel degrees to each pipeline stage, the network can optimize the use of each GPU's memory, computational power, and bandwidth.
Efficient utilization of resources leads to lower computation costs (ππππβ) and reduced communication overhead (ππππβ).
Adaptability to GPU Capabilities:
GPUs with higher computational power πΆπβ and memory bandwidth π΅πβ can handle more layers, reducing the overall computation time.
GPUs with lower latency πΏππ facilitate faster communication between devices, minimizing delays in data transfer.
Conditions for Efficiency and Low Latency:
High Computational Power: More powerful GPUs (πΆπ) reduce the computation cost ππππβ by processing layers faster.
High Memory Bandwidth: GPUs with higher memory bandwidth (π΅π) enhance data transfer speeds, lowering communication costs ππππ.
Low Communication Latency: Low latency (πΏππ) between GPUs enables quicker data exchange, which is crucial for synchronized parallel processing.
By distributing tasks based on the unique capabilities of each GPU, Beyond Network's approach ensures optimal resource utilization, leading to significant improvements in efficiency and latency. This addresses the inherent challenges in running large language models on a heterogeneous set of GPUs, paving the way for scalable and efficient AI inference on a distributed compute cloud.
Conclusion
Beyond Network's asymmetric parallel support is a powerful technique for addressing the challenges of running large language models on heterogeneous gaming GPUs in a decentralized environment. Beyond Network enables optimal resource utilization and improved performance by extending pipeline parallelism and allowing for a flexible assignment of layers and tensor model parallel degrees to each pipeline stage.
As the demand for efficient and scalable AI inference grows, Beyond Network's approach to handling heterogeneous GPUs sets the stage for the future of decentralized AI applications. By harnessing the power of gaming GPUs worldwide, Beyond Network democratizes end-user accessibility to AI and empowers open-source innovation.
References:
[1] Y. Huang et al., "GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism," arXiv:1811.06965, 2019.
[2] J. Park et al., "HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism," USENIX ATC, 2020.
[3] M. Shoeybi et al., "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism," arXiv:1909.08053, 2019.
[4] S. Rajbhandari et al., "HexGen: A System for Large-Scale Model Inference on Heterogeneous Hardware," arXiv:2303.17138, 2023.