TY - GEN
T1 - Efficient GPU Resource Management under Latency and Power Constraints for Deep Learning Inference
AU - Liu, Di
AU - Ma, Zimo
AU - Zhang, Aolin
AU - Zheng, Kuangyu
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Recent rapid development in deep learning (DL) applications generates harsh requirements for DL inference services provided by GPU servers. On one hand, a high volume of different DL workloads always demands better processing throughput. On the other hand, GPU servers need to meet both the constraints of latency and power: each inference request must be responded in real-time with strict latency requirements; GPU servers need to be operated within a fixed power cap to prevent system failures from power overloading or overheating. Therefore, how to efficiently manage GPU resources to achieve better throughput under both latency and power constraints has become a key challenge.To address this issue, we first perform comprehensive measurements of inference tasks and have studied the impact of several critical knobs, including batch size, frequency, and GPU spatial sharing, on system performance in throughput, latency, and power. Then, we propose Morak, a multi-knob resource management framework for DL inference under the constraints of latency and power. A key mechanism of Morak is GPU resource partitioning with efficient space multiplexing for DL models. To further improve throughput, Morak efficiently explores the search space of GPU frequency and batch size under the constraints. Experiment results on a hardware testbed show that Morak can achieve as much as 67.7% throughput improvement compared with several state-of-the-art baselines under tight constraints of latency and power.
AB - Recent rapid development in deep learning (DL) applications generates harsh requirements for DL inference services provided by GPU servers. On one hand, a high volume of different DL workloads always demands better processing throughput. On the other hand, GPU servers need to meet both the constraints of latency and power: each inference request must be responded in real-time with strict latency requirements; GPU servers need to be operated within a fixed power cap to prevent system failures from power overloading or overheating. Therefore, how to efficiently manage GPU resources to achieve better throughput under both latency and power constraints has become a key challenge.To address this issue, we first perform comprehensive measurements of inference tasks and have studied the impact of several critical knobs, including batch size, frequency, and GPU spatial sharing, on system performance in throughput, latency, and power. Then, we propose Morak, a multi-knob resource management framework for DL inference under the constraints of latency and power. A key mechanism of Morak is GPU resource partitioning with efficient space multiplexing for DL models. To further improve throughput, Morak efficiently explores the search space of GPU frequency and batch size under the constraints. Experiment results on a hardware testbed show that Morak can achieve as much as 67.7% throughput improvement compared with several state-of-the-art baselines under tight constraints of latency and power.
KW - Deep learning
KW - efficient computing
KW - latency
KW - power cap
KW - resource management
KW - spatial sharing
KW - throughput
UR - https://www.scopus.com/pages/publications/85178515802
U2 - 10.1109/MASS58611.2023.00074
DO - 10.1109/MASS58611.2023.00074
M3 - 会议稿件
AN - SCOPUS:85178515802
T3 - Proceedings - 2023 IEEE 20th International Conference on Mobile Ad Hoc and Smart Systems, MASS 2023
SP - 548
EP - 556
BT - Proceedings - 2023 IEEE 20th International Conference on Mobile Ad Hoc and Smart Systems, MASS 2023
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 20th IEEE International Conference on Mobile Ad Hoc and Smart Systems, MASS 2023
Y2 - 25 September 2023 through 27 September 2023
ER -