Efficient GPU Resource Management under Latency and Power Constraints for Deep Learning Inference

  • Di Liu
  • , Zimo Ma
  • , Aolin Zhang
  • , Kuangyu Zheng*
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Recent rapid development in deep learning (DL) applications generates harsh requirements for DL inference services provided by GPU servers. On one hand, a high volume of different DL workloads always demands better processing throughput. On the other hand, GPU servers need to meet both the constraints of latency and power: each inference request must be responded in real-time with strict latency requirements; GPU servers need to be operated within a fixed power cap to prevent system failures from power overloading or overheating. Therefore, how to efficiently manage GPU resources to achieve better throughput under both latency and power constraints has become a key challenge.To address this issue, we first perform comprehensive measurements of inference tasks and have studied the impact of several critical knobs, including batch size, frequency, and GPU spatial sharing, on system performance in throughput, latency, and power. Then, we propose Morak, a multi-knob resource management framework for DL inference under the constraints of latency and power. A key mechanism of Morak is GPU resource partitioning with efficient space multiplexing for DL models. To further improve throughput, Morak efficiently explores the search space of GPU frequency and batch size under the constraints. Experiment results on a hardware testbed show that Morak can achieve as much as 67.7% throughput improvement compared with several state-of-the-art baselines under tight constraints of latency and power.

Original languageEnglish
Title of host publicationProceedings - 2023 IEEE 20th International Conference on Mobile Ad Hoc and Smart Systems, MASS 2023
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages548-556
Number of pages9
ISBN (Electronic)9798350324334
DOIs
StatePublished - 2023
Event20th IEEE International Conference on Mobile Ad Hoc and Smart Systems, MASS 2023 - Toronto, Canada
Duration: 25 Sep 202327 Sep 2023

Publication series

NameProceedings - 2023 IEEE 20th International Conference on Mobile Ad Hoc and Smart Systems, MASS 2023

Conference

Conference20th IEEE International Conference on Mobile Ad Hoc and Smart Systems, MASS 2023
Country/TerritoryCanada
CityToronto
Period25/09/2327/09/23

Keywords

  • Deep learning
  • efficient computing
  • latency
  • power cap
  • resource management
  • spatial sharing
  • throughput

Fingerprint

Dive into the research topics of 'Efficient GPU Resource Management under Latency and Power Constraints for Deep Learning Inference'. Together they form a unique fingerprint.

Cite this