TY - JOUR
T1 - HAOTuner
T2 - A Hardware Adaptive Operator Auto-Tuner for Dynamic Shape Tensor Compilers
AU - Mu, Pengyu
AU - Liu, Yi
AU - Wang, Rui
AU - Liu, Guoxiang
AU - Sun, Zhonghao
AU - Yang, Hailong
AU - Luan, Zhongzhi
AU - Qian, Depei
N1 - Publisher Copyright:
© 1968-2012 IEEE.
PY - 2023/11/1
Y1 - 2023/11/1
N2 - Deep learning compilers with auto-tuners have the ability to generate high-performance programs, particularly tensor programs on accelerators. However, the performance of these tensor programs is shape-sensitive and hardware resource-sensitive. When the tensor shape is only known at runtime instead of compile time, auto-tuners must tune the tensor programs for every possible shape, leading to significant time and cost overhead. Additionally, if a tensor program tuned for one device is deployed on a different device, the performance may not be as optimal as before. To address these challenges, we propose HAOTuner, a hardware-adaptive deep learning operator auto-tuner specifically designed for dynamic shape tensors. We leverage the concept of micro-kernels as the unit of task allocation and have observed that the size of the micro-kernel greatly impacts performance. In HAOTuner, we determine the size of micro-kernels based not only on the tensor shapes but also on the available hardware resources. Specifically, we present an algorithm to select hardware-friendly micro-kernels as candidates, reducing the tuning time. We also design a cost model that is sensitive to hardware resources to support various hardware architectures. Furthermore, we provide a model transfer solution to enable fast deployment of the cost model on different hardware platforms. We evaluate HAOTuner on six different types of GPUs. The experiments demonstrate that HAOTuner surpasses the state-of-the-art dynamic shape tensor auto-tuner in terms of running time by an average of 26% and tuning time by 25%. Moreover, HAOTuner outperforms the state-of-the-art compiler with padding in terms of running time by an average of 39% and tuning time by 6×.
AB - Deep learning compilers with auto-tuners have the ability to generate high-performance programs, particularly tensor programs on accelerators. However, the performance of these tensor programs is shape-sensitive and hardware resource-sensitive. When the tensor shape is only known at runtime instead of compile time, auto-tuners must tune the tensor programs for every possible shape, leading to significant time and cost overhead. Additionally, if a tensor program tuned for one device is deployed on a different device, the performance may not be as optimal as before. To address these challenges, we propose HAOTuner, a hardware-adaptive deep learning operator auto-tuner specifically designed for dynamic shape tensors. We leverage the concept of micro-kernels as the unit of task allocation and have observed that the size of the micro-kernel greatly impacts performance. In HAOTuner, we determine the size of micro-kernels based not only on the tensor shapes but also on the available hardware resources. Specifically, we present an algorithm to select hardware-friendly micro-kernels as candidates, reducing the tuning time. We also design a cost model that is sensitive to hardware resources to support various hardware architectures. Furthermore, we provide a model transfer solution to enable fast deployment of the cost model on different hardware platforms. We evaluate HAOTuner on six different types of GPUs. The experiments demonstrate that HAOTuner surpasses the state-of-the-art dynamic shape tensor auto-tuner in terms of running time by an average of 26% and tuning time by 25%. Moreover, HAOTuner outperforms the state-of-the-art compiler with padding in terms of running time by an average of 39% and tuning time by 6×.
KW - Deep learning compilation
KW - auto-tuning
KW - dynamic shape tensor
KW - tensor program
UR - https://www.scopus.com/pages/publications/85163488037
U2 - 10.1109/TC.2023.3288758
DO - 10.1109/TC.2023.3288758
M3 - 文章
AN - SCOPUS:85163488037
SN - 0018-9340
VL - 72
SP - 3178
EP - 3190
JO - IEEE Transactions on Computers
JF - IEEE Transactions on Computers
IS - 11
ER -