TY - GEN
T1 - JOVS
T2 - 37th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2025
AU - Han, Yaochen
AU - Jiang, Hongxu
AU - Zhang, Runhua
AU - She, Rui
N1 - Publisher Copyright:
© 2025 ACM.
PY - 2025/7/16
Y1 - 2025/7/16
N2 - Recent embedded devices have integrated digital signal processors (DSPs) to balance performance and power when executing complex Deep Neural Network (DNN) workloads. With modern AI DSPs providing specialized tensor computation vector instructions and limited on-chip memory, fully releasing the potential of these DSPs remains a significant challenge. The performance of AI DSPs relies heavily on vendor-provided libraries and compilers. In practice, vendor-provided libraries are inflexible and prevent further optimization. State-of-the-art compilers usually focus on a single optimization (vectorization or scheduling), which is insufficient to address this challenge. In this paper, we propose JOVS, a Joint Optimization of Vectorization and Scheduling to accelerate DNN inference on AI DSPs. The key is to prioritize the selection of novel specialized instructions to implement operators, reducing the joint optimization space. For vectorization, we design a mapping and layout scheme for the specialized instruction and perform global instruction selection. For scheduling, we propose a two-step scheduling strategy for finegrained optimization. Finally, we evaluate JOVS on five popular DNNs. The experimental results show that JOVS achieves 1.44× speedup over the vendor-provided library. Compared to state-of-the-art compilers, JOVS also achieves 1.83× speedup.
AB - Recent embedded devices have integrated digital signal processors (DSPs) to balance performance and power when executing complex Deep Neural Network (DNN) workloads. With modern AI DSPs providing specialized tensor computation vector instructions and limited on-chip memory, fully releasing the potential of these DSPs remains a significant challenge. The performance of AI DSPs relies heavily on vendor-provided libraries and compilers. In practice, vendor-provided libraries are inflexible and prevent further optimization. State-of-the-art compilers usually focus on a single optimization (vectorization or scheduling), which is insufficient to address this challenge. In this paper, we propose JOVS, a Joint Optimization of Vectorization and Scheduling to accelerate DNN inference on AI DSPs. The key is to prioritize the selection of novel specialized instructions to implement operators, reducing the joint optimization space. For vectorization, we design a mapping and layout scheme for the specialized instruction and perform global instruction selection. For scheduling, we propose a two-step scheduling strategy for finegrained optimization. Finally, we evaluate JOVS on five popular DNNs. The experimental results show that JOVS achieves 1.44× speedup over the vendor-provided library. Compared to state-of-the-art compilers, JOVS also achieves 1.83× speedup.
KW - Compiler Optimization
KW - DSPs
KW - Scheduling
KW - Vectorization
UR - https://www.scopus.com/pages/publications/105012714618
U2 - 10.1145/3694906.3743309
DO - 10.1145/3694906.3743309
M3 - 会议稿件
AN - SCOPUS:105012714618
T3 - Annual ACM Symposium on Parallelism in Algorithms and Architectures
SP - 487
EP - 498
BT - SPAA 2025 - Proceedings of the 2025 37th ACM Symposium on Parallelism in Algorithms and Architectures
PB - Association for Computing Machinery
Y2 - 28 July 2025 through 1 August 2025
ER -