TY - GEN
T1 - Exploiting Efficient Mapping and Pipelined Execution for Accelerating SpMV on Tensor Cores
AU - Zhang, Kaige
AU - Yang, Hailong
AU - You, Xin
AU - Feng, Tianyu
AU - Xu, Yufan
AU - Luan, Zhongzhi
AU - Liu, Yi
AU - Qian, Depei
N1 - Publisher Copyright:
© 2026 Owner/Author.
PY - 2026/1/28
Y1 - 2026/1/28
N2 - Sparse matrix-vector multiplication (SpMV) is a fundamental operation in scientific computing, machine learning, and graph analytics, demanding efficient execution on modern hardware. Recent advances in hardware accelerators, such as Tensor Cores, have significantly improved the performance of many compute-intensive workloads. However, effectively utilizing Tensor Cores for SpMV remains challenging due to its irregular sparsity patterns and the mismatch between SpMV's computational characteristics and constrained architecture design, leading to suboptimal performance and underutilization of Tensor Cores. In this paper, we systematically analyze the state-of-the-art SpMV optimizations on Tensor Cores, identify key performance bottlenecks, and propose Drawloom, a Tensor-Core-aware framework for SpMV with efficient Tensor Core mapping and optimized pipeline execution. Drawloom leverages a redesigned Tensor Core mapping strategy with a zig-zag chained sparse storage format, as well as a multi-stage register pipeline to better exploit hardware parallelism. Our evaluation on SuiteSparse dataset demonstrates that Drawloom outperforms cuSPARSE by 2.71×/1.90× (in FP16), 2.95×/2.39× (in FP32), and 2.47×/1.54× (in FP64) on A100 and H100 GPUs, respectively. Compared to the state-of-the-art SpMV implementations, Drawloom achieves a performance speedup of 1.26×/1.18× (in FP16) and 1.49×/1.56× (in FP64) on A100 and H100 GPUs, respectively.
AB - Sparse matrix-vector multiplication (SpMV) is a fundamental operation in scientific computing, machine learning, and graph analytics, demanding efficient execution on modern hardware. Recent advances in hardware accelerators, such as Tensor Cores, have significantly improved the performance of many compute-intensive workloads. However, effectively utilizing Tensor Cores for SpMV remains challenging due to its irregular sparsity patterns and the mismatch between SpMV's computational characteristics and constrained architecture design, leading to suboptimal performance and underutilization of Tensor Cores. In this paper, we systematically analyze the state-of-the-art SpMV optimizations on Tensor Cores, identify key performance bottlenecks, and propose Drawloom, a Tensor-Core-aware framework for SpMV with efficient Tensor Core mapping and optimized pipeline execution. Drawloom leverages a redesigned Tensor Core mapping strategy with a zig-zag chained sparse storage format, as well as a multi-stage register pipeline to better exploit hardware parallelism. Our evaluation on SuiteSparse dataset demonstrates that Drawloom outperforms cuSPARSE by 2.71×/1.90× (in FP16), 2.95×/2.39× (in FP32), and 2.47×/1.54× (in FP64) on A100 and H100 GPUs, respectively. Compared to the state-of-the-art SpMV implementations, Drawloom achieves a performance speedup of 1.26×/1.18× (in FP16) and 1.49×/1.56× (in FP64) on A100 and H100 GPUs, respectively.
KW - GPU
KW - Hardware-aware optimization
KW - Sparse matrix-vector multiplication (SpMV)
KW - Sparse storage format
KW - Tensor Cores
UR - https://www.scopus.com/pages/publications/105029764497
U2 - 10.1145/3774934.3786441
DO - 10.1145/3774934.3786441
M3 - 会议稿件
AN - SCOPUS:105029764497
T3 - Proceedings of the 31st ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP 2026
SP - 245
EP - 258
BT - Proceedings of the 31st ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP 2026
A2 - Hosking, Tony
A2 - Musuvathi, Madan
A2 - Taura, Kenjiro
PB - Association for Computing Machinery, Inc
T2 - 31st Annual ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2026
Y2 - 31 January 2026 through 4 February 2026
ER -