Exploiting Efficient Mapping and Pipelined Execution for Accelerating SpMV on Tensor Cores

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Sparse matrix-vector multiplication (SpMV) is a fundamental operation in scientific computing, machine learning, and graph analytics, demanding efficient execution on modern hardware. Recent advances in hardware accelerators, such as Tensor Cores, have significantly improved the performance of many compute-intensive workloads. However, effectively utilizing Tensor Cores for SpMV remains challenging due to its irregular sparsity patterns and the mismatch between SpMV's computational characteristics and constrained architecture design, leading to suboptimal performance and underutilization of Tensor Cores. In this paper, we systematically analyze the state-of-the-art SpMV optimizations on Tensor Cores, identify key performance bottlenecks, and propose Drawloom, a Tensor-Core-aware framework for SpMV with efficient Tensor Core mapping and optimized pipeline execution. Drawloom leverages a redesigned Tensor Core mapping strategy with a zig-zag chained sparse storage format, as well as a multi-stage register pipeline to better exploit hardware parallelism. Our evaluation on SuiteSparse dataset demonstrates that Drawloom outperforms cuSPARSE by 2.71×/1.90× (in FP16), 2.95×/2.39× (in FP32), and 2.47×/1.54× (in FP64) on A100 and H100 GPUs, respectively. Compared to the state-of-the-art SpMV implementations, Drawloom achieves a performance speedup of 1.26×/1.18× (in FP16) and 1.49×/1.56× (in FP64) on A100 and H100 GPUs, respectively.

Original languageEnglish
Title of host publicationProceedings of the 31st ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP 2026
EditorsTony Hosking, Madan Musuvathi, Kenjiro Taura
PublisherAssociation for Computing Machinery, Inc
Pages245-258
Number of pages14
ISBN (Electronic)9798400723100
DOIs
StatePublished - 28 Jan 2026
Event31st Annual ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2026 - Sydney, Australia
Duration: 31 Jan 20264 Feb 2026

Publication series

NameProceedings of the 31st ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP 2026

Conference

Conference31st Annual ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2026
Country/TerritoryAustralia
CitySydney
Period31/01/264/02/26

Keywords

  • GPU
  • Hardware-aware optimization
  • Sparse matrix-vector multiplication (SpMV)
  • Sparse storage format
  • Tensor Cores

Fingerprint

Dive into the research topics of 'Exploiting Efficient Mapping and Pipelined Execution for Accelerating SpMV on Tensor Cores'. Together they form a unique fingerprint.

Cite this