TY - GEN
T1 - Low-Bit FlashAttention Accelerated Operator Design Based on Triton
AU - Du, Jinyang
AU - Guo, Jinyang
AU - Ding, Yifu
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Transformer architecture has been widely adopted across numerous models. As the core component of the Trans-former, the attention mechanism has a computational complexity of O(n2) compared to linear transformations, making it a major computational bottleneck when handling long sequences. Although quantization has proven to be an ef-fective method for accelerating model inference, existing quantization approaches struggle to support ultra-low-bit formats for attention. To address this, we propose a Triton-based low-bit FlashAttention accelerator design. Specif-ically, this work leverages operator fusion techniques to merge the dequantization process with matrix multiplication, reducing memory access overhead. In addition, a mixed-precision quantization strategy is employed to miti-gate quantization error and preserve model accuracy. Compared to FlashAttention2, the kernel has a 2.4× speedup ratio and an end-to-end inference 1.2× speedup. Exper-imental results demonstrate that the proposed kernel im-proves inference speed and memory efficiency while main-taining accuracy, offering a novel approach for deploying large models with low latency and resource consumption on edge devices. Our code is available at https:/github.com/Charles2530/lowbit-quant-fa2.
AB - Transformer architecture has been widely adopted across numerous models. As the core component of the Trans-former, the attention mechanism has a computational complexity of O(n2) compared to linear transformations, making it a major computational bottleneck when handling long sequences. Although quantization has proven to be an ef-fective method for accelerating model inference, existing quantization approaches struggle to support ultra-low-bit formats for attention. To address this, we propose a Triton-based low-bit FlashAttention accelerator design. Specif-ically, this work leverages operator fusion techniques to merge the dequantization process with matrix multiplication, reducing memory access overhead. In addition, a mixed-precision quantization strategy is employed to miti-gate quantization error and preserve model accuracy. Compared to FlashAttention2, the kernel has a 2.4× speedup ratio and an end-to-end inference 1.2× speedup. Exper-imental results demonstrate that the proposed kernel im-proves inference speed and memory efficiency while main-taining accuracy, offering a novel approach for deploying large models with low latency and resource consumption on edge devices. Our code is available at https:/github.com/Charles2530/lowbit-quant-fa2.
KW - Flashattention
KW - Hardware acceleration
KW - Low-bit quantization
KW - Mixed-precision quantization
KW - Triton
UR - https://www.scopus.com/pages/publications/105035184775
U2 - 10.1109/ICCVW69036.2025.00315
DO - 10.1109/ICCVW69036.2025.00315
M3 - 会议稿件
AN - SCOPUS:105035184775
T3 - Proceedings - 2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025
SP - 3017
EP - 3026
BT - Proceedings - 2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025
Y2 - 19 October 2025 through 20 October 2025
ER -