跳到主要导航 跳到搜索 跳到主要内容

Low-Bit FlashAttention Accelerated Operator Design Based on Triton

  • Jinyang Du
  • , Jinyang Guo
  • , Yifu Ding*
  • *此作品的通讯作者
  • Beihang University

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Transformer architecture has been widely adopted across numerous models. As the core component of the Trans-former, the attention mechanism has a computational complexity of O(n2) compared to linear transformations, making it a major computational bottleneck when handling long sequences. Although quantization has proven to be an ef-fective method for accelerating model inference, existing quantization approaches struggle to support ultra-low-bit formats for attention. To address this, we propose a Triton-based low-bit FlashAttention accelerator design. Specif-ically, this work leverages operator fusion techniques to merge the dequantization process with matrix multiplication, reducing memory access overhead. In addition, a mixed-precision quantization strategy is employed to miti-gate quantization error and preserve model accuracy. Compared to FlashAttention2, the kernel has a 2.4× speedup ratio and an end-to-end inference 1.2× speedup. Exper-imental results demonstrate that the proposed kernel im-proves inference speed and memory efficiency while main-taining accuracy, offering a novel approach for deploying large models with low latency and resource consumption on edge devices. Our code is available at https:/github.com/Charles2530/lowbit-quant-fa2.

源语言英语
主期刊名Proceedings - 2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025
出版商Institute of Electrical and Electronics Engineers Inc.
3017-3026
页数10
ISBN(电子版)9798331589882
DOI
出版状态已出版 - 2025
活动2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025 - Honolulu, 美国
期限: 19 10月 202520 10月 2025

出版系列

姓名Proceedings - 2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025

会议

会议2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025
国家/地区美国
Honolulu
时期19/10/2520/10/25

指纹

探究 'Low-Bit FlashAttention Accelerated Operator Design Based on Triton' 的科研主题。它们共同构成独一无二的指纹。

引用此