Abstract
Deep neural network (DNN)-based transformer models have demonstrated remarkable performance in natural language processing (NLP) applications. Unfortunately, the unique scaled dot-product attention mechanism and intensive memory access pose a significant challenge during inference on power-constrained edge devices. One emerging solution to this challenge is computing-in-memory (CIM), which uses memory cells for logic computation to reduce data movement and overcome the memory wall. However, existing CIM designs do not support high-precision computations, such as floating-point operations, which are essential for NLP applications. Furthermore, CIM architectures require complex control modules and costly peripheral circuits to harness the full potential of in-memory computation. Hence, this article proposes a scalable RRAM-based in-memory floating-point computation architecture (RIME) that uses single-cycle NOR, NAND, and minority logic to implement in-memory floating-point operations. RIME features efficient parallel and pipeline capabilities with a centralized control module and a simplified peripheral circuit to eliminate data movement during computation. Furthermore, the article proposes pipelined implementations of matrix-matrix multiplication (MatMul) and softmax functions, enabling the construction of a transformer accelerator based on RIME. Extensive experimental results show that compared with GPU-based implementation, the RIME-based transformer accelerator improves timing efficiency by 2.3\times and energy efficiency by 1.7\times without compromising inference accuracy.
| Original language | English |
|---|---|
| Pages (from-to) | 485-496 |
| Number of pages | 12 |
| Journal | IEEE Transactions on Very Large Scale Integration (VLSI) Systems |
| Volume | 32 |
| Issue number | 3 |
| DOIs | |
| State | Published - 1 Mar 2024 |
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 7 Affordable and Clean Energy
Keywords
- Accelerator
- computing-in-memory (CIM)
- energy efficiency
- resistive random access memory (RRAM)
- scalability
- transformer
Fingerprint
Dive into the research topics of 'An RRAM-Based Computing-in-Memory Architecture and Its Application in Accelerating Transformer Inference'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver