Abstract
Remote memory accesses in multi-chip GPUs pose a major performance bottleneck due to high latency and inter-chip bandwidth contention. Exploiting intra-chip locality alleviates this bottleneck by serving memory accesses locally and reducing cross-chip traffic. Yet, conventional coarse-grained approaches to exploiting locality in multi-chip GPUs often incur excessive overhead, limiting their potential performance benefits. To this end, we propose TLS-Cache, a two-level shared L1 cache that efficiently exploits intra-chip locality without additional cache capacity. It mitigates high-latency remote memory accesses by enabling fine-grained data reuse through cluster-shared and remote-shared L1 caches, which capture locality within and across streaming multiprocessor clusters, respectively. These two caches work cooperatively to maximize the exploitation of intra-chip locality and deliver measurable performance gains. Experimental results show that TLS-Cache improves instructions per cycle by 30.2% on average, compared with the baseline 4-chip GPU with private L1 caches.
| Original language | English |
|---|---|
| Article number | 103500 |
| Journal | Journal of Systems Architecture |
| Volume | 167 |
| DOIs | |
| State | Published - Oct 2025 |
Keywords
- Cache
- Locality
- Memory access
- Multi-chip GPUs
Fingerprint
Dive into the research topics of 'Exploiting intra-chip locality for multi-chip GPUs via two-level shared L1 cache'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver