Exploiting intra-chip locality for multi-chip GPUs via two-level shared L1 cache

Research output: Contribution to journalArticlepeer-review

Abstract

Remote memory accesses in multi-chip GPUs pose a major performance bottleneck due to high latency and inter-chip bandwidth contention. Exploiting intra-chip locality alleviates this bottleneck by serving memory accesses locally and reducing cross-chip traffic. Yet, conventional coarse-grained approaches to exploiting locality in multi-chip GPUs often incur excessive overhead, limiting their potential performance benefits. To this end, we propose TLS-Cache, a two-level shared L1 cache that efficiently exploits intra-chip locality without additional cache capacity. It mitigates high-latency remote memory accesses by enabling fine-grained data reuse through cluster-shared and remote-shared L1 caches, which capture locality within and across streaming multiprocessor clusters, respectively. These two caches work cooperatively to maximize the exploitation of intra-chip locality and deliver measurable performance gains. Experimental results show that TLS-Cache improves instructions per cycle by 30.2% on average, compared with the baseline 4-chip GPU with private L1 caches.

Original languageEnglish
Article number103500
JournalJournal of Systems Architecture
Volume167
DOIs
StatePublished - Oct 2025

Keywords

  • Cache
  • Locality
  • Memory access
  • Multi-chip GPUs

Fingerprint

Dive into the research topics of 'Exploiting intra-chip locality for multi-chip GPUs via two-level shared L1 cache'. Together they form a unique fingerprint.

Cite this