Inter-thread Communication in Multithreaded, Reconfigurable Coarse-grain
Arrays
release_r3kop6qsvzgmdn4oe252d32ynq
by
Dani Voitsechov, Yoav Etsion
2018
Abstract
Traditional von Neumann GPGPUs only allow threads to communicate through
memory on a group-to-group basis. In this model, a group of producer threads
writes intermediate values to memory, which are read by a group of consumer
threads after a barrier synchronization. To alleviate the memory bandwidth
imposed by this method of communication, GPGPUs provide a small scratchpad
memory that prevents intermediate values from overloading DRAM bandwidth. In
this paper we introduce direct inter-thread communications for massively
multithreaded CGRAs, where intermediate values are communicated directly
through the compute fabric on a point-to-point basis. This method avoids the
need to write values to memory, eliminates the need for a dedicated scratchpad,
and avoids workgroup-global barriers. The paper introduces the programming
model (CUDA) and execution model extensions, as well as the hardware primitives
that facilitate the communication. Our simulations of Rodinia benchmarks
running on the new system show that direct inter-thread communication provides
an average speedup of 4.5x (13.5x max) and reduces system power by an average
of 7x (33x max), when compared to an equivalent Nvidia GPGPU.
In text/plain
format
Archived Files and Locations
application/pdf 2.3 MB
file_mjuivxqqgnbbbakl4hrzmxhviq
|
arxiv.org (repository) web.archive.org (webarchive) |
1801.05178v1
access all versions, variants, and formats of this works (eg, pre-prints)