Red-blue pebbling revisited: near optimal parallel matrix-matrix
multiplication
release_bbkfuurrkbfajfhlhbfiqnwskq
by
Grzegorz Kwasniewski, Joost
VandeVondele
Department of Computer Science, ETH Zurich, Swiss
National Supercomputing Centre
2019
Abstract
We propose COSMA: a parallel matrix-matrix multiplication algorithm that is
near communication-optimal for all combinations of matrix dimensions, processor
counts, and memory sizes. The key idea behind COSMA is to derive an optimal (up
to a factor of 0.03% for 10MB of fast memory) sequential schedule and then
parallelize it, preserving I/O optimality. To achieve this, we use the red-blue
pebble game to precisely model MMM dependencies and derive a constructive and
tight sequential and parallel I/O lower bound proofs. Compared to 2D or 3D
algorithms, which fix processor decomposition upfront and then map it to the
matrix dimensions, it reduces communication volume by up to √(3) times.
COSMA outperforms the established ScaLAPACK, CARMA, and CTF algorithms in all
scenarios up to 12.8x (2.2x on average), achieving up to 88% of Piz Daint's
peak performance. Our work does not require any hand tuning and is maintained
as an open source implementation.
In text/plain
format
Archived Files and Locations
application/pdf 2.4 MB
file_vjqqgyy7cjgszfky4civ5wzxge
|
arxiv.org (repository) web.archive.org (webarchive) |
1908.09606v2
access all versions, variants, and formats of this works (eg, pre-prints)