DYRE: a DYnamic REconfigurable solution to increase GPGPU's reliability release_jkeu6ikz4bgxvgwid5lalhgnhu

by JOSIE ESTEBAN RODRIGUEZ CONDIA, Pierpaolo Narducci, Matteo Sonza Reorda, Luca Sterpone

Published in Journal of Supercomputing by Springer Science and Business Media LLC.

2021  

Abstract

<jats:title>Abstract</jats:title>General-purpose graphics processing units (GPGPUs) are extensively used in high-performance computing. However, it is well known that these devices' reliability may be limited by the rising of faults at the hardware level. This work introduces a flexible solution to detect and mitigate permanent faults affecting the execution units in these parallel devices. The proposed solution is based on adding some spare modules to perform two in-field operations: detecting and mitigating faults. The solution takes advantage of the regularity of the execution units in the device to avoid significant design changes and reduce the overhead. The proposed solution was evaluated in terms of reliability improvement and area, performance, and power overhead costs. For this purpose, we resorted to a micro-architectural open-source GPGPU model (FlexGripPlus). Experimental results show that the proposed solution can extend the reliability by up to 57%, with overhead costs lower than 2% and 8% in area and power, respectively.
In application/xml+jats format

Archived Files and Locations

application/pdf  1.4 MB
file_y32lpzd5urgedcc3tsxtwh4h74
link.springer.com (publisher)
web.archive.org (webarchive)
Read Archived PDF
Preserved and Accessible
Type  article-journal
Stage   published
Date   2021-03-29
Language   en ?
Journal Metadata
Not in DOAJ
In Keepers Registry
ISSN-L:  0920-8542
Work Entity
access all versions, variants, and formats of this works (eg, pre-prints)
Catalog Record
Revision: 00000655-85c8-45e2-8641-38493dcdfcb5
API URL: JSON