Checkpointing to minimize completion time for Inter-dependent Parallel
Processes on Volunteer Grids
release_l3rcbwfqmbafhjsf7v6f7gjfrm
by
Mohammad Tanvir Rahman, Hien Nguyen, Jaspal Subhlok, Gopal Pandurangan
2016
Abstract
Volunteer computing is being used successfully for large scale scientific
computations. This research is in the context of Volpex, a programming
framework that supports communicating parallel processes in a volunteer
environment. Redundancy and checkpointing are combined to ensure consistent
forward progress with Volpex in this unique execution environment characterized
by heterogeneous failure prone nodes and interdependent replicated processes.
An important parameter for optimizing performance with Volpex is the frequency
of checkpointing. The paper presents a mathematical model to minimize the
completion time for inter-dependent parallel processes running in a volunteer
environment by finding a suitable checkpoint interval. Validation is performed
with a sample real world application running on a pool of distributed volunteer
nodes. The results indicate that the performance with our predicted checkpoint
interval is fairly close to the best performance obtained empirically by
varying the checkpoint interval.
In text/plain
format
Archived Files and Locations
application/pdf 710.5 kB
file_dffyhrx3nvflzl2djcbcqryzl4
|
arxiv.org (repository) web.archive.org (webarchive) |
1603.03502v1
access all versions, variants, and formats of this works (eg, pre-prints)