SI2-SSI: FAMII: High Performance and Scalable Fabric Analysis, Monitoring and Introspection Infrastructure for HPC and Big Data release_w6m6x7p56fe7hjen2vr2kld3vy

by Hari Subramoni, Dhabaleswar K. Panda, Karen Tomko

Published by figshare.

2020  

Abstract

As heterogeneous computing (CPUs, GPUs etc.) and , networking (NVLinks, X-Bus etc.) hardware continue to advance, it becomes increasingly essential and challenging to understand the interactions between High-Performance Computing (HPC) and Deep Learning applications/frameworks, the communication middleware they rely on, the underlying communication fabric these high-performance middlewares depend on, and the schedulers that manage HPC clusters. Such understanding will enable application developers/users, system administrators, and middleware developers to maximize the efficiency and performance of individual components that comprise a modern HPC system and solve different grand challenge problems. Moreover, determining the root cause of performance degradation is complex for the domain scientist. The scale of emerging HPC clusters further exacerbates the problem. These issues lead to the following broad challenge: How can we design a tool that enables in-depth understanding of the communication traffic on the interconnect and GPU through tight integration with the MPI runtime at scale?
In text/plain format

Archived Files and Locations

application/pdf  950.4 kB
file_mdpcfhms4jgfzp2ucdmc35sqji
s3-eu-west-1.amazonaws.com (publisher)
web.archive.org (webarchive)
Read Archived PDF
Archived
Type  graphic
Stage   published
Date   2020-02-05
Work Entity
access all versions, variants, and formats of this works (eg, pre-prints)
Catalog Record
Revision: f267ba57-ef80-4eab-9933-edc81ab85c11
API URL: JSON