LazyBatching: An SLA-aware Batching System for Cloud Machine Learning Inference release_4kvl5wcvxvfg3a6pvx3ulun2ki

by Yujeong Choi, Yunseong Kim, Minsoo Rhu

Released as a article .

2020  

Abstract

In cloud ML inference systems, batching is an essential technique to increase throughput which helps optimize total-cost-of-ownership. Prior graph batching combines the individual DNN graphs into a single one, allowing multiple inputs to be concurrently executed in parallel. We observe that the coarse-grained graph batching becomes suboptimal in effectively handling the dynamic inference request traffic, leaving significant performance left on the table. This paper proposes LazyBatching, an SLA-aware batching system that considers both scheduling and batching in the granularity of individual graph nodes, rather than the entire graph for flexible batching. We show that LazyBatching can intelligently determine the set of nodes that can be efficiently batched together, achieving an average 15x, 1.5x, and 5.5x improvement than graph batching in terms of average response time, throughput, and SLA satisfaction, respectively.
In text/plain format

Archived Files and Locations

application/pdf  1.9 MB
file_p5qjeoka4nevljmkj56fdefg3e
arxiv.org (repository)
web.archive.org (webarchive)
Read Archived PDF
Preserved and Accessible
Type  article
Stage   submitted
Date   2020-10-25
Version   v1
Language   en ?
arXiv  2010.13103v1
Work Entity
access all versions, variants, and formats of this works (eg, pre-prints)
Catalog Record
Revision: cfff13c2-5a9f-401c-a2b5-b1575ccfa806
API URL: JSON