跳到主要导航 跳到搜索 跳到主要内容

More Effective Synchronization Scheme in ML Using Stale Parameters

  • Beihang University

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

In Machine learning (ML) the model we use is increasingly important, and the model's parameters, the key point of the ML, are adjusted through iteratively processing a training dataset until convergence. Although data-parallel ML systems often engage a perfect error tolerance when synchronizing the model parameters for maximizing parallelism, the synchronization of model parameters may delay in completion, a problem that generally gets worse at a large scale. This paper presents a Bounded Asynchronous Parallel (BAP) model of computation that allows computations using stale model parameters in order to reduce synchronization overheads. In the meanwhile, our BAP model ensures theoretical convergence guarantees for large scale data-parallel ML applications. This model permits distributed workers to use the stale parameters storing in the local cache, instead of waiting until the Parameter Server (PS) produces a new version. This expressively reduces the time workers spend on waiting. Furthermore, the BAP model guarantees the convergence of ML algorithm by bounding the maximum distance of the stale parameters. Experiments conducted on 4 cluster nodes with up to 32 GPUs showed that our model significantly improved the proportion of computing time relative to the waiting time and led to 1.2-2×speedup. Besides, we elaborated how to choose the staleness threshold when considering the tradeoff between Efficiency and Speed.

源语言英语
主期刊名Proceedings - 18th IEEE International Conference on High Performance Computing and Communications, 14th IEEE International Conference on Smart City and 2nd IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2016
编辑Laurence T. Yang, Jinjun Chen
出版商Institute of Electrical and Electronics Engineers Inc.
757-764
页数8
ISBN(电子版)9781509042968
DOI
出版状态已出版 - 20 1月 2017
活动18th IEEE International Conference on High Performance Computing and Communications, 14th IEEE International Conference on Smart City and 2nd IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2016 - Sydney, 澳大利亚
期限: 12 12月 201614 12月 2016

出版系列

姓名Proceedings - 18th IEEE International Conference on High Performance Computing and Communications, 14th IEEE International Conference on Smart City and 2nd IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2016

会议

会议18th IEEE International Conference on High Performance Computing and Communications, 14th IEEE International Conference on Smart City and 2nd IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2016
国家/地区澳大利亚
Sydney
时期12/12/1614/12/16

指纹

探究 'More Effective Synchronization Scheme in ML Using Stale Parameters' 的科研主题。它们共同构成独一无二的指纹。

引用此