ML-NA: A machine learning based node performance analyzer utilizing straggler statistics

  • Xue Ouyang
  • , Changjian Wang
  • , Renyu Yang
  • , Guogui Yang
  • , Paul Townend
  • , Jie Xu

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Current Cloud clusters often consist of heterogeneous machine nodes, which can trigger performance challenges such as the task straggler problem, whereby a small subset of parallel tasks running abnormally slower than the other sibling ones. The straggler problem leads to extended job response and deteriorates system throughput. Poor performance nodes are more likely to engender stragglers, and can undermine straggler mitigation effectiveness. For example, as the dominant mechanism for straggler alleviation, speculative execution functions by creating redundant task replicas on other machine nodes as soon as a straggler is detected. When speculative copies are assigned onto the poor performance nodes, it is hard for them to catch up with the stragglers compared to replicas run on fast nodes. And due to the fact that the performance heterogeneity is caused not only by static attribute variations such as physical capacity, but also dynamic characteristic uctuations such as contention level, analyzing node performance is important yet challenging. In this paper we develop ML-NA, a Machine Learning based Node performance Analyzer. By leveraging historical parallel tasks execution log data, ML-NA classies cluster nodes into different categories and predicts their performance in the near future as a scheduling guide to improve speculation effectiveness and minimize task straggler generation. We consider MapReduce as a representative framework to perform our analysis, and use the published OpenCloud trace as a case study to train and to evaluate our model. Results show that ML-NA can predict node performance categories with an average accuracy up to 92.86%.

Original languageEnglish
Title of host publicationProceedings - 2017 IEEE 23rd International Conference on Parallel and Distributed Systems, ICPADS 2017
PublisherIEEE Computer Society
Pages73-80
Number of pages8
ISBN (Electronic)9781538621295
DOIs
StatePublished - 2 Jul 2017
Event23rd IEEE International Conference on Parallel and Distributed Systems, ICPADS 2017 - Shenzhen, China
Duration: 15 Dec 201717 Dec 2017

Publication series

NameProceedings of the International Conference on Parallel and Distributed Systems - ICPADS
Volume2017-December
ISSN (Print)1521-9097

Conference

Conference23rd IEEE International Conference on Parallel and Distributed Systems, ICPADS 2017
Country/TerritoryChina
CityShenzhen
Period15/12/1717/12/17

Keywords

  • Machine Learning
  • Node Performance
  • Prediction
  • Straggler Problem

Fingerprint

Dive into the research topics of 'ML-NA: A machine learning based node performance analyzer utilizing straggler statistics'. Together they form a unique fingerprint.

Cite this