TY - GEN
T1 - ML-NA
T2 - 23rd IEEE International Conference on Parallel and Distributed Systems, ICPADS 2017
AU - Ouyang, Xue
AU - Wang, Changjian
AU - Yang, Renyu
AU - Yang, Guogui
AU - Townend, Paul
AU - Xu, Jie
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/7/2
Y1 - 2017/7/2
N2 - Current Cloud clusters often consist of heterogeneous machine nodes, which can trigger performance challenges such as the task straggler problem, whereby a small subset of parallel tasks running abnormally slower than the other sibling ones. The straggler problem leads to extended job response and deteriorates system throughput. Poor performance nodes are more likely to engender stragglers, and can undermine straggler mitigation effectiveness. For example, as the dominant mechanism for straggler alleviation, speculative execution functions by creating redundant task replicas on other machine nodes as soon as a straggler is detected. When speculative copies are assigned onto the poor performance nodes, it is hard for them to catch up with the stragglers compared to replicas run on fast nodes. And due to the fact that the performance heterogeneity is caused not only by static attribute variations such as physical capacity, but also dynamic characteristic uctuations such as contention level, analyzing node performance is important yet challenging. In this paper we develop ML-NA, a Machine Learning based Node performance Analyzer. By leveraging historical parallel tasks execution log data, ML-NA classies cluster nodes into different categories and predicts their performance in the near future as a scheduling guide to improve speculation effectiveness and minimize task straggler generation. We consider MapReduce as a representative framework to perform our analysis, and use the published OpenCloud trace as a case study to train and to evaluate our model. Results show that ML-NA can predict node performance categories with an average accuracy up to 92.86%.
AB - Current Cloud clusters often consist of heterogeneous machine nodes, which can trigger performance challenges such as the task straggler problem, whereby a small subset of parallel tasks running abnormally slower than the other sibling ones. The straggler problem leads to extended job response and deteriorates system throughput. Poor performance nodes are more likely to engender stragglers, and can undermine straggler mitigation effectiveness. For example, as the dominant mechanism for straggler alleviation, speculative execution functions by creating redundant task replicas on other machine nodes as soon as a straggler is detected. When speculative copies are assigned onto the poor performance nodes, it is hard for them to catch up with the stragglers compared to replicas run on fast nodes. And due to the fact that the performance heterogeneity is caused not only by static attribute variations such as physical capacity, but also dynamic characteristic uctuations such as contention level, analyzing node performance is important yet challenging. In this paper we develop ML-NA, a Machine Learning based Node performance Analyzer. By leveraging historical parallel tasks execution log data, ML-NA classies cluster nodes into different categories and predicts their performance in the near future as a scheduling guide to improve speculation effectiveness and minimize task straggler generation. We consider MapReduce as a representative framework to perform our analysis, and use the published OpenCloud trace as a case study to train and to evaluate our model. Results show that ML-NA can predict node performance categories with an average accuracy up to 92.86%.
KW - Machine Learning
KW - Node Performance
KW - Prediction
KW - Straggler Problem
UR - https://www.scopus.com/pages/publications/85048375312
U2 - 10.1109/ICPADS.2017.00021
DO - 10.1109/ICPADS.2017.00021
M3 - 会议稿件
AN - SCOPUS:85048375312
T3 - Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS
SP - 73
EP - 80
BT - Proceedings - 2017 IEEE 23rd International Conference on Parallel and Distributed Systems, ICPADS 2017
PB - IEEE Computer Society
Y2 - 15 December 2017 through 17 December 2017
ER -