TY - JOUR
T1 - KPAMA
T2 - A Kubernetes based tool for Mitigating ML system Aging
AU - Ding, Wenjie
AU - Liu, Zhihao
AU - Lu, Xuhui
AU - Du, Xiaoting
AU - Zheng, Zheng
N1 - Publisher Copyright:
© 2025 Elsevier Inc.
PY - 2025/8
Y1 - 2025/8
N2 - As machine learning (ML) systems continue to evolve and be applied, their user base and system size also expand. This expansion is particularly evident with the widespread adoption of large language models. Currently, the infrastructure supporting ML systems, such as cloud services and computing hardware, which are increasingly becoming foundational to the ML system environment, is increasingly adopted to support continuous training and inference services. Nevertheless, it has been shown that the increased data volume, complexity of computations, and extended run times challenge the stability of ML systems, efficiency, and availability, precipitating system aging. To address this issue, we develop a novel solution, KPAMA, leveraging Kubernetes, the leading container orchestration platform, to enhance the autoscaling of computing workflows and resources, effectively mitigating system aging. KPAMA employs a hybrid model to predict key aging metrics and uses decision and anti-oscillation algorithms to achieve system resource autoscaling. Our experiments indicate that KPAMA markedly mitigates system aging and enhances task reliability compared to the standard Horizontal Pod Autoscaler and systems without scaling capabilities.
AB - As machine learning (ML) systems continue to evolve and be applied, their user base and system size also expand. This expansion is particularly evident with the widespread adoption of large language models. Currently, the infrastructure supporting ML systems, such as cloud services and computing hardware, which are increasingly becoming foundational to the ML system environment, is increasingly adopted to support continuous training and inference services. Nevertheless, it has been shown that the increased data volume, complexity of computations, and extended run times challenge the stability of ML systems, efficiency, and availability, precipitating system aging. To address this issue, we develop a novel solution, KPAMA, leveraging Kubernetes, the leading container orchestration platform, to enhance the autoscaling of computing workflows and resources, effectively mitigating system aging. KPAMA employs a hybrid model to predict key aging metrics and uses decision and anti-oscillation algorithms to achieve system resource autoscaling. Our experiments indicate that KPAMA markedly mitigates system aging and enhances task reliability compared to the standard Horizontal Pod Autoscaler and systems without scaling capabilities.
KW - Autoscaling
KW - Data prediction
KW - Kubernetes-based machine learning system
KW - Software aging
UR - https://www.scopus.com/pages/publications/85219075919
U2 - 10.1016/j.jss.2025.112389
DO - 10.1016/j.jss.2025.112389
M3 - 文章
AN - SCOPUS:85219075919
SN - 0164-1212
VL - 226
JO - Journal of Systems and Software
JF - Journal of Systems and Software
M1 - 112389
ER -