TY - GEN
T1 - NENYA
T2 - 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2022
AU - Wang, Lu
AU - Zhao, Pu
AU - Du, Chao
AU - Luo, Chuan
AU - Su, Mengna
AU - Yang, Fangkai
AU - Liu, Yudong
AU - Lin, Qingwei
AU - Wang, Min
AU - Dang, Yingnong
AU - Zhang, Hongyu
AU - Rajmohan, Saravan
AU - Zhang, Dongmei
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/8/14
Y1 - 2022/8/14
N2 - Large-scale distributed systems, such as Microsoft 365's database system, require timely mitigation solutions to address failures and improve service availability and reliability. Still, mitigation actions can be costly as they may cause temporal performance degradation and even incur monetary expenses. Mitigation actions can be either administrated in a reactive fashion to contain detected failures or a proactive fashion to reduce potential failures. The proactive mitigation approach typically relies on a two-stage strategy: the prediction model will firstly identify instances (such as databases or disks) with high failure risk, then appropriate mitigation actions chosen by engineers or an automatic bandit learning model can be applied. As information is not fully shared across those two stages, important factors such as mitigation costs and states of instances are often ignored in one of those two stages. To address these issues, we propose NENYA, an end-to-end mitigation solution for a large-scale database system powered by a novel cascade reinforcement learning model. By taking the states of databases as input, NENYA directly outputs mitigation actions and is optimized based on jointly cumulative feedback on mitigation costs and failure rates. As the overwhelming majority of databases do not require mitigation actions, NENYA utilizes a novel cascade decision structure to firstly reliably filter out such databases and then focus on choosing appropriate mitigation actions for the rest. Extensive offline and online experiments have shown that our methods can outperform existing practices in reducing both failure rates of databases and mitigation costs. NENYA has been integrated into Microsoft 365, a productive platform, with sounding success.
AB - Large-scale distributed systems, such as Microsoft 365's database system, require timely mitigation solutions to address failures and improve service availability and reliability. Still, mitigation actions can be costly as they may cause temporal performance degradation and even incur monetary expenses. Mitigation actions can be either administrated in a reactive fashion to contain detected failures or a proactive fashion to reduce potential failures. The proactive mitigation approach typically relies on a two-stage strategy: the prediction model will firstly identify instances (such as databases or disks) with high failure risk, then appropriate mitigation actions chosen by engineers or an automatic bandit learning model can be applied. As information is not fully shared across those two stages, important factors such as mitigation costs and states of instances are often ignored in one of those two stages. To address these issues, we propose NENYA, an end-to-end mitigation solution for a large-scale database system powered by a novel cascade reinforcement learning model. By taking the states of databases as input, NENYA directly outputs mitigation actions and is optimized based on jointly cumulative feedback on mitigation costs and failure rates. As the overwhelming majority of databases do not require mitigation actions, NENYA utilizes a novel cascade decision structure to firstly reliably filter out such databases and then focus on choosing appropriate mitigation actions for the rest. Extensive offline and online experiments have shown that our methods can outperform existing practices in reducing both failure rates of databases and mitigation costs. NENYA has been integrated into Microsoft 365, a productive platform, with sounding success.
KW - cascade learning
KW - failure mitigation
KW - reinforcement learning
UR - https://www.scopus.com/pages/publications/85137140857
U2 - 10.1145/3534678.3539127
DO - 10.1145/3534678.3539127
M3 - 会议稿件
AN - SCOPUS:85137140857
T3 - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
SP - 4032
EP - 4040
BT - KDD 2022 - Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
PB - Association for Computing Machinery
Y2 - 14 August 2022 through 18 August 2022
ER -