跳到主要导航 跳到搜索 跳到主要内容

NTAM: Neighborhood-temporal attention model for disk failure prediction in cloud platforms

  • Chuan Luo
  • , Pu Zhao
  • , Bo Qiao
  • , Youjiang Wu
  • , Hongyu Zhang
  • , Wei Wu
  • , Weihai Lu
  • , Yingnong Dang
  • , Saravanakumar Rajmohan
  • , Qingwei Lin
  • , Dongmei Zhang
  • Microsoft USA
  • University of Newcastle
  • Leibniz University Hannover

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

With the rapid deployment of cloud platforms, high service reliability is of critical importance. An industrial cloud platform contains a huge number of disks, and disk failure is a common cause of service unreliability. In recent years, many machine learning based disk failure prediction approaches have been proposed, and they can predict disk failures based on disk status data before the failures actually happen. In this way, proactive actions can be taken in advance to improve service reliability. However, existing approaches treat each disk individually and do not explore the influence of the neighboring disks. In this paper, we propose Neighborhood-Temporal Attention Model (NTAM), a novel deep learning based approach to disk failure prediction. When predicting whether or not a disk will fail in near future, NTAM is a novel approach that not only utilizes a disk's own status data, but also considers its neighbors' status data. Moreover, NTAM includes a novel attention-based temporal component to capture the temporal nature of the disk status data. Besides, we propose a data enhancement method, called Temporal Progressive Sampling (TPS), to handle the extreme data imbalance issue. We evaluate NTAM on a public dataset as well as two industrial datasets collected from millions of disks in Microsoft Azure. Our experimental results show that NTAM significantly outperforms state-of-the-art competitors. Also, our empirical evaluations indicate the effectiveness of the neighborhood-ware component and the temporal component underlying NTAM as well as the effectiveness of TPS. More encouragingly, we have successfully applied NTAM and TPS to Microsoft cloud platforms (including Microsoft Azure and Microsoft 365) and obtained benefits in industrial practice.

源语言英语
主期刊名The Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021
出版商Association for Computing Machinery, Inc
1181-1191
页数11
ISBN(电子版)9781450383127
DOI
出版状态已出版 - 3 6月 2021
已对外发布
活动30th World Wide Web Conference, WWW 2021 - Ljubljana, 斯洛文尼亚
期限: 19 4月 202123 4月 2021

出版系列

姓名The Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021

会议

会议30th World Wide Web Conference, WWW 2021
国家/地区斯洛文尼亚
Ljubljana
时期19/04/2123/04/21

指纹

探究 'NTAM: Neighborhood-temporal attention model for disk failure prediction in cloud platforms' 的科研主题。它们共同构成独一无二的指纹。

引用此