Straggler Root-Cause and Impact Analysis for Massive-scale Virtualized Cloud Datacenters

  • Peter Garraghan
  • , Xue Ouyang
  • , Renyu Yang*
  • , David McKee
  • , Jie Xu
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Increased complexity and scale of virtualized distributed systems has resulted in the manifestation of emergent phenomena substantially affecting overall system performance. This phenomena is known as 'Long Tail', whereby a small proportion of task stragglers significantly impede job completion time. While work focuses on straggler detection and mitigation, there is limited work that empirically studies straggler root-cause and quantifies its impact upon system operation. Such analysis is critical to ascertain in-depth knowledge of straggler occurrence for focusing developmental and research efforts towards solving the Long Tail challenge. This paper provides an empirical analysis of straggler root-cause within virtualized Cloud datacenters; we analyze two large-scale production systems to quantify the frequency and impact stragglers impose, and propose a method for conducting root-cause analysis. Results demonstrate approximately 5 percent of task stragglers impact 50 percent of total jobs for batch processes, and 53 percent of stragglers occur due to high server resource utilization. We leverage these findings to propose a method for extreme straggler detection through a combination of offline execution patterns modeling and online analytic agents to monitor tasks at runtime. Experiments show the approach is capable of detecting stragglers less than 11 percent into their execution lifecycle with 95 percent accuracy for short duration jobs.

Original languageEnglish
Article number7572191
Pages (from-to)91-104
Number of pages14
JournalIEEE Transactions on Services Computing
Volume12
Issue number1
DOIs
StatePublished - 1 Jan 2019

Keywords

  • Straggler
  • cloud
  • datacenter
  • distributed systems
  • root-cause analysis

Fingerprint

Dive into the research topics of 'Straggler Root-Cause and Impact Analysis for Massive-scale Virtualized Cloud Datacenters'. Together they form a unique fingerprint.

Cite this