Title: Failure Diagnosis for Datacenter Applications
Advisors: Tom Anderson and Arvind Krishnamurthy
Supervisory Committee: Tom Anderson (Co-Chair), Arvind Krishnamurthy (Co-Chair), Sreeram Kannan (GSR, EE), and Xi Wang
Fast and accurate failure diagnosis remains a major challenge for datacenter operators. Current datacenter applications are increasingly architected around loosely-coupled modular components: each component can scale and evolve independently and achieve higher performance and reliability in the common case. However, when failures occur, datacenter operators may face significant challenges in diagnosing and locating those failures. First, component dependency can be complex and sometimes not well accounted for in failure diagnosis. Second, components can exhibit gray failures that confound failure diagnosis. Third, opaque routing in the network can further hinder failure diagnosis.
My thesis is that fast and accurate failure diagnosis for datacenter applications is possible using three key ideas:
(1) a global view of component interactions and dependencies,
(2) a penalized-regression-based algorithm that can localize both fail-stop and gray failures,
(3) a predictable network architecture that allows for failure localization despite evolving in-network features.
I present preliminary results for two complementary systems that demonstrate this. The first, Deepview, is a system that can diagnose virtual hard disk (VHD) failures in Infrastructure-as-a-Service (IaaS) clouds. Deepview composes a global view of the IaaS stack, and uses an algorithm which integrates Lasso regression and hypothesis testing to localize component failures. Deepview is already deployed at one of the largest IaaS providers, and is shown to localize VHD failures to compute, storage and network components in an accurate and timely fashion. The second, Volur, is a prototype predictable network architecture. It complements Deepview by proposing a design principle for datacenter network that makes in-network routing predictable to the end-hosts for failure diagnosis and load balancing. In simulation and testbed experiments, Volur is shown to accurately localize non-fail-stop link or switch failures as well as to recover from them quickly.