Title: Improving the Reliability and Efficiency of Data Center Networks

Advisors: Tom Anderson and Arvind Krishnamurthy

Supervisory Committee: Tom Anderson (Chair), Raadha Poovendran (GSR, EE), Arvind Krishnamurthy, and Shyam Gollakota

Abstract: Cloud computing owes much of its recent rise to the infrastructure on which it is built. Data center networks in particular have helped to enable the efficient and robust utilization of tens to hundreds of thousands of co-located servers. As applications and data sets continue to grow rapidly, the challenge for data center networks is to keep pace—not just by providing enough bandwidth, but also by lowering costs, increasing flexibility, and maintaining reliability.

This dissertation proposes that a key component in providing all of these properties simultaneously lies in the structure of the network. My thesis is that small, practical changes there can have large, far-reaching consequences–providing more than just bandwidth and efficiency, but also influencing higher-level protocol design and granting properties like near-instantaneous failure recovery and incremental upgrades of server capacity.

I present two complementary and composable systems that explore the efficacy of this approach.

The first system, F10, is a co-design of the network topology and failover protocols that is the first to provide efficient, near-instantaneous, fine-grained, and localized recovery and rebalancing for common-case network failures. Our results show that following network link and switch failures, F10 has less than 1/7th the packet loss of current schemes.

The second system, Subways, explores an alternative method to add network capacity toservers. In particular, we investigate the various ways an operator can augment the network using multiple network links per server. Using a simulation-based methodology, we show that Subways offers substantial performance benefits for popular application workloads: up to a 3.1× speedup in MapReduce and a 2.5× throughput improvement in memcache for a fixed average request latency, relative to an equivalent-bandwidth network that differs only in its wiring.

Place: 
CSE 503
When: 
Wednesday, August 10, 2016 - 14:30 to 16:30