New datacenter network architectures

This year’s HotNets workshop was held over the past two days in the faculty club at NYU; it was nice being on old turf.   The HotNets workshop has authors write 6-page “position” or “work-in-progress” papers on current “hot topics in networking” (surprise!).  Tucked into a cosy downstairs room, the workshop was nicely intimate and it saw lots of interesting questions and discussion.

One topic that was of particular interest to me were new ideas about datacenter networking; HotNets included two papers in each of two different research areas.

The first thematic area was addressing the problem of bisectional bandwidth within the datacenter. The problem is that each rack in a datacenter may have 40 machines, each potentially generating several Gbps of traffic.  Yet, all this traffic in typically aggregated at a single top-of-rack (ToR) switch, which often is the bottleneck in communicating with other racks.  (This is especially relevant for data-intensive workloads, such as Map Reduce-style computations.)  Even if the ToR switch is configured with 4-8 10Gbps links, this alone can be insufficient.  For example, an all-L2 network, while providing easy auto-configuration and supporting VM mobility, cannot take advantage of this capacity:  Even if multiple physical paths exist between racks, the Ethernet spanning tree will only use one.  In addition, the traditional method of performing L2 address resolution (ARP) is broadcast and thus not scalable.

Multiple solutions to this problem are currently being explored.  In SIGCOMM ’09 in late August, we saw three papers that proposed new L2/L3 networks to address this problem.  The first two leveraged the locator/ID split, so that virtual resources are named by some identifier independent of their location.

  • PortLand from UCSD effectively proposed a NAT layer for MAC addresses.  In Portland, virtual machines running on physical hosts can keep a permanent MAC address (identifier) even under VM migration, but their immediate upstream switch provides a location-specific “pseudo” MAC (the locator). Because these pseudo MACs are allocated hierarchically, they can be efficiently aggregated in upstream switches.  PortLand assumes that datacenter networks have a classic fat tree-like hierarchy between racks of servers, which is the typical network architecture in datacenters.  Instead of routing a packet to a VM’s MAC address, PortLand performs forwarding based on this pseudo-MAC; the VM’s upstream switch NATs between this PMAC and its permanent MAC. No end-host modifications need to be made (although one can certainly envision the host’s hypervisor to perform such NATting on the end-host itself before dispatching the packet to the unmodified VM). The sender is also unaware of this process, as its immediate upstream switch (resp. hypervisor) first translates the destination from MAC to PMAC before sending it out over the datacenter network. The question remains how to discover the MAC-to-PMAC binding, and Portland assumes a centralized lookup controller that stores these bindings and updates them under VM migration, much like the controller in Ethane. (In fact, Portland was prototyped using OpenFlow, which came out of Ethane.)
  • VL2 from MSR includes both L2 and L3 elements, unlike Portland’s L2-only solution. Changhoon Kim presented this work; he was a recent PhD graduate from Princeton (and in fact I served on the committee for his thesis, which included VL2). VL2 particularly focused on achieving both high bisectional bandwidth and good fairness between competing flows.    Using data collected from Microsoft’s online services, they found a high degree of variance between flows.  The implications of this is that a static allocation of bandwidth wouldn’t be all that promising, unless one managed to fully provision for high bisectional-bandwidth everywhere, which would be quite expensive. One of the particularly nice things about their evaluation (and implementation)—certainly aided by the support of an industrial research lab!—is that it ran on a  cluster consisting of 3 racks, 80 servers, and 10 switches, so provided at least some limited scaling experience.
    vl2On to specifics.  While Portland used “virtualized” Ethernet identifiers, VL2 assigns “virtualized” application-level IP addresses (AAs) to all nodes.  And rather than performing the equivalent of L2 NATing on the first-hop switches, VL2 uses unmodified switches but modifies end-hosts—in virtualized datacenter environments, it’s not terrible difficult to forklift upgrade your servers’ OS—to perform location-specific address (LA) resolution.  This is the identifier/locator split.  So applications use location-independent IP addresses, but a module on end-hosts resolves an AA IP address to a specific location address (LA) of the destination’s top-of-rack switch, then encapsulates this AA packet within an LA-addressed IP header.  This module also serves to intercept ARP requests for LA addresses and convert ARP broadcasts into unicast lookups to a centralized directory service (much like in PortLand and Ethane).   While this addressing mechanism replaces broadcasts and serves as an indirection layer to support address migration, it doesn’t itself provide good bisectional bandwidth.  For this, VL2 uses both Valiant Load Balancing (VLB) and Equal-Cost Multi-Path (ECMP) routing to spread traffic uniformly across network paths.  VLB makes use of a random intermediate waypoint to route traffic between two endpoints, while ECMP is used to select amongst multiple equal-cost paths.   With multiple, random intermediate waypoints, communication between two racks can follow multiple paths, and ECMP provides load balancing between those paths.  So, taken together, a packet from some source AA to a destination AA will first be encapsulated with the LA address of the destination’s ToR switch, which in turn is encapsulated with the the address of an intermediate switch for VLB.  This is shown in the figure to the left (taken from the VL2 paper). Interestingly, all of VL2’s mechanisms are built on backward-compatible network protocols (packet encapsulation, ECMP, OSPF, etc.).
  • BCube from MSR Asia specifically focused on scaling bisectional bandwidth between nodes in shipping-container-based modular datacenters.  This is a follow-on work to their DCell architecture from the previous year’s SIGCOMM.  BCube (and DCell) are, at their core, more complex interconnection networks (as compared to today’s fat trees) for wiring together and forwarding packets between servers.  They propose a connection topology that looks very much like a generalized hypercube.  So we are in fact seeing the rebirth of complex interconnection networks that were originally proposed for parallel machines like Thinking Machine’s CM-5; now it’s just for datacenters rather than supercomputers.  Actually wiring together such complex networks might be challenging, however, compared to today’s fat-tree architecture.

So these proposals all introduced new L2/L3 network architectures for the datacenter.  In HotNets, we saw more proposals for using different technologies, as opposed to new topologies (to borrow a phrase from here):

  • A group from Rice, CMU, and Intel Pittsburgh argued that traditional electrically-switched networks (i.e., Ethernet) in the datacenter should be accompanied by an optically-switched network for bulk data distribution.  This proposal takes advantage of reconfigurable optical switches to quickly set up light-paths between particular hosts for large transfers.  So their resulting system is a hybrid:  using electrical, packet-switched networks for small flows or for those requiring low-latency, but using optical, circuit-switched networks for large flows that can withstand the few millisecond delay necessary to reconfigure the optical switches.   And these types of larger flows are especially prevalent in data-intensive workloads (e.g., scientific computing and MapReduce).
  • A group from Microsoft Research focused on a similar problem—the paper’s presenter even joked that he could just skip his motivation slides.  But instead of proposing a hybrid electric/optical network, they argued for a hybrid wired/wireless network, where wireless links are used to provide a “fly-way”  between servers.  Instead of using these additional links for large transfers (as in the above proposal), however, this work uses these additional links to handle transient overages on the existing wired network.  Because it’s a wireless network, one doesn’t need to physically wire them up in place; the paper suggests that wireless connections in the 60GHz band might be especially useful given some prototypes that achieve 1-15 Gbps at distances of 4-10m.  The paper also discusses wired fly-ways by using additional switches to inter-connect random subsets of ToR switches, but the talk seemed to focus on the wireless case.

Either way, it’s interesting to see competing ideas for using different technologies to handle bisectional capacity problems (whether transient or persistent).

HotNet’s second thematic area of datacenter network papers considers managing all this new complexity.  There were two papers on NOX, which is a network controller for managing OpenFlow networks.

  • The first paper asked the question whether we needed special networking technologies to support new datacenter architectures (the talk focused specifically on the problem of building VL2), or whether we could construct similar functionality via NOX and OpenFlow switches.  They found (perhaps not surprisingly) that NOX could be sufficient.
  • The second NOX paper focused on greater support for virtualized end-hosts.  OpenVSwitch is meant to work with various end-host virtualization technologies (Xen, KVM, etc.) and provide functionality for managing their virtual network interfaces (instead of, e.g., Xen’s simple Ethernet bridge).  Openvswitch can be used, for example, to setup particular routes between VM instances or to provide encapsulation and tunneling between VMs.  The latter could enable L3 migration of VMs, with a VM’s old physical location forwarding packets to the VM’s new location (akin to Mobile IP).  Traditional VM migration, on the other hand, uses ARP spoofing and is thus limited to migration between hosts on the same L2 network.

This ability to perform more fine-grain network resource management is very interesting.  While most of these above papers (except the latter one) focus on supporting L2/L3 addressing and connectivity, our own SCAFFOLD project looks at the higher-level problem of supporting wide-area, distributed services.   Distributed services typically scale-out by replicating functionality across many machines; for customer-facing services, some combination of additional mechanisms are used for server selection and failover:  DNS, BGP tricks and IP anycast, VRRP, VIP/DIP load balancers, ARP spoofing, etc.  All this complexity is because, while clients are trying to access replicated services, the Internet provides communication between unicasted hosts. Thus, SCAFFOLD explores building a network architecture that supports communication between services, instead of network interfaces, and that explicitly supports churn (whether planned or unplanned) amongst the set of end-hosts composing that service. I’ll expand on SCAFFOLD’s motivation and design more in a future post.