CoralCDN Lesson: Interacting with virtualized and shared hosting services

In the previous post, I discussed how CoralCDN implemented bandwidth restrictions that were fair-shared between “customer” domains. There was another major twist to this problem, however, that I didn’t talk about: the challenge of performing such a technique on a virtualized and shared platform such as PlanetLab.  While my discussion is certainly PlanetLab-centric, its questions are also applicable to other P2P deployments where users run peers within resource containers, or to commercial hosting environments using billing models such as 95th percentile usage.

Interacting with hosting platforms

CoralCDN’s self-regulation works well in trusted environments, and this approach is used similarly in other peer-to-peer (e.g., BitTorrent and tor) and server-side (e.g., Apache mod_bandwidth) environments.  But when the resource provider (such as PlanetLab, a commercial hosting service, or peer-to-peer end-users) wants to enforce resource restrictions, rather than assume the software functions correctly, the situation becomes more challenging.

Many services run on top of PlanetLab; CoralCDN only being one of them.  Each of these instances is allocated a resource container (a “slice”) across all PlanetLab nodes.  This doesn’t have the same level of isolation as a virtual machine instance, but its much more scalable (in number of slices per node, each per-node slice called in sliver in PlanetLab).

PlanetLab began enforcing average daily bandwidth limits per sliver in 2006; prior to that, sliver usage was all self-enforced. Thereafter, however, when a sliver hit 80% of its daily limit, the PlanetLab kernel began enforcing bandwidth caps (using Linux’s Hierarchical Token Bucket scheduler) as calculated over five-minute epochs.  CoralCDN’s daily limit is 17.2 GB/day per sliver to the public Internet.

So, we see here two levels of bandwidth control: admission control by CoralCDN proxies and rate limiting by the underlying hosting service. Even though CoralCDN uses a relatively conservative limit for itself (10 GB/day), it still surpasses the 80% mark (13.8 GB) of its hosting platform on 5–10 servers per day. And once this happens, these servers begin throttling CoralCDN traffic, leading to degraded performance.  The main cause of this overage is that, while CoralCDN counts successful HTTP responses, its hosting platform accounts for all traffic—HTTP, DNS, DHT RPCs, system log transfers, and other management traffic—generated by CoralCDN’s sliver.

Unfortunately, there does not appear to be sufficiently lightweight or simple user-space mechanisms for proper aggregate resource accounting.  Capturing all traffic via libpcap, for example, would be too heavy-weight for our purposes.  Furthermore, a service would often like to make its own policy-based queuing decisions based on application knowledge.  For example, CoralCDN would prioritize DNS traffic before DHT RPCs, HTTP traffic next, and log collection of lowest priority. This is difficult through application-level control alone, while using a virtualized network interface that pushes traffic back through a user-space network stack would be expensive.

CoralCDN’s experience suggests two desirable properties from hosting platforms that enforce resource containers.  First, these platforms should provide slivers with their current measured resource consumption in a machine-readable format and in real time.  Second, these platforms should allow slices to express policies that affect how the underlying platform enforces resource containment. While this pushes higher-level preferences into lower layers, such behavior is not easily performed at these higher layers (and thus compatible with the end-to-end argument).  And it might be as simple as exposing multiple resource abstractions for slices to use, e.g., multiple virtual network connections with different priorities.

Somewhat amusingly, one of CoralCDN’s major outages came from a PlanetLab misconfiguration that changed its bandwidth caps from GBs to MBs.  As all packets were delayed for 30 seconds within the PlanetLab kernel, virtually all higher-layer protocols (e.g., DNS resolution for nyud.net) were timing out.  Such occasional misconfigurations are par for the course, and PlanetLab Central has been an amazing partner over the years.  Rather than criticize, however, my purpose is to simply point out how increased information sharing can be useful.  In this instance, for example, exposing information would have told CoralCDN to shut down many unnecessary services, while policy-based QoS could have at least preserved DNS responsiveness.

Over-subscription and latency sensitivity

While CoralCDN faced bandwidth tensions, there were latency implications with over-subscribed resources as well.  With PlanetLab services facing high disk, memory, and CPU contention, and even additional traffic shaping in the kernel, applications face both performance jitter and prolonged delays.  For example, application-level trace analysis performed on CoralCDN (in Chapter 6 of Rodrigo Fonseca’s PhD thesis) showed numerous performance faults that led to a highly variable client experience, while making normal debugging (“Why was this so slow?”) difficult.

These performance variations are certainly not restricted to PlanetLab, and they have been well documented in the literature across a variety of settings.  More recent examples have shown this in cloud computing settings.  For example, Google’s MapReduce found performance variations even among homogeneous components, which led to their speculative re-execution of work.  Recently, a Berkeley study of Hadoop on Amazon’s EC2 underscored how shared and virtualized deployment platforms provide new performance challenges.

cluster-timingsCoralCDN saw the implications of performance variations most strikingly with its latency-sensitive self-organization.  Coral’s DHT hierarchy, for example, was based on nodes clustering by network RTTs. A node would join a cluster provided some minimum fraction (85%) of its members were below the specified threshold (30 ms for level 2, 80 ms for level 1).  This figure shows the measured RTTs for RPC between Coral nodes, broken down by levels (with vertical lines added at 30 ms, 80 ms, and 1 s). While these graphs show the clustering algorithms meeting their targets and local clusters having lower RTTs, the heavy tail in all CDFs is rather striking.  Fully 1% of RPCs took longer than 1 second, even within local clusters.

Another lesson from CoralCDN’s deployment was the need for stability in the face of performance variations, which are only worse in heterogeneous deployments.  This translated to the following rule in Coral.  A node would switch to a smaller cluster if fewer than 70% of a cluster now satisfy its threshold, and form a singleton only if fewer than 50% of neighbors are satisfactory.  Before leveraging this form of hysteresis, cluster oscillations were much more common (leading to many stale DHT references).  A related focus on stability helped improve virtual network coordinate systems for both PlanetLab and Azureus’s peer-to-peer deployment, so it’s an important property to consider when performing latency-sensitive self-organization.

Next up…

In the next post, I’ll talk a bit about some of the major availability problems we faced, particularly because our deployment has machines with static IP addresses directly connected to the Internet (i.e., not behind NATs or load balancers).  In other words, the model that the Internet and traditional network layering was actually designed with in mind…