CoralCDN Lesson: The design was mostly wrong

Most of my posts about CoralCDN to date have discussed techniques to make the system more robust; now I discuss what it got wrong.  While nice, many of these optimizations were in fact moot: CoralCDN’s design is ill-suited for its current deployment and usage.

coral-uniq-reqsLet us frame this argument by first considering some usage statistics from CoralCDN’s deployment.  The available aggregate data from 167 of the ~250 operating CoralCDN nodes during one recent, randomly-chosen day (January 20, 2009) shows that these nodes received a total of 9.74M requests from 983K unique client IPs.  These requests were for 596K unique URLs at 9,895 domains.  The figure on the left plots the entire distribution of requests per URL: The most popular URL received 448K requests itself, while 420K URLs (a full 70%) received a single request.  These requests appear to follow the Zipf-like distribution common among web caching and proxy networks.

coral-fetch-statsSo, if a small number of pages are so popular, we might measure their working set size, to determine how much storage is actually required to handle a large fraction of the total requests.  This table provides such an analysis.  The most popular 0.01% of URLs account for more than 38% of the total requests to CoralCDN, yet require only 52MB of storage. 1% of URLs account for almost 80% of requests, yet still require only 1.8GB of storage.  Recall that each CoralCDN proxy on PlanetLab has a 3GB disk cache.

These workload distributions support one aspect of CoralCDN’s design: Content should be locally cached by the “forward” CoralCDN proxy directly serving end-clients, given that small to moderate size caches in these proxies can serve a very large fraction of requests.  This differs from the traditional DHT approach of just storing data on a small number of globally-selected proxies, so-called “server surrogates”.  (While not analyzed here, per-node local caching also supports geographic differences in request distributions, provided that clients interact with nearby proxies.)

On the other hand, such a workload points out the unnecessary complexity in CoralCDN’s design: Global cooperation through a highly-scalable DHT indexing layer has marginal benefit to hit rates. We see this result in the following breakdown:

76.7% hit in local cache
9.8% returned 4xx or 5xx error code
7.0% fetched from origin site
6.3% fetched from other CoralCDN proxy
—  2.3% from level-2 cluster (local)
—  1.8% from level-1 cluster
—  2.2% from level-0 cluster (global)

In short, almost 77% of requests to proxies are satisfied locally, while only a little more than 6% result in cooperative transfers.  (The high rate of error messages is due to the bandwidth management of our underlying hosting service.)  A related result about the limits of cooperative caching had been observed earlier by Wolman et al, but instead from the perspective of client hit rates.  During the original design of CoralCDN, we had thought that this result may not be directly applicable to CoralCDN’s main goal, which was to decrease origin server load, not increase client hit rate.  The concern was that non-cooperating groups of caches would each individually request content from the origin, potentially overloading it.

To assess the benefits of cooperative caching to reduce server load, consider the three main usage scenarios observed in CoralCDN:

  1. Random surfing: Recall that fully 70% of unique URLs through CoralCDN saw a single request. These may be from posted server links that are unpopular.  Or they may be from clients explicitly Coralizing links themselves (e.g., using client-side plugins) for a variety of reasons, such as (presumed) anonymity, censorship or filtering circumvention, or automated crawling.
  2. Resurrect this page: Users attempt to use CoralCDN to retrieve content that is otherwise unavailable due to a overloaded or unavailable origin server.  This commonly occurs after Coralized links are posted as backup links or in comments on portals like Slashdot:  “Site downTry the Coral Cache
  3. Free bandwidth and flash crowds: The majority of requests to CoralCDN are for popular content, already widely cached, from a stable set of “customer” domains. Even its flash crowds occur on the order of minutes, not seconds.

These various use cases require different solutions, but CoralCDN’s design is not ideal for any of these cases:  It’s a jack-of-all-trades, master of none.

For unpopular content (use case #1), clients do not meaningfully change server-side load by using cooperative caching. Furthermore, the goal of censorship or anonymity circumvention can be better served simply through open proxies or specialized tools such as Tor (which recently saw a significant uptick from its use by Iranian protesters).

For long-term unavailable content (use case #2), CoralCDN is not designed for archival storage and durability.  For short-time horizons, on the other hand, if clients directly overload the server before CoralCDN can retrieve a valid copy of the content, its cooperation is also to no avail.  In fact, even if CoralCDN has already retrieved a valid copy of the file, many origin servers—especially those deployed via third-party hosting—cause it to replace good content with bad (as I discussed earlier).

Finally, for popular content (use case #3), one could imagine alternative cooperative strategies that only rely on regional cooperation.  Coral’s clustering algorithms would still be used for self-organizing the network into local groups, but a simple one-hop DHT could be used for content discovery (via consistent hashing).  After all, the above data shows that cooperation between proxies on a global scale (level-0) is only used to satisfy 2.2% of requests.  Such a strategy would easily scale to a CoralCDN network that is at least one order of magnitude larger than its current deployment.  Alternatively, to simply maintain its current 200–400 node deployment on PlanetLab, having each node maintain connectivity and liveness information about all others would certainly result in improved performance compared to its current “scalable” design.

One might ask, however, if CoralCDN remains the correct design for a truly Internet-scale cooperative CDN, where either the active working set or the number of participating proxies are several orders of magnitude larger.  Especially in the case of the latter, a single request from each distinct group could perhaps still overwhelm servers, which was our initial concern.  Unfortunately, a more “peer-to-peer” deployment—which is what CoralCDN’s algorithms are designed for—introduces a different problem if we wish to preserve CoralCDN’s compatibility with unmodified web clients and servers:  security.

There is nothing stopping malicious proxies from returning spam, malware, advertisements, or other modified content to unsuspecting web clients. I’ll return to this question of security and naming in a later post, but the overall implication seems to be a trade-off: either backwards compatible with today’s Web and a smaller deployment on trusted infrastructure (such as PlanetLab), or a more peer-to-peer deployment that requires end-host adoption.  In fact, this last point is one of the strong motivations behind our ongoing work on Firecoral.

So that’s the downside of CoralCDN’s design.  But it did catch on beyond what we expected; mostly, I believe, because of its simple user interface that requires no registration and little knowledge, yet was eventually used in a variety of flexible, powerful ways.  In fact, our “customers'” interactions with CoralCDN portended the elastic use of cloud computing resources for (what some have called) surge computing.  I’ll discuss this in my next post.