Category Archives: Experiences

Erroneous DMCA notices and copyright enforcement, part deux

[Given my continued use of Ed’s Freedom-To-Tinker blog, I’m reposting this article from there.]

A few weeks ago, I wrote about a deluge of DMCA notices and pre-settlement letters that CoralCDN experienced in late August. This article actually received a bit of press, including MediaPost, ArsTechnica, TechDirt, and, very recently, Slashdot. I’m glad that my own experience was able to shed some light on the more insidious practices that are still going on under the umbrella of copyright enforcement. More transparency is especially important at this time, given the current debate over the Anti-Counterfeiting Trade Agreement.

Given this discussion, I wanted to write a short follow-on to my previous post.

The VPA drops Nexicon

First and foremost, I was contacted by the founder of the Video Protection Alliance not long after this story broke. I was informed that the VPA has not actually developed its own technology to discover users who are actively uploading or downloading copyrighted material, but rather contracts out this role to Nexicon. (You can find a comment from Nexicon’s CTO to my previous article here.) As I was told, the VPA was contracted by certain content publishers to help reduce copyright infringement of (largely adult) content. The VPA in turn contracted Nexicon to find IP addresses that are participating in BitTorrent swarms of those specified movies. Using the IP addresses given them by Nexicon, the VPA subsequently would send pre-settlement letters to the network providers of those addresses.

The VPA’s founder also assured me that their main goal was to reduce infringement, as opposed to collecting pre-settlement money. (And that users had been let off with only a warning, or, in the cases where infringement might have been due to an open wireless network, informed how to secure their wireless network.) He also expressed surprise that there were false positives in the addresses given to them (beyond said open wireless), especially to the extent that appropriate verification was lacking. Given this new knowledge, he stated that the VPA dropped their use of Nexicon’s technology.

BitTorrent and Proxies

Second, I should clarify my claims about BitTorrent’s usefulness with an on-path proxy. While it is true that the address registered with the BitTorrent tracker is not usable, peers connecting from behind a proxy can still download content from other addresses learned from the tracker. If their requests to those addresses are optimistically unchoked, they have the opportunity to even engage in incentivized bilateral exchange. Furthermore, the use of DHT- and gossip-based discovery with other peers—the latter is termed PEX, for Peer EXchange, in BitTorrent—allows their real address to be learned by others. Thus, through these more modern discovery means, other peers may initiate connections to them, further increasing the opportunity for tit-for-tat exchanges.

Some readers also pointed out that there is good reason why BitTorrent trackers do not just accept any IP address communicated to it via an HTTP query string, but rather use the end-point IP address of the TCP connection. Namely, any HTTP query parameter can be spoofed, leading to anybody being able to add another’s IP address to the tracker list. That would make them susceptible to receiving DMCA complaints, just we experienced with CoralCDN. From a more technical perspective, their machine would also start receiving unsolicited TCP connection requests from other BitTorrent peers, an easy DoS amplification attack.

That said, there are some additional checks that BitTorrent trackers could do. For example, if the IP query string or X-Forwarded-For HTTP headers are present, only add the network IP address if it matches the query string or X-Forwarded-For headers. Additionally, some BitTorrent tracker operators have mentioned that they have certain IP addresses whitelisted as trusted proxies; in those cases, the X-Forwarded-For address is used already. Otherwise, I don’t see a good reason (plausible deniability aside) for recording an IP address that is known to be likely incorrect.

Best Practices for Online Technical Copyright Enforcement

Finally, my article pointed out a strategy that I clearly thought was insufficient for copyright enforcement: simply crawling a BitTorrent tracker for a list of registered IP addresses, and issuing a infringement notice to each IP address. I’ll add to that two other approaches that I think are either insufficient, unethical, or illegal—or all three—yet have been bandied about as possible solutions.

  • Wiretapping: It has been suggested that network providers can perform deep-packet inspection (DPI) on their customer’s traffic in order to detect copyrighted content. This approach probably breaks a number of laws (either in the U.S. or elsewhere), creates a dangerous precedent and existing infrastructure for far-flung Internet surveillance, and yet is of dubious benefit given the move to encrypted communication by file-sharing software.
  • Spyware: By surreptitiously installing spyware/malware on end-hosts, one could scan a user’s local disk in order to detect the existence of potentially copyrighted material. This practice has even worse legal and ethical implications than network-level wiretapping, and yet politicians such as Senator Orrin Hatch (Utah) have gone as far as declaring that infringers’ computers should be destroyed. And it opens users up to the real danger that their computers or information could be misused by others; witness, for example, the security weaknesses of China’s Green Dam software.

So, if one starts from the position that copyrights are valid and should be enforceable—some dispute this—what would you like to see as best practices for copyright enforcement?

The approach taken by DRM is to try to build a technical framework that restricts users’ ability to share content or to consume it in a proscribed manner. But DRM has been largely disliked by end-users, mostly in the way it creates a poor user experience and interferes with expected rights (under fair-use doctrine). But DRM is a misleading argument, as copyright infringement notices are needed precisely after “unprotected” content has already flown the coop.

So I’ll start with two properties that I would want all enforcement agencies to take when issuing DMCA take-down notices. Let’s restrict this consideration to complaints about “whole” content (e.g., entire movies), as opposed to those DMCA challenges over sampled or remixed content, which is a legal debate.

  • For any end client suspected of file-sharing, one MUST verify that the client was actually uploading or downloading content, AND that the content corresponded to a valid portion of a copyrighted file. In BitTorrent, this might be that the client sends or receives a complete file block, and that the file block hashes to the correct value specified in the .torrent file.
  • When issuing a DMCA take-down notice, the request MUST be accompanied by logged information that shows (a) the client’s IP:port network address engaged in content transfer (e.g., a record of a TCP flow); (b) the actual application request/response that was acted upon (e.g., BitTorrent-level logs); and (c) that the transferred content corresponds to a valid file block (e.g., a BitTorrent hash).

So my question to the readers: What would you add to or remove from this list? With what other approaches do you think copyright enforcement should be performed or incentivized?

Update (12/21/2009): Discussion about post on Slashdot.

Erroneous DMCA notices and copyright enforcement: the VPA, BitTorrent, and me

I recently posted an article on Ed Felten’s group blog, Freedom to Tinker, which covers both technical and policy-related I.T. issues. It describes some technical issues and my recent experience with DMCA enforcement agencies, BitTorrent, and CoralCDN. I think our readers would be interested in it as well, as well as join in the lively conversation among commenters.

In the past few weeks, Ed has been writing about targeted and inaccurate copyright enforcement. While it may be difficult to quantify the actual extent of inaccurate claims, we can at least try to understand whether copyright enforcement companies are making a “good faith” best effort to minimize any false positives. My short answer: not really.

Let’s start with a typical abuse letter that gets sent to a network provider (in this case, a university) from a “copyright enforcement” company such as the Video Protection Alliance.

The rest of the article can be found here.

CoralCDN Lesson: The interface was right -or- Programming elastic CDN services

While my previous post argued that CoralCDN’s architecture design might not be ideal given its deployment, it has proven successful from the simple perspective of real-world use. Rather than any technical argument, we believe that the central reason for its adoption has been its simple user interface: Any URL can be requested through CoralCDN by appending nyud.net to its hostname.

Interface design

While superficially obvious, this interface design achieves several important deployment goals:

  • Transparency: Work with unmodified, unconfigured, and unaware web clients and web servers.
  • Deep caching: Support the automatic retrieval of embedded images or links also through CoralCDN when appropriate.
  • Server control: Not interfere with sites’ ability to perform usage logging or otherwise control how their content is served (e.g., via CoralCDN or directly).
  • Ad-friendly: Not interfere with third-party advertising, analytics, or other tools incorporated into a site.
  • Forward compatible: Be amenable to future end-to-end security mechanisms for content integrity or other end-host deployed mechanisms.

Consider an alternative, even simpler, interface design. RedSwoosh, Dijjer, FreeCache, and CoBlitz, among others, all embedded origin URLs within the URL’s relative path, e.g., http://nyud.net/example.com/file. Not only is HTTP parsing simpler, but their nameservers do not need to synthesize DNS records on the fly (unlike CoralCDN’s DNS servers for *.nyud.net) and can take better advantage of client-side DNS caching. Unfortunately, while such an interface supports the distribution of specifically named files, it fails to transparently load an HTML webpage: Any relative embedded links would lack the example.com prefix, and a proxy would thus be unable to identify to which origin domain it refers. (One alternative might be to try to rewrite pages to add such links, although active content such as javascript makes this notoriously difficult, even ignoring the goal of not modifying server content.)

CoralCDN’s approach, however, interprets relative links with respect to a page’s Coralized hostname, and thus transparently requests these objects through it as well. But because CoralCDN does not modify body content, all absolute URLs continue to point to their origin sites. Thus, third-party advertisements are largely unaffected, and origin servers can use simple web beacons to log clients. Origin sites retain control about how their content is displayed and, down the line, content may be amenable to verification through end-to-end content signatures (as in RFC2660) or web tripwire tricks.

An API for dynamic adoption

CoralCDN was envisioned with manual URL manipulation in mind, whether by publishers editing HTML, users typing Coralized URLs, or third-party posters to Usenet groups or web portals submitting Coralized URLs. After deployment, however, users soon began treating CoralCDN’s interface as an API for accessing CDN services.

On the client side, these techniques included simple browser extensions that offer “right-click” options to Coralize links or that provide a CoralCDN link when a page appears unavailable. They also ranged to more complex integration into frameworks like Firefox’s Greasemonkey. Greasemonkey allows third-party developers to write site-specific javascript code that, once installed by users, manipulates a site’s HTML content (usually through the DOM interface) whenever the user accesses it. CoralCDN scripts for Greasemonkey include ones that automatically rewrite links, or that add Coralized links (in text or via tooltips) to posted articles on Slashdot, digg, or other portals. CoralCDN was also integrated directly into a number of client-side software for podcasting, such as Plone’s Plodcasting, Juice Receiver, and Easypodcast. Given the view that podcasting served to “democratize” Internet radio broadcasting, this seemed to fit quite well with CoralCDN’s stated aims of “democratizing content publication”.

But perhaps the more interesting cases of CoralCDN integration are those on the server-side. In flash-crowd scenarios, smaller websites might become overloaded for a variety of reasons: bandwidth-limited to serve larger files (especially due to hosting contracts), CPU-limited given expensive scripts (e.g., PHP), or disk-limited given expensive database queries. At the same time, their webserver(s) can often still handle the network interrupt and processing overhead for simple HTTP requests. And further, websites often still want to get complete logs for all page accesses, especially given Referer headers. Given such scenarios, a common use of CoralCDN is for origin servers to directly receive an HTTP request, but respond with an HTTP redirect (302) to a Coralized URL that will serve the actual content.

This is as simple as installing a server plugin and writing a few lines of code. For example, the complete dynamic redirection rule using Apache’s mod_rewrite plugin is the following:

   RewriteEngine on
   RewriteCond %{HTTP_USER_AGENT !^CoralWebPrx
   RewriteCond %{QUERY_STRING !(^|&)coral-no-serve$
   RewriteRule ^(.*)$ http://%{HTTP_HOST.nyud.net %{REQUEST_URI [R,L]

while similar plugins and scripts exist for other web platforms (e.g., the WordPress blogging suite).

Redirection rules must be crafted somewhat carefully, still. In the above example, the second line checks whether the client is a CoralCDN proxy and thus should be served directly. Otherwise, a redirection loop could be formed. Numerous server misconfigurations have omitted such checks; thus, CoralCDN proxies check for potential loops and return errors if present. Amusingly, some early users during CoralCDN’s deployment caused recursion in a different way. By submitting URLs with many copies of nyud.net appended to the hostname suffix:

   http://example.com.nyud.net.nyud.net....nyud.net/

they created a form of amplification attack against CoralCDN. This single request caused a proxy to issue a number of requests, stripping the last instance of nyud.net off in each iteration. Such requests are now rejected.

While the above dynamic rewriting rules apply for all content, other sites incorporate URL Coralization in more inventive ways:

   RewriteCond %{HTTP_REFERER slashdot\.org [NC]
   RewriteCond %{HTTP_REFERER digg\.com [NC,OR]
   RewriteCond %{HTTP_REFERER blogspot\.com [NC,OR]

Combined with the above, these rules redirect clients to CoralCDN if and only if the requester originates from particular high-traffic portals. In Apache, such rules can be specified in .htaccess files and thus do not require administrative privileges. Other sites have even combined such tools with server plugins that monitor server load and bandwidth use, so that their servers only start rewriting requests under high load conditions.

These examples have shown users innovate with CoralCDN’s simple interface, which can be accessed like any other URL resource. We have even recently seen Coralized URLs being dynamically constructed within client-side Flash ActionScript. Indeed, CoralCDN’s most popular domain as of January 2009 was a Tamil imitation of YouTube that loads Coralized URLs from Flash animations of “recently watched” videos.

An Elastic Computing Resource

One of the most interesting aspects of these developments has been the adoption of CoralCDN as an elastic resource for content distribution, long before the term “cloud computing” was popularized and Amazon began offering CDN and other “surge” services.  Through completely automated means, work can get dynamically expanded out to use CoralCDN when websites require additional bandwidth resources, and contracted back when flash crowds abate. Still without prior registration, sites can even specify between several options on how they would like CoralCDN to handle their requests. X-Coral-Control headers returned by webservers provide in-band signaling and are saved as cache meta-data, such as whether to “redirect home” when domains exceed their bandwidth limits (per our previous post). But again, this type of control illustrates CoralCDN’s interface as a programmable API.

Admittedly, CoralCDN can provide free service (and avoid registration) because it operates on a deployment platform, PlanetLab, comprised of volunteer research sites.  On the flip side, CoralCDN’s popularity led it to quickly overwhelm the bandwidth resources allocated to PlanetLab by affiliated sites, leading to the fair-sharing mechanisms we described earlier.  My next (and final) post about our experiences with CoralCDN asks whether we should just move off a trusted platform like PlanetLab and accept untrusted operators.

CoralCDN Lesson: The design was mostly wrong

Most of my posts about CoralCDN to date have discussed techniques to make the system more robust; now I discuss what it got wrong.  While nice, many of these optimizations were in fact moot: CoralCDN’s design is ill-suited for its current deployment and usage.

coral-uniq-reqsLet us frame this argument by first considering some usage statistics from CoralCDN’s deployment.  The available aggregate data from 167 of the ~250 operating CoralCDN nodes during one recent, randomly-chosen day (January 20, 2009) shows that these nodes received a total of 9.74M requests from 983K unique client IPs.  These requests were for 596K unique URLs at 9,895 domains.  The figure on the left plots the entire distribution of requests per URL: The most popular URL received 448K requests itself, while 420K URLs (a full 70%) received a single request.  These requests appear to follow the Zipf-like distribution common among web caching and proxy networks.

coral-fetch-statsSo, if a small number of pages are so popular, we might measure their working set size, to determine how much storage is actually required to handle a large fraction of the total requests.  This table provides such an analysis.  The most popular 0.01% of URLs account for more than 38% of the total requests to CoralCDN, yet require only 52MB of storage. 1% of URLs account for almost 80% of requests, yet still require only 1.8GB of storage.  Recall that each CoralCDN proxy on PlanetLab has a 3GB disk cache.

These workload distributions support one aspect of CoralCDN’s design: Content should be locally cached by the “forward” CoralCDN proxy directly serving end-clients, given that small to moderate size caches in these proxies can serve a very large fraction of requests.  This differs from the traditional DHT approach of just storing data on a small number of globally-selected proxies, so-called “server surrogates”.  (While not analyzed here, per-node local caching also supports geographic differences in request distributions, provided that clients interact with nearby proxies.)

On the other hand, such a workload points out the unnecessary complexity in CoralCDN’s design: Global cooperation through a highly-scalable DHT indexing layer has marginal benefit to hit rates. We see this result in the following breakdown:

76.7% hit in local cache
9.8% returned 4xx or 5xx error code
7.0% fetched from origin site
6.3% fetched from other CoralCDN proxy
—  2.3% from level-2 cluster (local)
—  1.8% from level-1 cluster
—  2.2% from level-0 cluster (global)

In short, almost 77% of requests to proxies are satisfied locally, while only a little more than 6% result in cooperative transfers.  (The high rate of error messages is due to the bandwidth management of our underlying hosting service.)  A related result about the limits of cooperative caching had been observed earlier by Wolman et al, but instead from the perspective of client hit rates.  During the original design of CoralCDN, we had thought that this result may not be directly applicable to CoralCDN’s main goal, which was to decrease origin server load, not increase client hit rate.  The concern was that non-cooperating groups of caches would each individually request content from the origin, potentially overloading it.

To assess the benefits of cooperative caching to reduce server load, consider the three main usage scenarios observed in CoralCDN:

  1. Random surfing: Recall that fully 70% of unique URLs through CoralCDN saw a single request. These may be from posted server links that are unpopular.  Or they may be from clients explicitly Coralizing links themselves (e.g., using client-side plugins) for a variety of reasons, such as (presumed) anonymity, censorship or filtering circumvention, or automated crawling.
  2. Resurrect this page: Users attempt to use CoralCDN to retrieve content that is otherwise unavailable due to a overloaded or unavailable origin server.  This commonly occurs after Coralized links are posted as backup links or in comments on portals like Slashdot:  “Site downTry the Coral Cache
  3. Free bandwidth and flash crowds: The majority of requests to CoralCDN are for popular content, already widely cached, from a stable set of “customer” domains. Even its flash crowds occur on the order of minutes, not seconds.

These various use cases require different solutions, but CoralCDN’s design is not ideal for any of these cases:  It’s a jack-of-all-trades, master of none.

For unpopular content (use case #1), clients do not meaningfully change server-side load by using cooperative caching. Furthermore, the goal of censorship or anonymity circumvention can be better served simply through open proxies or specialized tools such as Tor (which recently saw a significant uptick from its use by Iranian protesters).

For long-term unavailable content (use case #2), CoralCDN is not designed for archival storage and durability.  For short-time horizons, on the other hand, if clients directly overload the server before CoralCDN can retrieve a valid copy of the content, its cooperation is also to no avail.  In fact, even if CoralCDN has already retrieved a valid copy of the file, many origin servers—especially those deployed via third-party hosting—cause it to replace good content with bad (as I discussed earlier).

Finally, for popular content (use case #3), one could imagine alternative cooperative strategies that only rely on regional cooperation.  Coral’s clustering algorithms would still be used for self-organizing the network into local groups, but a simple one-hop DHT could be used for content discovery (via consistent hashing).  After all, the above data shows that cooperation between proxies on a global scale (level-0) is only used to satisfy 2.2% of requests.  Such a strategy would easily scale to a CoralCDN network that is at least one order of magnitude larger than its current deployment.  Alternatively, to simply maintain its current 200–400 node deployment on PlanetLab, having each node maintain connectivity and liveness information about all others would certainly result in improved performance compared to its current “scalable” design.

One might ask, however, if CoralCDN remains the correct design for a truly Internet-scale cooperative CDN, where either the active working set or the number of participating proxies are several orders of magnitude larger.  Especially in the case of the latter, a single request from each distinct group could perhaps still overwhelm servers, which was our initial concern.  Unfortunately, a more “peer-to-peer” deployment—which is what CoralCDN’s algorithms are designed for—introduces a different problem if we wish to preserve CoralCDN’s compatibility with unmodified web clients and servers:  security.

There is nothing stopping malicious proxies from returning spam, malware, advertisements, or other modified content to unsuspecting web clients. I’ll return to this question of security and naming in a later post, but the overall implication seems to be a trade-off: either backwards compatible with today’s Web and a smaller deployment on trusted infrastructure (such as PlanetLab), or a more peer-to-peer deployment that requires end-host adoption.  In fact, this last point is one of the strong motivations behind our ongoing work on Firecoral.

So that’s the downside of CoralCDN’s design.  But it did catch on beyond what we expected; mostly, I believe, because of its simple user interface that requires no registration and little knowledge, yet was eventually used in a variety of flexible, powerful ways.  In fact, our “customers'” interactions with CoralCDN portended the elastic use of cloud computing resources for (what some have called) surge computing.  I’ll discuss this in my next post.

CoralCDN Lesson: Interacting with virtualized and shared hosting services

In the previous post, I discussed how CoralCDN implemented bandwidth restrictions that were fair-shared between “customer” domains. There was another major twist to this problem, however, that I didn’t talk about: the challenge of performing such a technique on a virtualized and shared platform such as PlanetLab.  While my discussion is certainly PlanetLab-centric, its questions are also applicable to other P2P deployments where users run peers within resource containers, or to commercial hosting environments using billing models such as 95th percentile usage.

Interacting with hosting platforms

CoralCDN’s self-regulation works well in trusted environments, and this approach is used similarly in other peer-to-peer (e.g., BitTorrent and tor) and server-side (e.g., Apache mod_bandwidth) environments.  But when the resource provider (such as PlanetLab, a commercial hosting service, or peer-to-peer end-users) wants to enforce resource restrictions, rather than assume the software functions correctly, the situation becomes more challenging.

Many services run on top of PlanetLab; CoralCDN only being one of them.  Each of these instances is allocated a resource container (a “slice”) across all PlanetLab nodes.  This doesn’t have the same level of isolation as a virtual machine instance, but its much more scalable (in number of slices per node, each per-node slice called in sliver in PlanetLab).

PlanetLab began enforcing average daily bandwidth limits per sliver in 2006; prior to that, sliver usage was all self-enforced. Thereafter, however, when a sliver hit 80% of its daily limit, the PlanetLab kernel began enforcing bandwidth caps (using Linux’s Hierarchical Token Bucket scheduler) as calculated over five-minute epochs.  CoralCDN’s daily limit is 17.2 GB/day per sliver to the public Internet.

So, we see here two levels of bandwidth control: admission control by CoralCDN proxies and rate limiting by the underlying hosting service. Even though CoralCDN uses a relatively conservative limit for itself (10 GB/day), it still surpasses the 80% mark (13.8 GB) of its hosting platform on 5–10 servers per day. And once this happens, these servers begin throttling CoralCDN traffic, leading to degraded performance.  The main cause of this overage is that, while CoralCDN counts successful HTTP responses, its hosting platform accounts for all traffic—HTTP, DNS, DHT RPCs, system log transfers, and other management traffic—generated by CoralCDN’s sliver.

Unfortunately, there does not appear to be sufficiently lightweight or simple user-space mechanisms for proper aggregate resource accounting.  Capturing all traffic via libpcap, for example, would be too heavy-weight for our purposes.  Furthermore, a service would often like to make its own policy-based queuing decisions based on application knowledge.  For example, CoralCDN would prioritize DNS traffic before DHT RPCs, HTTP traffic next, and log collection of lowest priority. This is difficult through application-level control alone, while using a virtualized network interface that pushes traffic back through a user-space network stack would be expensive.

CoralCDN’s experience suggests two desirable properties from hosting platforms that enforce resource containers.  First, these platforms should provide slivers with their current measured resource consumption in a machine-readable format and in real time.  Second, these platforms should allow slices to express policies that affect how the underlying platform enforces resource containment. While this pushes higher-level preferences into lower layers, such behavior is not easily performed at these higher layers (and thus compatible with the end-to-end argument).  And it might be as simple as exposing multiple resource abstractions for slices to use, e.g., multiple virtual network connections with different priorities.

Somewhat amusingly, one of CoralCDN’s major outages came from a PlanetLab misconfiguration that changed its bandwidth caps from GBs to MBs.  As all packets were delayed for 30 seconds within the PlanetLab kernel, virtually all higher-layer protocols (e.g., DNS resolution for nyud.net) were timing out.  Such occasional misconfigurations are par for the course, and PlanetLab Central has been an amazing partner over the years.  Rather than criticize, however, my purpose is to simply point out how increased information sharing can be useful.  In this instance, for example, exposing information would have told CoralCDN to shut down many unnecessary services, while policy-based QoS could have at least preserved DNS responsiveness.

Over-subscription and latency sensitivity

While CoralCDN faced bandwidth tensions, there were latency implications with over-subscribed resources as well.  With PlanetLab services facing high disk, memory, and CPU contention, and even additional traffic shaping in the kernel, applications face both performance jitter and prolonged delays.  For example, application-level trace analysis performed on CoralCDN (in Chapter 6 of Rodrigo Fonseca’s PhD thesis) showed numerous performance faults that led to a highly variable client experience, while making normal debugging (“Why was this so slow?”) difficult.

These performance variations are certainly not restricted to PlanetLab, and they have been well documented in the literature across a variety of settings.  More recent examples have shown this in cloud computing settings.  For example, Google’s MapReduce found performance variations even among homogeneous components, which led to their speculative re-execution of work.  Recently, a Berkeley study of Hadoop on Amazon’s EC2 underscored how shared and virtualized deployment platforms provide new performance challenges.

cluster-timingsCoralCDN saw the implications of performance variations most strikingly with its latency-sensitive self-organization.  Coral’s DHT hierarchy, for example, was based on nodes clustering by network RTTs. A node would join a cluster provided some minimum fraction (85%) of its members were below the specified threshold (30 ms for level 2, 80 ms for level 1).  This figure shows the measured RTTs for RPC between Coral nodes, broken down by levels (with vertical lines added at 30 ms, 80 ms, and 1 s). While these graphs show the clustering algorithms meeting their targets and local clusters having lower RTTs, the heavy tail in all CDFs is rather striking.  Fully 1% of RPCs took longer than 1 second, even within local clusters.

Another lesson from CoralCDN’s deployment was the need for stability in the face of performance variations, which are only worse in heterogeneous deployments.  This translated to the following rule in Coral.  A node would switch to a smaller cluster if fewer than 70% of a cluster now satisfy its threshold, and form a singleton only if fewer than 50% of neighbors are satisfactory.  Before leveraging this form of hysteresis, cluster oscillations were much more common (leading to many stale DHT references).  A related focus on stability helped improve virtual network coordinate systems for both PlanetLab and Azureus’s peer-to-peer deployment, so it’s an important property to consider when performing latency-sensitive self-organization.

Next up…

In the next post, I’ll talk a bit about some of the major availability problems we faced, particularly because our deployment has machines with static IP addresses directly connected to the Internet (i.e., not behind NATs or load balancers).  In other words, the model that the Internet and traditional network layering was actually designed with in mind…

CoralCDN Lesson: Fair-sharing bandwidth via admission control

For commercial CDNs and other computing services, the typical answer to resource limits is simply to acquire more capacity.  As CoralCDN’s deployment on PlanetLab does not have that luxury, we instead apply admission control to manage its bandwidth resources.  This post describes some of these mechanisms, while we’ll take a step back in the next post to describe some of the challenges in doing resource accounting and management on a virtualized and shared platform such as PlanetLab.

asiantsunamivideos

Following the Asian tsunami of December 2004, and with YouTube yet to be created, CoralCDN distributed large quantities of amateur videos of the natural disaster.  With no bandwidth restrictions on PlanetLab at the time, CoralCDN’s network traffic to the public Internet quickly spiked.  PlanetLab sites threatened to pull their servers off the network if such use could not be curtailed.  It was agreed that the service should restrict its usage to approximately 10GB per node per day.

One could limit a proxy’s bandwidth use by rate-throttling all traffic (as in BitTorrent and tor), or by simply shutting a proxy down after exceeding its configured daily capacity (also supported by tor).  But if CoralCDN primarily provides a service for websites, as opposed to for clients, then perhaps it should provide some notion of fairness across origin domains. While this problem may seem difficult given that CoralCDN can interact with 10,000s of domains over the course of the day, our usage patterns greatly simplify the accounting.

forbiddenThe figure on the left shows the total number of requests per domain that CoralCDN received over one recent day in January, 2009 (the solid top line).  The distribution clearly has some very popular domains: the most popular one (Tamil YouTube) received 2.6M requests, the second most popular (a Brazilian forum on Japanese manga) received 1.2M requests, the third (screensaver videos of user-contributed “fractal flames”) received 880K requests, while the remaining distribution fell off in a Zipf-like manner. Given that CoralCDN’s traffic is dominated by a limited number of domains, its mechanisms can serve mainly to reject requests for (i.e., perform admission control on) the bandwidth hogs.  Still, CoralCDN should differentiate between peak limits and steady-state behavior to allow for flash crowds or changing traffic patterns. To achieve these aims, each CoralCDN proxy implements an algorithm that attempts to simultaneously (1) provide a hard-upper limit on peak traffic per hour (configured to 1000 MB per hour), (2) bound the expected total traffic per epoch in steady state (400 MB per hour), and (3) bound the steady-state limit per domain.  As setting this last limit statically—such as 1/k-th of the total traffic if there are k popular domains—would lead to good fairness but poor utilization (given the non-uniform distribution across domains), we dynamically adjust this last traffic limit to balance this trade-off.

During each hour-long epoch, a proxy records the total number of bytes transmitted for each domain.  It also calculates domains’ average bandwidth as an exponentially-weighted moving average (attenuated over one week), as well as the total average consumption across all domains.   Across epochs, bandwidth usage is only tracked, and durably stored, for the top-100 domains (although more traditional HTTP logs do store information about all requests).  If a domain is not currently one of the top-100 bandwidth users, its historical average bandwidth is set to zero (providing leeway to sites experiencing flash crowds).

The per-domain dynamic bandwidth limit is calculated at the beginning of each epoch for each of the top-100 domains as follows.  The proxy computes the sum of all top domains’ average bandwidths. If this sum exceeds the total steady-state limit, it reduces the per-domain limit until the point when, if that limit had been enforced in the prior time period, the bandwidth sum would have been <= the permitted total limit.  (This limit thus first affects the top-ranked domain, then affects the top-two domains, etc.)

When a particular requested domain is over its hourly budget (case 3 above), CoralCDN proxies respond with 403 Forbidden messages. If instead the proxy is over its peak or steady-state limit calculated over all domains (cases 1 or 2 above), then the proxy redirects the client back to the origin site, and the proxy temporarily making itself unavailable for new client requests, which would be rejected anyway.  (If clients are to be redirected back to the origin, a proxy appends the query-string coral-no-serve on the location URL returned to the client.  Origins that use redirection scripts with CoralCDN check for such a string prevent loops.)  Although not the default, operators of some sites preferred this redirection home even if their particular domain was to blame. Domains can specify this type of control in a X-Coral-Control response header (which we’ll discuss a bit later).

bwusageBy applying  these mechanisms, CoralCDN reduces its bandwidth consumption to manageable levels.  While its demand sometimes exceeds 10TBs per day, its actual HTTP traffic remains steady at about 2TB per day after rejecting a significant number of requests.  The scatter plot in the above figure shows the number of requests resulting in 403 responses per domain, most due to these admission control mechanisms.  We see how variances in domains’ object sizes yield different rejection rates.  The second-most popular domain mostly serves gifs smaller than 10KB and experiences a rejection rate of 3.3%.  Yet the fractal videos of the third-most popular domain are typically 5MB in size, and it sees an 89% rejection rate. Ranking these domains instead by bandwidth served, the figure on the left plots the average hourly traffic that four proxies permitted each of the top-100 bandwidth consumers.

Note that this accounting mechanism is applied independently on each node.  This brings up two interesting questions worth further consideration:

  1. Would the system behave differently if its accounting shared information between nodes?  One potential result from not sharing information is penalizing content with local interest (that is, files that may be of interest only to more regional Internet populations).  For if a system’s server-selection mechanisms actually redirect clients to more localized proxies, then most users from a particular region might only see a smaller subset of the total number of servers.  On the other hand, if our goal is more to prevent the total bandwidth hogs, rather than provide any strict notion of fairness, does this problem arise significantly in practice?
  2. From a more research perspective, how can one efficiently perform (approximate) accounting across the entire network, yet do so without centralization and in a bandwidth-efficient manner.  Work  in SIGCOMM ’07 called Cloud Control with Distributed Rate Limiting considered a similar problem, but effectively shared O(kN) information across the system via gossiping (where k is the number of domains we’re monitoring and N the number of nodes).  Can we provide accurate/precise rate limiting with o(kN) communication complexity?  If anybody has any pointers to effective algorithms here, please leave a comment.  It feels like the sensor network community probably has addressed this problem at some point when aggregating and collecting information.

In the next post, we’ll ask whether a hosting platform need bother provide any special interfaces or information to support such accounting.

CoralCDN Lesson: Accepting conservatively and serving liberally

At its heart, CoralCDN provides a caching serving, not a persistent data store.  Thus, it ultimately requires that a URL’s origin server is initially available, so that it can pull in content to some CoralCDN proxy and make it available across the network.   While traditional web proxies normally interact with sufficiently-provisioned or otherwise well-behaved origin webservers, CoralCDN experiences a different norm.  Given its very design goals, its proxies typically interact with overloaded or poorly-behaving servers; it therefore needs to react to (non-crash) failures as the rule, not the exception.  Thus, one design philosophy that has come to govern CoralCDN proxies’ behavior—proxies should accept content conservatively and serve results liberally—is the exact opposite of Postel’s Law.

Consider the following situation, fairly normal for CoralCDN.  A portal like slashdot.org or digg.com first runs a story that links to a third-party website, driving a sudden influx of readers to this previously unpopular site.  Then a user posts a Coralized link to the third-party site as a “comment” to the portal’s story, providing an alternate means to fetch the content. Several situations are possible in such scenarios, all demonstrative of different ways which CoralCDN must handle origin failures.

  1. The website’s origin server becomes unavailable before any proxy downloads its content.
  2. CoralCDN already has a copy of the content, but requests arrive to it after the content’s expiry time has passed.  Unfortunately, subsequent HTTP requests to the origin webserver result in failures or errors.
  3. CoralCDN’s content is again expired, but subsequent requests to the origin yield only partial transfers.

We next consider how CoralCDN’s mechanisms handle these different situations.

Tackling #1:  Negative result caching

CoralCDN may be hit with a flood of requests for an inaccessible URL (e.g., DNS resolution fails, TCP connections timeout, etc.).  For these situations, proxies maintain a local negative result cache about repeated failures.  Otherwise, we have seen resource exhaustion on both proxies and their local DNS resolvers, given flash crowds to apparently dead sites. While more a usability issue than a resource-exhaustion concern, CoralCDN even receives requests for some Coralized URLs several years after their origins became unavailable:  The “dead links” problem of the web, but one that would otherwise cause our resources to get unnecessarily tied up.

Tackling #2:  Serving stale content

CoralCDN proxies mostly obey content expiry times, as specified by Cache-Control or Expires headers, with a default expiry of 12 hours.  If cached content expires, proxies perform a conditional request (If-Modified-Since) to revalidate or update their content. What happens, however, if an origin server fails to respond, or simply returns some temporary error condition?  Rather than return an error, proxies return stale content. Specifically, if the origin responds with many 400-level (Forbidden, Not Found, Timeout) or 500-level (Internal Server Error, Service Unavailable, Gateway Timeout) errors, a proxy will serve stale data for up to 24 hours after it expires.

This trade-off will not satisfy every situation.  Is a Forbidden message due to the website publisher seeking to make the content unavailable, or it is caused by the website going over its daily bandwidth quota and its hosting service returning an error?  Does a “File Not Found” indicate whether the condition is temporary (from a PHP or database error) or permanent (from a third-party issuing a DMCA take-down notice to the website)?  Indeed, such ambiguity led to the introduction of 410 (Gone) messages in HTTP/1.1, denoting permanence, which does result in the eviction of content from our caches. CoralCDN has experienced all these situations, and the difficulty is that many status codes are inherently ambiguous.

Unfortunately, we have also seen many situations caused by semantically-incorrect server responses.  These are often generated by poorly-written PHP or other server-side scripts.  Too often do our servers receive a 200 (OK) message with humanly-readable body content to the tune of “an error occurred.”  Or, common for virtually-hosted websites, a redirect (302) will lead to a generic error page (of type 200) reporting that “the website has exceeded their bandwidth allotment,” Both situations, unfortunately, result in CoralCDN replacing valid content with less useful information.

Tackling #3:  Whole-file overwrites

Finally, consider when the CoralCDN proxy is already caching an expired file, but a subsequent re-request yields a partial or excessively-slow response from the origin site (as it is being overloaded).  Rather than having the proxy lose a valid copy of a stale file, proxies perform whole-file overwrites in the spirit of AFS.  Namely, new versions of content are written to temporary files; only after the file completes downloading and appears valid (e.g., based on Content-Length) will a proxy replace its existing copy with this new version of the data.

Meta Lesson:  Preserve the status quo

These examples all point to a lesson that seems to govern CoralCDN’s proxy design: Maintain the status quo unless improvements are possible.  A similar theme has helped govern our system management. CoralCDN servers query a centralized management point for a number of tasks: to update their overall run status, to start or stop individual service components (HTTP, DNS, DHT), to reinstall or update to a new software version, or to update shared secrets that provide admission control to Coral’s decentralized DHT.  Although designed to handle intermittent connectivity to its management servers, one of CoralCDN’s significant outages came when the management server began misbehaving and returning unexpected information.  This led to management scripts on servers killing CoralCDN’s local processes.  In response, CoralCDN now implements what we might call fail-same behavior that accepts updates conservatively.  Management information is stored durably on servers, maintaining their status-quo operation (even across local crashes) until correct new instructions are received.

In the next post, I’ll start discussing some of the issues we’ve needed to deal with while operating CoralCDN on PlanetLab, a deployment platform that is heterogeneous, shared, virtualized, loosely managed, and itself oversubscribed.

CoralCDN Lesson: Fixing overlooked assumptions in DHTs

So let’s start with the first of seven lessons from CoralCDN’s deployment:

  • How all published distributed hash table (DHT) algorithms are susceptible to race conditions and routing errors for non-transitive network connectivity, and what can be done to mitigate these problems.

Some challenges with deploying DHTs

slashdot-data

CoralCDN’s primary goal was to enable websites to survive spikes in traffic.  We can see examples of such so-called flash crowds through CoralCDN: The figure on the left shows a spike to Coralized slashdot.org URLs that occurred in mid-2005.  Requests grew from nothing to 50/s within the first five minutes, peaking to about 250/s within the first hour.  We also see what occurs after a story leaves Slashdot’s front page:  traffic plummeted to zero within a few minutes.

Now, if traffic arrives rather unexpectedly and suddenly, we should ask whether Coral’s DHT indexing architecture provides the property that if content is cached (or is in the process of being downloaded) anywhere in the network, other CoralCDN proxies can effectively find this content.

coral-lookupThis next figure demonstrates the results of a more controlled experiment we performed during an initial deployment of the CoralCDN network (i.e., before it started handling public traffic).  Twelve files were hosted on an Apache server behind a home DSL line, while 166 geographically-distributed proxies on PlanetLab began mostly simultaneously requesting the files.  The graph shows the number of files in the experiment that were served by inter-proxy transfers between each level of Coral’s hierarchy: most requests (the solid red top line) were handled via localized transfers, and decreasingly fewer by more distant transfers. A total of 14 requests were served by the origin server (the dashed purple line).  Thus, while demonstrating the efficacy of Coral’s lookup algorithms, origin servers can—and do, in real deployments—see multiple requests for the same file.

One obvious source of multiple requests can arise when information is “lost” from the DHT; i.e., when nodes storing DHT entries fail.   After all, Coral is designed to store soft-state pointers, as opposed to the hard-state of some DHTs that use active replication (e.g., CFS, OpenDHT, etc.).  Or transient failures may arise when packet loss or delays between nodes is sufficiently high.

This post describes two other sources of “failed lookups” to which most DHTs are susceptible: test-and-set-like race conditions and non-transitive connectivity.  I don’t suggest that this is a comprehensive list, but they are important details for DHT designers and implementers to consider.

DHT Race Conditions

CoralCDN’s particular use of its indexing layer may result in redundant fetches to an origin server because a race condition exists in the protocol.

Consider that a key k is not yet inserted into the system, and two nodes are interested in k.  If they concurrently execute their get(), both operations serialize and will fail at the key’s root node.  Both nodes will subsequently optimistically put() their identity after contacting the origin server.  Simply inverting the order of these operations is even worse.  If multiple nodes optimistically perform a put before their gets, they can discover one another and effectively form cycles just waiting for one another (like distributed deadlock), with nobody actually retrieving data.

To minimize this condition, Coral extended insert operations to provide return status information, like test-and-set in shared-memory systems.  Specifically, it introduced a single put+get protocol which performs both operations: This algorithm behaves similar to a Coral put (described in a previous post), but also returns the first values discovered in either direction to satisfy the “get” portion of the request.  (Values returned during the forward put direction help performance, as a put may need to traverse past the first stored values, while a get should return as soon as the first values are found. Values returning during the reverse put phase prevent this race condition.)  While we see that Coral’s load-balancing properties add a little complexity, in that get and put would return after encountering different sets of nodes, this is handled by a put+get exposing two separate callbacks to higher-layer applications.

Interesting, there’s another aspect to this story.  I initially developed this test-and-set behavior for a cooperative file system called Shark (published in NSDI ’05).  Shark focused on solving the “I want to distribute my build directory to all nodes on PlanetLab,” and it used Coral for discovery.  Because we imagined that the flash crowds in Shark could be so sudden (press “run”), this put+get primitive seemed necessary.  In practice, however, nodes could see some head-of-line blocking effects, as the optimistic put could register a peer for a file chunk much before it downloaded the chunk, leading to deep multicast trees.  (Shark supported lookup and transfers on file chunks, as opposed to whole files as in CoralCDN.)  Unfortunately, if certain parent nodes were particularly slow, this could delay the entire subtree below it.  Nodes would eventually give up on a particular transfer, and switch to either another peer or fall back to the origin, but this would lead to greater transfer times.  This delay was less of an issue for large files that were broken into many chunks fetched in parallel, but would be a bigger issue for CoralCDN’s whole file, and often small file, transfers.  On top of that, we see from the above figure that CoralCDN’s flash crowds are on the order of tens of seconds or more, not a few seconds.  Thus, given both these reasons, while CoralCDN supports the put+get primitive, its configuration file has this option turned off in our live deployment.

Non-Transitive Connectivity and DHT Routing

This isn’t the only explanation for multiple requests, however. Other redundant requests would occur over time-scales that were not explainable by race conditions and even when the indexing layer would not experience any node failures.  Yet Coral’s practice of retrying failed RPCs and using adaptive timeouts suggests that such failures are not due to transient network congestion either.

An implicit assumption in DHT protocols is that all nodes are able to communicate with each other.  Yet we know this assumption is unfounded in practice.  We say that three hosts, A, B, and C, exhibit non-transitivity if two pairs of these nodes can communicate with one another, but one pair (say, A and B) cannot. These transient periods of non-transitivity occur for many reasons, including link failures, BGP routing updates, and ISP peering disputes.

While this DHT connectivity assumption seems obvious in hindsight, I don’t think it was explicitly identified before our paper at WORLDS ’05 (with Karthik Lakshminarayanan, Sean Rhea, and Ion Stoica). The paper includes some brief measurement data of connectivity between PlanetLab nodes; e.g., during the observed time window, we saw non-transitivity between 2.3% of node triples.   But the high-level bit is that non-transitivity has great impact on DHT algorithms.

ntrThe problem with non-transitive connectivity and DHTs is due to DHT’s use of  greedy routing.  Consider the segment of a Chord ring in the figure to the left, where the dashed lines represent predecessor links.  Identifiers increase from the left, so node B is the root (successor) to key k.  If nodes A and B are unable to communicate, A will believe that C is its successor. Upon receiving a lookup for k, A will respond with C.  If the requester then tries to insert a value under k at node C, C would refuse, since according to its view it is not responsible for key k.

The basic way to handle non-transitive connectivity is to route around the problem.  To do so, one can (1) modify DHT routing tables to include indirect entries and (2) add support for a basic forward primitive.  Using the above example, node A‘s routing table would include entries:

B → <C, B>

C → <C>

where the former implies an indirect path to B through C.  The forward primitive takes a destination and a message m,  and it replaces the ultimate destination with its overlay route.  Each hop forwards the packet as expected, although most “paths” will actually be of length one (when direct connectivity exists). Using such a forwarding primitive, DHTs that implement recursive routing—where intermediate nodes actually forward DHT queries, as opposed to return next-hop information to senders—can circumvent the problem of non-transitivity.  Specifically, to route among ones’ immediate neighbors, which is required for correctly  routing to a key’s root, DHT nodes exchange local routing tables and build link-state information about their local peers.  Such information is less necessary for long-distance hops (in terms of identifier-space distance), as DHT routing structures offer flexibility in the choice of peers (see Gummadi et al. in SIGCOMM ’03).

Coral’s iterative routing, however, presents additional problems: Nodes may not have direct connectivity to the next-hop peers returned by intermediates, even though those intermediates do.  In practice, this problem is mitigated in two ways.  First, Coral’s implementation caches additional information beyond O(log N) routing state as a performance optimization: Caching live nodes aids in accurate RTT information to set good timeouts, while remembering unreachable nodes prevents unnecessary delays when waiting for a request to timeout. This cache also is used to remember indirect connectivity, the very information returned by iterative routing.  Second, Coral’s implementation keeps a window of multiple outstanding RPCs (4 in our current deployment), and thus avoids the performance hit of the single node’s failure or delayed response.  As a request approaches the key’s root, indirect requests can be sent through the root’s immediate neighbors, and thus avoid inconsistent roots.  More detailed information can be found in our WORLDS paper.


In the next set of posts, I’ll pop up a layer and turn Postel’s Law upside down:

  • How a CDN (or system in general) designed to interact with erratic services should be designed to accept conservatively and serve liberally.

Security mechanisms in CoralCDN (and some attacks)

Before finally getting to some experiences, I wanted to touch on some of the security mechanisms that CoralCDN proxies incorporate to curtail misuse, especially important given their deployment at PlanetLab-affiliated universities.

Limited functionality

CoralCDN proxies only support GET and HEAD requests.  Many of the attacks for which “open” proxies are infamous are simply not feasible.  For example, clients cannot use CoralCDN to POST passwords for brute-force cracking.  It does not support SSL and thus risk carry more confidential data.  CoralCDN proxies do not support CONNECT requests, and thus they cannot be used to send spam as SMTP relays or forge From: addresses in web mail. And it deletes cookies from HTTP headers (in both direction), again reducing its use for carrying personalized or authenticated traffic.  Furthermore, because CoralCDN only handles Coralized URLs (i.e., example.com.nyud.net, rather than example.com), it cannot be used by simply configuring a vanilla browser’s proxy settings.

Maximum file size

Given both its storage and bandwidth limitations, CoralCDN enforces a maximum file size of 50MB.  This has generally prevented clients from using CoralCDN for video distribution.  Some sites have attempted to circumvent these limits by omitting Content-Length headers (on connections marked as persistent and without chunked encoding).  So, to ensure compliance, proxies monitor ongoing transfers and halt any ones that exceed their limits.  The 50MB limit was instigated after we saw early use of CoralCDN to serve 700MB-or-so videos through it, which open us up both to increased DMCA take-down notices and, perhaps more from a fairness perspective, uses up a significant fraction of local disk cache (only 4 GB per server) and our daily bandwidth limits (15 GB per server per day).

Excessive resource use

While I’ll later discuss CoralCDN’s techniques for limiting bandwidth overuse by servers—as we only have about 15 GB of bandwidth per server per day, we need to somewhat fairly split it between origin sites if oversubscribed—there’s also resource abuse by clients.  So, in addition to monitoring server-side consumption, proxies keep a sliding window of client-side usage patterns.  Not only do we seek to prevent excessive aggregate bandwidth usage by clients, but also an excessive number of (even small) requests.  These are usually caused either by server misconfigurations (which I’ll discuss in a future post) that result in HTTP redirection loops or by bot misuse.  While CoralCDN does not support POSTs or https, even some top-10 web sites support cleartext passwords over GET requests.  Thus, bots occasionally attempt to use CoralCDN to hide their identity while launching brute-force login attacks on websites through it.

Domain blacklisting

Finally, I maintain a global domain-name blacklist that each proxy regularly fetches and reloads during run-time.  This blacklist is used to satisfy origin servers that do not wish their content to be accessed via CoralCDN. Such restrictions were especially necessary to keep the proxies running on university campuses, as many universities have their IP address space “whitelisted” by some websites as a form of IP-based authentication, a practice especially common among online academic journals.  Prior to implementing this, we—or rather, the NetOps folks at various deployment sites—received abuse complaints from journal operators about the existence of “open proxies”, leading the ops people to threaten that the CoralCDN proxies (or even PlanetLab servers) would get pulled from their networks. Part of sites’ contractual obligations with journals include prevention of such unauthorized access.

But interestingly, these site operators actually do have a way to prevent such accesses themselves. After all, CoralCDN does not attempt to hide its presence nor its clients’ identities. Proxies use unique User-Agent strings (“CoralWebPrx“) when fetching content from webservers, and client IP addresses are reported in X-Forwarded-For headers (a semi-standardized header introduced by Squid proxies in the ’90s).  In practice, however, many site operators are either averse or unable to perform such security checks themselves.  As abuse complaints to PlanetLab sites can risk their continued operation, such blacklists were necessary to enable CoralCDN to continue to run.

We did encounter some interesting circumventions of our domain-based blacklists.  On at least one occasion, a user created dynamic DNS records for a random prefix that pointed to the IP address of a target domain: i.e., x12345.foo.com pointed to the IP address for www.journal.org. (I forget the specific online academic journal that was targeted, although I vaguely remember it being a Russian-language chemistry one.) Then, because we weren’t blacklisting foo.com, the domain passed our check. CoralCDN proxies resolved it to the journal’s IP address and successfully downloaded some articles from the origin site.  Now, it’s rather surprising that this attack even worked, because the HTTP request from our proxy would have had a Host field of x12345.foo.com, rather than that of the journal. But the site operators were not checking the Host field, and, when informed of the problem, were not initially sure how to do so (I vaguely remember them running some non-Apache webserver on Solaris). This site, and some others, have requested that we perform IP-based blacklisting instead of domain-based, although I’ve avoided such restrictions given the difficulty in keeping these up-to-date.

Still, there’s an issue here related to our initial goal of a fully decentralized CDN. We’ve been using a central administratively-supplied blacklist, as otherwise deployment sites get uncomfortable with running a (semi) “open” proxy. If different proxies were to employ different blacklists, however, there’s the configuration challenge of directing users to proxies that can satisfy their requests and avoiding those proxies that would reject them.  The tor anonymizing network allows different exit proxies to specify different configuration policies, but these are fairly coarse-grained (i.e., which ports to block or allow) as opposed to more finer-grained configuration that we want (e.g., blacklisting specific HTTP domains).  I remember a similar problem about academic journals be discussed by someone looking to operate a tor exist node at Berkeley a few years back (although I don’t remember how the problem was actually resolved…anybody?).  In general, this seems like a hard configuration and organization problem to solve, perhaps a good research problem related to decentralized management and access.

Security through Obscurity (and Unintentional Tarpitting)

Overall, however, CoralCDN sees surprisingly less abuse than I expected.  (Let’s hope this post isn’t thumbing my nose at fate!)  I see two reasons for this.

First, as described above, CoralCDN doesn’t work with vanilla proxy settings because it only handled Coralized URLs.  Thus, it also doesn’t show success for crawlers that “search” for open proxies.  In fact, to the best of my knowledge, CoralCDN does not appear in any published open-proxy lists.  Now, it would be trivial for some user to change whatever script they are using to auto-Coralized domain names (indeed, there are a host of browser plugins and Greasemonkey scripts that do exactly that).  But this one-line change is probably beyond the ability of some script-kiddies.  And, perhaps even more important, there are just so many other crunchy targets to be had that a special rule for CoralCDN proxies is just not worth the trouble.

Second, CoralCDN can be quite slow at times, especially when it doesn’t have content cached locally and must perform global DHT lookup before touching an origin site.  So delays for a URL found neither locally nor anywhere in the DHT—as happens when one is performing a brute-force login attack against a website, where the URL query string changes each attempt—can be on the order of seconds.  This sounds very much like an (albeit unintentional) tarpit, which is a security trick of “delaying” response to a client request, especially used in SMTP.  Good users don’t notice the additional delay (although this isn’t quite so true in interactive protocols like HTTP), while attackers see much lower throughput, given any limits on the number of outstanding connections they make.  Put simpler, it’s going to be a long day when each attempt to guess a password takes 2 seconds.

Another Princeton-based CDN, CoDeeN, has introduced a number of other interesting security mechanisms for automated robot detection.   Perhaps if our usage patterns change we’ll have to look into using some of them.


So that’s my background description of CoralCDN and its security mechanisms.  In the next post of this series, I’ll discuss how all published DHT algorithms are susceptible to both race conditions and routing problems—the latter given non-transitive network connectivity—and what can be done to mitigate these problems.

The Design of CoralCDN

In this post, I describe the architecture and mechanisms of CoralCDN at a high-level. This is meant to provide some of the background necessary for some of our experiences and lessons with operating the system.

System Overview

CoralCDN is composed of three main parts: (1) a network of cooperative HTTP proxies that handle users’ requests, (2) a network of DNS nameservers for .nyud.net that map clients to nearby CoralCDN HTTP proxies, and (3) the underlying Coral indexing infrastructure and clustering machinery on which the first two applications are built.  You’ll find that I refer to the entire system as “CoralCDN”, but the network of indexing nodes as “Coral” (although this clarification is not always shared by others, who refer to the entire system as either “Coral” or “Coral Cache”).

At a high-level, the following steps occur when a client issues a request to CoralCDN.

  1. A client resolves a “Coralized” domain name (e.g., of the form sns.cs.princeton.edu.nyud.net) using CoralCDN nameservers, possibly starting at one of the 10–12 primary nameservers registered for .nyud.net under the .net domain.
  2. Upon receiving a query, a CoralCDN nameserver probes the client to determines its round-trip-time.  The nameserver uses this information to determine which nameservers (NS records) and CoralCDN proxies (authoritative A records) to return.
  3. The client sends an HTTP request for a Coralized URL to one of the returned proxy.  If the proxy is caching the file locally, it returns the file and stops.  Otherwise, this process continues.
  4. The proxy looks up the web object’s URL in Coral.
  5. If Coral returns the address of a node caching the object, the proxy fetches the object from this node.  Otherwise, the proxy downloads the object from the origin server (e.g., sns.cs.princeton.edu).
  6. The proxy stores the web object and returns it to the client browser.
  7. The proxy stores a reference to itself in Coral, recording the fact that is now caching the URL.

A review of DHTs

The Coral indexing layer is closely related to the structure and organization of distributed hash tables (DHTs) like Chord and Kademlia.   Let’s briefly review the design of DHTs; skip to the next section if you are already familiar with DHTs.

DHTs were first concurrently proposed around 2001 by several different research groups (Chord from MIT, Pastry from Rice/Microsoft, CAN from ICSI, and Tapestry from Berkeley), which basically all had a similar flavor.  They provide a distributed data structure that exposes two functions:   put(key, value) and get(key).  These proposals took load-balancing properties first considered by the consistent hashing work of Karger et al ’97, differing because consistent hashing had each node have a global view of the system (where all system participants (nodes) know all other nodes in the system) where DHTs focused on more scalable designs (where nodes only know a fraction of total system participants).  For most of these algorithms, this “fraction” is O(log n), although you now see DHT proposals called “one-hop DHTs” which effectively go back to the O(n) state of consistent hashing.   DHTs have had some real impact:  BitTorrent uses DHTs to select trackers for their “trackerless” operation, while memory-caching systems like memcached (used by Facebook, LiveJournal, Wikipedia, etc.) use it to map objects only cache nodes.

In these DHTs, all nodes are assigned one or more random identifiers, called nodeids, typically from some space “keyspace” (e.g., all 160-bit numbers).  “Keys” are also mapped into this same identifier/keyspace by computing some function F(k) that maps any input key k ({0,1}*) into the keyspace ({0,1}^160).  One often uses a cryptographic hash function like SHA-1 for this F, as it has the property of collision resistance; namely, that it is very difficult to find two inputs k and k’ for which F(k) == F(k’) yet k != k’.  (As a slight digression, if SHA-1 indeed behaves like a random function, then finding any such pair k,k’ that “collide” will require searching roughly 2^80 numbers for 160-bit values, i.e., the “Birthday bound”.  There’s been some very clever attacks against SHA-1 in the past few years to bring this down to around 2^63, but that’s perhaps not that relevant to our discussion, and one could always switch instead to a longer function, such as the 256-bit aptly-named SHA-256, and NIST currently has another competition going on to determine the next-generation cryptographic hash function.)

Anyway, I’ll describe the basic approach of DHTs using Chord as an example, because I think it’s the easiest to explain.  We first start out with a identifier space that’s comprised of all 160-bit numbers, i.e., [0, 2^160-1].  Imagine placing these numbers on a ring, so that you have both “0” and “2^160-1” in the 12-o’clock position.  Then, we map both keys and nodeids onto this space.  In Coral, keys are the SHA-1 hash of URLs, and nodeids are the SHA-1 hash of nodes’ IP addresses.  We say that a node is a successor (or root) of a key if it is the first node that is clockwise from the key, as you go around the ring.  You can also view a node as “being responsible for” the entire keyspace between its own identifier and that of its nearest node predecessor (in the counter-clockwise direction).

So that’s how we assign keys to nodes, which is (roughly) how Akamai supposedly assigns responsibility for caching files to its servers, as well as how memcached works (albeit memcache doesn’t support the dynamic addition or removal of cache nodes, unfortunately). This approach of consistent hashing is superior to other simple “mod-based” hashing techniques, which number all nodes from 1..n, then assign a key to the ith node by simply computing key mod n.  In mod-based hashing, one gets great load balancing, but bad smoothness.  This smoothness property addresses how many items will move between nodes if new nodes join or leave the system.  In the hash-based approach, if n changes by only one, then all items will need to be reassigned (or somehow a complex renumbering trick played); in consistent hashing, when a node is added or removed, only its successor will typically be affected.

But both of these hashing approaches assume that every node knows approximately all other nodes in the system.  DHTs ask how we can design a system where each node only knows some small fraction (typically O(log n)) of the system’s nodes.  In Chord, as in most other DHTs, these known nodes are drawn from a special distribution, one that allows any node to find a key’s node successor in a short number of routing hops (typically O(log n)) when we use the system’s nodes as a routing overlays.  Thus nodes in a multi-hop DHT play the role of both routing lookups and storing keys/value pairs.

Chord maintains two different types of “links” in its routing table:  (1) successor links which map to its immediate k nodes in a clockwise direction around the ID ring (i.e., nodes with next-highest ids), where k successors are known instead of just a single one for fault-tolerance; (2) fingers which provide long-distance shortcuts for efficient routing.  These finger links are drawn from an exponential distribution, so a node with id me knows the node successors for the ids {me+2^1, me+2^2, me+2^3, … me+2^log n}.  In other words, it has a link to some node 1/2 around the ring from it, 1/4 around the ring, 1/8, 1/16, and so on.  (In fact, some performance optimizations don’t require that fingers are chosen in quite so strict a manner–see “proximity neighbor selection“–but this basic distribution is still maintained for efficient routing.)

Then, in order to route to another node responsible for any key, a node uses these links in a greedy fashion:  in each hop, it chooses a known node who’s nodeid is closest to the key being looked up.  Given that each node has a routing table of neighbors with the above distribution, that means that each hop is guaranteed to get at least 1/2 closer to the destination, hence the O(log n) bound for routing.  As a concrete example, let’s consider we just have 5-bit identifiers, and let’s assume for simplicity that there are actually 32 nodes in the system.  Then, the node with id 0 will have links to nodes with ids 1, 2, 4, 8, and 16.  Further, if node 0 wants to lookup key 29, its lookup traverses the path through nodes 16, 24, 28, and 29, respectively.

The Coral indexing layer

Compared to “traditional” DHTs, Coral introduced a few novelties given its application requirements: Its key-value indexing layer offered weaker consistency and included a locality-organized hierarchical design.  After all, a client need not discover all proxies caching a particular file, it only needs to find several such proxies, preferably ones nearby.  So, keys in Coral are computed SHA-1(URL), values are the IP addresses of proxying a web object, and each key/value pair is stored as soft-state in the system so includes a TTL for expiry.

We start by introducing the notion of sloppiness, which basically refers to the fact that the DHT layer is explicitly designed to not be strongly (or even eventually) consistent. The basic idea with sloppiness is to prevent a key’s nominal root from getting overloaded if there’s significant interest in a single key (i.e., the URL that’s currently getting Slashdotted).  So while unpopular keys will be stored on their root, inserts to popular keys will halt during their overlay routing before reaching the actual root and instead store the key/value pair being inserted on some more distant node.  Then, a subsequent lookup for this key can similarly stop upon finding any set of values stored under the key, i.e., never actually progressing to the root.  As keys get more popular, their storage will get pushed out to more nodes in the overlay, particularly because the number of nodes that are one hop away from the root is log n, the number that are two hops away is log^2 n, etc.

So the real question is the following:  (1) how do we actually do this “push-back” to store items further away from the root, and (2) how do we ensure that lookups still succeed, no matter which routing path requests come in from, especially as keys are now getting pushed back and entries in Coral are just soft-state.

To insert a key/value pair, Coral performs a two-phase operation.  In the “forward” phase, Coral routes to nodes successively closer to the key k, as in typical DHT routing, but stops when happening upon either (1) the key’s root or (2) a node that is both full and loaded.  We say a node is full if it has reached the maximum number of values for the key (with sufficiently long-lived TTLs), and loaded when there is heavy write traffic for a particular key.  During the “reverse” phase, the proxy attempts to insert the value at the closest node seen.  These insert-rate-limiting processes prevent tree saturation—i.e., the root becoming overloaded with a high rate of insertions—and makes get operations satisfiable by more distance nodes.

That gives us sloppiness, but what about locality?  To improve locality, these routing operations are not initially performed across the entire global overlay.  Instead, each Coral node belongs to several distinct routing structures called clusters. Each cluster is characterized by a maximum desired network round-trip-time (RTT).  The system is parameterized by a fixed hierarchy of clusters with different expected RTT thresholds. Coral’s deployment uses a three-level hierarchy, with level-0 denoting the global cluster and level-2 the most local one.  Coral employs distributed algorithms to form localized, stable clusters.

coral-clusters

Every node belongs to one cluster at each level, as shown in the picture.  Coral queries nodes in its local cluster (shown by the nodes in yellow for lookups numbered 1 and 2) before going on to lookups across larger or the global cluster. This both reduces lookup latency and increases the chances of returning values stored at nearby nodes, which would in fact be pointers to nearby proxies.  Thus, if a local lookup succeeds, the returned IP addresses will also correspond to proxies that have a low RTT with respect to the requesting node.  This doesn’t guarantee that downloading a web object from that proxy will be fast, but its a good approximation, especially as we are normally dealing with smaller web objects and hence TCP-based transfers are often in slow-start (making low RTT all the more important).

Coral’s hierarchical structure has another interesting property. Because Coral nodes maintain the same nodeids in all clusters, once a lookup has progressed as far as it can in one level, it is guaranteed to have made similar forward progress within its larger cluster. This is shown by the vertical dotted lines in the above picture.  For a more in-depth description of how Coral self-organizes itself into these clusters, see our original NSDI paper.  I will also talk a bit more about the clustering design when discussing “latency-sensitive” system architectures in a later post.

This clustering mechanism arose around the same time locality-preferences started getting introduced into DHTs (see mention of “proximity neighbor selection (PNS)” above).   I’m still unclear as to the relative benefit of both.  The clusters certainly introduce complexity to the systems, which is a point against them.  On the flip side, they provide a more static structure for routing, which perhaps allows us to make stronger claims as to their load-balancing properties and also make them better suited for “constrained routing” for security-conscious DHT designs. Perhaps most important, however, is that the hierarchy seems less susceptible to variations arising from measuring RTTs between nodes (especially important in virtualized environments like PlanetLab which can cause significant jitter, as I will discuss in a later post).  Thus, I’m concerned that the PNS-based approaches will “miss” locally stored content more often than the hierarchy-based approach — the routing path in PNS has to line up perfectly, after all — although I’ve not simulated or otherwise tested this theory to say so with any real confidence.

The CoralCDN HTTP Proxy

CoralCDN seeks to aggressively minimize load on origin servers.  We now summarize how its HTTP proxies were designed to use the Coral indexing layer for inter-proxy cooperation and rapid adaptation to flash crowds.

Locality-optimized inter-proxy transfers.  Each CoralCDN proxy keeps a local cache from which it can immediately fulfill client requests.  When a client requests a non-resident URL, CoralCDN proxies attempt to fetch web content from each other, using the Coral indexing layer for discovery.  A proxy will only contact a URL’s origin server after the Coral indexing layer provides no referrals or none of its referrals return the data.

If a CoralCDN proxy discovers that one or more neighbors have the object, it attempts to establish TCP connections in parallel to multiple proxies (2 in our live deployment).  It issues an HTTP request on the first established connection.  If this request fails—e.g., if the remote proxy has evicted cached content, returns an unexpected error, or is unresponsive—these pre-established connections provide fast failover.  Otherwise, they are closed once the remote proxy begins returning valid content.

CoralCDN’s inter-proxy transfers are locality-optimized, both from their use of parallel connection requests and, more importantly, by the order which neighboring proxies are contacted. The properties of Coral’s hierarchical indexing layer implies that the list of proxy IP addresses returned by get(SHA-1(URL)) will be sorted based on their cluster distance to the request initiator.  Thus, proxies will attempt to contact level-2 neighbors before level-1 and level-0 proxies, respectively.

Rapid adaptation to flash crowds. Unlike many web proxies, CoralCDN is explicitly designed for flash-crowd scenarios.  If a flash crowd suddenly arrives for a web object, proxies self-organize into a form of multicast tree for retrieving the object.  Data streams from the proxies that initially fetch the object from the origin server to those arriving later. CoralCDN provides such behavior by combining optimistic references and cut-through routing.

First, as soon as a CoralCDN proxy begins receiving the first bytes of a web object—either from the origin or from another proxy—it inserts a reference to itself in the Coral indexing layer with a short time-to-live (30 seconds).  It continually renews this short-lived reference until either it completes the download or the download fails.

Second, CoralCDN’s cut-through routing at each proxy helps reduce transmission time for larger files.  That is, a proxy will upload portions of a object as soon as they are downloaded, not waiting until it receives the entire object.  (The proxy implements such behavior through its event-based structure communicating through the file system. Multiple client connections can register read callbacks on a single file, and whenever this file is written to as content is received from a remote node, these connection callbacks are dispatched and the newly-written content is sent to other remote nodes downstream in the tree.)  Once any CoralCDN proxy successfully finishes downloading the content, it inserts a much longer-lived reference to itself into Coral. Because the Coral insertion algorithm accounts for expiry times—the deployed system uses 2-hour TTLs for “successful” transmissions (e.g., status codes of 200 (OK), 301 or 302 (Moved), etc.), while also uses 15-minute TTLs for non-transient “unsuccessful” transmissions (e.g., 403 (Forbidden), 404 (File Not Found), etc.) from the origin—these longer-lived references will overwrite shorter-lived ones, and they can be stored on well-selected nodes even under high insertion load. Proxies renew these pointers in Coral  after their TTL expires, but only if the content has been subsequently requested by other parties.

A CoralCDN proxy manages its disk using a least-recently-used eviction policy (each proxy in our PlanetLab deployment has 3GB of disk cache).  The LRU eviction policy can cause a proxy to evict an item, even while a reference in Coral still exists.  (Offering only best-effort guarantees, Coral does not provide any mechanism to delete references.) Recovery from such failures is handled by HTTP failover between proxies, as described above.  That said, in a later post I’ll show that this condition is relatively rare given the system’s working set size.

So that’s the basic design of CoralCDN, with a little pedagogical discussion of DHTs thrown in for fun.  Since deployment, however, I’ve needed to introduce additional functionality to keep in running.  In the next post, I’ll give a brief overview of some of our security mechanisms, while I’ll later discuss some of the per-origin bandwidth fair-sharing mechanisms that we introduced to address the fact that, well, PlanetLab gives us approximately 2 TB of upstream bandwidth per day system-wide, while we’ve long seen demand on the order of >10TB per day.  So if CoralCDN gives you the dreaded “over quota” message, I plead impotent!