CoralCDN Lesson: Accepting conservatively and serving liberally

At its heart, CoralCDN provides a caching serving, not a persistent data store.  Thus, it ultimately requires that a URL’s origin server is initially available, so that it can pull in content to some CoralCDN proxy and make it available across the network.   While traditional web proxies normally interact with sufficiently-provisioned or otherwise well-behaved origin webservers, CoralCDN experiences a different norm.  Given its very design goals, its proxies typically interact with overloaded or poorly-behaving servers; it therefore needs to react to (non-crash) failures as the rule, not the exception.  Thus, one design philosophy that has come to govern CoralCDN proxies’ behavior—proxies should accept content conservatively and serve results liberally—is the exact opposite of Postel’s Law.

Consider the following situation, fairly normal for CoralCDN.  A portal like slashdot.org or digg.com first runs a story that links to a third-party website, driving a sudden influx of readers to this previously unpopular site.  Then a user posts a Coralized link to the third-party site as a “comment” to the portal’s story, providing an alternate means to fetch the content. Several situations are possible in such scenarios, all demonstrative of different ways which CoralCDN must handle origin failures.

  1. The website’s origin server becomes unavailable before any proxy downloads its content.
  2. CoralCDN already has a copy of the content, but requests arrive to it after the content’s expiry time has passed.  Unfortunately, subsequent HTTP requests to the origin webserver result in failures or errors.
  3. CoralCDN’s content is again expired, but subsequent requests to the origin yield only partial transfers.

We next consider how CoralCDN’s mechanisms handle these different situations.

Tackling #1:  Negative result caching

CoralCDN may be hit with a flood of requests for an inaccessible URL (e.g., DNS resolution fails, TCP connections timeout, etc.).  For these situations, proxies maintain a local negative result cache about repeated failures.  Otherwise, we have seen resource exhaustion on both proxies and their local DNS resolvers, given flash crowds to apparently dead sites. While more a usability issue than a resource-exhaustion concern, CoralCDN even receives requests for some Coralized URLs several years after their origins became unavailable:  The “dead links” problem of the web, but one that would otherwise cause our resources to get unnecessarily tied up.

Tackling #2:  Serving stale content

CoralCDN proxies mostly obey content expiry times, as specified by Cache-Control or Expires headers, with a default expiry of 12 hours.  If cached content expires, proxies perform a conditional request (If-Modified-Since) to revalidate or update their content. What happens, however, if an origin server fails to respond, or simply returns some temporary error condition?  Rather than return an error, proxies return stale content. Specifically, if the origin responds with many 400-level (Forbidden, Not Found, Timeout) or 500-level (Internal Server Error, Service Unavailable, Gateway Timeout) errors, a proxy will serve stale data for up to 24 hours after it expires.

This trade-off will not satisfy every situation.  Is a Forbidden message due to the website publisher seeking to make the content unavailable, or it is caused by the website going over its daily bandwidth quota and its hosting service returning an error?  Does a “File Not Found” indicate whether the condition is temporary (from a PHP or database error) or permanent (from a third-party issuing a DMCA take-down notice to the website)?  Indeed, such ambiguity led to the introduction of 410 (Gone) messages in HTTP/1.1, denoting permanence, which does result in the eviction of content from our caches. CoralCDN has experienced all these situations, and the difficulty is that many status codes are inherently ambiguous.

Unfortunately, we have also seen many situations caused by semantically-incorrect server responses.  These are often generated by poorly-written PHP or other server-side scripts.  Too often do our servers receive a 200 (OK) message with humanly-readable body content to the tune of “an error occurred.”  Or, common for virtually-hosted websites, a redirect (302) will lead to a generic error page (of type 200) reporting that “the website has exceeded their bandwidth allotment,” Both situations, unfortunately, result in CoralCDN replacing valid content with less useful information.

Tackling #3:  Whole-file overwrites

Finally, consider when the CoralCDN proxy is already caching an expired file, but a subsequent re-request yields a partial or excessively-slow response from the origin site (as it is being overloaded).  Rather than having the proxy lose a valid copy of a stale file, proxies perform whole-file overwrites in the spirit of AFS.  Namely, new versions of content are written to temporary files; only after the file completes downloading and appears valid (e.g., based on Content-Length) will a proxy replace its existing copy with this new version of the data.

Meta Lesson:  Preserve the status quo

These examples all point to a lesson that seems to govern CoralCDN’s proxy design: Maintain the status quo unless improvements are possible.  A similar theme has helped govern our system management. CoralCDN servers query a centralized management point for a number of tasks: to update their overall run status, to start or stop individual service components (HTTP, DNS, DHT), to reinstall or update to a new software version, or to update shared secrets that provide admission control to Coral’s decentralized DHT.  Although designed to handle intermittent connectivity to its management servers, one of CoralCDN’s significant outages came when the management server began misbehaving and returning unexpected information.  This led to management scripts on servers killing CoralCDN’s local processes.  In response, CoralCDN now implements what we might call fail-same behavior that accepts updates conservatively.  Management information is stored durably on servers, maintaining their status-quo operation (even across local crashes) until correct new instructions are received.

In the next post, I’ll start discussing some of the issues we’ve needed to deal with while operating CoralCDN on PlanetLab, a deployment platform that is heterogeneous, shared, virtualized, loosely managed, and itself oversubscribed.

  • Omprakash Gnawali

    If a site has gone down in response to DMCA take-down notice, can “serving liberally” result in proxies also being subject to DMCA notices? The technical argument here says that the original server did not return the “GONE” response but sending “GONE” response is probably not legally mandated.

  • Omprakash Gnawali

    If a site has gone down in response to DMCA take-down notice, can “serving liberally” result in proxies also being subject to DMCA notices? The technical argument here says that the original server did not return the “GONE” response but sending “GONE” response is probably not legally mandated.

  • Hi Om,

    Well, anytime you appear to be a public-facing proxy cache, you are subject to DMCA notices. And sometimes even when you aren’t, like when you are actually a printer. The short response is that yes, we’ve had to deal with DMCA notices. The typical answer I give appears in our FAQ:

    CoralCDN does not provide archival storage of content, like google.com’s cache or archive.org. Much like a web cache or “content accelerator” at ISPs, CoralCDN only keeps data temporarily in its file caches, either until the data expires or it is evicted (as may occur for unpopular data). As described above, CoralCDN will serve data for some maximum fixed period (24 hours) before checking back with the origin website. If the content at that site has changed, CoralCDN will fetch the new content afresh, replacing the old. If the origin site is no longer online or the particular content returns some HTTP error message, CoralCDN will only serve the old data for a short time (24 hours).

    Thus, if you believe that a website is making infringing content available, please direct any notices to that particular website. Recall that CoralCDN’s naming method makes it obvious the identity of the origin website: A Coralized URL of the form http://www.example.com.nyud.net/ corresponds to an origin site http://www.example.com/ .

    If/When that origin site complies with the notice, the content in question will naturally be removed from CoralCDN’s caches through purely automated technical means in at most 24 hours.

    My understanding is that the DMCA requires that content be taken down with some reasonable time period, which CoralCDN’s expiry times satisfy. In the past, this explanation and the resulting system behavior has satisfied content owners.

  • Hi Om,

    Well, anytime you appear to be a public-facing proxy cache, you are subject to DMCA notices. And sometimes even when you aren’t, like when you are actually a printer. The short response is that yes, we’ve had to deal with DMCA notices. The typical answer I give appears in our FAQ:

    CoralCDN does not provide archival storage of content, like google.com’s cache or archive.org. Much like a web cache or “content accelerator” at ISPs, CoralCDN only keeps data temporarily in its file caches, either until the data expires or it is evicted (as may occur for unpopular data). As described above, CoralCDN will serve data for some maximum fixed period (24 hours) before checking back with the origin website. If the content at that site has changed, CoralCDN will fetch the new content afresh, replacing the old. If the origin site is no longer online or the particular content returns some HTTP error message, CoralCDN will only serve the old data for a short time (24 hours).

    Thus, if you believe that a website is making infringing content available, please direct any notices to that particular website. Recall that CoralCDN’s naming method makes it obvious the identity of the origin website: A Coralized URL of the form http://www.example.com.nyud.net/ corresponds to an origin site http://www.example.com/ .

    If/When that origin site complies with the notice, the content in question will naturally be removed from CoralCDN’s caches through purely automated technical means in at most 24 hours.

    My understanding is that the DMCA requires that content be taken down with some reasonable time period, which CoralCDN’s expiry times satisfy. In the past, this explanation and the resulting system behavior has satisfied content owners.

  • Pingback: Princeton S* Network Systems» Blog Archive » CoralCDN Lesson: The design was mostly wrong()