CoralCDN Lesson: Accepting conservatively and serving liberally
At its heart, CoralCDN provides a caching serving, not a persistent data store. Thus, it ultimately requires that a URL’s origin server is initially available, so that it can pull in content to some CoralCDN proxy and make it available across the network. While traditional web proxies normally interact with sufficiently-provisioned or otherwise well-behaved origin webservers, CoralCDN experiences a different norm. Given its very design goals, its proxies typically interact with overloaded or poorly-behaving servers; it therefore needs to react to (non-crash) failures as the rule, not the exception. Thus, one design philosophy that has come to govern CoralCDN proxies’ behavior—proxies should accept content conservatively and serve results liberally—is the exact opposite of Postel’s Law.
Consider the following situation, fairly normal for CoralCDN. A portal like slashdot.org or digg.com first runs a story that links to a third-party website, driving a sudden influx of readers to this previously unpopular site. Then a user posts a Coralized link to the third-party site as a “comment” to the portal’s story, providing an alternate means to fetch the content. Several situations are possible in such scenarios, all demonstrative of different ways which CoralCDN must handle origin failures.
- The website’s origin server becomes unavailable before any proxy downloads its content.
- CoralCDN already has a copy of the content, but requests arrive to it after the content’s expiry time has passed. Unfortunately, subsequent HTTP requests to the origin webserver result in failures or errors.
- CoralCDN’s content is again expired, but subsequent requests to the origin yield only partial transfers.
We next consider how CoralCDN’s mechanisms handle these different situations.
Tackling #1: Negative result caching
CoralCDN may be hit with a flood of requests for an inaccessible URL (e.g., DNS resolution fails, TCP connections timeout, etc.). For these situations, proxies maintain a local negative result cache about repeated failures. Otherwise, we have seen resource exhaustion on both proxies and their local DNS resolvers, given flash crowds to apparently dead sites. While more a usability issue than a resource-exhaustion concern, CoralCDN even receives requests for some Coralized URLs several years after their origins became unavailable: The “dead links” problem of the web, but one that would otherwise cause our resources to get unnecessarily tied up.
Tackling #2: Serving stale content
CoralCDN proxies mostly obey content expiry times, as specified by Cache-Control or Expires headers, with a default expiry of 12 hours. If cached content expires, proxies perform a conditional request (If-Modified-Since) to revalidate or update their content. What happens, however, if an origin server fails to respond, or simply returns some temporary error condition? Rather than return an error, proxies return stale content. Specifically, if the origin responds with many 400-level (Forbidden, Not Found, Timeout) or 500-level (Internal Server Error, Service Unavailable, Gateway Timeout) errors, a proxy will serve stale data for up to 24 hours after it expires.
This trade-off will not satisfy every situation. Is a Forbidden message due to the website publisher seeking to make the content unavailable, or it is caused by the website going over its daily bandwidth quota and its hosting service returning an error? Does a “File Not Found” indicate whether the condition is temporary (from a PHP or database error) or permanent (from a third-party issuing a DMCA take-down notice to the website)? Indeed, such ambiguity led to the introduction of 410 (Gone) messages in HTTP/1.1, denoting permanence, which does result in the eviction of content from our caches. CoralCDN has experienced all these situations, and the difficulty is that many status codes are inherently ambiguous.
Unfortunately, we have also seen many situations caused by semantically-incorrect server responses. These are often generated by poorly-written PHP or other server-side scripts. Too often do our servers receive a 200 (OK) message with humanly-readable body content to the tune of “an error occurred.” Or, common for virtually-hosted websites, a redirect (302) will lead to a generic error page (of type 200) reporting that “the website has exceeded their bandwidth allotment,” Both situations, unfortunately, result in CoralCDN replacing valid content with less useful information.
Tackling #3: Whole-file overwrites
Finally, consider when the CoralCDN proxy is already caching an expired file, but a subsequent re-request yields a partial or excessively-slow response from the origin site (as it is being overloaded). Rather than having the proxy lose a valid copy of a stale file, proxies perform whole-file overwrites in the spirit of AFS. Namely, new versions of content are written to temporary files; only after the file completes downloading and appears valid (e.g., based on Content-Length) will a proxy replace its existing copy with this new version of the data.
Meta Lesson: Preserve the status quo
These examples all point to a lesson that seems to govern CoralCDN’s proxy design: Maintain the status quo unless improvements are possible. A similar theme has helped govern our system management. CoralCDN servers query a centralized management point for a number of tasks: to update their overall run status, to start or stop individual service components (HTTP, DNS, DHT), to reinstall or update to a new software version, or to update shared secrets that provide admission control to Coral’s decentralized DHT. Although designed to handle intermittent connectivity to its management servers, one of CoralCDN’s significant outages came when the management server began misbehaving and returning unexpected information. This led to management scripts on servers killing CoralCDN’s local processes. In response, CoralCDN now implements what we might call fail-same behavior that accepts updates conservatively. Management information is stored durably on servers, maintaining their status-quo operation (even across local crashes) until correct new instructions are received.
In the next post, I’ll start discussing some of the issues we’ve needed to deal with while operating CoralCDN on PlanetLab, a deployment platform that is heterogeneous, shared, virtualized, loosely managed, and itself oversubscribed.
Pingback: Princeton S* Network Systems» Blog Archive » CoralCDN Lesson: The design was mostly wrong