CoralCDN Lesson: Fixing overlooked assumptions in DHTs

So let’s start with the first of seven lessons from CoralCDN’s deployment:

  • How all published distributed hash table (DHT) algorithms are susceptible to race conditions and routing errors for non-transitive network connectivity, and what can be done to mitigate these problems.

Some challenges with deploying DHTs

slashdot-data

CoralCDN’s primary goal was to enable websites to survive spikes in traffic.  We can see examples of such so-called flash crowds through CoralCDN: The figure on the left shows a spike to Coralized slashdot.org URLs that occurred in mid-2005.  Requests grew from nothing to 50/s within the first five minutes, peaking to about 250/s within the first hour.  We also see what occurs after a story leaves Slashdot’s front page:  traffic plummeted to zero within a few minutes.

Now, if traffic arrives rather unexpectedly and suddenly, we should ask whether Coral’s DHT indexing architecture provides the property that if content is cached (or is in the process of being downloaded) anywhere in the network, other CoralCDN proxies can effectively find this content.

coral-lookupThis next figure demonstrates the results of a more controlled experiment we performed during an initial deployment of the CoralCDN network (i.e., before it started handling public traffic).  Twelve files were hosted on an Apache server behind a home DSL line, while 166 geographically-distributed proxies on PlanetLab began mostly simultaneously requesting the files.  The graph shows the number of files in the experiment that were served by inter-proxy transfers between each level of Coral’s hierarchy: most requests (the solid red top line) were handled via localized transfers, and decreasingly fewer by more distant transfers. A total of 14 requests were served by the origin server (the dashed purple line).  Thus, while demonstrating the efficacy of Coral’s lookup algorithms, origin servers can—and do, in real deployments—see multiple requests for the same file.

One obvious source of multiple requests can arise when information is “lost” from the DHT; i.e., when nodes storing DHT entries fail.   After all, Coral is designed to store soft-state pointers, as opposed to the hard-state of some DHTs that use active replication (e.g., CFS, OpenDHT, etc.).  Or transient failures may arise when packet loss or delays between nodes is sufficiently high.

This post describes two other sources of “failed lookups” to which most DHTs are susceptible: test-and-set-like race conditions and non-transitive connectivity.  I don’t suggest that this is a comprehensive list, but they are important details for DHT designers and implementers to consider.

DHT Race Conditions

CoralCDN’s particular use of its indexing layer may result in redundant fetches to an origin server because a race condition exists in the protocol.

Consider that a key k is not yet inserted into the system, and two nodes are interested in k.  If they concurrently execute their get(), both operations serialize and will fail at the key’s root node.  Both nodes will subsequently optimistically put() their identity after contacting the origin server.  Simply inverting the order of these operations is even worse.  If multiple nodes optimistically perform a put before their gets, they can discover one another and effectively form cycles just waiting for one another (like distributed deadlock), with nobody actually retrieving data.

To minimize this condition, Coral extended insert operations to provide return status information, like test-and-set in shared-memory systems.  Specifically, it introduced a single put+get protocol which performs both operations: This algorithm behaves similar to a Coral put (described in a previous post), but also returns the first values discovered in either direction to satisfy the “get” portion of the request.  (Values returned during the forward put direction help performance, as a put may need to traverse past the first stored values, while a get should return as soon as the first values are found. Values returning during the reverse put phase prevent this race condition.)  While we see that Coral’s load-balancing properties add a little complexity, in that get and put would return after encountering different sets of nodes, this is handled by a put+get exposing two separate callbacks to higher-layer applications.

Interesting, there’s another aspect to this story.  I initially developed this test-and-set behavior for a cooperative file system called Shark (published in NSDI ’05).  Shark focused on solving the “I want to distribute my build directory to all nodes on PlanetLab,” and it used Coral for discovery.  Because we imagined that the flash crowds in Shark could be so sudden (press “run”), this put+get primitive seemed necessary.  In practice, however, nodes could see some head-of-line blocking effects, as the optimistic put could register a peer for a file chunk much before it downloaded the chunk, leading to deep multicast trees.  (Shark supported lookup and transfers on file chunks, as opposed to whole files as in CoralCDN.)  Unfortunately, if certain parent nodes were particularly slow, this could delay the entire subtree below it.  Nodes would eventually give up on a particular transfer, and switch to either another peer or fall back to the origin, but this would lead to greater transfer times.  This delay was less of an issue for large files that were broken into many chunks fetched in parallel, but would be a bigger issue for CoralCDN’s whole file, and often small file, transfers.  On top of that, we see from the above figure that CoralCDN’s flash crowds are on the order of tens of seconds or more, not a few seconds.  Thus, given both these reasons, while CoralCDN supports the put+get primitive, its configuration file has this option turned off in our live deployment.

Non-Transitive Connectivity and DHT Routing

This isn’t the only explanation for multiple requests, however. Other redundant requests would occur over time-scales that were not explainable by race conditions and even when the indexing layer would not experience any node failures.  Yet Coral’s practice of retrying failed RPCs and using adaptive timeouts suggests that such failures are not due to transient network congestion either.

An implicit assumption in DHT protocols is that all nodes are able to communicate with each other.  Yet we know this assumption is unfounded in practice.  We say that three hosts, A, B, and C, exhibit non-transitivity if two pairs of these nodes can communicate with one another, but one pair (say, A and B) cannot. These transient periods of non-transitivity occur for many reasons, including link failures, BGP routing updates, and ISP peering disputes.

While this DHT connectivity assumption seems obvious in hindsight, I don’t think it was explicitly identified before our paper at WORLDS ’05 (with Karthik Lakshminarayanan, Sean Rhea, and Ion Stoica). The paper includes some brief measurement data of connectivity between PlanetLab nodes; e.g., during the observed time window, we saw non-transitivity between 2.3% of node triples.   But the high-level bit is that non-transitivity has great impact on DHT algorithms.

ntrThe problem with non-transitive connectivity and DHTs is due to DHT’s use of  greedy routing.  Consider the segment of a Chord ring in the figure to the left, where the dashed lines represent predecessor links.  Identifiers increase from the left, so node B is the root (successor) to key k.  If nodes A and B are unable to communicate, A will believe that C is its successor. Upon receiving a lookup for k, A will respond with C.  If the requester then tries to insert a value under k at node C, C would refuse, since according to its view it is not responsible for key k.

The basic way to handle non-transitive connectivity is to route around the problem.  To do so, one can (1) modify DHT routing tables to include indirect entries and (2) add support for a basic forward primitive.  Using the above example, node A‘s routing table would include entries:

B → <C, B>

C → <C>

where the former implies an indirect path to B through C.  The forward primitive takes a destination and a message m,  and it replaces the ultimate destination with its overlay route.  Each hop forwards the packet as expected, although most “paths” will actually be of length one (when direct connectivity exists). Using such a forwarding primitive, DHTs that implement recursive routing—where intermediate nodes actually forward DHT queries, as opposed to return next-hop information to senders—can circumvent the problem of non-transitivity.  Specifically, to route among ones’ immediate neighbors, which is required for correctly  routing to a key’s root, DHT nodes exchange local routing tables and build link-state information about their local peers.  Such information is less necessary for long-distance hops (in terms of identifier-space distance), as DHT routing structures offer flexibility in the choice of peers (see Gummadi et al. in SIGCOMM ’03).

Coral’s iterative routing, however, presents additional problems: Nodes may not have direct connectivity to the next-hop peers returned by intermediates, even though those intermediates do.  In practice, this problem is mitigated in two ways.  First, Coral’s implementation caches additional information beyond O(log N) routing state as a performance optimization: Caching live nodes aids in accurate RTT information to set good timeouts, while remembering unreachable nodes prevents unnecessary delays when waiting for a request to timeout. This cache also is used to remember indirect connectivity, the very information returned by iterative routing.  Second, Coral’s implementation keeps a window of multiple outstanding RPCs (4 in our current deployment), and thus avoids the performance hit of the single node’s failure or delayed response.  As a request approaches the key’s root, indirect requests can be sent through the root’s immediate neighbors, and thus avoid inconsistent roots.  More detailed information can be found in our WORLDS paper.


In the next set of posts, I’ll pop up a layer and turn Postel’s Law upside down:

  • How a CDN (or system in general) designed to interact with erratic services should be designed to accept conservatively and serve liberally.
  • Hey man that was really a fantastic blog you share with us here, its really a nice and very helpful for us and for all who read this nice blog..

  • Jake Bunce

    Hi! I'm Jake Bunce, the manager of Viettel ISP at http://www.adslviettel.com, and I think your post is awesome. It's hard to find quality information like this, I'm glad i found this, thanks for the valuable information.
    But I have a suggestion: In my opinion the posts font and size is not the best typo for read. It is very uncomfortable.
    Anyway, good work!

  • Hi! I'm Jake Bunce, the manager of Viettel ISP at http://www.adslviettel.com, and I think your post is awesome. It's hard to find quality information like this, I'm glad i found this, thanks for the valuable information.
    But I have a suggestion: In my opinion the posts font and size is not the best typo for read. It is very uncomfortable.
    Anyway, good work!

  • I just stumbled upon your blog and wanted to say that I have really liked reading your blog posts.My friend mentioned to me your blog, so I thought I’d come have a read. Very interesting material, will be back for more!

  • webmaster sweetalma

    WOW!, Thanks for the nice Blog. This is really Fantastic. Enjoy online fashion shopping with us! We are- Kuku fashion Australia, with Kuku, Kuku dress, kuku dresses, Kuku fashion, Mad Love stockist, mad love fashion, Toi et Moi online, Toi et Moi jacket. Thanks for staying with us.