Category Archives: Peer-to-peer

Erroneous DMCA notices and copyright enforcement, part deux

[Given my continued use of Ed’s Freedom-To-Tinker blog, I’m reposting this article from there.]

A few weeks ago, I wrote about a deluge of DMCA notices and pre-settlement letters that CoralCDN experienced in late August. This article actually received a bit of press, including MediaPost, ArsTechnica, TechDirt, and, very recently, Slashdot. I’m glad that my own experience was able to shed some light on the more insidious practices that are still going on under the umbrella of copyright enforcement. More transparency is especially important at this time, given the current debate over the Anti-Counterfeiting Trade Agreement.

Given this discussion, I wanted to write a short follow-on to my previous post.

The VPA drops Nexicon

First and foremost, I was contacted by the founder of the Video Protection Alliance not long after this story broke. I was informed that the VPA has not actually developed its own technology to discover users who are actively uploading or downloading copyrighted material, but rather contracts out this role to Nexicon. (You can find a comment from Nexicon’s CTO to my previous article here.) As I was told, the VPA was contracted by certain content publishers to help reduce copyright infringement of (largely adult) content. The VPA in turn contracted Nexicon to find IP addresses that are participating in BitTorrent swarms of those specified movies. Using the IP addresses given them by Nexicon, the VPA subsequently would send pre-settlement letters to the network providers of those addresses.

The VPA’s founder also assured me that their main goal was to reduce infringement, as opposed to collecting pre-settlement money. (And that users had been let off with only a warning, or, in the cases where infringement might have been due to an open wireless network, informed how to secure their wireless network.) He also expressed surprise that there were false positives in the addresses given to them (beyond said open wireless), especially to the extent that appropriate verification was lacking. Given this new knowledge, he stated that the VPA dropped their use of Nexicon’s technology.

BitTorrent and Proxies

Second, I should clarify my claims about BitTorrent’s usefulness with an on-path proxy. While it is true that the address registered with the BitTorrent tracker is not usable, peers connecting from behind a proxy can still download content from other addresses learned from the tracker. If their requests to those addresses are optimistically unchoked, they have the opportunity to even engage in incentivized bilateral exchange. Furthermore, the use of DHT- and gossip-based discovery with other peers—the latter is termed PEX, for Peer EXchange, in BitTorrent—allows their real address to be learned by others. Thus, through these more modern discovery means, other peers may initiate connections to them, further increasing the opportunity for tit-for-tat exchanges.

Some readers also pointed out that there is good reason why BitTorrent trackers do not just accept any IP address communicated to it via an HTTP query string, but rather use the end-point IP address of the TCP connection. Namely, any HTTP query parameter can be spoofed, leading to anybody being able to add another’s IP address to the tracker list. That would make them susceptible to receiving DMCA complaints, just we experienced with CoralCDN. From a more technical perspective, their machine would also start receiving unsolicited TCP connection requests from other BitTorrent peers, an easy DoS amplification attack.

That said, there are some additional checks that BitTorrent trackers could do. For example, if the IP query string or X-Forwarded-For HTTP headers are present, only add the network IP address if it matches the query string or X-Forwarded-For headers. Additionally, some BitTorrent tracker operators have mentioned that they have certain IP addresses whitelisted as trusted proxies; in those cases, the X-Forwarded-For address is used already. Otherwise, I don’t see a good reason (plausible deniability aside) for recording an IP address that is known to be likely incorrect.

Best Practices for Online Technical Copyright Enforcement

Finally, my article pointed out a strategy that I clearly thought was insufficient for copyright enforcement: simply crawling a BitTorrent tracker for a list of registered IP addresses, and issuing a infringement notice to each IP address. I’ll add to that two other approaches that I think are either insufficient, unethical, or illegal—or all three—yet have been bandied about as possible solutions.

  • Wiretapping: It has been suggested that network providers can perform deep-packet inspection (DPI) on their customer’s traffic in order to detect copyrighted content. This approach probably breaks a number of laws (either in the U.S. or elsewhere), creates a dangerous precedent and existing infrastructure for far-flung Internet surveillance, and yet is of dubious benefit given the move to encrypted communication by file-sharing software.
  • Spyware: By surreptitiously installing spyware/malware on end-hosts, one could scan a user’s local disk in order to detect the existence of potentially copyrighted material. This practice has even worse legal and ethical implications than network-level wiretapping, and yet politicians such as Senator Orrin Hatch (Utah) have gone as far as declaring that infringers’ computers should be destroyed. And it opens users up to the real danger that their computers or information could be misused by others; witness, for example, the security weaknesses of China’s Green Dam software.

So, if one starts from the position that copyrights are valid and should be enforceable—some dispute this—what would you like to see as best practices for copyright enforcement?

The approach taken by DRM is to try to build a technical framework that restricts users’ ability to share content or to consume it in a proscribed manner. But DRM has been largely disliked by end-users, mostly in the way it creates a poor user experience and interferes with expected rights (under fair-use doctrine). But DRM is a misleading argument, as copyright infringement notices are needed precisely after “unprotected” content has already flown the coop.

So I’ll start with two properties that I would want all enforcement agencies to take when issuing DMCA take-down notices. Let’s restrict this consideration to complaints about “whole” content (e.g., entire movies), as opposed to those DMCA challenges over sampled or remixed content, which is a legal debate.

  • For any end client suspected of file-sharing, one MUST verify that the client was actually uploading or downloading content, AND that the content corresponded to a valid portion of a copyrighted file. In BitTorrent, this might be that the client sends or receives a complete file block, and that the file block hashes to the correct value specified in the .torrent file.
  • When issuing a DMCA take-down notice, the request MUST be accompanied by logged information that shows (a) the client’s IP:port network address engaged in content transfer (e.g., a record of a TCP flow); (b) the actual application request/response that was acted upon (e.g., BitTorrent-level logs); and (c) that the transferred content corresponds to a valid file block (e.g., a BitTorrent hash).

So my question to the readers: What would you add to or remove from this list? With what other approaches do you think copyright enforcement should be performed or incentivized?

Update (12/21/2009): Discussion about post on Slashdot.

Erroneous DMCA notices and copyright enforcement: the VPA, BitTorrent, and me

I recently posted an article on Ed Felten’s group blog, Freedom to Tinker, which covers both technical and policy-related I.T. issues. It describes some technical issues and my recent experience with DMCA enforcement agencies, BitTorrent, and CoralCDN. I think our readers would be interested in it as well, as well as join in the lively conversation among commenters.

In the past few weeks, Ed has been writing about targeted and inaccurate copyright enforcement. While it may be difficult to quantify the actual extent of inaccurate claims, we can at least try to understand whether copyright enforcement companies are making a “good faith” best effort to minimize any false positives. My short answer: not really.

Let’s start with a typical abuse letter that gets sent to a network provider (in this case, a university) from a “copyright enforcement” company such as the Video Protection Alliance.

The rest of the article can be found here.

IPTPS ’10 call for papers

Together with Arvind Krishnamurthy, I’ll be chairing this year’s International Workshop on Peer-to-Peer Systems (IPTPS).  The workshop was started in 2002, which coincided both with the popularization of P2P file sharing (Napster, KaZaA) and the introduction of distributed hash tables (DHTs) from several different research groups.

Eight years later, P2P file sharing is still going strong (now through BitTorrent), while the previously-academic DHTs have found their way into real use.  DHTs now form the decentralized lookup structures for file sharing services—in the form of so-called “trackerless” BitTorrent—with the DHT in the Vuze service comprising more than a million concurrent users.  (As an aside, I’m proud to note that Vuze’s DHT is based on Kademlia, which was proposed by one of my officemates in grad school, Petar Maymounkov.)

These self-organizing systems have also found their way into the datacenter.  One notable example is the storage system, Dynamo, that forms the basis for Amazon’s shopping cart and other back-end  applications.  Or Facebook’s Cassandra, used for its Inbox search.  Or the rest of the key-value stores that do automated partitioning.  And we are starting to see these techniques being proposed for scaling enterprise networks as well.  With that in mind, we wanted to broaden the scope of this year’s IPTPS to include topics relating to self-organizing and self-managing distributed systems, even those running in single administrative domains.

We also plan to have a demo session at this year’s IPTPS to highlight developed and deployed systems.  The workshop will be collocated with NSDI in San Jose, so will be especially convenient for those in the Bay Area.  We welcome submissions (both paper and demos) from researchers, developers, and hackers.  If you don’t want to write a paper, come show off your running P2P system.

Paper submissions are due Friday, December 18, 2009.  More information can be found at http://www.usenix.org/event/iptps10/cfp/.

Continue reading IPTPS ’10 call for papers

CoralCDN Lesson: Fixing overlooked assumptions in DHTs

So let’s start with the first of seven lessons from CoralCDN’s deployment:

  • How all published distributed hash table (DHT) algorithms are susceptible to race conditions and routing errors for non-transitive network connectivity, and what can be done to mitigate these problems.

Some challenges with deploying DHTs

slashdot-data

CoralCDN’s primary goal was to enable websites to survive spikes in traffic.  We can see examples of such so-called flash crowds through CoralCDN: The figure on the left shows a spike to Coralized slashdot.org URLs that occurred in mid-2005.  Requests grew from nothing to 50/s within the first five minutes, peaking to about 250/s within the first hour.  We also see what occurs after a story leaves Slashdot’s front page:  traffic plummeted to zero within a few minutes.

Now, if traffic arrives rather unexpectedly and suddenly, we should ask whether Coral’s DHT indexing architecture provides the property that if content is cached (or is in the process of being downloaded) anywhere in the network, other CoralCDN proxies can effectively find this content.

coral-lookupThis next figure demonstrates the results of a more controlled experiment we performed during an initial deployment of the CoralCDN network (i.e., before it started handling public traffic).  Twelve files were hosted on an Apache server behind a home DSL line, while 166 geographically-distributed proxies on PlanetLab began mostly simultaneously requesting the files.  The graph shows the number of files in the experiment that were served by inter-proxy transfers between each level of Coral’s hierarchy: most requests (the solid red top line) were handled via localized transfers, and decreasingly fewer by more distant transfers. A total of 14 requests were served by the origin server (the dashed purple line).  Thus, while demonstrating the efficacy of Coral’s lookup algorithms, origin servers can—and do, in real deployments—see multiple requests for the same file.

One obvious source of multiple requests can arise when information is “lost” from the DHT; i.e., when nodes storing DHT entries fail.   After all, Coral is designed to store soft-state pointers, as opposed to the hard-state of some DHTs that use active replication (e.g., CFS, OpenDHT, etc.).  Or transient failures may arise when packet loss or delays between nodes is sufficiently high.

This post describes two other sources of “failed lookups” to which most DHTs are susceptible: test-and-set-like race conditions and non-transitive connectivity.  I don’t suggest that this is a comprehensive list, but they are important details for DHT designers and implementers to consider.

DHT Race Conditions

CoralCDN’s particular use of its indexing layer may result in redundant fetches to an origin server because a race condition exists in the protocol.

Consider that a key k is not yet inserted into the system, and two nodes are interested in k.  If they concurrently execute their get(), both operations serialize and will fail at the key’s root node.  Both nodes will subsequently optimistically put() their identity after contacting the origin server.  Simply inverting the order of these operations is even worse.  If multiple nodes optimistically perform a put before their gets, they can discover one another and effectively form cycles just waiting for one another (like distributed deadlock), with nobody actually retrieving data.

To minimize this condition, Coral extended insert operations to provide return status information, like test-and-set in shared-memory systems.  Specifically, it introduced a single put+get protocol which performs both operations: This algorithm behaves similar to a Coral put (described in a previous post), but also returns the first values discovered in either direction to satisfy the “get” portion of the request.  (Values returned during the forward put direction help performance, as a put may need to traverse past the first stored values, while a get should return as soon as the first values are found. Values returning during the reverse put phase prevent this race condition.)  While we see that Coral’s load-balancing properties add a little complexity, in that get and put would return after encountering different sets of nodes, this is handled by a put+get exposing two separate callbacks to higher-layer applications.

Interesting, there’s another aspect to this story.  I initially developed this test-and-set behavior for a cooperative file system called Shark (published in NSDI ’05).  Shark focused on solving the “I want to distribute my build directory to all nodes on PlanetLab,” and it used Coral for discovery.  Because we imagined that the flash crowds in Shark could be so sudden (press “run”), this put+get primitive seemed necessary.  In practice, however, nodes could see some head-of-line blocking effects, as the optimistic put could register a peer for a file chunk much before it downloaded the chunk, leading to deep multicast trees.  (Shark supported lookup and transfers on file chunks, as opposed to whole files as in CoralCDN.)  Unfortunately, if certain parent nodes were particularly slow, this could delay the entire subtree below it.  Nodes would eventually give up on a particular transfer, and switch to either another peer or fall back to the origin, but this would lead to greater transfer times.  This delay was less of an issue for large files that were broken into many chunks fetched in parallel, but would be a bigger issue for CoralCDN’s whole file, and often small file, transfers.  On top of that, we see from the above figure that CoralCDN’s flash crowds are on the order of tens of seconds or more, not a few seconds.  Thus, given both these reasons, while CoralCDN supports the put+get primitive, its configuration file has this option turned off in our live deployment.

Non-Transitive Connectivity and DHT Routing

This isn’t the only explanation for multiple requests, however. Other redundant requests would occur over time-scales that were not explainable by race conditions and even when the indexing layer would not experience any node failures.  Yet Coral’s practice of retrying failed RPCs and using adaptive timeouts suggests that such failures are not due to transient network congestion either.

An implicit assumption in DHT protocols is that all nodes are able to communicate with each other.  Yet we know this assumption is unfounded in practice.  We say that three hosts, A, B, and C, exhibit non-transitivity if two pairs of these nodes can communicate with one another, but one pair (say, A and B) cannot. These transient periods of non-transitivity occur for many reasons, including link failures, BGP routing updates, and ISP peering disputes.

While this DHT connectivity assumption seems obvious in hindsight, I don’t think it was explicitly identified before our paper at WORLDS ’05 (with Karthik Lakshminarayanan, Sean Rhea, and Ion Stoica). The paper includes some brief measurement data of connectivity between PlanetLab nodes; e.g., during the observed time window, we saw non-transitivity between 2.3% of node triples.   But the high-level bit is that non-transitivity has great impact on DHT algorithms.

ntrThe problem with non-transitive connectivity and DHTs is due to DHT’s use of  greedy routing.  Consider the segment of a Chord ring in the figure to the left, where the dashed lines represent predecessor links.  Identifiers increase from the left, so node B is the root (successor) to key k.  If nodes A and B are unable to communicate, A will believe that C is its successor. Upon receiving a lookup for k, A will respond with C.  If the requester then tries to insert a value under k at node C, C would refuse, since according to its view it is not responsible for key k.

The basic way to handle non-transitive connectivity is to route around the problem.  To do so, one can (1) modify DHT routing tables to include indirect entries and (2) add support for a basic forward primitive.  Using the above example, node A‘s routing table would include entries:

B → <C, B>

C → <C>

where the former implies an indirect path to B through C.  The forward primitive takes a destination and a message m,  and it replaces the ultimate destination with its overlay route.  Each hop forwards the packet as expected, although most “paths” will actually be of length one (when direct connectivity exists). Using such a forwarding primitive, DHTs that implement recursive routing—where intermediate nodes actually forward DHT queries, as opposed to return next-hop information to senders—can circumvent the problem of non-transitivity.  Specifically, to route among ones’ immediate neighbors, which is required for correctly  routing to a key’s root, DHT nodes exchange local routing tables and build link-state information about their local peers.  Such information is less necessary for long-distance hops (in terms of identifier-space distance), as DHT routing structures offer flexibility in the choice of peers (see Gummadi et al. in SIGCOMM ’03).

Coral’s iterative routing, however, presents additional problems: Nodes may not have direct connectivity to the next-hop peers returned by intermediates, even though those intermediates do.  In practice, this problem is mitigated in two ways.  First, Coral’s implementation caches additional information beyond O(log N) routing state as a performance optimization: Caching live nodes aids in accurate RTT information to set good timeouts, while remembering unreachable nodes prevents unnecessary delays when waiting for a request to timeout. This cache also is used to remember indirect connectivity, the very information returned by iterative routing.  Second, Coral’s implementation keeps a window of multiple outstanding RPCs (4 in our current deployment), and thus avoids the performance hit of the single node’s failure or delayed response.  As a request approaches the key’s root, indirect requests can be sent through the root’s immediate neighbors, and thus avoid inconsistent roots.  More detailed information can be found in our WORLDS paper.


In the next set of posts, I’ll pop up a layer and turn Postel’s Law upside down:

  • How a CDN (or system in general) designed to interact with erratic services should be designed to accept conservatively and serve liberally.

The Design of CoralCDN

In this post, I describe the architecture and mechanisms of CoralCDN at a high-level. This is meant to provide some of the background necessary for some of our experiences and lessons with operating the system.

System Overview

CoralCDN is composed of three main parts: (1) a network of cooperative HTTP proxies that handle users’ requests, (2) a network of DNS nameservers for .nyud.net that map clients to nearby CoralCDN HTTP proxies, and (3) the underlying Coral indexing infrastructure and clustering machinery on which the first two applications are built.  You’ll find that I refer to the entire system as “CoralCDN”, but the network of indexing nodes as “Coral” (although this clarification is not always shared by others, who refer to the entire system as either “Coral” or “Coral Cache”).

At a high-level, the following steps occur when a client issues a request to CoralCDN.

  1. A client resolves a “Coralized” domain name (e.g., of the form sns.cs.princeton.edu.nyud.net) using CoralCDN nameservers, possibly starting at one of the 10–12 primary nameservers registered for .nyud.net under the .net domain.
  2. Upon receiving a query, a CoralCDN nameserver probes the client to determines its round-trip-time.  The nameserver uses this information to determine which nameservers (NS records) and CoralCDN proxies (authoritative A records) to return.
  3. The client sends an HTTP request for a Coralized URL to one of the returned proxy.  If the proxy is caching the file locally, it returns the file and stops.  Otherwise, this process continues.
  4. The proxy looks up the web object’s URL in Coral.
  5. If Coral returns the address of a node caching the object, the proxy fetches the object from this node.  Otherwise, the proxy downloads the object from the origin server (e.g., sns.cs.princeton.edu).
  6. The proxy stores the web object and returns it to the client browser.
  7. The proxy stores a reference to itself in Coral, recording the fact that is now caching the URL.

A review of DHTs

The Coral indexing layer is closely related to the structure and organization of distributed hash tables (DHTs) like Chord and Kademlia.   Let’s briefly review the design of DHTs; skip to the next section if you are already familiar with DHTs.

DHTs were first concurrently proposed around 2001 by several different research groups (Chord from MIT, Pastry from Rice/Microsoft, CAN from ICSI, and Tapestry from Berkeley), which basically all had a similar flavor.  They provide a distributed data structure that exposes two functions:   put(key, value) and get(key).  These proposals took load-balancing properties first considered by the consistent hashing work of Karger et al ’97, differing because consistent hashing had each node have a global view of the system (where all system participants (nodes) know all other nodes in the system) where DHTs focused on more scalable designs (where nodes only know a fraction of total system participants).  For most of these algorithms, this “fraction” is O(log n), although you now see DHT proposals called “one-hop DHTs” which effectively go back to the O(n) state of consistent hashing.   DHTs have had some real impact:  BitTorrent uses DHTs to select trackers for their “trackerless” operation, while memory-caching systems like memcached (used by Facebook, LiveJournal, Wikipedia, etc.) use it to map objects only cache nodes.

In these DHTs, all nodes are assigned one or more random identifiers, called nodeids, typically from some space “keyspace” (e.g., all 160-bit numbers).  “Keys” are also mapped into this same identifier/keyspace by computing some function F(k) that maps any input key k ({0,1}*) into the keyspace ({0,1}^160).  One often uses a cryptographic hash function like SHA-1 for this F, as it has the property of collision resistance; namely, that it is very difficult to find two inputs k and k’ for which F(k) == F(k’) yet k != k’.  (As a slight digression, if SHA-1 indeed behaves like a random function, then finding any such pair k,k’ that “collide” will require searching roughly 2^80 numbers for 160-bit values, i.e., the “Birthday bound”.  There’s been some very clever attacks against SHA-1 in the past few years to bring this down to around 2^63, but that’s perhaps not that relevant to our discussion, and one could always switch instead to a longer function, such as the 256-bit aptly-named SHA-256, and NIST currently has another competition going on to determine the next-generation cryptographic hash function.)

Anyway, I’ll describe the basic approach of DHTs using Chord as an example, because I think it’s the easiest to explain.  We first start out with a identifier space that’s comprised of all 160-bit numbers, i.e., [0, 2^160-1].  Imagine placing these numbers on a ring, so that you have both “0” and “2^160-1” in the 12-o’clock position.  Then, we map both keys and nodeids onto this space.  In Coral, keys are the SHA-1 hash of URLs, and nodeids are the SHA-1 hash of nodes’ IP addresses.  We say that a node is a successor (or root) of a key if it is the first node that is clockwise from the key, as you go around the ring.  You can also view a node as “being responsible for” the entire keyspace between its own identifier and that of its nearest node predecessor (in the counter-clockwise direction).

So that’s how we assign keys to nodes, which is (roughly) how Akamai supposedly assigns responsibility for caching files to its servers, as well as how memcached works (albeit memcache doesn’t support the dynamic addition or removal of cache nodes, unfortunately). This approach of consistent hashing is superior to other simple “mod-based” hashing techniques, which number all nodes from 1..n, then assign a key to the ith node by simply computing key mod n.  In mod-based hashing, one gets great load balancing, but bad smoothness.  This smoothness property addresses how many items will move between nodes if new nodes join or leave the system.  In the hash-based approach, if n changes by only one, then all items will need to be reassigned (or somehow a complex renumbering trick played); in consistent hashing, when a node is added or removed, only its successor will typically be affected.

But both of these hashing approaches assume that every node knows approximately all other nodes in the system.  DHTs ask how we can design a system where each node only knows some small fraction (typically O(log n)) of the system’s nodes.  In Chord, as in most other DHTs, these known nodes are drawn from a special distribution, one that allows any node to find a key’s node successor in a short number of routing hops (typically O(log n)) when we use the system’s nodes as a routing overlays.  Thus nodes in a multi-hop DHT play the role of both routing lookups and storing keys/value pairs.

Chord maintains two different types of “links” in its routing table:  (1) successor links which map to its immediate k nodes in a clockwise direction around the ID ring (i.e., nodes with next-highest ids), where k successors are known instead of just a single one for fault-tolerance; (2) fingers which provide long-distance shortcuts for efficient routing.  These finger links are drawn from an exponential distribution, so a node with id me knows the node successors for the ids {me+2^1, me+2^2, me+2^3, … me+2^log n}.  In other words, it has a link to some node 1/2 around the ring from it, 1/4 around the ring, 1/8, 1/16, and so on.  (In fact, some performance optimizations don’t require that fingers are chosen in quite so strict a manner–see “proximity neighbor selection“–but this basic distribution is still maintained for efficient routing.)

Then, in order to route to another node responsible for any key, a node uses these links in a greedy fashion:  in each hop, it chooses a known node who’s nodeid is closest to the key being looked up.  Given that each node has a routing table of neighbors with the above distribution, that means that each hop is guaranteed to get at least 1/2 closer to the destination, hence the O(log n) bound for routing.  As a concrete example, let’s consider we just have 5-bit identifiers, and let’s assume for simplicity that there are actually 32 nodes in the system.  Then, the node with id 0 will have links to nodes with ids 1, 2, 4, 8, and 16.  Further, if node 0 wants to lookup key 29, its lookup traverses the path through nodes 16, 24, 28, and 29, respectively.

The Coral indexing layer

Compared to “traditional” DHTs, Coral introduced a few novelties given its application requirements: Its key-value indexing layer offered weaker consistency and included a locality-organized hierarchical design.  After all, a client need not discover all proxies caching a particular file, it only needs to find several such proxies, preferably ones nearby.  So, keys in Coral are computed SHA-1(URL), values are the IP addresses of proxying a web object, and each key/value pair is stored as soft-state in the system so includes a TTL for expiry.

We start by introducing the notion of sloppiness, which basically refers to the fact that the DHT layer is explicitly designed to not be strongly (or even eventually) consistent. The basic idea with sloppiness is to prevent a key’s nominal root from getting overloaded if there’s significant interest in a single key (i.e., the URL that’s currently getting Slashdotted).  So while unpopular keys will be stored on their root, inserts to popular keys will halt during their overlay routing before reaching the actual root and instead store the key/value pair being inserted on some more distant node.  Then, a subsequent lookup for this key can similarly stop upon finding any set of values stored under the key, i.e., never actually progressing to the root.  As keys get more popular, their storage will get pushed out to more nodes in the overlay, particularly because the number of nodes that are one hop away from the root is log n, the number that are two hops away is log^2 n, etc.

So the real question is the following:  (1) how do we actually do this “push-back” to store items further away from the root, and (2) how do we ensure that lookups still succeed, no matter which routing path requests come in from, especially as keys are now getting pushed back and entries in Coral are just soft-state.

To insert a key/value pair, Coral performs a two-phase operation.  In the “forward” phase, Coral routes to nodes successively closer to the key k, as in typical DHT routing, but stops when happening upon either (1) the key’s root or (2) a node that is both full and loaded.  We say a node is full if it has reached the maximum number of values for the key (with sufficiently long-lived TTLs), and loaded when there is heavy write traffic for a particular key.  During the “reverse” phase, the proxy attempts to insert the value at the closest node seen.  These insert-rate-limiting processes prevent tree saturation—i.e., the root becoming overloaded with a high rate of insertions—and makes get operations satisfiable by more distance nodes.

That gives us sloppiness, but what about locality?  To improve locality, these routing operations are not initially performed across the entire global overlay.  Instead, each Coral node belongs to several distinct routing structures called clusters. Each cluster is characterized by a maximum desired network round-trip-time (RTT).  The system is parameterized by a fixed hierarchy of clusters with different expected RTT thresholds. Coral’s deployment uses a three-level hierarchy, with level-0 denoting the global cluster and level-2 the most local one.  Coral employs distributed algorithms to form localized, stable clusters.

coral-clusters

Every node belongs to one cluster at each level, as shown in the picture.  Coral queries nodes in its local cluster (shown by the nodes in yellow for lookups numbered 1 and 2) before going on to lookups across larger or the global cluster. This both reduces lookup latency and increases the chances of returning values stored at nearby nodes, which would in fact be pointers to nearby proxies.  Thus, if a local lookup succeeds, the returned IP addresses will also correspond to proxies that have a low RTT with respect to the requesting node.  This doesn’t guarantee that downloading a web object from that proxy will be fast, but its a good approximation, especially as we are normally dealing with smaller web objects and hence TCP-based transfers are often in slow-start (making low RTT all the more important).

Coral’s hierarchical structure has another interesting property. Because Coral nodes maintain the same nodeids in all clusters, once a lookup has progressed as far as it can in one level, it is guaranteed to have made similar forward progress within its larger cluster. This is shown by the vertical dotted lines in the above picture.  For a more in-depth description of how Coral self-organizes itself into these clusters, see our original NSDI paper.  I will also talk a bit more about the clustering design when discussing “latency-sensitive” system architectures in a later post.

This clustering mechanism arose around the same time locality-preferences started getting introduced into DHTs (see mention of “proximity neighbor selection (PNS)” above).   I’m still unclear as to the relative benefit of both.  The clusters certainly introduce complexity to the systems, which is a point against them.  On the flip side, they provide a more static structure for routing, which perhaps allows us to make stronger claims as to their load-balancing properties and also make them better suited for “constrained routing” for security-conscious DHT designs. Perhaps most important, however, is that the hierarchy seems less susceptible to variations arising from measuring RTTs between nodes (especially important in virtualized environments like PlanetLab which can cause significant jitter, as I will discuss in a later post).  Thus, I’m concerned that the PNS-based approaches will “miss” locally stored content more often than the hierarchy-based approach — the routing path in PNS has to line up perfectly, after all — although I’ve not simulated or otherwise tested this theory to say so with any real confidence.

The CoralCDN HTTP Proxy

CoralCDN seeks to aggressively minimize load on origin servers.  We now summarize how its HTTP proxies were designed to use the Coral indexing layer for inter-proxy cooperation and rapid adaptation to flash crowds.

Locality-optimized inter-proxy transfers.  Each CoralCDN proxy keeps a local cache from which it can immediately fulfill client requests.  When a client requests a non-resident URL, CoralCDN proxies attempt to fetch web content from each other, using the Coral indexing layer for discovery.  A proxy will only contact a URL’s origin server after the Coral indexing layer provides no referrals or none of its referrals return the data.

If a CoralCDN proxy discovers that one or more neighbors have the object, it attempts to establish TCP connections in parallel to multiple proxies (2 in our live deployment).  It issues an HTTP request on the first established connection.  If this request fails—e.g., if the remote proxy has evicted cached content, returns an unexpected error, or is unresponsive—these pre-established connections provide fast failover.  Otherwise, they are closed once the remote proxy begins returning valid content.

CoralCDN’s inter-proxy transfers are locality-optimized, both from their use of parallel connection requests and, more importantly, by the order which neighboring proxies are contacted. The properties of Coral’s hierarchical indexing layer implies that the list of proxy IP addresses returned by get(SHA-1(URL)) will be sorted based on their cluster distance to the request initiator.  Thus, proxies will attempt to contact level-2 neighbors before level-1 and level-0 proxies, respectively.

Rapid adaptation to flash crowds. Unlike many web proxies, CoralCDN is explicitly designed for flash-crowd scenarios.  If a flash crowd suddenly arrives for a web object, proxies self-organize into a form of multicast tree for retrieving the object.  Data streams from the proxies that initially fetch the object from the origin server to those arriving later. CoralCDN provides such behavior by combining optimistic references and cut-through routing.

First, as soon as a CoralCDN proxy begins receiving the first bytes of a web object—either from the origin or from another proxy—it inserts a reference to itself in the Coral indexing layer with a short time-to-live (30 seconds).  It continually renews this short-lived reference until either it completes the download or the download fails.

Second, CoralCDN’s cut-through routing at each proxy helps reduce transmission time for larger files.  That is, a proxy will upload portions of a object as soon as they are downloaded, not waiting until it receives the entire object.  (The proxy implements such behavior through its event-based structure communicating through the file system. Multiple client connections can register read callbacks on a single file, and whenever this file is written to as content is received from a remote node, these connection callbacks are dispatched and the newly-written content is sent to other remote nodes downstream in the tree.)  Once any CoralCDN proxy successfully finishes downloading the content, it inserts a much longer-lived reference to itself into Coral. Because the Coral insertion algorithm accounts for expiry times—the deployed system uses 2-hour TTLs for “successful” transmissions (e.g., status codes of 200 (OK), 301 or 302 (Moved), etc.), while also uses 15-minute TTLs for non-transient “unsuccessful” transmissions (e.g., 403 (Forbidden), 404 (File Not Found), etc.) from the origin—these longer-lived references will overwrite shorter-lived ones, and they can be stored on well-selected nodes even under high insertion load. Proxies renew these pointers in Coral  after their TTL expires, but only if the content has been subsequently requested by other parties.

A CoralCDN proxy manages its disk using a least-recently-used eviction policy (each proxy in our PlanetLab deployment has 3GB of disk cache).  The LRU eviction policy can cause a proxy to evict an item, even while a reference in Coral still exists.  (Offering only best-effort guarantees, Coral does not provide any mechanism to delete references.) Recovery from such failures is handled by HTTP failover between proxies, as described above.  That said, in a later post I’ll show that this condition is relatively rare given the system’s working set size.

So that’s the basic design of CoralCDN, with a little pedagogical discussion of DHTs thrown in for fun.  Since deployment, however, I’ve needed to introduce additional functionality to keep in running.  In the next post, I’ll give a brief overview of some of our security mechanisms, while I’ll later discuss some of the per-origin bandwidth fair-sharing mechanisms that we introduced to address the fact that, well, PlanetLab gives us approximately 2 TB of upstream bandwidth per day system-wide, while we’ve long seen demand on the order of >10TB per day.  So if CoralCDN gives you the dreaded “over quota” message, I plead impotent!

Firecoral @ IPTPS

We’ve recently been working hard on Firecoral – a browser-based, peer-to-peer content distribution network for web caching. I’ll be presenting a short talk on Firecoral at the 8th International Workshop on Peer-to-Peer Systems (IPTPS) on April 21st in Boston, MA.

Peer-to-peer content distribution has been inarguably successful for large file distribution (e.g. BitTorrent), but P2P services have been restricted to stand-alone applications, not transparently incorporated into Web browsing and seamlessly running over HTTP. CoralCDN has served as a web content distribution network for the past five years, but its deployment has been limited to PlanetLab and demand quickly outgrew capacity.

Firecoral’s goal is to scale web content distribution to end users by allowing mutually distrustful users to share their web browser caches, yet ensure the authenticity of content and enable users to preserve privacy by expressing flexible content sharing policies.

To achieve these goals, we have built a Firefox extension that uses subscription-based XPath queries to extract URLs from web sites and redirect them to a proxy server running inside the browser. For example, all external links from Slashdot are redirected to Firecoral. Once a URL is received by the proxy, a tracker is queried for other users caching the same URL, and the content is fetched from peers instead of the origin web server. We ensure data integrity with cryptographic signatures from a trusted service.

We will be releasing a beta version of Firecoral on our recently launched Firecoral website soon. For more details about Firecoral, please see our paper, Bringing P2P to the Web: Security and Privacy in the Firecoral Network.