The Design of CoralCDN
Tags: CoralCDN
In this post, I describe the architecture and mechanisms of CoralCDN at a high-level. This is meant to provide some of the background necessary for some of our experiences and lessons with operating the system.
System Overview
CoralCDN is composed of three main parts: (1) a network of cooperative HTTP proxies that handle users’ requests, (2) a network of DNS nameservers for .nyud.net that map clients to nearby CoralCDN HTTP proxies, and (3) the underlying Coral indexing infrastructure and clustering machinery on which the first two applications are built. You’ll find that I refer to the entire system as “CoralCDN”, but the network of indexing nodes as “Coral” (although this clarification is not always shared by others, who refer to the entire system as either “Coral” or “Coral Cache”).
At a high-level, the following steps occur when a client issues a request to CoralCDN.
- A client resolves a “Coralized” domain name (e.g., of the form sns.cs.princeton.edu.nyud.net) using CoralCDN nameservers, possibly starting at one of the 10–12 primary nameservers registered for .nyud.net under the .net domain.
- Upon receiving a query, a CoralCDN nameserver probes the client to determines its round-trip-time. The nameserver uses this information to determine which nameservers (NS records) and CoralCDN proxies (authoritative A records) to return.
- The client sends an HTTP request for a Coralized URL to one of the returned proxy. If the proxy is caching the file locally, it returns the file and stops. Otherwise, this process continues.
- The proxy looks up the web object’s URL in Coral.
- If Coral returns the address of a node caching the object, the proxy fetches the object from this node. Otherwise, the proxy downloads the object from the origin server (e.g., sns.cs.princeton.edu).
- The proxy stores the web object and returns it to the client browser.
- The proxy stores a reference to itself in Coral, recording the fact that is now caching the URL.
A review of DHTs
The Coral indexing layer is closely related to the structure and organization of distributed hash tables (DHTs) like Chord and Kademlia. Let’s briefly review the design of DHTs; skip to the next section if you are already familiar with DHTs.
DHTs were first concurrently proposed around 2001 by several different research groups (Chord from MIT, Pastry from Rice/Microsoft, CAN from ICSI, and Tapestry from Berkeley), which basically all had a similar flavor. They provide a distributed data structure that exposes two functions: put(key, value) and get(key). These proposals took load-balancing properties first considered by the consistent hashing work of Karger et al ’97, differing because consistent hashing had each node have a global view of the system (where all system participants (nodes) know all other nodes in the system) where DHTs focused on more scalable designs (where nodes only know a fraction of total system participants). For most of these algorithms, this “fraction” is O(log n), although you now see DHT proposals called “one-hop DHTs” which effectively go back to the O(n) state of consistent hashing. DHTs have had some real impact: BitTorrent uses DHTs to select trackers for their “trackerless” operation, while memory-caching systems like memcached (used by Facebook, LiveJournal, Wikipedia, etc.) use it to map objects only cache nodes.
In these DHTs, all nodes are assigned one or more random identifiers, called nodeids, typically from some space “keyspace” (e.g., all 160-bit numbers). “Keys” are also mapped into this same identifier/keyspace by computing some function F(k) that maps any input key k ({0,1}*) into the keyspace ({0,1}^160). One often uses a cryptographic hash function like SHA-1 for this F, as it has the property of collision resistance; namely, that it is very difficult to find two inputs k and k’ for which F(k) == F(k’) yet k != k’. (As a slight digression, if SHA-1 indeed behaves like a random function, then finding any such pair k,k’ that “collide” will require searching roughly 2^80 numbers for 160-bit values, i.e., the “Birthday bound”. There’s been some very clever attacks against SHA-1 in the past few years to bring this down to around 2^63, but that’s perhaps not that relevant to our discussion, and one could always switch instead to a longer function, such as the 256-bit aptly-named SHA-256, and NIST currently has another competition going on to determine the next-generation cryptographic hash function.)
Anyway, I’ll describe the basic approach of DHTs using Chord as an example, because I think it’s the easiest to explain. We first start out with a identifier space that’s comprised of all 160-bit numbers, i.e., [0, 2^160-1]. Imagine placing these numbers on a ring, so that you have both “0″ and “2^160-1″ in the 12-o’clock position. Then, we map both keys and nodeids onto this space. In Coral, keys are the SHA-1 hash of URLs, and nodeids are the SHA-1 hash of nodes’ IP addresses. We say that a node is a successor (or root) of a key if it is the first node that is clockwise from the key, as you go around the ring. You can also view a node as “being responsible for” the entire keyspace between its own identifier and that of its nearest node predecessor (in the counter-clockwise direction).
So that’s how we assign keys to nodes, which is (roughly) how Akamai supposedly assigns responsibility for caching files to its servers, as well as how memcached works (albeit memcache doesn’t support the dynamic addition or removal of cache nodes, unfortunately). This approach of consistent hashing is superior to other simple “mod-based” hashing techniques, which number all nodes from 1..n, then assign a key to the ith node by simply computing key mod n. In mod-based hashing, one gets great load balancing, but bad smoothness. This smoothness property addresses how many items will move between nodes if new nodes join or leave the system. In the hash-based approach, if n changes by only one, then all items will need to be reassigned (or somehow a complex renumbering trick played); in consistent hashing, when a node is added or removed, only its successor will typically be affected.
But both of these hashing approaches assume that every node knows approximately all other nodes in the system. DHTs ask how we can design a system where each node only knows some small fraction (typically O(log n)) of the system’s nodes. In Chord, as in most other DHTs, these known nodes are drawn from a special distribution, one that allows any node to find a key’s node successor in a short number of routing hops (typically O(log n)) when we use the system’s nodes as a routing overlays. Thus nodes in a multi-hop DHT play the role of both routing lookups and storing keys/value pairs.
Chord maintains two different types of “links” in its routing table: (1) successor links which map to its immediate k nodes in a clockwise direction around the ID ring (i.e., nodes with next-highest ids), where k successors are known instead of just a single one for fault-tolerance; (2) fingers which provide long-distance shortcuts for efficient routing. These finger links are drawn from an exponential distribution, so a node with id me knows the node successors for the ids {me+2^1, me+2^2, me+2^3, … me+2^log n}. In other words, it has a link to some node 1/2 around the ring from it, 1/4 around the ring, 1/8, 1/16, and so on. (In fact, some performance optimizations don’t require that fingers are chosen in quite so strict a manner–see “proximity neighbor selection“–but this basic distribution is still maintained for efficient routing.)
Then, in order to route to another node responsible for any key, a node uses these links in a greedy fashion: in each hop, it chooses a known node who’s nodeid is closest to the key being looked up. Given that each node has a routing table of neighbors with the above distribution, that means that each hop is guaranteed to get at least 1/2 closer to the destination, hence the O(log n) bound for routing. As a concrete example, let’s consider we just have 5-bit identifiers, and let’s assume for simplicity that there are actually 32 nodes in the system. Then, the node with id 0 will have links to nodes with ids 1, 2, 4, 8, and 16. Further, if node 0 wants to lookup key 29, its lookup traverses the path through nodes 16, 24, 28, and 29, respectively.
The Coral indexing layer
Compared to “traditional” DHTs, Coral introduced a few novelties given its application requirements: Its key-value indexing layer offered weaker consistency and included a locality-organized hierarchical design. After all, a client need not discover all proxies caching a particular file, it only needs to find several such proxies, preferably ones nearby. So, keys in Coral are computed SHA-1(URL), values are the IP addresses of proxying a web object, and each key/value pair is stored as soft-state in the system so includes a TTL for expiry.
We start by introducing the notion of sloppiness, which basically refers to the fact that the DHT layer is explicitly designed to not be strongly (or even eventually) consistent. The basic idea with sloppiness is to prevent a key’s nominal root from getting overloaded if there’s significant interest in a single key (i.e., the URL that’s currently getting Slashdotted). So while unpopular keys will be stored on their root, inserts to popular keys will halt during their overlay routing before reaching the actual root and instead store the key/value pair being inserted on some more distant node. Then, a subsequent lookup for this key can similarly stop upon finding any set of values stored under the key, i.e., never actually progressing to the root. As keys get more popular, their storage will get pushed out to more nodes in the overlay, particularly because the number of nodes that are one hop away from the root is log n, the number that are two hops away is log^2 n, etc.
So the real question is the following: (1) how do we actually do this “push-back” to store items further away from the root, and (2) how do we ensure that lookups still succeed, no matter which routing path requests come in from, especially as keys are now getting pushed back and entries in Coral are just soft-state.
To insert a key/value pair, Coral performs a two-phase operation. In the “forward” phase, Coral routes to nodes successively closer to the key k, as in typical DHT routing, but stops when happening upon either (1) the key’s root or (2) a node that is both full and loaded. We say a node is full if it has reached the maximum number of values for the key (with sufficiently long-lived TTLs), and loaded when there is heavy write traffic for a particular key. During the “reverse” phase, the proxy attempts to insert the value at the closest node seen. These insert-rate-limiting processes prevent tree saturation—i.e., the root becoming overloaded with a high rate of insertions—and makes get operations satisfiable by more distance nodes.
That gives us sloppiness, but what about locality? To improve locality, these routing operations are not initially performed across the entire global overlay. Instead, each Coral node belongs to several distinct routing structures called clusters. Each cluster is characterized by a maximum desired network round-trip-time (RTT). The system is parameterized by a fixed hierarchy of clusters with different expected RTT thresholds. Coral’s deployment uses a three-level hierarchy, with level-0 denoting the global cluster and level-2 the most local one. Coral employs distributed algorithms to form localized, stable clusters.
Every node belongs to one cluster at each level, as shown in the picture. Coral queries nodes in its local cluster (shown by the nodes in yellow for lookups numbered 1 and 2) before going on to lookups across larger or the global cluster. This both reduces lookup latency and increases the chances of returning values stored at nearby nodes, which would in fact be pointers to nearby proxies. Thus, if a local lookup succeeds, the returned IP addresses will also correspond to proxies that have a low RTT with respect to the requesting node. This doesn’t guarantee that downloading a web object from that proxy will be fast, but its a good approximation, especially as we are normally dealing with smaller web objects and hence TCP-based transfers are often in slow-start (making low RTT all the more important).
Coral’s hierarchical structure has another interesting property. Because Coral nodes maintain the same nodeids in all clusters, once a lookup has progressed as far as it can in one level, it is guaranteed to have made similar forward progress within its larger cluster. This is shown by the vertical dotted lines in the above picture. For a more in-depth description of how Coral self-organizes itself into these clusters, see our original NSDI paper. I will also talk a bit more about the clustering design when discussing “latency-sensitive” system architectures in a later post.
This clustering mechanism arose around the same time locality-preferences started getting introduced into DHTs (see mention of “proximity neighbor selection (PNS)” above). I’m still unclear as to the relative benefit of both. The clusters certainly introduce complexity to the systems, which is a point against them. On the flip side, they provide a more static structure for routing, which perhaps allows us to make stronger claims as to their load-balancing properties and also make them better suited for “constrained routing” for security-conscious DHT designs. Perhaps most important, however, is that the hierarchy seems less susceptible to variations arising from measuring RTTs between nodes (especially important in virtualized environments like PlanetLab which can cause significant jitter, as I will discuss in a later post). Thus, I’m concerned that the PNS-based approaches will “miss” locally stored content more often than the hierarchy-based approach — the routing path in PNS has to line up perfectly, after all — although I’ve not simulated or otherwise tested this theory to say so with any real confidence.
The CoralCDN HTTP Proxy
CoralCDN seeks to aggressively minimize load on origin servers. We now summarize how its HTTP proxies were designed to use the Coral indexing layer for inter-proxy cooperation and rapid adaptation to flash crowds.
Locality-optimized inter-proxy transfers. Each CoralCDN proxy keeps a local cache from which it can immediately fulfill client requests. When a client requests a non-resident URL, CoralCDN proxies attempt to fetch web content from each other, using the Coral indexing layer for discovery. A proxy will only contact a URL’s origin server after the Coral indexing layer provides no referrals or none of its referrals return the data.
If a CoralCDN proxy discovers that one or more neighbors have the object, it attempts to establish TCP connections in parallel to multiple proxies (2 in our live deployment). It issues an HTTP request on the first established connection. If this request fails—e.g., if the remote proxy has evicted cached content, returns an unexpected error, or is unresponsive—these pre-established connections provide fast failover. Otherwise, they are closed once the remote proxy begins returning valid content.
CoralCDN’s inter-proxy transfers are locality-optimized, both from their use of parallel connection requests and, more importantly, by the order which neighboring proxies are contacted. The properties of Coral’s hierarchical indexing layer implies that the list of proxy IP addresses returned by get(SHA-1(URL)) will be sorted based on their cluster distance to the request initiator. Thus, proxies will attempt to contact level-2 neighbors before level-1 and level-0 proxies, respectively.
Rapid adaptation to flash crowds. Unlike many web proxies, CoralCDN is explicitly designed for flash-crowd scenarios. If a flash crowd suddenly arrives for a web object, proxies self-organize into a form of multicast tree for retrieving the object. Data streams from the proxies that initially fetch the object from the origin server to those arriving later. CoralCDN provides such behavior by combining optimistic references and cut-through routing.
First, as soon as a CoralCDN proxy begins receiving the first bytes of a web object—either from the origin or from another proxy—it inserts a reference to itself in the Coral indexing layer with a short time-to-live (30 seconds). It continually renews this short-lived reference until either it completes the download or the download fails.
Second, CoralCDN’s cut-through routing at each proxy helps reduce transmission time for larger files. That is, a proxy will upload portions of a object as soon as they are downloaded, not waiting until it receives the entire object. (The proxy implements such behavior through its event-based structure communicating through the file system. Multiple client connections can register read callbacks on a single file, and whenever this file is written to as content is received from a remote node, these connection callbacks are dispatched and the newly-written content is sent to other remote nodes downstream in the tree.) Once any CoralCDN proxy successfully finishes downloading the content, it inserts a much longer-lived reference to itself into Coral. Because the Coral insertion algorithm accounts for expiry times—the deployed system uses 2-hour TTLs for “successful” transmissions (e.g., status codes of 200 (OK), 301 or 302 (Moved), etc.), while also uses 15-minute TTLs for non-transient “unsuccessful” transmissions (e.g., 403 (Forbidden), 404 (File Not Found), etc.) from the origin—these longer-lived references will overwrite shorter-lived ones, and they can be stored on well-selected nodes even under high insertion load. Proxies renew these pointers in Coral after their TTL expires, but only if the content has been subsequently requested by other parties.
A CoralCDN proxy manages its disk using a least-recently-used eviction policy (each proxy in our PlanetLab deployment has 3GB of disk cache). The LRU eviction policy can cause a proxy to evict an item, even while a reference in Coral still exists. (Offering only best-effort guarantees, Coral does not provide any mechanism to delete references.) Recovery from such failures is handled by HTTP failover between proxies, as described above. That said, in a later post I’ll show that this condition is relatively rare given the system’s working set size.
So that’s the basic design of CoralCDN, with a little pedagogical discussion of DHTs thrown in for fun. Since deployment, however, I’ve needed to introduce additional functionality to keep in running. In the next post, I’ll give a brief overview of some of our security mechanisms, while I’ll later discuss some of the per-origin bandwidth fair-sharing mechanisms that we introduced to address the fact that, well, PlanetLab gives us approximately 2 TB of upstream bandwidth per day system-wide, while we’ve long seen demand on the order of >10TB per day. So if CoralCDN gives you the dreaded “over quota” message, I plead impotent!
