All posts by Mike Freedman

Consistency, Availability, and Geo-Replicated Storage

For the past few years, we’ve been working on problems related to geo-replicated storage. We want to store data “in the cloud,” but that data should reside within multiple datacenters, not just in a single one.  When data is geographically replicated in such a fashion:

  • Users can experience lower latency by accessing a datacenter near to them, rather than one halfway around the world.
  • Network or system failures at a single datacenter doesn’t make the service unavailable  (even for data stored at that site).

This is common practice today.  Google runs multiple datacenters around the world, and Amazon Web Services offers multiple “Availability Zones” that are supposed to fail independently.

When data is replicated between locations, an important question arises about the consistency model such a system exposes.  Wyatt Lloyd has been tackling this question in his recent COPS and Eiger systems.  The  problem space this work explores — between giving up on any consistency guarantees one can reason about and just going with “eventual” consistency on one extreme, and giving up on availability guarantees to gain strong consistency and real transactions on the other — is going to be an increasingly important one.

Normally, folks think that the CAP Theorem tells us these two choices are fundamental.  But the key point is that CAP doesn’t tell us that eventual consistency is required, just that (as Partitions can happen) one can’t have both Availability and Strong Consistency (or more formally, linearizability).  It doesn’t tell us anything about consistency models that are weaker than linearizability yet stronger than “eventual.”  And that’s where COPS and Eiger come in.

One of our collaborators at CMU, Dave Andersen, recently wrote-up a more accessible discussion of these systems, and the causally-consistent data model they expose.   With the explosion of new data storage systems, particularly of the NoSQL variety, it’s important for folks to realize that there’s a (powerful and practical) choice between these two extremes.

Caring about Causality – now in Cassandra

Over the past few years, we’ve spent a bunch of time thinking about and designing scalable systems that provide causally-consistent wide-area replication.  (Here, “we” means the team of Wyatt Lloyd, Michael Freedman, Michael Kaminsky, and myself;  but if you know academia, you wouldn’t be surprised that about 90% of the project was accomplished by Wyatt, who’s a graduating Ph.D. student at the time of this writing.)  I’m posting this because we’ve finally entered the realm of the practical, with the release of both the paper (to appear at NSDI’13) and code for our new implementation of causally-consistent replication (we call it Eiger) within the popular Cassandra key-value store.

Read Dave’s full post here.

Princeton CS hiring again this year

This year saw some great faculty additions to the CS department:

  • Mark Braverman in theoretical computer science,
  • Zeev Dvir in TCS and Math,
  • Rebecca Fiebrink in computational music and HCI (hired in 2010), and
  • David Wentzlaff in the EE department, although his work includes the very CS-like topics of computer architecture and operating systems.

This coming spring, we’re happy to be interviewing again at the Assistant Professor level. Rather than limit the search to any particular areas of focus, we’re looking for top applicants across all areas of computer science.

Tenure-Track Positions.

The Department of Computer Science at Princeton University invites applications for faculty positions at the Assistant Professor level. We are accepting applications in all areas of Computer Science.

Applicants must demonstrate superior research and scholarship potential as well as teaching ability. A PhD in Computer Science or a related area is required. Successful candidates are expected to pursue an active research program and to contribute significantly to the teaching programs of the department. Applicants should include a CV and contact information for at least three people who can comment on the applicant’s professional qualifications.

There is no deadline, but review of applications will start in December 2011. Princeton University is an equal opportunity employer and complies with applicable EEO and affirmative action regulations. You may apply online at: http://jobs.cs.princeton.edu/.

Requisition Number: 0110422

Like at many schools, the application cycle has shifted earlier a bit compared to previous years, so applicants are recommended to apply earlier rather than later. The website is already open.

Erroneous DMCA notices and copyright enforcement, part deux

[Given my continued use of Ed’s Freedom-To-Tinker blog, I’m reposting this article from there.]

A few weeks ago, I wrote about a deluge of DMCA notices and pre-settlement letters that CoralCDN experienced in late August. This article actually received a bit of press, including MediaPost, ArsTechnica, TechDirt, and, very recently, Slashdot. I’m glad that my own experience was able to shed some light on the more insidious practices that are still going on under the umbrella of copyright enforcement. More transparency is especially important at this time, given the current debate over the Anti-Counterfeiting Trade Agreement.

Given this discussion, I wanted to write a short follow-on to my previous post.

The VPA drops Nexicon

First and foremost, I was contacted by the founder of the Video Protection Alliance not long after this story broke. I was informed that the VPA has not actually developed its own technology to discover users who are actively uploading or downloading copyrighted material, but rather contracts out this role to Nexicon. (You can find a comment from Nexicon’s CTO to my previous article here.) As I was told, the VPA was contracted by certain content publishers to help reduce copyright infringement of (largely adult) content. The VPA in turn contracted Nexicon to find IP addresses that are participating in BitTorrent swarms of those specified movies. Using the IP addresses given them by Nexicon, the VPA subsequently would send pre-settlement letters to the network providers of those addresses.

The VPA’s founder also assured me that their main goal was to reduce infringement, as opposed to collecting pre-settlement money. (And that users had been let off with only a warning, or, in the cases where infringement might have been due to an open wireless network, informed how to secure their wireless network.) He also expressed surprise that there were false positives in the addresses given to them (beyond said open wireless), especially to the extent that appropriate verification was lacking. Given this new knowledge, he stated that the VPA dropped their use of Nexicon’s technology.

BitTorrent and Proxies

Second, I should clarify my claims about BitTorrent’s usefulness with an on-path proxy. While it is true that the address registered with the BitTorrent tracker is not usable, peers connecting from behind a proxy can still download content from other addresses learned from the tracker. If their requests to those addresses are optimistically unchoked, they have the opportunity to even engage in incentivized bilateral exchange. Furthermore, the use of DHT- and gossip-based discovery with other peers—the latter is termed PEX, for Peer EXchange, in BitTorrent—allows their real address to be learned by others. Thus, through these more modern discovery means, other peers may initiate connections to them, further increasing the opportunity for tit-for-tat exchanges.

Some readers also pointed out that there is good reason why BitTorrent trackers do not just accept any IP address communicated to it via an HTTP query string, but rather use the end-point IP address of the TCP connection. Namely, any HTTP query parameter can be spoofed, leading to anybody being able to add another’s IP address to the tracker list. That would make them susceptible to receiving DMCA complaints, just we experienced with CoralCDN. From a more technical perspective, their machine would also start receiving unsolicited TCP connection requests from other BitTorrent peers, an easy DoS amplification attack.

That said, there are some additional checks that BitTorrent trackers could do. For example, if the IP query string or X-Forwarded-For HTTP headers are present, only add the network IP address if it matches the query string or X-Forwarded-For headers. Additionally, some BitTorrent tracker operators have mentioned that they have certain IP addresses whitelisted as trusted proxies; in those cases, the X-Forwarded-For address is used already. Otherwise, I don’t see a good reason (plausible deniability aside) for recording an IP address that is known to be likely incorrect.

Best Practices for Online Technical Copyright Enforcement

Finally, my article pointed out a strategy that I clearly thought was insufficient for copyright enforcement: simply crawling a BitTorrent tracker for a list of registered IP addresses, and issuing a infringement notice to each IP address. I’ll add to that two other approaches that I think are either insufficient, unethical, or illegal—or all three—yet have been bandied about as possible solutions.

  • Wiretapping: It has been suggested that network providers can perform deep-packet inspection (DPI) on their customer’s traffic in order to detect copyrighted content. This approach probably breaks a number of laws (either in the U.S. or elsewhere), creates a dangerous precedent and existing infrastructure for far-flung Internet surveillance, and yet is of dubious benefit given the move to encrypted communication by file-sharing software.
  • Spyware: By surreptitiously installing spyware/malware on end-hosts, one could scan a user’s local disk in order to detect the existence of potentially copyrighted material. This practice has even worse legal and ethical implications than network-level wiretapping, and yet politicians such as Senator Orrin Hatch (Utah) have gone as far as declaring that infringers’ computers should be destroyed. And it opens users up to the real danger that their computers or information could be misused by others; witness, for example, the security weaknesses of China’s Green Dam software.

So, if one starts from the position that copyrights are valid and should be enforceable—some dispute this—what would you like to see as best practices for copyright enforcement?

The approach taken by DRM is to try to build a technical framework that restricts users’ ability to share content or to consume it in a proscribed manner. But DRM has been largely disliked by end-users, mostly in the way it creates a poor user experience and interferes with expected rights (under fair-use doctrine). But DRM is a misleading argument, as copyright infringement notices are needed precisely after “unprotected” content has already flown the coop.

So I’ll start with two properties that I would want all enforcement agencies to take when issuing DMCA take-down notices. Let’s restrict this consideration to complaints about “whole” content (e.g., entire movies), as opposed to those DMCA challenges over sampled or remixed content, which is a legal debate.

  • For any end client suspected of file-sharing, one MUST verify that the client was actually uploading or downloading content, AND that the content corresponded to a valid portion of a copyrighted file. In BitTorrent, this might be that the client sends or receives a complete file block, and that the file block hashes to the correct value specified in the .torrent file.
  • When issuing a DMCA take-down notice, the request MUST be accompanied by logged information that shows (a) the client’s IP:port network address engaged in content transfer (e.g., a record of a TCP flow); (b) the actual application request/response that was acted upon (e.g., BitTorrent-level logs); and (c) that the transferred content corresponds to a valid file block (e.g., a BitTorrent hash).

So my question to the readers: What would you add to or remove from this list? With what other approaches do you think copyright enforcement should be performed or incentivized?

Update (12/21/2009): Discussion about post on Slashdot.

Faculty hiring at Princeton Computer Science

I’m happy to be able to advertise that, after a two-year hiatus from hiring, the Princeton Computer Science Department will be interviewing for tenure-track faculty positions this year.  Because of Perry Cook‘s recently announced retirement—and he will be sorely missed!—there will be a focus on music, sound, and HCI.  Top applicants from all areas of computer science will be considered, however.

Computer Science Department
Princeton University

Tenure-Track Position

The Department of Computer Science at Princeton University invites applications for Assistant Professor. We are accepting applications in all areas of Computer Science, with particular emphasis on music, sound, and HCI.

Applications must demonstrate superior research and scholarship potential as well as teaching ability. A PhD or equivalent in Computer Science or related area is required. Successful candidates are expected to pursue an active research program and to contribute significantly to the teaching programs of the department. Applicants should include a resume and the names of at least three people who can comment on the applicant’s professional qualifications.

Princeton University is an equal opportunity employer and complies with applicable EEO and affirmative action regulations. You may apply online and self-identify at https://jobs.cs.princeton.edu/.

See the official announcement here and the ACM posting here.  The application website is now available.

Erroneous DMCA notices and copyright enforcement: the VPA, BitTorrent, and me

I recently posted an article on Ed Felten’s group blog, Freedom to Tinker, which covers both technical and policy-related I.T. issues. It describes some technical issues and my recent experience with DMCA enforcement agencies, BitTorrent, and CoralCDN. I think our readers would be interested in it as well, as well as join in the lively conversation among commenters.

In the past few weeks, Ed has been writing about targeted and inaccurate copyright enforcement. While it may be difficult to quantify the actual extent of inaccurate claims, we can at least try to understand whether copyright enforcement companies are making a “good faith” best effort to minimize any false positives. My short answer: not really.

Let’s start with a typical abuse letter that gets sent to a network provider (in this case, a university) from a “copyright enforcement” company such as the Video Protection Alliance.

The rest of the article can be found here.

IPTPS ’10 call for papers

Together with Arvind Krishnamurthy, I’ll be chairing this year’s International Workshop on Peer-to-Peer Systems (IPTPS).  The workshop was started in 2002, which coincided both with the popularization of P2P file sharing (Napster, KaZaA) and the introduction of distributed hash tables (DHTs) from several different research groups.

Eight years later, P2P file sharing is still going strong (now through BitTorrent), while the previously-academic DHTs have found their way into real use.  DHTs now form the decentralized lookup structures for file sharing services—in the form of so-called “trackerless” BitTorrent—with the DHT in the Vuze service comprising more than a million concurrent users.  (As an aside, I’m proud to note that Vuze’s DHT is based on Kademlia, which was proposed by one of my officemates in grad school, Petar Maymounkov.)

These self-organizing systems have also found their way into the datacenter.  One notable example is the storage system, Dynamo, that forms the basis for Amazon’s shopping cart and other back-end  applications.  Or Facebook’s Cassandra, used for its Inbox search.  Or the rest of the key-value stores that do automated partitioning.  And we are starting to see these techniques being proposed for scaling enterprise networks as well.  With that in mind, we wanted to broaden the scope of this year’s IPTPS to include topics relating to self-organizing and self-managing distributed systems, even those running in single administrative domains.

We also plan to have a demo session at this year’s IPTPS to highlight developed and deployed systems.  The workshop will be collocated with NSDI in San Jose, so will be especially convenient for those in the Bay Area.  We welcome submissions (both paper and demos) from researchers, developers, and hackers.  If you don’t want to write a paper, come show off your running P2P system.

Paper submissions are due Friday, December 18, 2009.  More information can be found at http://www.usenix.org/event/iptps10/cfp/.

Continue reading IPTPS ’10 call for papers

Post-doc opportunity

Jen Rexford and I are jointly seeking to hire a post-doc to join us at Princeton.   We are looking for somebody to start sometime between February and June 2010 (ideally, as soon as possible) and stay for a duration of 18-24 months.

This research opportunity is part of the SCAFFOLD project.  SCAFFOLD is a new network architecture that focuses on better supporting distributed and wide-area services.  These services, run over multiple hosts distributed across many locations, need to respond quickly to server and network churn: both unexpected changes (due to equipment failures and physical mobility) and intentional changes (during planned maintenance, load balancing, and workload migration). SCAFFOLD is exploring network support for treating service-level objects (rather than hosts) as first-class citizens and for providing a tighter coupling between object-based naming and routing.  While the design can support a clean-slate Internet architecture, we are immediately focusing on incremental deployment within a single datacenter or enterprise, as well as across multiple, geo-diverse datacenters.

The postdoctoral associate will provide a leadership role in designing, prototyping, and deploying a SCAFFOLD network.  System components include network-layer devices (OpenFlow-based routers/switches, proxies for backwards compatibility, and NOX-based network controllers), integrated end-hosts, and application-level services to be deployed on top of SCAFFOLD.

Applicants should have a Ph.D. in computer science or a related field at the time of starting the position. They should have a strong research record, as well as first-hand experience building complex networked systems.  Applicants are requested to apply via email (to Jen and myself) with a curriculum vitae and a description of their background and interests. The position will provide a competitive salary and support for further professional development.  Non-U.S. citizens or residents are welcome to apply.  The project is funded through the National Science Foundation, the GENI Project Office, the Office of Naval Research, and Cisco Systems.

Update (March 1):  This post-doc position has since been filled.  Welcome, Erik Nordstrom.

New datacenter network architectures

This year’s HotNets workshop was held over the past two days in the faculty club at NYU; it was nice being on old turf.   The HotNets workshop has authors write 6-page “position” or “work-in-progress” papers on current “hot topics in networking” (surprise!).  Tucked into a cosy downstairs room, the workshop was nicely intimate and it saw lots of interesting questions and discussion.

One topic that was of particular interest to me were new ideas about datacenter networking; HotNets included two papers in each of two different research areas.

The first thematic area was addressing the problem of bisectional bandwidth within the datacenter. The problem is that each rack in a datacenter may have 40 machines, each potentially generating several Gbps of traffic.  Yet, all this traffic in typically aggregated at a single top-of-rack (ToR) switch, which often is the bottleneck in communicating with other racks.  (This is especially relevant for data-intensive workloads, such as Map Reduce-style computations.)  Even if the ToR switch is configured with 4-8 10Gbps links, this alone can be insufficient.  For example, an all-L2 network, while providing easy auto-configuration and supporting VM mobility, cannot take advantage of this capacity:  Even if multiple physical paths exist between racks, the Ethernet spanning tree will only use one.  In addition, the traditional method of performing L2 address resolution (ARP) is broadcast and thus not scalable.

Multiple solutions to this problem are currently being explored.  In SIGCOMM ’09 in late August, we saw three papers that proposed new L2/L3 networks to address this problem.  The first two leveraged the locator/ID split, so that virtual resources are named by some identifier independent of their location.

  • PortLand from UCSD effectively proposed a NAT layer for MAC addresses.  In Portland, virtual machines running on physical hosts can keep a permanent MAC address (identifier) even under VM migration, but their immediate upstream switch provides a location-specific “pseudo” MAC (the locator). Because these pseudo MACs are allocated hierarchically, they can be efficiently aggregated in upstream switches.  PortLand assumes that datacenter networks have a classic fat tree-like hierarchy between racks of servers, which is the typical network architecture in datacenters.  Instead of routing a packet to a VM’s MAC address, PortLand performs forwarding based on this pseudo-MAC; the VM’s upstream switch NATs between this PMAC and its permanent MAC. No end-host modifications need to be made (although one can certainly envision the host’s hypervisor to perform such NATting on the end-host itself before dispatching the packet to the unmodified VM). The sender is also unaware of this process, as its immediate upstream switch (resp. hypervisor) first translates the destination from MAC to PMAC before sending it out over the datacenter network. The question remains how to discover the MAC-to-PMAC binding, and Portland assumes a centralized lookup controller that stores these bindings and updates them under VM migration, much like the controller in Ethane. (In fact, Portland was prototyped using OpenFlow, which came out of Ethane.)
  • VL2 from MSR includes both L2 and L3 elements, unlike Portland’s L2-only solution. Changhoon Kim presented this work; he was a recent PhD graduate from Princeton (and in fact I served on the committee for his thesis, which included VL2). VL2 particularly focused on achieving both high bisectional bandwidth and good fairness between competing flows.    Using data collected from Microsoft’s online services, they found a high degree of variance between flows.  The implications of this is that a static allocation of bandwidth wouldn’t be all that promising, unless one managed to fully provision for high bisectional-bandwidth everywhere, which would be quite expensive. One of the particularly nice things about their evaluation (and implementation)—certainly aided by the support of an industrial research lab!—is that it ran on a  cluster consisting of 3 racks, 80 servers, and 10 switches, so provided at least some limited scaling experience.
    vl2On to specifics.  While Portland used “virtualized” Ethernet identifiers, VL2 assigns “virtualized” application-level IP addresses (AAs) to all nodes.  And rather than performing the equivalent of L2 NATing on the first-hop switches, VL2 uses unmodified switches but modifies end-hosts—in virtualized datacenter environments, it’s not terrible difficult to forklift upgrade your servers’ OS—to perform location-specific address (LA) resolution.  This is the identifier/locator split.  So applications use location-independent IP addresses, but a module on end-hosts resolves an AA IP address to a specific location address (LA) of the destination’s top-of-rack switch, then encapsulates this AA packet within an LA-addressed IP header.  This module also serves to intercept ARP requests for LA addresses and convert ARP broadcasts into unicast lookups to a centralized directory service (much like in PortLand and Ethane).   While this addressing mechanism replaces broadcasts and serves as an indirection layer to support address migration, it doesn’t itself provide good bisectional bandwidth.  For this, VL2 uses both Valiant Load Balancing (VLB) and Equal-Cost Multi-Path (ECMP) routing to spread traffic uniformly across network paths.  VLB makes use of a random intermediate waypoint to route traffic between two endpoints, while ECMP is used to select amongst multiple equal-cost paths.   With multiple, random intermediate waypoints, communication between two racks can follow multiple paths, and ECMP provides load balancing between those paths.  So, taken together, a packet from some source AA to a destination AA will first be encapsulated with the LA address of the destination’s ToR switch, which in turn is encapsulated with the the address of an intermediate switch for VLB.  This is shown in the figure to the left (taken from the VL2 paper). Interestingly, all of VL2’s mechanisms are built on backward-compatible network protocols (packet encapsulation, ECMP, OSPF, etc.).
  • BCube from MSR Asia specifically focused on scaling bisectional bandwidth between nodes in shipping-container-based modular datacenters.  This is a follow-on work to their DCell architecture from the previous year’s SIGCOMM.  BCube (and DCell) are, at their core, more complex interconnection networks (as compared to today’s fat trees) for wiring together and forwarding packets between servers.  They propose a connection topology that looks very much like a generalized hypercube.  So we are in fact seeing the rebirth of complex interconnection networks that were originally proposed for parallel machines like Thinking Machine’s CM-5; now it’s just for datacenters rather than supercomputers.  Actually wiring together such complex networks might be challenging, however, compared to today’s fat-tree architecture.

So these proposals all introduced new L2/L3 network architectures for the datacenter.  In HotNets, we saw more proposals for using different technologies, as opposed to new topologies (to borrow a phrase from here):

  • A group from Rice, CMU, and Intel Pittsburgh argued that traditional electrically-switched networks (i.e., Ethernet) in the datacenter should be accompanied by an optically-switched network for bulk data distribution.  This proposal takes advantage of reconfigurable optical switches to quickly set up light-paths between particular hosts for large transfers.  So their resulting system is a hybrid:  using electrical, packet-switched networks for small flows or for those requiring low-latency, but using optical, circuit-switched networks for large flows that can withstand the few millisecond delay necessary to reconfigure the optical switches.   And these types of larger flows are especially prevalent in data-intensive workloads (e.g., scientific computing and MapReduce).
  • A group from Microsoft Research focused on a similar problem—the paper’s presenter even joked that he could just skip his motivation slides.  But instead of proposing a hybrid electric/optical network, they argued for a hybrid wired/wireless network, where wireless links are used to provide a “fly-way”  between servers.  Instead of using these additional links for large transfers (as in the above proposal), however, this work uses these additional links to handle transient overages on the existing wired network.  Because it’s a wireless network, one doesn’t need to physically wire them up in place; the paper suggests that wireless connections in the 60GHz band might be especially useful given some prototypes that achieve 1-15 Gbps at distances of 4-10m.  The paper also discusses wired fly-ways by using additional switches to inter-connect random subsets of ToR switches, but the talk seemed to focus on the wireless case.

Either way, it’s interesting to see competing ideas for using different technologies to handle bisectional capacity problems (whether transient or persistent).

HotNet’s second thematic area of datacenter network papers considers managing all this new complexity.  There were two papers on NOX, which is a network controller for managing OpenFlow networks.

  • The first paper asked the question whether we needed special networking technologies to support new datacenter architectures (the talk focused specifically on the problem of building VL2), or whether we could construct similar functionality via NOX and OpenFlow switches.  They found (perhaps not surprisingly) that NOX could be sufficient.
  • The second NOX paper focused on greater support for virtualized end-hosts.  OpenVSwitch is meant to work with various end-host virtualization technologies (Xen, KVM, etc.) and provide functionality for managing their virtual network interfaces (instead of, e.g., Xen’s simple Ethernet bridge).  Openvswitch can be used, for example, to setup particular routes between VM instances or to provide encapsulation and tunneling between VMs.  The latter could enable L3 migration of VMs, with a VM’s old physical location forwarding packets to the VM’s new location (akin to Mobile IP).  Traditional VM migration, on the other hand, uses ARP spoofing and is thus limited to migration between hosts on the same L2 network.

This ability to perform more fine-grain network resource management is very interesting.  While most of these above papers (except the latter one) focus on supporting L2/L3 addressing and connectivity, our own SCAFFOLD project looks at the higher-level problem of supporting wide-area, distributed services.   Distributed services typically scale-out by replicating functionality across many machines; for customer-facing services, some combination of additional mechanisms are used for server selection and failover:  DNS, BGP tricks and IP anycast, VRRP, VIP/DIP load balancers, ARP spoofing, etc.  All this complexity is because, while clients are trying to access replicated services, the Internet provides communication between unicasted hosts. Thus, SCAFFOLD explores building a network architecture that supports communication between services, instead of network interfaces, and that explicitly supports churn (whether planned or unplanned) amongst the set of end-hosts composing that service. I’ll expand on SCAFFOLD’s motivation and design more in a future post.

CoralCDN Lesson: The great naming conflation of the Web

The last post argued how CoralCDN’s API through domain manipulation provided a simple yet surprisingly powerful content delivery mechanism.  Unfortunately, its technique flies in the face of the web’s use of domain names.

Conflating naming, location, and authorization, browsers use domains for three purposes:

  1. Domains provide a human-readable name for what administrative entity a client is interacting with (e.g., the “common name” identified in SSL server certificates).
  2. Domains specify where to retrieve content after they are resolved to IP addresses (through DNS).
  3. Domains specify what security policies to enforce on web objects and their interactions, especially as it relates to browser Same Origin Policy (SOP).

CoralCDN’s domain manipulation clearly focuses on the location/addressing aspect of web objects (#2).  And while it has generated abuse complaints given its naming (#1)—either from sites complaining about “illegal mirroring,” third-parties mistakenly issuing DMCA take-down notices, or from those fearing phishing attacks—its most serious implications apply to browser security (#3).

The Same Origin Policy in browsers specifies how scripts and instructions from an origin domain can access and modify browser state.  This policy most significantly applies to manipulating cookies, browser windows, frames, documents (through the DOM), as well as to requesting URLs via an XmlHttpRequest. At its simplest level, all of these behaviors are only allowed between resources that belong to the identical origin domain.  This provides security against sites accessing each others’ private information kept in cookies, for example.  It also prevents websites that run advertisements (such as Google’s AdSense) from easily performing click fraud and pay themselves advertising dollars by programmatically “clicking” on the advertisements shown on their site.  (This is enforced because advertisements like AdSense are loaded in an iframe that the parent “document”—the third-party website that stands to gain revenue—cannot access, as the frame belongs to a different domain.)

One caveat to the strict definition of an identical origin (per RFC-2965) is that it provides an exception for domains that share the same domain.tld suffix, in that www.example.com can read and set cookies for example.com.  Consider, however, how CoralCDN’s domain manipulation effects this.  When example.com is accessed via CoralCDN, it can manipulate all nyud.net cookies, not just those restricted to example.com.nyud.net.  Concerned with the potential privacy violations from this, CoralCDN does not “support” cookies, in that its proxies delete any Cookie or Set-Cookie HTTP headers.

Many websites now manage cookies via javascript, however, so cookie information still “leaks” between Coralized domains on the browser. This happens often without a site’s knowledge, as sites commonly use the URL’s domain suffix without verifying its name. Thus, if the Coralized example.com writes nyud.net cookies, these will be sent to evil.com.nyud.net if the client visits that webpage. Honest CoralCDN proxies will delete these cookies in transit, but attackers can still circumvent this problem.  For example, when a client visits evil.com.nyud.net, javascript from that page can access nyud.net cookies, then issue a XmlHttpRequest back to  evil.com.nyud.net with cookie information embedded in the URL.  These problems are mitigated by other security decisions: As CoralCDN does not support https or POST, it is unlikely that sites will establish authenticated sessions over it.  Given these attack vectors, however, simply opening up CoralCDN to a peer-to-peer deployment as is would introduce significant risk.  Similar attacks would be possible against other uses of the Same Origin Policy in the browser, especially as it relates to the ability to access and manipulate the DOM.

These issues demonstrate other challenges with deploying a secure, cooperative CDN, beyond the problem of finding the right “tradeoff” I talked about previously. It may be attractive to consider using end-hosts in a peer-to-peer fashion, perhaps even embedding proxy software in resource containers or VMs to satisfy those users’ concerns.  If clients and servers can be slightly modified, end-to-end signatures (as in RFC 2660 and Firecoral) can help ensure the integrity of content distributed through an untrusted proxy network.  Similar care would still need to be taken, however, to ensure the appropriate confidentiality of user-specific information.

In fact, these are some of the very challenges and approaches we are tackling with Firecoral, which seeks to build a P2P-CDN by running “cooperative proxies” as a browser extension of participating peers. We’re actively working towards a release; hopefully any day now!

CoralCDN Lesson: The interface was right -or- Programming elastic CDN services

While my previous post argued that CoralCDN’s architecture design might not be ideal given its deployment, it has proven successful from the simple perspective of real-world use. Rather than any technical argument, we believe that the central reason for its adoption has been its simple user interface: Any URL can be requested through CoralCDN by appending nyud.net to its hostname.

Interface design

While superficially obvious, this interface design achieves several important deployment goals:

  • Transparency: Work with unmodified, unconfigured, and unaware web clients and web servers.
  • Deep caching: Support the automatic retrieval of embedded images or links also through CoralCDN when appropriate.
  • Server control: Not interfere with sites’ ability to perform usage logging or otherwise control how their content is served (e.g., via CoralCDN or directly).
  • Ad-friendly: Not interfere with third-party advertising, analytics, or other tools incorporated into a site.
  • Forward compatible: Be amenable to future end-to-end security mechanisms for content integrity or other end-host deployed mechanisms.

Consider an alternative, even simpler, interface design. RedSwoosh, Dijjer, FreeCache, and CoBlitz, among others, all embedded origin URLs within the URL’s relative path, e.g., http://nyud.net/example.com/file. Not only is HTTP parsing simpler, but their nameservers do not need to synthesize DNS records on the fly (unlike CoralCDN’s DNS servers for *.nyud.net) and can take better advantage of client-side DNS caching. Unfortunately, while such an interface supports the distribution of specifically named files, it fails to transparently load an HTML webpage: Any relative embedded links would lack the example.com prefix, and a proxy would thus be unable to identify to which origin domain it refers. (One alternative might be to try to rewrite pages to add such links, although active content such as javascript makes this notoriously difficult, even ignoring the goal of not modifying server content.)

CoralCDN’s approach, however, interprets relative links with respect to a page’s Coralized hostname, and thus transparently requests these objects through it as well. But because CoralCDN does not modify body content, all absolute URLs continue to point to their origin sites. Thus, third-party advertisements are largely unaffected, and origin servers can use simple web beacons to log clients. Origin sites retain control about how their content is displayed and, down the line, content may be amenable to verification through end-to-end content signatures (as in RFC2660) or web tripwire tricks.

An API for dynamic adoption

CoralCDN was envisioned with manual URL manipulation in mind, whether by publishers editing HTML, users typing Coralized URLs, or third-party posters to Usenet groups or web portals submitting Coralized URLs. After deployment, however, users soon began treating CoralCDN’s interface as an API for accessing CDN services.

On the client side, these techniques included simple browser extensions that offer “right-click” options to Coralize links or that provide a CoralCDN link when a page appears unavailable. They also ranged to more complex integration into frameworks like Firefox’s Greasemonkey. Greasemonkey allows third-party developers to write site-specific javascript code that, once installed by users, manipulates a site’s HTML content (usually through the DOM interface) whenever the user accesses it. CoralCDN scripts for Greasemonkey include ones that automatically rewrite links, or that add Coralized links (in text or via tooltips) to posted articles on Slashdot, digg, or other portals. CoralCDN was also integrated directly into a number of client-side software for podcasting, such as Plone’s Plodcasting, Juice Receiver, and Easypodcast. Given the view that podcasting served to “democratize” Internet radio broadcasting, this seemed to fit quite well with CoralCDN’s stated aims of “democratizing content publication”.

But perhaps the more interesting cases of CoralCDN integration are those on the server-side. In flash-crowd scenarios, smaller websites might become overloaded for a variety of reasons: bandwidth-limited to serve larger files (especially due to hosting contracts), CPU-limited given expensive scripts (e.g., PHP), or disk-limited given expensive database queries. At the same time, their webserver(s) can often still handle the network interrupt and processing overhead for simple HTTP requests. And further, websites often still want to get complete logs for all page accesses, especially given Referer headers. Given such scenarios, a common use of CoralCDN is for origin servers to directly receive an HTTP request, but respond with an HTTP redirect (302) to a Coralized URL that will serve the actual content.

This is as simple as installing a server plugin and writing a few lines of code. For example, the complete dynamic redirection rule using Apache’s mod_rewrite plugin is the following:

   RewriteEngine on
   RewriteCond %{HTTP_USER_AGENT !^CoralWebPrx
   RewriteCond %{QUERY_STRING !(^|&)coral-no-serve$
   RewriteRule ^(.*)$ http://%{HTTP_HOST.nyud.net %{REQUEST_URI [R,L]

while similar plugins and scripts exist for other web platforms (e.g., the WordPress blogging suite).

Redirection rules must be crafted somewhat carefully, still. In the above example, the second line checks whether the client is a CoralCDN proxy and thus should be served directly. Otherwise, a redirection loop could be formed. Numerous server misconfigurations have omitted such checks; thus, CoralCDN proxies check for potential loops and return errors if present. Amusingly, some early users during CoralCDN’s deployment caused recursion in a different way. By submitting URLs with many copies of nyud.net appended to the hostname suffix:

   http://example.com.nyud.net.nyud.net....nyud.net/

they created a form of amplification attack against CoralCDN. This single request caused a proxy to issue a number of requests, stripping the last instance of nyud.net off in each iteration. Such requests are now rejected.

While the above dynamic rewriting rules apply for all content, other sites incorporate URL Coralization in more inventive ways:

   RewriteCond %{HTTP_REFERER slashdot\.org [NC]
   RewriteCond %{HTTP_REFERER digg\.com [NC,OR]
   RewriteCond %{HTTP_REFERER blogspot\.com [NC,OR]

Combined with the above, these rules redirect clients to CoralCDN if and only if the requester originates from particular high-traffic portals. In Apache, such rules can be specified in .htaccess files and thus do not require administrative privileges. Other sites have even combined such tools with server plugins that monitor server load and bandwidth use, so that their servers only start rewriting requests under high load conditions.

These examples have shown users innovate with CoralCDN’s simple interface, which can be accessed like any other URL resource. We have even recently seen Coralized URLs being dynamically constructed within client-side Flash ActionScript. Indeed, CoralCDN’s most popular domain as of January 2009 was a Tamil imitation of YouTube that loads Coralized URLs from Flash animations of “recently watched” videos.

An Elastic Computing Resource

One of the most interesting aspects of these developments has been the adoption of CoralCDN as an elastic resource for content distribution, long before the term “cloud computing” was popularized and Amazon began offering CDN and other “surge” services.  Through completely automated means, work can get dynamically expanded out to use CoralCDN when websites require additional bandwidth resources, and contracted back when flash crowds abate. Still without prior registration, sites can even specify between several options on how they would like CoralCDN to handle their requests. X-Coral-Control headers returned by webservers provide in-band signaling and are saved as cache meta-data, such as whether to “redirect home” when domains exceed their bandwidth limits (per our previous post). But again, this type of control illustrates CoralCDN’s interface as a programmable API.

Admittedly, CoralCDN can provide free service (and avoid registration) because it operates on a deployment platform, PlanetLab, comprised of volunteer research sites.  On the flip side, CoralCDN’s popularity led it to quickly overwhelm the bandwidth resources allocated to PlanetLab by affiliated sites, leading to the fair-sharing mechanisms we described earlier.  My next (and final) post about our experiences with CoralCDN asks whether we should just move off a trusted platform like PlanetLab and accept untrusted operators.