Category Archives: Cloud Computing

Consistency, Availability, and Geo-Replicated Storage

For the past few years, we’ve been working on problems related to geo-replicated storage. We want to store data “in the cloud,” but that data should reside within multiple datacenters, not just in a single one.  When data is geographically replicated in such a fashion:

  • Users can experience lower latency by accessing a datacenter near to them, rather than one halfway around the world.
  • Network or system failures at a single datacenter doesn’t make the service unavailable  (even for data stored at that site).

This is common practice today.  Google runs multiple datacenters around the world, and Amazon Web Services offers multiple “Availability Zones” that are supposed to fail independently.

When data is replicated between locations, an important question arises about the consistency model such a system exposes.  Wyatt Lloyd has been tackling this question in his recent COPS and Eiger systems.  The  problem space this work explores — between giving up on any consistency guarantees one can reason about and just going with “eventual” consistency on one extreme, and giving up on availability guarantees to gain strong consistency and real transactions on the other — is going to be an increasingly important one.

Normally, folks think that the CAP Theorem tells us these two choices are fundamental.  But the key point is that CAP doesn’t tell us that eventual consistency is required, just that (as Partitions can happen) one can’t have both Availability and Strong Consistency (or more formally, linearizability).  It doesn’t tell us anything about consistency models that are weaker than linearizability yet stronger than “eventual.”  And that’s where COPS and Eiger come in.

One of our collaborators at CMU, Dave Andersen, recently wrote-up a more accessible discussion of these systems, and the causally-consistent data model they expose.   With the explosion of new data storage systems, particularly of the NoSQL variety, it’s important for folks to realize that there’s a (powerful and practical) choice between these two extremes.

Caring about Causality – now in Cassandra

Over the past few years, we’ve spent a bunch of time thinking about and designing scalable systems that provide causally-consistent wide-area replication.  (Here, “we” means the team of Wyatt Lloyd, Michael Freedman, Michael Kaminsky, and myself;  but if you know academia, you wouldn’t be surprised that about 90% of the project was accomplished by Wyatt, who’s a graduating Ph.D. student at the time of this writing.)  I’m posting this because we’ve finally entered the realm of the practical, with the release of both the paper (to appear at NSDI’13) and code for our new implementation of causally-consistent replication (we call it Eiger) within the popular Cassandra key-value store.

Read Dave’s full post here.

ACM Symposium on Cloud Computing (SOCC 2010) Day 2

I’m back in Princeton after spending the week in Indianapolis, Indiana for the first ACM Symposium on Cloud Computing (SOCC 2010). I’m posting here with a brief summary of each talk from Day 2 of the conference as well as some of my thoughts. These are my reactions to the presentations only, as I haven’t read most of the papers.

[See my previous post for Day 1]

Keynote 2

Building Facebook: Performance at Massive Scale
Jason Sobel (Facebook)

Jason gave a great presentation about the architecture at Facebook and some of the lessons their engineering team has learned in the last several years. In the last three years, Facebook has seen a factor of 20 growth in the number of users they support – from 1 million users in 2007 to 400 million users in 2010. With this high rate of growth, the infrastructure team has had to keep the site running in addition to innovating.

The traditional website scaling technique is to have all of a user’s data stored in a single server so that data can be partitioned easily and therefore scale. The social network that Facebook has can’t use this method because the data is interconnected and there is no way to partition the user graph. Instead, Facebook uses a number of techniques – aggressive caching, a multi-tiered infrastructure, and innovative APIs for developers that allow for applications to easily scale.

The three tiered structure that Facebook uses consists of:

  1. a web server front end running PHP that processes user requests. Not much to say here except that Facebook now compiles PHP to C++ to speed up execution
  2. a caching layer that uses memcached which is simple, fast, and reliable. They have 10’s of TBs of memory for caching!
  3. a database layer that uses MySQL. They have 8,000 server-years of run time experience without data loss or corruption. Since most requests are cached, the traffic to MySQL stays low, but they require a 3:1 ratio of the innodb table size to RAM for each server in order to keep up with request rate.

One interesting caveat here is that they are double caching most data. It’s hard to provide enough caching space to only cache in the memcache layer or the database layer, so the double-caching approach is the simplest option right now for Facebook, but looking into reducing this in the future.

Jason also talked about their Hadoop/Hive/Scribe system used to run MapReduce queries as well as a system called TAO that they use to provide a simple API to their developers for writing applications. Rather than developers having to write database and memcache queries, TAO is an API-aware layer built on top of memcache that provides easy to use abstractions.

Overall, a great talk with nice insight into Facebook’s infrastructure.


Hermes: Clustering Users in Large-Scale E-mail Services
Thomas Karagiannis (Microsoft Research), Christos Gkantsidis (Microsoft Research), Dushyanth Narayanan (Microsoft Research) , Antony Rowstron (Microsoft Research)

The Hermes project comes out of MSR where they are looking into optimizing the storage of e-mail messages in Microsoft Exchange. The normal implementation for Exchange is to take a new user account and locate their messages on any server in the Exchange cluster that has space available. By doing this, it splits load across the Exchange servers well.

The problem is that when two users that are on different physical servers receive the same message in their mailbox, the message has to be stored twice – once on each server. Hermes looks into trying to cluster users to minimize the amount of storage redundancy and network traffic between servers.

An example system was given that has 128K users on 68 exchange servers (1800 users/server). The sample size was 22 weeks of e-mail which contained 337M messages. They calculated that there is 3TB of extra storage required per week to store messages. By using a clustering algorithm to cluster together users that frequently e-mail each other, Hermes was able to save 55TB of storage over the 22 weeks of e-mail. Since moving user mailboxes is expensive, they suggest partitioning once and then re-partitioning once a week.

This was nice work, and seems like it solved a big problem with Exchange, although I question the architecture of Exchange to begin with. Services like Gmail do a much better job of storing e-mail messages and don’t run into problems like this.

Defining Future Platform Requirements for e-Science Clouds
Lavanya Ramakrishnan (Lawrence Berkeley National Lab), Keith Jackson(Lawrence Berkeley National Lab) , Shane Canon (Lawrence Berkeley National Lab) , Shreyas Cholia (Laurence Berkeley National Lab) , John Shalf (Lawrence Berkeley National Lab)
[position paper]

Keith talked about the current state of how (non-computer) scientists use cloud services to do their work. Many scientists use things like Hadoop to process large datasets (e.g. Large Hadron Collider). There are three main types:

  1. Large scale synchronous workloads – generally run today on supercomputers
  2. Medium scale synchronous workloads – can use local clusters
  3. Large scale asynchronous workloads – these can be parallelized well so can be run on cloud services

He talked about how the interfaces to cloud services are not standardized, so once chosen, scientists are locked in to a single provider. The suggestion was that cloud services should be more standardized and make it easier to share and collaborate among scientists.

Programming Models and Optimization

Fluxo: A System for Internet Service Programming by Non-expert Developers
Emre Kiciman (Microsoft Research), Benjamin Livshits (Microsoft Research), Madanlal Musuvathi (Microsoft Research), Kevin Webb (University of California San Diego)

This work focused on trying to provide a way for non-experts to write Internet services that scale well. Non-experts now have access to large scale cloud platforms like Amazon, Google, and Microsoft, so hardware is no longer the bottleneck when needing to scale an application. Instead, it’s difficult to architect an application.

The current techniques for scaling Internet services are things like tiering, partitioning, replication, data duplication and de-normalization, pre-computation, and others. These techniques are notoriously difficult to build and deploy correctly. It would be nice for non-experts to have a way to write applications using these techniques easily.

In comes Fluxo – a compiler that takes input in the form of a restricted programming language and translates it to run on Azure. The restricted programming language simplifies program analysis and allows for the use of the common techniques listed above. Fluxo is profile driven, so it collects metrics, analyzes the program, transforms it, and repeats.

In an evaluation, pre-computation can get benefits of up to 60% in performance, and caching can decrease latency by up to 50%. Seems like a nice idea – I can see small to medium size companies finding a platform like this useful.

Nephele/PACs: A Programming Model and Execution Framework for Web-Scale Analytical Processing
Stephan Ewen (TU Berlin) , Fabian Hueske (TU Berlin) , Daniel Warneke (TU Berlin) , Dominic Battré (TU Berlin) , Volker Markl (TU Berlin) , Odej Kao (TU Berlin)

Data-centric parallel programming uses schema-free systems like MapReduce and Hadoop to run efficiently. Relational databases, on the other hand, are schema bound, have well-defined properties, are flexible and optimizable, but don’t scale well. This work looks at extending the capabilities of MapReduce. The notion of first-order and second-order function is introduced. First-order functions are user code, while second order functions are things like the Map and Reduce steps today.

PACTS extends the MapReduce paradigm with some additional functions. For example:

  • Cross (two inputs) – each combination of records from the two inputs is built and is independently processable
  • Match (two inputs) – each combination of records with equal keys from the two inputs is built
  • CoGroup (many inputs) – pairs with identical keys are grouped for each input, group of all inputs with identical keys are processed together

PACT programs are compiled to parallel data-flows to be executed in Nephele. Nephele is an execution engine for parallel data-flow programs. It provides the process scheduling, communication, and fault-tolerance mechanisms needed to execute the workload. Example given of PACT workloads are TCP-H aggregation and K-Means Clustering. I’m not too familiar with running MapReduce workloads, so not sure of the impact this work will have.

The Case For PIQL: A Performance Insightful Query Language
Michael Armbrust (UC Berkeley), Nick Lanham (UC Berkeley), Stephen Tu (UC Berkeley) , Armando Fox (UC Berkeley) , Michael Franklin (UC Berkeley) , David Patterson (UC Berkeley)
[position paper]

This work looked at the tradeoff between RDBMS’s and NoSQL-type databases. RDBMS’s have powerful declarative query languages, have a query optimizer that decides the best execution plan, has automatic parallelism, and is performance-opaque. Contrastingly, NoSQL has a simple query interface (get/set), but complex queries have to be implemented at the application layer and are non-trivial to architect.

The argument is made to use a system called PIQL which is designed to be a middle ground between the two extremes. PIQL is a scale-independent declarative language that only allows developers to write queries with a data-size independent upper bound. PIQL is able to say no to certain queries, where as an RDBMS can’t.

Most applications have natural cardinality constraints. Example given was that a Facebook user has an upper limit of 5000 friends. PIQL lets developers specify these constraints in their data model so the optimizer can use them to decide on an optimized execution plan. If a query would be too slow, PIQL rejects it. PIQL is designed to run on top of existing key/value data-stores and it scales well. I think this is the right direction for providing easier ways to write NoSQL-type applications. I’m interested in looking more at what their API provides.

Towards Automatic Optimization of MapReduce Programs
Shivnath Babu (Duke University)
[position paper]

Shivnath described some of the difficulties with using MapReduce today. The lifecycle of a MapReduce job is that the programer first provides the Map and Reduce functions, kicks off the job, which gets split into many Map waves, followed by many Reduce waves, until it finishes.

The complexity here is that the user has to specify things like the number of maps, the number of reduces, and memory usage. There are over 190 parameters that a user can specify in Hadoop, and for many, it isn’t clear how they impact performance. As an experiment, a 50GB terasort was run on EC2. By changing parameters, the running time drastically changed, and certain parameters depend on each other, so individual parameters can’t be tuned separately.

The argument was that a better approach should be used for default parameters in Hadoop. There has been a lot of work in traditional RDBMS’s and these techniques should be leveraged to try and tune MapReduce workloads.

Benchmarking and Testing

Benchmarking Cloud Serving Systems with YCSB
Brian Cooper (Yahoo! Research) , Adam Silberstein (Yahoo! Research) , Erwin Tam (Yahoo! Research) , Raghu Ramakrishnan (Yahoo!) , Russell Sears (Yahoo! Research)

This work looks at how to properly evaluate a NoSQL cloud database. There are many systems out there – PNUTS, BigTable, Megastore, Azure, Cassandra, CouchDB, and many more, but the tradeoffs between systems are not clear. It would be nice if there was a standard benchmark that could be used to analyze the performance of each system so that developers can choose the appropriate solution to meet their needs.

In comes YSCB – an open source, standard benchmark tool for evaluating NoSQL systems. YCSB focuses on performance (latency and throughput) and scalability (adding servers, latency as system scales, and elasticity). The YCSB tool is written in Java, which provides easily extendable workloads.

A test setup was run on six server-class machines, and systems evaluated are Cassandra, HBase, sharded MySQL, and PNUTS. Some of the example workloads given were (more in paper):

  • A – Update heavy workload that had 50:50 ratio of reads to updates
  • B – Read heavy workload with 95:5 ratio of reads to updates
  • E – Short range scans of 1 to 100 records of size 1KB
  • Elasticity – Run a read-heavy workload while adding servers from 2 to 6.

The graphs showed a clear distinction between different systems and provided insight to which systems would be best in different situations. It was also noted that these tests can be used to improve the database system, as happened with Cassandra which improved dramatically in subsequent version releases.

This looks like a great tool for evaluating the plethora of NoSQL systems out there. It would be nice to have a standard setup (e.g. on Emulab) so that the comparison between systems is fair and repeatable.

Automated Software Testing as a Service
George Candea (EPFL) , Stefan Bucur (EPFL) , Cristian Zamfir (EPFL)
[position paper]

Stefan pointed out that when we install privileged software today (like drivers) we have to take their reliability on faith. Third-party drivers run with the highest privilege in the operating system, so if they have bugs or malicious code, serious problems can occur. The analogy made was to cars – we wouldn’t give our car keys to a random driver without some guarantee of the car’s safety.

It would be nice to be able to test a driver by exploring all possible code paths and look for common bugs and problems. Testing as a Service (TaaS) is a system for doing just that. TaaS can be used by home users by uploading a driver to a website and evaluating it, by developers who can integrate testing into their IDE’s for convenience, and as a commercial certification service for software.

This seems like a nice idea, although I question how much traction it could get. A lot of Windows drivers today don’t get certified because companies like to cut corners and produce a product for the least amount of money possible. It seems like even if this service was provided, companies would still cut corners and just skip using it to save money. Being able to test as a home user is not that useful, as a failed test would just give you the option of not installing the software, the same option you have today. Perhaps if a big player like Microsoft started requiring certification for driver installation, it could get the traction it needed.

Keynote 3

The Internal Design of’s Multi-Tenant Architecture
Rob Woollen (

This was an interesting keynote. I wasn’t very familiar with exactly what Salesforce provided to its customers before this talk. I knew that a lot of companies used their services, but I didn’t realize the scale they were working at. Salesforce provides cloud services for 75,000 customers running 150,000 custom applications and serving 17,500 websites.

They provide many services, including a query language, API’s, OLAP, batch processing, search, mobile, content management, packaging, and many others. What’s interesting about Salesforce’s architecture, though, is that each customer (or “tenant” as they call them) is not given its own architecture. Instead, all tenants run on a single platform, supported by a single code base.

Each username is mapped to a tenant, which is mapped to an “instance”. An instance is a cluster of hardware including a SAN, application servers, load balancers, object storage, etc. To provide isolation, each tenant runs as a separate Java thread. They have a three-tier caching layer, first caching locally within each Java thread, then caching with remote requests, and finally providing an application-layer service to users running on memcache.

Underneath the cache is a persistence layer that uses Oracle. Every customer’s data is stored together in a table, with static (compile time) and dynamic (run time) query analysis that enforces that a customer can only read their own data. All data is stored as a string in the database, so to provide indexes, a separate table has a column for each possible type which can then be indexed.

Salesforce only maintains a single code base for their infrastructure. To provide compatibility with previous releases, functions are annotated with API version numbers, such that if a customer is still using an old version, they don’t get access to changes in the API. When customers package their own code, they are required to submit unit tests that cover most of their code paths, so when new versions are released, Salesforce uses customer tests to verify compatibility with old API versions. An analytics component is provided that does real-time calculations instead of offline processing like most other analytics systems.

This was a nice talk. The presentation was a little dry (Rob didn’t seem very enthusiastic about their services!) but it was still interesting.

Data Services

G-Store: A Scalable Data Store for Transactional Multi key Access in the Cloud
Sudipto Das (UC Santa Barbara) , Divyakant Agrawal (UC Santa Barbara) , Amr El Abbadi (UC Santa Barbara)

G-Store addresses the weakness of most NoSQL databases that they don’t provide transactions over multiple keys. Many NoSQL databases provide test-and-set type semantics for a single key, but can’t provide any guarantees about multiple keys. To do this, users have to implement any transactional logic in the application layer.

G-Store is a layer that runs on top of a key/value store and provides additional functions. Users can submit a list of keys to add to a group. When a group is created, a single node is assigned as the controller for this set of keys. Since a single node controls the group, it can execute transactions over the group. Once a user is finished, the group can be deleted.

This work is similar to our own CRAQ work, which can provide multi-key transactions over a chain. CRAQ only provides static group membership, in that it doesn’t support explicitly moving keys around after their initial creation, but by leveraging Sinfonia-type mini-transactions, it provides the same service as G-Store. I was disappointed that we weren’t cited.

Google Fusion Tables: Data Management, Integration and Collaboration in the Cloud
Alon Halevy (Google) , Hector Gonzalez (Google) , Jayant Madhavan (Google) , Christian Jensen (Aalborg University) , Jonathan Goldberg-Kidon (MIT) , Warren Shen (Google) , Rebecca Shapley (Google) , Anno Langen (Google)
[industrial presentation]

Google Fusion Tables is a service that allows data management for the web era. Google wants to provide data storage that can integrate seamlessly with the web, is searchable, embeddable, extensible, easy to use, provides incentives for sharing data, and facilitates collaboration.

The example given was biking trails in Europe. There is a website that allows users to upload GPS data of bike trails. They want to allow data like this to be integrated into the Google Maps API for visualization and allow users to query it easily, e.g. find trail lengths < 25 miles. Google Fusion Tables allows users to extend data tables in order to collaborate. Tables can be merged. For example, a user might merge the bike trails table with a sponsors table and a results table. Users can have discussions about the data, set permissions, and perform other types of operations. The architecture to support these operations has a front end dispatcher that receives requests, a query processing component, a backend that does plan execution, and a replicated storage layer running on BigTable. This seems like a nice, useful API. Looking forward to checking it out.

High Availability and Reliability

Making Cloud Intermediate Data Fault-Tolerant
Steve Ko (Princeton University) , Imranul Hoque (University of Illinois at Urbana-Champaign) , Brian Cho (University of Illinois at Urbana-Champaign) , Indranil Gupta (University of Illinois at Urbana-Champaign)

This work tries to address some shortcomings in Hadoop. In MapReduce, there is a two stage computation: the Map step and the Reduce step. Between processing steps, there is a large amount of intermediate data that has to get shipped from server to server. For example, Google’s indexing application runs a chain of 24 MapReduce jobs. In many cases, this intermediate data is larger than either the input data or output data.

If a MapReduce node fails, the computation it was supposed to provide has to be rerun on another node. The problem is that to rerun the computation, the intermediate data that was sent to the failed node has to be sent to the new node as input. If it’s not the first step in the chain, this intermediate data is no longer available – it was the output of the previous step. To recover it, the previous step has to be rerun as well. This is a problem when chaining multiple steps together, because the failure will cascade and cause every previous step to have to be rerun.

This work extends Hadoop by replicating this intermediate data. Unfortunately, this isn’t a simple thing to do because most MapReduce jobs are network-bound (verified experimentally), meaning that adding additional replication slows down the job. An observation is made, though, that often the bottleneck is at the inter-rack switches. The network capacity within a rack is not saturated. To take advantage of this, replication can be done at the rack level.

Failures really do occur often in large MapReduce clusters (verified from industry reports), and an experimental evaluation shows that when using rack-level replication, jobs can be sped up significantly when failures occur. When no failures occur, there is only a small overhead associated with the replication. This seems like a nice addition to Hadoop, and I hope to see it added to the main branch.

Characterizing Cloud Computing Hardware Reliability
Kashi Vishwanath (Microsoft Research) , Nachi Nagappan (Microsoft Research)

This work looks at hardware failures in large scale datacenters. Microsoft has 100K+ servers that are geographically distributed around the world running heterogeneous hardware. They observe lots of failures and have to deal with them with a support staff. Since a person has to diagnose hardware faults and perform the necessary fix, failures come at a high cost.

The dataset used here was from Microsoft’s online services. This runs things like Messenger, Hotmail, etc. The dataset was over a 14 month period, looking at 100K+ servers, and contained a work log from their datacenter. The work log records which server fails along with what component had to be replaced (e.g. disk, dimm, raid controller, etc).

The analysis they performed looked at characteristics in aggregate, failure rates across individual servers and datacenters, and tried to see if failures could be predicted. The method used was a decision tree. Some interesting takeaways:

  • For every 100 machines, 9.5 machines recorded a failure annually
  • For every 100 machines, 20 failures were recorded annually (so 2 repairs per machine)
  • 70.7% of failures were disks, 6.5% raid controller, 5.1% memory, 17.6% other
  • When a failure occurs, there is a 20% chance another failure will occur within 1 day!
  • Failure probability is correlated with number of disks in a server, slope is greater than 1!
  • Many times, a faulty raid controller would cause failures to be reported as a disk failure

A Self-Organized, Fault-Tolerant and Scalable Replication scheme for Cloud Storage
Nicolas Bonvin (EPFL), Thanasis Papaioannou (EPFL), Karl Aberer (EPFL)

I really have nothing to say about this talk. I couldn’t understand anything from the presentation at all. I verified it wasn’t just me – others that I asked didn’t understand it, and there were no audience questions either. This is a great example of how not to give a talk.

Storage and System Modeling

Robust and Flexible Power-Proportional Storage
Hrishikesh Amur (Georgia Institute of Technology) , James Cipar (Carnegie Mellon University) , Varun Gupta (Carnegie Mellon University) , Michael Kozuch (Intel Corporation) , Gregory Ganger (Carnegie Mellon University) , Karsten Schwan (Georgia Institute of Technology)

This work looked at trying to provide a distributed storage service that is power-proportional. This means that as the load on the system increases, the power consumption should also increase proportionally. This is not the case today – idling servers still use a large percentage of the required power for peak load.

One way to reduce power consumption is just to turn off servers during times of low demand, but traditional distributed storage services use methods like consistent hashing to get good load balancing, so if you shut off a server, the data on that server is no longer available.

Instead, what Rabbit does is store a primary replica of every key in a small number of nodes. A secondary replica goes to a larger number of nodes, a third replica to an even larger set of nodes, and a fourth replica to the largest number of nodes. By doing this, you can turn off 95% of your servers without losing availability. If you have 5 servers on, each serves 20% of requests. If you have 100 servers on, each serves 1% of requests.

This allows for fast, fine-grained scaling by simply powering off machines. An example benchmark of a 100GB terasort shows that read throughput scales linearly with percentage of maximum power used. However, write performance decreases because all data has to be written to a small number of nodes. A method is used from previous work to offload writes to get better write performance.

RACS: A Case for Cloud Storage Diversity
Lonnie Princehouse (Cornell University) , Hussam Abu-Libdeh (Cornell University) , Hakim Weatherspoon (Cornell University)

RACS is a method for combining multiple cloud platforms together for data storage, such that a single vendor doesn’t have to be relied on. Today, when choosing a cloud datastore, customers are often locked in because of the high cost of switching providers. To transfer a large amount of data from vendor A to vendor B, the customer must pay for the bandwidth cost of the transfer twice – once for each provider.

What RACS does is use erasure coding across many cloud platforms to reduce the costs associated with switching vendors. The notation RACS(k,n) is used, where it stripes data over n providers such that any k-subset contains a complete copy of the data. This can tolerate up to (n-k) failures before data is lost.

An evaluation used an FTP trace from the Internet Archive over 1.5 years. The cost of three providers over time were very similar. Using RACS is more expensive: RACS(1,2) is much more expensive, while RACS(8,9) is only slightly more expensive. However, the cost of switching providers is very expensive when not using RACS, on the order of two-months of operating costs, compared to much cheaper cost when using RACS. This is interesting work – I hope they implement more cloud platforms in addition to their current prototype.

Characterizing, Modeling, and Generating Workload Spikes for Stateful Services
Peter Bodik (UC Berkeley) , Armando Fox (UC Berkeley) , Michael Franklin (UC Berkeley) , Michael Jordan (UC Berkeley) , David Patterson (UC Berkeley)

Web applications often experience periods of spikes in traffic that are unpredictable. For example, Wikpedia, after the death of Michael Jackson, saw 13% of total requests to the single page of Michael Jackson. The first contribution of the paper was to characterize these spikes. There are two main types – workload volume spikes consisting of the time to peak, duration, magnitude, and maximum slope; and data hotspots consisting of the number of hotspots, spacial locality, and entropy.

The conclusion was that there is no typical type of spike because they vary widely. The next contribution was to try and generate workload spikes using the 7 characteristics listed above. They were able to create a generative spike model using two parameters N, the number of hotspots, and L, a clustering parameter.

After using the generative model, they were able to match real world examples they collected with their counterpart from the output of the generative model, validating that the model produces valid workload spikes.


SOCC turned out to be a great conference, and was definitely a success overall. The next SOCC conference will be co-hosted with SOSP in 2011 in Portugal, so I’m looking forward to that if I’ll be able to attend!

ACM Symposium on Cloud Computing (SOCC 2010) Day 1

I’m currently in Indianapolis, Indiana for the first ACM Symposium on Cloud Computing (SOCC 2010). I’m posting here with a brief summary of each talk at the conference as well as some of my thoughts. These are my reactions to the presentations only, as I haven’t read most of the papers.

[See my next post for Day 2

Keynote 1

[Note: My SIGMOD paper was being presented opposite the first keynote, so Steve Ko who is also at the conference wrote this first summary.]

Evolution and Future Directions of Large-Scale Storage and Computation Systems at Google
Jeffrey Dean (Google)

Jeff Dean gave a keynote about different services running at Google and general principles on how to build large-scale services. The talk was roughly divided into three parts. The first part was about Google’s data centers that house a few hundreds of clusters. Each cluster has thousands of machines with one or a handful of configurations. Each machine (at least) runs GFS and Colossus (next-gen GFS), and a cluster scheduling daemon.

The second part was about Google’s back-end services including MapReduce, BigTable, and Spanner. A few interesting notes about these systems are:

  1. BigTable now has a dedicated team that manages BigTable service clusters. Because of this, there has been a lot of work on fair-share scheduling and performance isolation.
  2. BigTable has something called coprocessors, which are basically “arbitrary code that runs next to each tablet in table”.
  3. Spanner is a storage & computation system that runs across data centers. It supports the mix of strong & weak consistency models, fine-grained replication, “zone”-based hierarchy (1 master per zone), etc.

The third part was about experiences and design patterns for building large-scale services. There were too many design patterns to post here (Jeff said he would post slides on his website), but here are a few which I find interesting:

  1. Use back-of-the-envelope calculation whenever possible before choosing a design
  2. Design for growth, but don’t overdo it (e.g., 5x – 10x OK, but not 100x growth), and
  3. Canary requests: some requests crash every single process in the same service, so try a few machines first.

Overall, it was an interesting talk about quite a broad set of topics. There are only a few places where you can accumulate wisdom about building truly large-scale systems, and it is always interesting to see what they are doing to cope with the scale.

Operating Systems

An Operating System for Multicore and Clouds: Mechanisms and Implementation
David Wentzlaff (MIT) , Charles Gruenwald III (MIT CSAIL) , Nathan Beckmann (MIT CSAIL) , Kevin Modzelewski (MIT CSAIL) , Adam Belay (MIT CSAIL) , Lamia Youseff (MIT CSAIL) , Jason Miller (MIT CSAIL) , Anant Agarwal (MIT CSAIL)

This work addresses the issue of how to create applications that run in the cloud. Machines today have many cores (16 today, up to 1000 in five years), and parallelizing across many cores is very difficult. They created a new system called Fos (Factored Operating System) which runs on top of Xen and splits the OS into separate services. The kernel is a microkernel and all services run on top (e.g. File System). The communication between components in the system is done with message passing. There are no shared locks.

Because of this abstraction, each component can be adjusted elastically to handle demand. For example, if an application starts accessing the File System component at too high of a rate, the underlying system can spawn another File System component on another core or another machine and start transparently redirecting requests to the new component.

The implementation of Fos is fairly complete – some applications include a web server, slide viewer, video transcoder (ffmpeg), and busy box. In fact, it was revealed at the end of the talk that the presentation was running on Fos, and it didn’t crash! The system’s website is running on the Fos web server.

I’m not sure how this work will play out. I could see this becoming a standard approach as servers start having thousands of cores, but I’m not sure how applications here would be able to cope with network latency involved in inter-machine message passing.


Lithium: Virtual Machine Storage for the Cloud (can’t find this paper online yet)
Jacob Hansen (VMware) , Eric Jul (Bell Labs, Dublin)

This work looks at trying to increase the performance of shared file systems that are used by virtual machine clusters. In the traditional setup, a datacenter has a SAN server that hosts files for VMs in a cluster to access. The SAN is set up such that it is highly reliable and provides high throughput.

The problem with a SAN is that it’s expensive and it doesn’t scale very well to hundreds or thousands of servers accessing it simultaneously. It’s also limited by network bandwidth. VMware’s Lithium technology does away with the SAN and instead arranges the VM hosts into a peer-to-peer network and uses local storage to store files, replicating data for redundancy and performance.

The system still preserves the features required of a VM file server, e.g. cloning, snapshots, etc. It uses a branching method similar to some source control systems like Mercurial for quick copying of large files.

When compared to a SAN, Lithium doesn’t perform as well with a small number of hosts, but as the number of VM hosts increases, Lithium scales linearly while the SAN maxes out at a constant throughput. This approach seems like a great idea, and hope to see it pushed to production in future VMware releases.

Differential Virtual Time (DVT): Rethinking I/O Service Differentiation for Virtual Machines
Mukil Kesavan (Georgia Institute of Technology), Ada Gavrilovska (Georgia Institute of Technology), Karsten Schwan (Georgia Institute of Technology)

The presentation of this work was quite confusing, but the basic idea is that when doing fair scheduling for resources in a VM, ill effects are often encountered. For example, as is well known, TCP doesn’t react very well to congestion in the network (e.g. why TCP doesn’t perform very well over wireless networks), so when a VM host artificially limits a guest’s access to the network, the TCP congestion window will drop dramatically and negatively impact performance.

To try and solve this problem, DVT takes an approach of never sharply decreasing a guest’s share of the network. For example, if host X is receiving 100% of network access, but hosts W, Y and Z suddenly request access, rather than the traditional approach of reducing X immediately to 25%, DVT instead slowly reduces it over time. To keep access fair, DVT will eventually reduce X to lower than 25% to make up for the increased share it got previously.

By doing this, DVT increases the performance of VM guests by up to 25% since the TCP congestion window doesn’t drop sharply. It was mentioned that future work will look at applying the same method to disk access, although it isn’t clear to me how slowly reducing disk access instead of sharply reducing it would increase performance in an application.

Virtual Machine Power Metering and Provisioning
Aman Kansal (Microsoft Research) , Feng Zhao (Microsoft Research) , Jie Liu (Microsoft Research) , Nupur Kothari (USC) , Arka Bhattacharya (IIT Kharagpur)

This work asked the question “Can we tell how much power a VM guest is consuming?”. I was wondering what the motivation for measuring this was throughput the talk until it was finally mentioned at the end. I’ll start with the motivation first instead – the reasoning given was mainly to use knowledge of a VM guest’s consumption to provision your datacenter power accordingly. Other usages are to charge your cloud users according to power consumption (although I don’t buy this, as I don’t see how it would differ from billing with current methods – cpu, memory, storage, bandwidth), and to track which VMs are consuming the most power so you can target power reduction for “green” initiatives.

To answer this question is a two step process. First, they measure the power consumption of each component in the system when a context switch between VM guests happens by using OS-level events. Next, they have to figure out which VM guest is using each component. Rather than doing this live, they monitor a VM guest for a period of time to determine its power consumption and then use that value for future calculations. The reason for this is that different loads may use different internal power for the same externally visible power state, so learning a VM guest’s power profile over time is a more accurate measurement of power consumption.

There were a lot of technical details glossed over in this talk, and there were many formulas on the slides that weren’t explained or accompanied by variable descriptions, so I found the presentation somewhat confusing. I’m sure reading the paper would make this more clear.

Distributed and Parallel Processing

Stateful Bulk Processing for Incremental Algorithms
Dionysios Logothetis (UC San Diego), Christopher Olston (Yahoo! Research) , Benjamin Reed (Yahoo! Research) , Kevin Webb (UC San Diego) , Kenneth Yocum (UC San Diego)

This work targets large data applications. These are things like web analytics, graph mining, log analysis, and PageRank, which use massive amounts of data. An insight here is that these applications have to continually process on the order of TBs of new data per day and they are stateful, but the running time is proportional to the total amount of state, not proportional to the amount of new data.

Continuous Bulk Processing (CBP) provides users with an additional API of a Translate() function and a RouteBy() function similar to the map and reduce stages of MapReduce. Current systems only do “outer” grouping, while CBP allows for “inner” grouping so that only the state that needs to be accessed is shipped around. In an evaluation, this inner grouping method reduced running time by up to 53%.

Perhaps I’m not familiar enough with MapReduce, but the presentation went too fast for me to follow the details of the Translate and RouteBy API, so see the paper for details.

Comet: Batched Stream Processing for Data Intensive Distributed Computing
Bingsheng He (Microsoft Research), Mao Yang (Microsoft Research) , Zhenyu Guo (Microsoft Research) , Rishan Chen (Beijing University) , Wei Lin (Microsoft Research) , Bing Su (Microsoft Research) , lidong Zhou (Microsoft Research)

Comet is a system that tries to reduce the running time of large data-intensive applications that have work in common. The example given was a set of four jobs: one that computes the top ten hottest Chinese pages daily, another that computes the top ten hottest English pages daily, and corresponding jobs that compute the top ten hottest Chinese and English pages weekly. The first two jobs have the same input data, but have different filters in place, so the first step of each of those jobs is the same.

The flow of the system is: query series, normalization, logical optimization, physical optimization, execution plan, and then execution. By doing these optimization steps, Comet is able to reduce the amount of work done by 52% for jobs that have commonalities with previous jobs.

The evaluation showed that computing the top ten pages weekly was able to take advantage of the top ten daily calculation, but the top ten pages for the week don’t necessarily overlap with the top ten pages of each day, so it’s not clear how this works. This question was asked in the Q&A, but the author wasn’t able to answer. The presentation of this work was very confusing, and it was clear that the rest of the audience didn’t understand either. I’m sure reading the paper would make more sense.

Skew-Resistant Parallel Processing of Feature-Extracting Scientific User-Defined Functions
YongChul Kwon (University of Washington) , Balazinska Magdalena (University of Washington) , Bill Howe (University of Washington) , Jerome Rolia (HP)

This work addresses how to reduce the running time of large scientific experiments. The example given here was an application that takes astronomical images, does feature extraction to identify the celestial objects in the images, and then runs a Friends of Friends algorithm, which is used in astronomy to cluster celestial objects. MapReduce-type systems like Hadoop are a great fit for workloads like this, but it is hard to express complex algorithms and get good performance. As an example, this algorithm when first implemented took 14 hours to run, and after a week of working, they were able to reduce the running time to 70 minutes.

The reason these types of algorithms can take a long time to run is that the same amount of input data doesn’t always have the same running time (e.g. if the cluster of celestial objects is more dense, takes longer), so a static partitioning scheme doesn’t get good performance. An alternative is to use micro partitions, which reduces the impact of skew, but there is additional framework overhead and to find the sweet spot, the algorithm must be run many times, which is undesirable.

The SkewReduce algorithm takes a sampling approach to figure out the best partitioning scheme. In evaluation, this SkewReduce algorithm was able to reduce the running time of algorithms by 2-8 times. This seems like a nice scheduling algorithm, and I hope this finds its way into the main branch of Hadoop. A person who works at Google shared a similar optimization that Google uses, but they do their optimizations in the Reduce stage rather than the Map stage.

I will attempt to get my writeup for Day 2 posted late tomorrow.
[Update: Day 2 posted.]

New datacenter network architectures

This year’s HotNets workshop was held over the past two days in the faculty club at NYU; it was nice being on old turf.   The HotNets workshop has authors write 6-page “position” or “work-in-progress” papers on current “hot topics in networking” (surprise!).  Tucked into a cosy downstairs room, the workshop was nicely intimate and it saw lots of interesting questions and discussion.

One topic that was of particular interest to me were new ideas about datacenter networking; HotNets included two papers in each of two different research areas.

The first thematic area was addressing the problem of bisectional bandwidth within the datacenter. The problem is that each rack in a datacenter may have 40 machines, each potentially generating several Gbps of traffic.  Yet, all this traffic in typically aggregated at a single top-of-rack (ToR) switch, which often is the bottleneck in communicating with other racks.  (This is especially relevant for data-intensive workloads, such as Map Reduce-style computations.)  Even if the ToR switch is configured with 4-8 10Gbps links, this alone can be insufficient.  For example, an all-L2 network, while providing easy auto-configuration and supporting VM mobility, cannot take advantage of this capacity:  Even if multiple physical paths exist between racks, the Ethernet spanning tree will only use one.  In addition, the traditional method of performing L2 address resolution (ARP) is broadcast and thus not scalable.

Multiple solutions to this problem are currently being explored.  In SIGCOMM ’09 in late August, we saw three papers that proposed new L2/L3 networks to address this problem.  The first two leveraged the locator/ID split, so that virtual resources are named by some identifier independent of their location.

  • PortLand from UCSD effectively proposed a NAT layer for MAC addresses.  In Portland, virtual machines running on physical hosts can keep a permanent MAC address (identifier) even under VM migration, but their immediate upstream switch provides a location-specific “pseudo” MAC (the locator). Because these pseudo MACs are allocated hierarchically, they can be efficiently aggregated in upstream switches.  PortLand assumes that datacenter networks have a classic fat tree-like hierarchy between racks of servers, which is the typical network architecture in datacenters.  Instead of routing a packet to a VM’s MAC address, PortLand performs forwarding based on this pseudo-MAC; the VM’s upstream switch NATs between this PMAC and its permanent MAC. No end-host modifications need to be made (although one can certainly envision the host’s hypervisor to perform such NATting on the end-host itself before dispatching the packet to the unmodified VM). The sender is also unaware of this process, as its immediate upstream switch (resp. hypervisor) first translates the destination from MAC to PMAC before sending it out over the datacenter network. The question remains how to discover the MAC-to-PMAC binding, and Portland assumes a centralized lookup controller that stores these bindings and updates them under VM migration, much like the controller in Ethane. (In fact, Portland was prototyped using OpenFlow, which came out of Ethane.)
  • VL2 from MSR includes both L2 and L3 elements, unlike Portland’s L2-only solution. Changhoon Kim presented this work; he was a recent PhD graduate from Princeton (and in fact I served on the committee for his thesis, which included VL2). VL2 particularly focused on achieving both high bisectional bandwidth and good fairness between competing flows.    Using data collected from Microsoft’s online services, they found a high degree of variance between flows.  The implications of this is that a static allocation of bandwidth wouldn’t be all that promising, unless one managed to fully provision for high bisectional-bandwidth everywhere, which would be quite expensive. One of the particularly nice things about their evaluation (and implementation)—certainly aided by the support of an industrial research lab!—is that it ran on a  cluster consisting of 3 racks, 80 servers, and 10 switches, so provided at least some limited scaling experience.
    vl2On to specifics.  While Portland used “virtualized” Ethernet identifiers, VL2 assigns “virtualized” application-level IP addresses (AAs) to all nodes.  And rather than performing the equivalent of L2 NATing on the first-hop switches, VL2 uses unmodified switches but modifies end-hosts—in virtualized datacenter environments, it’s not terrible difficult to forklift upgrade your servers’ OS—to perform location-specific address (LA) resolution.  This is the identifier/locator split.  So applications use location-independent IP addresses, but a module on end-hosts resolves an AA IP address to a specific location address (LA) of the destination’s top-of-rack switch, then encapsulates this AA packet within an LA-addressed IP header.  This module also serves to intercept ARP requests for LA addresses and convert ARP broadcasts into unicast lookups to a centralized directory service (much like in PortLand and Ethane).   While this addressing mechanism replaces broadcasts and serves as an indirection layer to support address migration, it doesn’t itself provide good bisectional bandwidth.  For this, VL2 uses both Valiant Load Balancing (VLB) and Equal-Cost Multi-Path (ECMP) routing to spread traffic uniformly across network paths.  VLB makes use of a random intermediate waypoint to route traffic between two endpoints, while ECMP is used to select amongst multiple equal-cost paths.   With multiple, random intermediate waypoints, communication between two racks can follow multiple paths, and ECMP provides load balancing between those paths.  So, taken together, a packet from some source AA to a destination AA will first be encapsulated with the LA address of the destination’s ToR switch, which in turn is encapsulated with the the address of an intermediate switch for VLB.  This is shown in the figure to the left (taken from the VL2 paper). Interestingly, all of VL2’s mechanisms are built on backward-compatible network protocols (packet encapsulation, ECMP, OSPF, etc.).
  • BCube from MSR Asia specifically focused on scaling bisectional bandwidth between nodes in shipping-container-based modular datacenters.  This is a follow-on work to their DCell architecture from the previous year’s SIGCOMM.  BCube (and DCell) are, at their core, more complex interconnection networks (as compared to today’s fat trees) for wiring together and forwarding packets between servers.  They propose a connection topology that looks very much like a generalized hypercube.  So we are in fact seeing the rebirth of complex interconnection networks that were originally proposed for parallel machines like Thinking Machine’s CM-5; now it’s just for datacenters rather than supercomputers.  Actually wiring together such complex networks might be challenging, however, compared to today’s fat-tree architecture.

So these proposals all introduced new L2/L3 network architectures for the datacenter.  In HotNets, we saw more proposals for using different technologies, as opposed to new topologies (to borrow a phrase from here):

  • A group from Rice, CMU, and Intel Pittsburgh argued that traditional electrically-switched networks (i.e., Ethernet) in the datacenter should be accompanied by an optically-switched network for bulk data distribution.  This proposal takes advantage of reconfigurable optical switches to quickly set up light-paths between particular hosts for large transfers.  So their resulting system is a hybrid:  using electrical, packet-switched networks for small flows or for those requiring low-latency, but using optical, circuit-switched networks for large flows that can withstand the few millisecond delay necessary to reconfigure the optical switches.   And these types of larger flows are especially prevalent in data-intensive workloads (e.g., scientific computing and MapReduce).
  • A group from Microsoft Research focused on a similar problem—the paper’s presenter even joked that he could just skip his motivation slides.  But instead of proposing a hybrid electric/optical network, they argued for a hybrid wired/wireless network, where wireless links are used to provide a “fly-way”  between servers.  Instead of using these additional links for large transfers (as in the above proposal), however, this work uses these additional links to handle transient overages on the existing wired network.  Because it’s a wireless network, one doesn’t need to physically wire them up in place; the paper suggests that wireless connections in the 60GHz band might be especially useful given some prototypes that achieve 1-15 Gbps at distances of 4-10m.  The paper also discusses wired fly-ways by using additional switches to inter-connect random subsets of ToR switches, but the talk seemed to focus on the wireless case.

Either way, it’s interesting to see competing ideas for using different technologies to handle bisectional capacity problems (whether transient or persistent).

HotNet’s second thematic area of datacenter network papers considers managing all this new complexity.  There were two papers on NOX, which is a network controller for managing OpenFlow networks.

  • The first paper asked the question whether we needed special networking technologies to support new datacenter architectures (the talk focused specifically on the problem of building VL2), or whether we could construct similar functionality via NOX and OpenFlow switches.  They found (perhaps not surprisingly) that NOX could be sufficient.
  • The second NOX paper focused on greater support for virtualized end-hosts.  OpenVSwitch is meant to work with various end-host virtualization technologies (Xen, KVM, etc.) and provide functionality for managing their virtual network interfaces (instead of, e.g., Xen’s simple Ethernet bridge).  Openvswitch can be used, for example, to setup particular routes between VM instances or to provide encapsulation and tunneling between VMs.  The latter could enable L3 migration of VMs, with a VM’s old physical location forwarding packets to the VM’s new location (akin to Mobile IP).  Traditional VM migration, on the other hand, uses ARP spoofing and is thus limited to migration between hosts on the same L2 network.

This ability to perform more fine-grain network resource management is very interesting.  While most of these above papers (except the latter one) focus on supporting L2/L3 addressing and connectivity, our own SCAFFOLD project looks at the higher-level problem of supporting wide-area, distributed services.   Distributed services typically scale-out by replicating functionality across many machines; for customer-facing services, some combination of additional mechanisms are used for server selection and failover:  DNS, BGP tricks and IP anycast, VRRP, VIP/DIP load balancers, ARP spoofing, etc.  All this complexity is because, while clients are trying to access replicated services, the Internet provides communication between unicasted hosts. Thus, SCAFFOLD explores building a network architecture that supports communication between services, instead of network interfaces, and that explicitly supports churn (whether planned or unplanned) amongst the set of end-hosts composing that service. I’ll expand on SCAFFOLD’s motivation and design more in a future post.

CoralCDN Lesson: Interacting with virtualized and shared hosting services

In the previous post, I discussed how CoralCDN implemented bandwidth restrictions that were fair-shared between “customer” domains. There was another major twist to this problem, however, that I didn’t talk about: the challenge of performing such a technique on a virtualized and shared platform such as PlanetLab.  While my discussion is certainly PlanetLab-centric, its questions are also applicable to other P2P deployments where users run peers within resource containers, or to commercial hosting environments using billing models such as 95th percentile usage.

Interacting with hosting platforms

CoralCDN’s self-regulation works well in trusted environments, and this approach is used similarly in other peer-to-peer (e.g., BitTorrent and tor) and server-side (e.g., Apache mod_bandwidth) environments.  But when the resource provider (such as PlanetLab, a commercial hosting service, or peer-to-peer end-users) wants to enforce resource restrictions, rather than assume the software functions correctly, the situation becomes more challenging.

Many services run on top of PlanetLab; CoralCDN only being one of them.  Each of these instances is allocated a resource container (a “slice”) across all PlanetLab nodes.  This doesn’t have the same level of isolation as a virtual machine instance, but its much more scalable (in number of slices per node, each per-node slice called in sliver in PlanetLab).

PlanetLab began enforcing average daily bandwidth limits per sliver in 2006; prior to that, sliver usage was all self-enforced. Thereafter, however, when a sliver hit 80% of its daily limit, the PlanetLab kernel began enforcing bandwidth caps (using Linux’s Hierarchical Token Bucket scheduler) as calculated over five-minute epochs.  CoralCDN’s daily limit is 17.2 GB/day per sliver to the public Internet.

So, we see here two levels of bandwidth control: admission control by CoralCDN proxies and rate limiting by the underlying hosting service. Even though CoralCDN uses a relatively conservative limit for itself (10 GB/day), it still surpasses the 80% mark (13.8 GB) of its hosting platform on 5–10 servers per day. And once this happens, these servers begin throttling CoralCDN traffic, leading to degraded performance.  The main cause of this overage is that, while CoralCDN counts successful HTTP responses, its hosting platform accounts for all traffic—HTTP, DNS, DHT RPCs, system log transfers, and other management traffic—generated by CoralCDN’s sliver.

Unfortunately, there does not appear to be sufficiently lightweight or simple user-space mechanisms for proper aggregate resource accounting.  Capturing all traffic via libpcap, for example, would be too heavy-weight for our purposes.  Furthermore, a service would often like to make its own policy-based queuing decisions based on application knowledge.  For example, CoralCDN would prioritize DNS traffic before DHT RPCs, HTTP traffic next, and log collection of lowest priority. This is difficult through application-level control alone, while using a virtualized network interface that pushes traffic back through a user-space network stack would be expensive.

CoralCDN’s experience suggests two desirable properties from hosting platforms that enforce resource containers.  First, these platforms should provide slivers with their current measured resource consumption in a machine-readable format and in real time.  Second, these platforms should allow slices to express policies that affect how the underlying platform enforces resource containment. While this pushes higher-level preferences into lower layers, such behavior is not easily performed at these higher layers (and thus compatible with the end-to-end argument).  And it might be as simple as exposing multiple resource abstractions for slices to use, e.g., multiple virtual network connections with different priorities.

Somewhat amusingly, one of CoralCDN’s major outages came from a PlanetLab misconfiguration that changed its bandwidth caps from GBs to MBs.  As all packets were delayed for 30 seconds within the PlanetLab kernel, virtually all higher-layer protocols (e.g., DNS resolution for were timing out.  Such occasional misconfigurations are par for the course, and PlanetLab Central has been an amazing partner over the years.  Rather than criticize, however, my purpose is to simply point out how increased information sharing can be useful.  In this instance, for example, exposing information would have told CoralCDN to shut down many unnecessary services, while policy-based QoS could have at least preserved DNS responsiveness.

Over-subscription and latency sensitivity

While CoralCDN faced bandwidth tensions, there were latency implications with over-subscribed resources as well.  With PlanetLab services facing high disk, memory, and CPU contention, and even additional traffic shaping in the kernel, applications face both performance jitter and prolonged delays.  For example, application-level trace analysis performed on CoralCDN (in Chapter 6 of Rodrigo Fonseca’s PhD thesis) showed numerous performance faults that led to a highly variable client experience, while making normal debugging (“Why was this so slow?”) difficult.

These performance variations are certainly not restricted to PlanetLab, and they have been well documented in the literature across a variety of settings.  More recent examples have shown this in cloud computing settings.  For example, Google’s MapReduce found performance variations even among homogeneous components, which led to their speculative re-execution of work.  Recently, a Berkeley study of Hadoop on Amazon’s EC2 underscored how shared and virtualized deployment platforms provide new performance challenges.

cluster-timingsCoralCDN saw the implications of performance variations most strikingly with its latency-sensitive self-organization.  Coral’s DHT hierarchy, for example, was based on nodes clustering by network RTTs. A node would join a cluster provided some minimum fraction (85%) of its members were below the specified threshold (30 ms for level 2, 80 ms for level 1).  This figure shows the measured RTTs for RPC between Coral nodes, broken down by levels (with vertical lines added at 30 ms, 80 ms, and 1 s). While these graphs show the clustering algorithms meeting their targets and local clusters having lower RTTs, the heavy tail in all CDFs is rather striking.  Fully 1% of RPCs took longer than 1 second, even within local clusters.

Another lesson from CoralCDN’s deployment was the need for stability in the face of performance variations, which are only worse in heterogeneous deployments.  This translated to the following rule in Coral.  A node would switch to a smaller cluster if fewer than 70% of a cluster now satisfy its threshold, and form a singleton only if fewer than 50% of neighbors are satisfactory.  Before leveraging this form of hysteresis, cluster oscillations were much more common (leading to many stale DHT references).  A related focus on stability helped improve virtual network coordinate systems for both PlanetLab and Azureus’s peer-to-peer deployment, so it’s an important property to consider when performing latency-sensitive self-organization.

Next up…

In the next post, I’ll talk a bit about some of the major availability problems we faced, particularly because our deployment has machines with static IP addresses directly connected to the Internet (i.e., not behind NATs or load balancers).  In other words, the model that the Internet and traditional network layering was actually designed with in mind…