I’m back in Princeton after spending the week in Indianapolis, Indiana for the first ACM Symposium on Cloud Computing (SOCC 2010). I’m posting here with a brief summary of each talk from Day 2 of the conference as well as some of my thoughts. These are my reactions to the presentations only, as I haven’t read most of the papers.
[See my previous post for Day 1]
Building Facebook: Performance at Massive Scale
Jason Sobel (Facebook)
Jason gave a great presentation about the architecture at Facebook and some of the lessons their engineering team has learned in the last several years. In the last three years, Facebook has seen a factor of 20 growth in the number of users they support – from 1 million users in 2007 to 400 million users in 2010. With this high rate of growth, the infrastructure team has had to keep the site running in addition to innovating.
The traditional website scaling technique is to have all of a user’s data stored in a single server so that data can be partitioned easily and therefore scale. The social network that Facebook has can’t use this method because the data is interconnected and there is no way to partition the user graph. Instead, Facebook uses a number of techniques – aggressive caching, a multi-tiered infrastructure, and innovative APIs for developers that allow for applications to easily scale.
The three tiered structure that Facebook uses consists of:
- a web server front end running PHP that processes user requests. Not much to say here except that Facebook now compiles PHP to C++ to speed up execution
- a caching layer that uses memcached which is simple, fast, and reliable. They have 10’s of TBs of memory for caching!
- a database layer that uses MySQL. They have 8,000 server-years of run time experience without data loss or corruption. Since most requests are cached, the traffic to MySQL stays low, but they require a 3:1 ratio of the innodb table size to RAM for each server in order to keep up with request rate.
One interesting caveat here is that they are double caching most data. It’s hard to provide enough caching space to only cache in the memcache layer or the database layer, so the double-caching approach is the simplest option right now for Facebook, but looking into reducing this in the future.
Jason also talked about their Hadoop/Hive/Scribe system used to run MapReduce queries as well as a system called TAO that they use to provide a simple API to their developers for writing applications. Rather than developers having to write database and memcache queries, TAO is an API-aware layer built on top of memcache that provides easy to use abstractions.
Overall, a great talk with nice insight into Facebook’s infrastructure.
Hermes: Clustering Users in Large-Scale E-mail Services
Thomas Karagiannis (Microsoft Research), Christos Gkantsidis (Microsoft Research), Dushyanth Narayanan (Microsoft Research) , Antony Rowstron (Microsoft Research)
The Hermes project comes out of MSR where they are looking into optimizing the storage of e-mail messages in Microsoft Exchange. The normal implementation for Exchange is to take a new user account and locate their messages on any server in the Exchange cluster that has space available. By doing this, it splits load across the Exchange servers well.
The problem is that when two users that are on different physical servers receive the same message in their mailbox, the message has to be stored twice – once on each server. Hermes looks into trying to cluster users to minimize the amount of storage redundancy and network traffic between servers.
An example system was given that has 128K users on 68 exchange servers (1800 users/server). The sample size was 22 weeks of e-mail which contained 337M messages. They calculated that there is 3TB of extra storage required per week to store messages. By using a clustering algorithm to cluster together users that frequently e-mail each other, Hermes was able to save 55TB of storage over the 22 weeks of e-mail. Since moving user mailboxes is expensive, they suggest partitioning once and then re-partitioning once a week.
This was nice work, and seems like it solved a big problem with Exchange, although I question the architecture of Exchange to begin with. Services like Gmail do a much better job of storing e-mail messages and don’t run into problems like this.
Defining Future Platform Requirements for e-Science Clouds
Lavanya Ramakrishnan (Lawrence Berkeley National Lab), Keith Jackson(Lawrence Berkeley National Lab) , Shane Canon (Lawrence Berkeley National Lab) , Shreyas Cholia (Laurence Berkeley National Lab) , John Shalf (Lawrence Berkeley National Lab)
Keith talked about the current state of how (non-computer) scientists use cloud services to do their work. Many scientists use things like Hadoop to process large datasets (e.g. Large Hadron Collider). There are three main types:
- Large scale synchronous workloads – generally run today on supercomputers
- Medium scale synchronous workloads – can use local clusters
- Large scale asynchronous workloads – these can be parallelized well so can be run on cloud services
He talked about how the interfaces to cloud services are not standardized, so once chosen, scientists are locked in to a single provider. The suggestion was that cloud services should be more standardized and make it easier to share and collaborate among scientists.
Programming Models and Optimization
Fluxo: A System for Internet Service Programming by Non-expert Developers
Emre Kiciman (Microsoft Research), Benjamin Livshits (Microsoft Research), Madanlal Musuvathi (Microsoft Research), Kevin Webb (University of California San Diego)
This work focused on trying to provide a way for non-experts to write Internet services that scale well. Non-experts now have access to large scale cloud platforms like Amazon, Google, and Microsoft, so hardware is no longer the bottleneck when needing to scale an application. Instead, it’s difficult to architect an application.
The current techniques for scaling Internet services are things like tiering, partitioning, replication, data duplication and de-normalization, pre-computation, and others. These techniques are notoriously difficult to build and deploy correctly. It would be nice for non-experts to have a way to write applications using these techniques easily.
In comes Fluxo – a compiler that takes input in the form of a restricted programming language and translates it to run on Azure. The restricted programming language simplifies program analysis and allows for the use of the common techniques listed above. Fluxo is profile driven, so it collects metrics, analyzes the program, transforms it, and repeats.
In an evaluation, pre-computation can get benefits of up to 60% in performance, and caching can decrease latency by up to 50%. Seems like a nice idea – I can see small to medium size companies finding a platform like this useful.
Nephele/PACs: A Programming Model and Execution Framework for Web-Scale Analytical Processing
Stephan Ewen (TU Berlin) , Fabian Hueske (TU Berlin) , Daniel Warneke (TU Berlin) , Dominic Battré (TU Berlin) , Volker Markl (TU Berlin) , Odej Kao (TU Berlin)
Data-centric parallel programming uses schema-free systems like MapReduce and Hadoop to run efficiently. Relational databases, on the other hand, are schema bound, have well-defined properties, are flexible and optimizable, but don’t scale well. This work looks at extending the capabilities of MapReduce. The notion of first-order and second-order function is introduced. First-order functions are user code, while second order functions are things like the Map and Reduce steps today.
PACTS extends the MapReduce paradigm with some additional functions. For example:
- Cross (two inputs) – each combination of records from the two inputs is built and is independently processable
- Match (two inputs) – each combination of records with equal keys from the two inputs is built
- CoGroup (many inputs) – pairs with identical keys are grouped for each input, group of all inputs with identical keys are processed together
PACT programs are compiled to parallel data-flows to be executed in Nephele. Nephele is an execution engine for parallel data-flow programs. It provides the process scheduling, communication, and fault-tolerance mechanisms needed to execute the workload. Example given of PACT workloads are TCP-H aggregation and K-Means Clustering. I’m not too familiar with running MapReduce workloads, so not sure of the impact this work will have.
The Case For PIQL: A Performance Insightful Query Language
Michael Armbrust (UC Berkeley), Nick Lanham (UC Berkeley), Stephen Tu (UC Berkeley) , Armando Fox (UC Berkeley) , Michael Franklin (UC Berkeley) , David Patterson (UC Berkeley)
This work looked at the tradeoff between RDBMS’s and NoSQL-type databases. RDBMS’s have powerful declarative query languages, have a query optimizer that decides the best execution plan, has automatic parallelism, and is performance-opaque. Contrastingly, NoSQL has a simple query interface (get/set), but complex queries have to be implemented at the application layer and are non-trivial to architect.
The argument is made to use a system called PIQL which is designed to be a middle ground between the two extremes. PIQL is a scale-independent declarative language that only allows developers to write queries with a data-size independent upper bound. PIQL is able to say no to certain queries, where as an RDBMS can’t.
Most applications have natural cardinality constraints. Example given was that a Facebook user has an upper limit of 5000 friends. PIQL lets developers specify these constraints in their data model so the optimizer can use them to decide on an optimized execution plan. If a query would be too slow, PIQL rejects it. PIQL is designed to run on top of existing key/value data-stores and it scales well. I think this is the right direction for providing easier ways to write NoSQL-type applications. I’m interested in looking more at what their API provides.
Towards Automatic Optimization of MapReduce Programs
Shivnath Babu (Duke University)
Shivnath described some of the difficulties with using MapReduce today. The lifecycle of a MapReduce job is that the programer first provides the Map and Reduce functions, kicks off the job, which gets split into many Map waves, followed by many Reduce waves, until it finishes.
The complexity here is that the user has to specify things like the number of maps, the number of reduces, and memory usage. There are over 190 parameters that a user can specify in Hadoop, and for many, it isn’t clear how they impact performance. As an experiment, a 50GB terasort was run on EC2. By changing parameters, the running time drastically changed, and certain parameters depend on each other, so individual parameters can’t be tuned separately.
The argument was that a better approach should be used for default parameters in Hadoop. There has been a lot of work in traditional RDBMS’s and these techniques should be leveraged to try and tune MapReduce workloads.
Benchmarking and Testing
Benchmarking Cloud Serving Systems with YCSB
Brian Cooper (Yahoo! Research) , Adam Silberstein (Yahoo! Research) , Erwin Tam (Yahoo! Research) , Raghu Ramakrishnan (Yahoo!) , Russell Sears (Yahoo! Research)
This work looks at how to properly evaluate a NoSQL cloud database. There are many systems out there – PNUTS, BigTable, Megastore, Azure, Cassandra, CouchDB, and many more, but the tradeoffs between systems are not clear. It would be nice if there was a standard benchmark that could be used to analyze the performance of each system so that developers can choose the appropriate solution to meet their needs.
In comes YSCB – an open source, standard benchmark tool for evaluating NoSQL systems. YCSB focuses on performance (latency and throughput) and scalability (adding servers, latency as system scales, and elasticity). The YCSB tool is written in Java, which provides easily extendable workloads.
A test setup was run on six server-class machines, and systems evaluated are Cassandra, HBase, sharded MySQL, and PNUTS. Some of the example workloads given were (more in paper):
- A – Update heavy workload that had 50:50 ratio of reads to updates
- B – Read heavy workload with 95:5 ratio of reads to updates
- E – Short range scans of 1 to 100 records of size 1KB
- Elasticity – Run a read-heavy workload while adding servers from 2 to 6.
The graphs showed a clear distinction between different systems and provided insight to which systems would be best in different situations. It was also noted that these tests can be used to improve the database system, as happened with Cassandra which improved dramatically in subsequent version releases.
This looks like a great tool for evaluating the plethora of NoSQL systems out there. It would be nice to have a standard setup (e.g. on Emulab) so that the comparison between systems is fair and repeatable.
Automated Software Testing as a Service
George Candea (EPFL) , Stefan Bucur (EPFL) , Cristian Zamfir (EPFL)
Stefan pointed out that when we install privileged software today (like drivers) we have to take their reliability on faith. Third-party drivers run with the highest privilege in the operating system, so if they have bugs or malicious code, serious problems can occur. The analogy made was to cars – we wouldn’t give our car keys to a random driver without some guarantee of the car’s safety.
It would be nice to be able to test a driver by exploring all possible code paths and look for common bugs and problems. Testing as a Service (TaaS) is a system for doing just that. TaaS can be used by home users by uploading a driver to a website and evaluating it, by developers who can integrate testing into their IDE’s for convenience, and as a commercial certification service for software.
This seems like a nice idea, although I question how much traction it could get. A lot of Windows drivers today don’t get certified because companies like to cut corners and produce a product for the least amount of money possible. It seems like even if this service was provided, companies would still cut corners and just skip using it to save money. Being able to test as a home user is not that useful, as a failed test would just give you the option of not installing the software, the same option you have today. Perhaps if a big player like Microsoft started requiring certification for driver installation, it could get the traction it needed.
The Internal Design of Salesforce.com’s Multi-Tenant Architecture
Rob Woollen (Salesforce.com)
This was an interesting keynote. I wasn’t very familiar with exactly what Salesforce provided to its customers before this talk. I knew that a lot of companies used their services, but I didn’t realize the scale they were working at. Salesforce provides cloud services for 75,000 customers running 150,000 custom applications and serving 17,500 websites.
They provide many services, including a query language, API’s, OLAP, batch processing, search, mobile, content management, packaging, and many others. What’s interesting about Salesforce’s architecture, though, is that each customer (or “tenant” as they call them) is not given its own architecture. Instead, all tenants run on a single platform, supported by a single code base.
Each username is mapped to a tenant, which is mapped to an “instance”. An instance is a cluster of hardware including a SAN, application servers, load balancers, object storage, etc. To provide isolation, each tenant runs as a separate Java thread. They have a three-tier caching layer, first caching locally within each Java thread, then caching with remote requests, and finally providing an application-layer service to users running on memcache.
Underneath the cache is a persistence layer that uses Oracle. Every customer’s data is stored together in a table, with static (compile time) and dynamic (run time) query analysis that enforces that a customer can only read their own data. All data is stored as a string in the database, so to provide indexes, a separate table has a column for each possible type which can then be indexed.
Salesforce only maintains a single code base for their infrastructure. To provide compatibility with previous releases, functions are annotated with API version numbers, such that if a customer is still using an old version, they don’t get access to changes in the API. When customers package their own code, they are required to submit unit tests that cover most of their code paths, so when new versions are released, Salesforce uses customer tests to verify compatibility with old API versions. An analytics component is provided that does real-time calculations instead of offline processing like most other analytics systems.
This was a nice talk. The presentation was a little dry (Rob didn’t seem very enthusiastic about their services!) but it was still interesting.
G-Store: A Scalable Data Store for Transactional Multi key Access in the Cloud
Sudipto Das (UC Santa Barbara) , Divyakant Agrawal (UC Santa Barbara) , Amr El Abbadi (UC Santa Barbara)
G-Store addresses the weakness of most NoSQL databases that they don’t provide transactions over multiple keys. Many NoSQL databases provide test-and-set type semantics for a single key, but can’t provide any guarantees about multiple keys. To do this, users have to implement any transactional logic in the application layer.
G-Store is a layer that runs on top of a key/value store and provides additional functions. Users can submit a list of keys to add to a group. When a group is created, a single node is assigned as the controller for this set of keys. Since a single node controls the group, it can execute transactions over the group. Once a user is finished, the group can be deleted.
This work is similar to our own CRAQ work, which can provide multi-key transactions over a chain. CRAQ only provides static group membership, in that it doesn’t support explicitly moving keys around after their initial creation, but by leveraging Sinfonia-type mini-transactions, it provides the same service as G-Store. I was disappointed that we weren’t cited.
Google Fusion Tables: Data Management, Integration and Collaboration in the Cloud
Alon Halevy (Google) , Hector Gonzalez (Google) , Jayant Madhavan (Google) , Christian Jensen (Aalborg University) , Jonathan Goldberg-Kidon (MIT) , Warren Shen (Google) , Rebecca Shapley (Google) , Anno Langen (Google)
Google Fusion Tables is a service that allows data management for the web era. Google wants to provide data storage that can integrate seamlessly with the web, is searchable, embeddable, extensible, easy to use, provides incentives for sharing data, and facilitates collaboration.
The example given was biking trails in Europe. There is a website that allows users to upload GPS data of bike trails. They want to allow data like this to be integrated into the Google Maps API for visualization and allow users to query it easily, e.g. find trail lengths < 25 miles.
Google Fusion Tables allows users to extend data tables in order to collaborate. Tables can be merged. For example, a user might merge the bike trails table with a sponsors table and a results table. Users can have discussions about the data, set permissions, and perform other types of operations.
The architecture to support these operations has a front end dispatcher that receives requests, a query processing component, a backend that does plan execution, and a replicated storage layer running on BigTable. This seems like a nice, useful API. Looking forward to checking it out.
High Availability and Reliability
Making Cloud Intermediate Data Fault-Tolerant
Steve Ko (Princeton University) , Imranul Hoque (University of Illinois at Urbana-Champaign) , Brian Cho (University of Illinois at Urbana-Champaign) , Indranil Gupta (University of Illinois at Urbana-Champaign)
This work tries to address some shortcomings in Hadoop. In MapReduce, there is a two stage computation: the Map step and the Reduce step. Between processing steps, there is a large amount of intermediate data that has to get shipped from server to server. For example, Google’s indexing application runs a chain of 24 MapReduce jobs. In many cases, this intermediate data is larger than either the input data or output data.
If a MapReduce node fails, the computation it was supposed to provide has to be rerun on another node. The problem is that to rerun the computation, the intermediate data that was sent to the failed node has to be sent to the new node as input. If it’s not the first step in the chain, this intermediate data is no longer available – it was the output of the previous step. To recover it, the previous step has to be rerun as well. This is a problem when chaining multiple steps together, because the failure will cascade and cause every previous step to have to be rerun.
This work extends Hadoop by replicating this intermediate data. Unfortunately, this isn’t a simple thing to do because most MapReduce jobs are network-bound (verified experimentally), meaning that adding additional replication slows down the job. An observation is made, though, that often the bottleneck is at the inter-rack switches. The network capacity within a rack is not saturated. To take advantage of this, replication can be done at the rack level.
Failures really do occur often in large MapReduce clusters (verified from industry reports), and an experimental evaluation shows that when using rack-level replication, jobs can be sped up significantly when failures occur. When no failures occur, there is only a small overhead associated with the replication. This seems like a nice addition to Hadoop, and I hope to see it added to the main branch.
Characterizing Cloud Computing Hardware Reliability
Kashi Vishwanath (Microsoft Research) , Nachi Nagappan (Microsoft Research)
This work looks at hardware failures in large scale datacenters. Microsoft has 100K+ servers that are geographically distributed around the world running heterogeneous hardware. They observe lots of failures and have to deal with them with a support staff. Since a person has to diagnose hardware faults and perform the necessary fix, failures come at a high cost.
The dataset used here was from Microsoft’s online services. This runs things like Messenger, Hotmail, etc. The dataset was over a 14 month period, looking at 100K+ servers, and contained a work log from their datacenter. The work log records which server fails along with what component had to be replaced (e.g. disk, dimm, raid controller, etc).
The analysis they performed looked at characteristics in aggregate, failure rates across individual servers and datacenters, and tried to see if failures could be predicted. The method used was a decision tree. Some interesting takeaways:
- For every 100 machines, 9.5 machines recorded a failure annually
- For every 100 machines, 20 failures were recorded annually (so 2 repairs per machine)
- 70.7% of failures were disks, 6.5% raid controller, 5.1% memory, 17.6% other
- When a failure occurs, there is a 20% chance another failure will occur within 1 day!
- Failure probability is correlated with number of disks in a server, slope is greater than 1!
- Many times, a faulty raid controller would cause failures to be reported as a disk failure
A Self-Organized, Fault-Tolerant and Scalable Replication scheme for Cloud Storage
Nicolas Bonvin (EPFL), Thanasis Papaioannou (EPFL), Karl Aberer (EPFL)
I really have nothing to say about this talk. I couldn’t understand anything from the presentation at all. I verified it wasn’t just me – others that I asked didn’t understand it, and there were no audience questions either. This is a great example of how not to give a talk.
Storage and System Modeling
Robust and Flexible Power-Proportional Storage
Hrishikesh Amur (Georgia Institute of Technology) , James Cipar (Carnegie Mellon University) , Varun Gupta (Carnegie Mellon University) , Michael Kozuch (Intel Corporation) , Gregory Ganger (Carnegie Mellon University) , Karsten Schwan (Georgia Institute of Technology)
This work looked at trying to provide a distributed storage service that is power-proportional. This means that as the load on the system increases, the power consumption should also increase proportionally. This is not the case today – idling servers still use a large percentage of the required power for peak load.
One way to reduce power consumption is just to turn off servers during times of low demand, but traditional distributed storage services use methods like consistent hashing to get good load balancing, so if you shut off a server, the data on that server is no longer available.
Instead, what Rabbit does is store a primary replica of every key in a small number of nodes. A secondary replica goes to a larger number of nodes, a third replica to an even larger set of nodes, and a fourth replica to the largest number of nodes. By doing this, you can turn off 95% of your servers without losing availability. If you have 5 servers on, each serves 20% of requests. If you have 100 servers on, each serves 1% of requests.
This allows for fast, fine-grained scaling by simply powering off machines. An example benchmark of a 100GB terasort shows that read throughput scales linearly with percentage of maximum power used. However, write performance decreases because all data has to be written to a small number of nodes. A method is used from previous work to offload writes to get better write performance.
RACS: A Case for Cloud Storage Diversity
Lonnie Princehouse (Cornell University) , Hussam Abu-Libdeh (Cornell University) , Hakim Weatherspoon (Cornell University)
RACS is a method for combining multiple cloud platforms together for data storage, such that a single vendor doesn’t have to be relied on. Today, when choosing a cloud datastore, customers are often locked in because of the high cost of switching providers. To transfer a large amount of data from vendor A to vendor B, the customer must pay for the bandwidth cost of the transfer twice – once for each provider.
What RACS does is use erasure coding across many cloud platforms to reduce the costs associated with switching vendors. The notation RACS(k,n) is used, where it stripes data over n providers such that any k-subset contains a complete copy of the data. This can tolerate up to (n-k) failures before data is lost.
An evaluation used an FTP trace from the Internet Archive over 1.5 years. The cost of three providers over time were very similar. Using RACS is more expensive: RACS(1,2) is much more expensive, while RACS(8,9) is only slightly more expensive. However, the cost of switching providers is very expensive when not using RACS, on the order of two-months of operating costs, compared to much cheaper cost when using RACS. This is interesting work – I hope they implement more cloud platforms in addition to their current prototype.
Characterizing, Modeling, and Generating Workload Spikes for Stateful Services
Peter Bodik (UC Berkeley) , Armando Fox (UC Berkeley) , Michael Franklin (UC Berkeley) , Michael Jordan (UC Berkeley) , David Patterson (UC Berkeley)
Web applications often experience periods of spikes in traffic that are unpredictable. For example, Wikpedia, after the death of Michael Jackson, saw 13% of total requests to the single page of Michael Jackson. The first contribution of the paper was to characterize these spikes. There are two main types – workload volume spikes consisting of the time to peak, duration, magnitude, and maximum slope; and data hotspots consisting of the number of hotspots, spacial locality, and entropy.
The conclusion was that there is no typical type of spike because they vary widely. The next contribution was to try and generate workload spikes using the 7 characteristics listed above. They were able to create a generative spike model using two parameters N, the number of hotspots, and L, a clustering parameter.
After using the generative model, they were able to match real world examples they collected with their counterpart from the output of the generative model, validating that the model produces valid workload spikes.
SOCC turned out to be a great conference, and was definitely a success overall. The next SOCC conference will be co-hosted with SOSP in 2011 in Portugal, so I’m looking forward to that if I’ll be able to attend!