I’m currently in Indianapolis, Indiana for the first ACM Symposium on Cloud Computing (SOCC 2010). I’m posting here with a brief summary of each talk at the conference as well as some of my thoughts. These are my reactions to the presentations only, as I haven’t read most of the papers.
[See my next post for Day 2
Evolution and Future Directions of Large-Scale Storage and Computation Systems at Google
Jeffrey Dean (Google)
Jeff Dean gave a keynote about different services running at Google and general principles on how to build large-scale services. The talk was roughly divided into three parts. The first part was about Google’s data centers that house a few hundreds of clusters. Each cluster has thousands of machines with one or a handful of configurations. Each machine (at least) runs GFS and Colossus (next-gen GFS), and a cluster scheduling daemon.
The second part was about Google’s back-end services including MapReduce, BigTable, and Spanner. A few interesting notes about these systems are:
- BigTable now has a dedicated team that manages BigTable service clusters. Because of this, there has been a lot of work on fair-share scheduling and performance isolation.
- BigTable has something called coprocessors, which are basically “arbitrary code that runs next to each tablet in table”.
- Spanner is a storage & computation system that runs across data centers. It supports the mix of strong & weak consistency models, fine-grained replication, “zone”-based hierarchy (1 master per zone), etc.
The third part was about experiences and design patterns for building large-scale services. There were too many design patterns to post here (Jeff said he would post slides on his website), but here are a few which I find interesting:
- Use back-of-the-envelope calculation whenever possible before choosing a design
- Design for growth, but don’t overdo it (e.g., 5x – 10x OK, but not 100x growth), and
- Canary requests: some requests crash every single process in the same service, so try a few machines first.
Overall, it was an interesting talk about quite a broad set of topics. There are only a few places where you can accumulate wisdom about building truly large-scale systems, and it is always interesting to see what they are doing to cope with the scale.
An Operating System for Multicore and Clouds: Mechanisms and Implementation
David Wentzlaff (MIT) , Charles Gruenwald III (MIT CSAIL) , Nathan Beckmann (MIT CSAIL) , Kevin Modzelewski (MIT CSAIL) , Adam Belay (MIT CSAIL) , Lamia Youseff (MIT CSAIL) , Jason Miller (MIT CSAIL) , Anant Agarwal (MIT CSAIL)
This work addresses the issue of how to create applications that run in the cloud. Machines today have many cores (16 today, up to 1000 in five years), and parallelizing across many cores is very difficult. They created a new system called Fos (Factored Operating System) which runs on top of Xen and splits the OS into separate services. The kernel is a microkernel and all services run on top (e.g. File System). The communication between components in the system is done with message passing. There are no shared locks.
Because of this abstraction, each component can be adjusted elastically to handle demand. For example, if an application starts accessing the File System component at too high of a rate, the underlying system can spawn another File System component on another core or another machine and start transparently redirecting requests to the new component.
The implementation of Fos is fairly complete – some applications include a web server, slide viewer, video transcoder (ffmpeg), and busy box. In fact, it was revealed at the end of the talk that the presentation was running on Fos, and it didn’t crash! The system’s website is running on the Fos web server.
I’m not sure how this work will play out. I could see this becoming a standard approach as servers start having thousands of cores, but I’m not sure how applications here would be able to cope with network latency involved in inter-machine message passing.
Lithium: Virtual Machine Storage for the Cloud (can’t find this paper online yet)
Jacob Hansen (VMware) , Eric Jul (Bell Labs, Dublin)
This work looks at trying to increase the performance of shared file systems that are used by virtual machine clusters. In the traditional setup, a datacenter has a SAN server that hosts files for VMs in a cluster to access. The SAN is set up such that it is highly reliable and provides high throughput.
The problem with a SAN is that it’s expensive and it doesn’t scale very well to hundreds or thousands of servers accessing it simultaneously. It’s also limited by network bandwidth. VMware’s Lithium technology does away with the SAN and instead arranges the VM hosts into a peer-to-peer network and uses local storage to store files, replicating data for redundancy and performance.
The system still preserves the features required of a VM file server, e.g. cloning, snapshots, etc. It uses a branching method similar to some source control systems like Mercurial for quick copying of large files.
When compared to a SAN, Lithium doesn’t perform as well with a small number of hosts, but as the number of VM hosts increases, Lithium scales linearly while the SAN maxes out at a constant throughput. This approach seems like a great idea, and hope to see it pushed to production in future VMware releases.
Differential Virtual Time (DVT): Rethinking I/O Service Differentiation for Virtual Machines
Mukil Kesavan (Georgia Institute of Technology), Ada Gavrilovska (Georgia Institute of Technology), Karsten Schwan (Georgia Institute of Technology)
The presentation of this work was quite confusing, but the basic idea is that when doing fair scheduling for resources in a VM, ill effects are often encountered. For example, as is well known, TCP doesn’t react very well to congestion in the network (e.g. why TCP doesn’t perform very well over wireless networks), so when a VM host artificially limits a guest’s access to the network, the TCP congestion window will drop dramatically and negatively impact performance.
To try and solve this problem, DVT takes an approach of never sharply decreasing a guest’s share of the network. For example, if host X is receiving 100% of network access, but hosts W, Y and Z suddenly request access, rather than the traditional approach of reducing X immediately to 25%, DVT instead slowly reduces it over time. To keep access fair, DVT will eventually reduce X to lower than 25% to make up for the increased share it got previously.
By doing this, DVT increases the performance of VM guests by up to 25% since the TCP congestion window doesn’t drop sharply. It was mentioned that future work will look at applying the same method to disk access, although it isn’t clear to me how slowly reducing disk access instead of sharply reducing it would increase performance in an application.
Virtual Machine Power Metering and Provisioning
Aman Kansal (Microsoft Research) , Feng Zhao (Microsoft Research) , Jie Liu (Microsoft Research) , Nupur Kothari (USC) , Arka Bhattacharya (IIT Kharagpur)
This work asked the question “Can we tell how much power a VM guest is consuming?”. I was wondering what the motivation for measuring this was throughput the talk until it was finally mentioned at the end. I’ll start with the motivation first instead – the reasoning given was mainly to use knowledge of a VM guest’s consumption to provision your datacenter power accordingly. Other usages are to charge your cloud users according to power consumption (although I don’t buy this, as I don’t see how it would differ from billing with current methods – cpu, memory, storage, bandwidth), and to track which VMs are consuming the most power so you can target power reduction for “green” initiatives.
To answer this question is a two step process. First, they measure the power consumption of each component in the system when a context switch between VM guests happens by using OS-level events. Next, they have to figure out which VM guest is using each component. Rather than doing this live, they monitor a VM guest for a period of time to determine its power consumption and then use that value for future calculations. The reason for this is that different loads may use different internal power for the same externally visible power state, so learning a VM guest’s power profile over time is a more accurate measurement of power consumption.
There were a lot of technical details glossed over in this talk, and there were many formulas on the slides that weren’t explained or accompanied by variable descriptions, so I found the presentation somewhat confusing. I’m sure reading the paper would make this more clear.
Distributed and Parallel Processing
Stateful Bulk Processing for Incremental Algorithms
Dionysios Logothetis (UC San Diego), Christopher Olston (Yahoo! Research) , Benjamin Reed (Yahoo! Research) , Kevin Webb (UC San Diego) , Kenneth Yocum (UC San Diego)
This work targets large data applications. These are things like web analytics, graph mining, log analysis, and PageRank, which use massive amounts of data. An insight here is that these applications have to continually process on the order of TBs of new data per day and they are stateful, but the running time is proportional to the total amount of state, not proportional to the amount of new data.
Continuous Bulk Processing (CBP) provides users with an additional API of a Translate() function and a RouteBy() function similar to the map and reduce stages of MapReduce. Current systems only do “outer” grouping, while CBP allows for “inner” grouping so that only the state that needs to be accessed is shipped around. In an evaluation, this inner grouping method reduced running time by up to 53%.
Perhaps I’m not familiar enough with MapReduce, but the presentation went too fast for me to follow the details of the Translate and RouteBy API, so see the paper for details.
Comet: Batched Stream Processing for Data Intensive Distributed Computing
Bingsheng He (Microsoft Research), Mao Yang (Microsoft Research) , Zhenyu Guo (Microsoft Research) , Rishan Chen (Beijing University) , Wei Lin (Microsoft Research) , Bing Su (Microsoft Research) , lidong Zhou (Microsoft Research)
Comet is a system that tries to reduce the running time of large data-intensive applications that have work in common. The example given was a set of four jobs: one that computes the top ten hottest Chinese pages daily, another that computes the top ten hottest English pages daily, and corresponding jobs that compute the top ten hottest Chinese and English pages weekly. The first two jobs have the same input data, but have different filters in place, so the first step of each of those jobs is the same.
The flow of the system is: query series, normalization, logical optimization, physical optimization, execution plan, and then execution. By doing these optimization steps, Comet is able to reduce the amount of work done by 52% for jobs that have commonalities with previous jobs.
The evaluation showed that computing the top ten pages weekly was able to take advantage of the top ten daily calculation, but the top ten pages for the week don’t necessarily overlap with the top ten pages of each day, so it’s not clear how this works. This question was asked in the Q&A, but the author wasn’t able to answer. The presentation of this work was very confusing, and it was clear that the rest of the audience didn’t understand either. I’m sure reading the paper would make more sense.
Skew-Resistant Parallel Processing of Feature-Extracting Scientific User-Defined Functions
YongChul Kwon (University of Washington) , Balazinska Magdalena (University of Washington) , Bill Howe (University of Washington) , Jerome Rolia (HP)
This work addresses how to reduce the running time of large scientific experiments. The example given here was an application that takes astronomical images, does feature extraction to identify the celestial objects in the images, and then runs a Friends of Friends algorithm, which is used in astronomy to cluster celestial objects. MapReduce-type systems like Hadoop are a great fit for workloads like this, but it is hard to express complex algorithms and get good performance. As an example, this algorithm when first implemented took 14 hours to run, and after a week of working, they were able to reduce the running time to 70 minutes.
The reason these types of algorithms can take a long time to run is that the same amount of input data doesn’t always have the same running time (e.g. if the cluster of celestial objects is more dense, takes longer), so a static partitioning scheme doesn’t get good performance. An alternative is to use micro partitions, which reduces the impact of skew, but there is additional framework overhead and to find the sweet spot, the algorithm must be run many times, which is undesirable.
The SkewReduce algorithm takes a sampling approach to figure out the best partitioning scheme. In evaluation, this SkewReduce algorithm was able to reduce the running time of algorithms by 2-8 times. This seems like a nice scheduling algorithm, and I hope this finds its way into the main branch of Hadoop. A person who works at Google shared a similar optimization that Google uses, but they do their optimizations in the Reduce stage rather than the Map stage.
I will attempt to get my writeup for Day 2 posted late tomorrow.
[Update: Day 2 posted.]