Category Archives: Conferences

Should We Extend Conference Q&A With Written Responses?

The CS community recently discussed extending the Q&A session that occurs after each talk at a conference into a more formal written Q&A.  More specifically, this was raised during the business meeting at SOSP and the proposal was to publish the results in SIGOPS OSR.  The idea was this written extension to Q&A could really get to the bottom of the issues raised, and it wouldn’t let speakers avoid questions by saying, “Let’s take that offline.”  There was some push back against this with arguments like “most questions are just misunderstandings” and “that will add a lot of pointless work for speakers/authors.”

In this post I’ll examine the questions asked at the end of my SOSP talk on COPS.  We’ll look at a summary of each of the questions asked and my written response, and then hopefully we’ll be able to conclude if a written Q&A is a good idea or not.  The full transcript of each question with comments and clarification added in square brackets is toggable with the transcript links.


Question 1

Question from Hussam Abu-Libdeh (Cornell University)

Hussam: So I’m actually a bit confused by how you achieve partition tolerance.  If my operations are going to block until a server that has a dependency that I depend on responds back to me, I can talk to a datacenter perform an operation, that datacenter gets partitioned away, I talk to another datacenter but that datacenter didn’t see any of the operations that I depend on so I’m going to block.

Wyatt: Sure, so I think the question, if I can paraphrase the question, is that you have multiple datacenters, it’s possible you depend on something that’s being replicated from one datacenter, that datacenter gets partitioned away, and then your updates aren’t going to show up to a third datacenter until these updates are propagated from the now partitioned datacenter. Is that correct?

Hussam: Yeah, and I’m blocking meanwhile.

Wyatt: You’re not blocking anywhere.  These operations won’t show up right away. Causal consistency doesn’t say, “I see thing right away.” It’s not strong consistency like that.  You’ll still get to see consistent values, they just won’t be super up-to-date.

Hussam: I’ll take it offline.

Question Summary:  The question could be interpreted two ways, so we’ll look at both.

Interpretation A: “What happens if a client is partitioned from the datacenter they are accessing?” (Note: Much of the feedback and questions after the talk were questions like A, so I think this is what Hussam meant.)

Written Answer: The clients of our system are the web servers collocated in the datacenter with the storage cluster, so they won’t be partitioned.  What you are really asking about are not the direct clients of the storage system, but the human who is a client of a web browser who is a client of a web server who is a client of the storage system.  Our system doesn’t provide consistency directly for those clients three levels away, but we think it’s an important and interesting problem, and we’re actively thinking about it.

Interpretation B: “What happens if a datacenter that is replicating data you depend on is partitioned?” (This is what I interpreted Hussam to mean at the time.)

Written Answer: No operations will ever block, but your new put operations won’t show up in other datacenters until their dependencies have shown up in that datacenter.  So there is no blocking, but this comes at the cost of not guaranteeing your updates show up everywhere immediately.


Question 2

Question from Maysam Yabandeh (Yahoo! Research Spain)

Maysam: Let’s put details, implementations, and your wide-area setting aside.  From an abstract point of view I see lots of similarities between your model [causal+ consistency] and snapshot isolation.  First, both of you might maintain multiple versions of data.  Second, both of you talk about snapshots.  And third, both of you try to detect and avoid write-write conflicts.  I wonder about the differences.

[Note: COPS does not avoid write-write conflicts.  We only allow single key put operations, so we can only have write-write conflicts between two put operations. These can happen and are then either resolved by the last-writer-wins rule or the convergent conflict handler function.]

Wyatt: Are you asking me about the difference between this [causal+ consistency] and what Jinyang just talked about, PSI, or just Snapshot Isolation in general?

Maysam: In general.

Wyatt: In general, snapshot isolation is sort of a database property, so it’s a stronger consistency then what you get [with causal+].  Snapshot isolation you can do these transaction that have reads and writes and things like that. We don’t have that in our system.  What we have in our system is we’re guaranteeing you low latency.  Things will always complete right away, very quickly, no matter what.

Maysam: But look at this from an abstract point of view. I want to compare causal+ from an abstract point of view to snapshot isolation.

Wyatt: This is sort of tricky.  In the last talk Jinyang had this spectrum of consistency models. What she was showing you was more from the database side, where you have these transactions that involve multiple keys at the same time, and multiple updates, and multiple operations and things like that.  And we’re more from the shared memory side or something like that, where all of these things involve one operation at a time.  So how exactly they interact, it’s a very complex graph of how these consistency models interact.  I would say Snapshot Isolation is definitely a stronger property than what we provide, but we do so with better performance characteristics.

Maysam: But you talk about write-write conflicts.  Write-write conflict make sense if you have write conflicts between two transactions.  You didn’t call it transactions, but I guess in the paper you call it context or something like that.  You didn’t call it transactions but you call it context or something like that.  You give it a different name.  But still it is kind of context, but it is kind of transactions.

[Maysam is confused here, the context we describe in the paper is part of the client API for identifying different clients, it has nothing to do with transactions.]

Wyatt: So we only have read transactions.  You can only read multiple values in a transaction.

Maysam: So when you talk about write-write conflicts, is it between [trailed off]

Wyatt: Write-write conflicts?  We can have write-write conflicts in our system, but we have to use the last writer wins rule, or we have to use some sort of application specific function that is going to resolve these conflicts for us.

[Again, write-write conflicts are only for two puts to the same key.  There are not general transactions in COPS.]

Maysam: But, to have write-write conflicts, you first need to [cut off]

Ant Rowstron (Session Chair): I think we need to take this offline and head onto the next question.

Question Summary: What are the differences between snapshot isolation and causal+ consistency?

Question Answer: Causal+ consistency deals with single key put operations and single or multi key get operations.  Snapshot isolation is stronger that causal+ because it deals with general transactions that can include many different put and get operations.  In addition, snapshot isolation ensures there are never conflicting transactions in the system (avoids write-write conflicts). While causal+ doesn’t have the notion of a transaction, but does allow and then resolves conflicting writes to the same key (embraces single key write-write conflicts).


Question 3

Question from Marcos Aguilera (Microsoft Research Silicon Valley)

Marcos: You made a case that gets are not enough therefore you need get transactions.  [Wyatt says “yes”].  The previous person was asking about other types of transactions.  You could also make the argument that puts are not enough and you need put transactions.  In fact, you need more general transactions.  And you mentioned that you have more the perspective of a shared-memory system, but there we have transactional memory as well.  And so, I’m wondering without general transactions isn’t that the same thing as trying to go to war with rocks and stones when you have machine guns available, which is what general transactions are.

Wyatt: I would agree with the first half of what you said and strongly disagree with the second half.  So I think put transactions are important and it’s something that I’m thinking about.  What else did you say? General transactions.  My view of your work in the previous paper and this work is that they’re sort of complementary approaches.  Like, we really want to have low latency, we say operations must be really really fast.  In your work, you say, “We have to have these transactions.  We have to avoid write-write conflicts.”  I think there’s places for both of these and I think ultimately you’d have some sort of system that would join the two.  And I don’t think this is like using rocks, this is like using something that you know is going to be really fast.  I’m never going to have to do that slow 2PC across the wide area [unlike in walter].

Question Summary:  Can you compare COPS and Walter? (Walter was the system described in the previous talk, one of whose authors asked this question.)

Question Answer: The two systems provide complementary approaches.  COPS guarantees successful low latency operations at the cost of not providing general transactions.  Walter guarantees conflict-free general transaction at the cost of allowing transactions to abort and (sometimes) having to do wide-area locking via two phase commit, which is directly incompatible with low latency.

Question 4

Question from Marc Shapiro (INRIA)

Ant Rowstrom (Session Chair):  Can we keep the last two questions very short.  Marc, is it a question?

Marc: A comment and a question.

Ant: Can we have the question?

Marc: The comment is, I think your causal+ property is much too strong, you can get exactly the same results with something a lot simpler.  But we can take that offline. The question is, you said explicit dependency tracking is novel. It’s been around for a long time. It’s been beaten to death.

Wyatt: No, no, no.  I didn’t mean to say explicit dependency tracking itself is a novel technique.  Doing this is conjunction with decentralized replication is a new technique.

[Note: The slide read, “Novel Techniques: Explicit dependency tracking and verification with decentralized replication. …”]

Marc: The question is, vector clocks were invented because explicit dependency tracking is complicated and slow.  So I’m really puzzled why didn’t you just use vector clocks.

[Note: I misunderstood Marc’s question here, see the response to what he was asking in the “written answer” below.  I thought he wanted to know why do we use lamport timestamps (small fixed size) to establish a causal order instead of (much larger) vector clocks that give a more precise order.]

Wyatt: So we don’t use vector clocks because we’re talking about really big systems.  And when we have this really big system, like let’s say I have a thousand nodes, then I’m going to have a vector clock with a thousand entries in it.  [Marc (while Wyatt is still speaking): Yeah but there are compressed versions of that.]  So it’s going to be huge compared to the small amount of metadata we’re propagating around normally.

Marc: That’s been beaten to death. 

Question Summary: Why not use vector clocks instead of explicit dependencies to capture causality?

Written Answer: We use explicit dependencies because they are compatible with distributed verification, whereas vector clocks are not. They would need a centralized serialization point in each datacenter to ensure that updates from other datacenters are applied in the correct causal order.


Question 5

Ant Rowstrom(Session Chair): Okay, let’s go for the last one.  Is it quick?

Question from Unidentified, un for short.

Un: Where is the metadata stored physically?

Wyatt: Metadata? It’s physically stored both on servers and in the client library.

Un: On the servers where?  Like in memory, or … My question is actually, “How do you deal with corruption of metadata or failures on the side?”

Wyatt: So failures inside a datacenter.  We looked at this like, this is not what our main contribution is.  And we took existing techniques like chain replication, that give you this strong consistency, that give you this fault tolerance inside these datacenters.  We said we’ll just build on top of that, that’s not where our contribution is.  And in terms of dealing with bit flips, you’d probably want checksums in your system. I think Amazon came out with that, “we really want that, it screwed things up awhile ago.”

Un: Thanks. 

Question Summary: How do you deal with different types of failures?

Written Answer: That’s not where our innovation is, so we just used existing techniques to deal with failures (currently, chain replication).



In reviewing the questions, it seem pretty clear that almost all questions stem from confusion surrounding parts of the system that were gone over quickly or skipped in the talk.  These are good questions to have immediately after a talk, other people in the audience are probably confused about the same things.  However, the questions only make sense with the context from either the talk or the paper and almost all of them would be clarified by reading the paper.

So let’s break down the potential audience for the extended answers:

1) OSR readers who didn’t see or don’t remember the talk and didn’t read the paper.  The questions and answer wouldn’t make any sense to these people.

2) OSR readers who saw the talk, didn’t read the paper, and remember the talk over a month later.  Based on how much I remember from talks I saw a month ago, I don’t think this will be a very populous group.

3) OSR readers who read the paper. The paper should cover everything that was asked about, so the extra written answers should be unnecessary.  (E.g., Section 2/Fig 1 answer question 1, Related Work answers question 3)

4) People who watched the talk on youtube.  This audience is relatively large, the video of the talk has 224 view after being up for about a week.  They have exactly the same context as IRL audience members, and I know they have some of the same questions.  For instance, Todd Hoff, who wrote a post about COPS on his high scalability blog, also thought of question 5: why not use vector clocks?  Given I misinterpreted the question at the time, it’s good to have a correct answer here!

So while the audience for written answers in OSR would be tiny, I think there is an audience for more detailed answers to questions: youtube viewers!  I’m now all for written answers to questions, but I think that a blog, like this, is the appropriate venue for publishing them and not OSR!


Conference Presentation Faux Pas

I recently attended OSDI 2010 where I sat through about 30 presentations on systems-related topics. I was surprised that there were so many occurrences of things that I think should be avoided when giving a presentation.

In this post, I’m going to outline several things that, in my opinion, you should avoid when giving a talk at a conference. Keep in mind that this is just my opinion and you’re welcome to disagree, but I think you’re wrong and I’ll explain why.

Note that the examples below are all taken from the OSDI slides that were posted after the conference. If you’re an author of one of these presentations, please don’t take it as criticism of your work or you. The criticism is only aimed at your stylistic choices when creating your presentation.

Dark Backgrounds

Dark Background

This is probably the worst decision you can make. Now it’s definitely true that dark backgrounds can look nice. In fact, they can look particularly good on the screen where you created the presentation, which is why I think it’s an attractive choice for a lot of people.

The problem with this is two-fold. First, a more minor reason, is that it’s difficult to keep things consistent when using a dark background. Often you include things like images, tables, diagrams, and graphs. Many of these will have white backgrounds, so to get them to fit with your presentation, it’s extra work that you have to do.

Second, the major reason not to use a dark background is that it makes your slides much harder to read when displayed on a projector. The way that the human eye perceives blackness is relative to what’s around it. When you project black text on a white background, the black color of the text looks very black because of the large, bright, patch of white around it. When you project white text on a black background, the background looks washed out because there isn’t much light around it. This effect is also amplified because projectors can’t actually display a real black color. Black is the absence of light, so a projector simply doesn’t project any light at the pixels that are supposed to be black. Since conferences are always in brightly lit rooms, the black background ends up looking like a pale gray.

Ugly Fonts

This presentation uses the Chalkboard font. It turns what should be a technical diagram into what looks like a 4-year-old’s drawing. It’s very distracting. Comic Sans also made an appearance at OSDI. Please use reasonable fonts. You can also see the poor choice of a dark background here as well.

Outline Slides

I’ll admit that outline slides can have their place. For example, if you’re giving an hour-long talk, it can be nice to give an overview to your audience to tell them what you’re going to cover. When you only have 22 minutes, however, wasting time telling us what you’re going to talk about instead of just talking about it is silly. In this particular talk, they went back to the outline slide each time they switched sections. Even if it only takes a few seconds each time, you probably wasted a full minute just telling me what you were going to talk about. Especially when you’re throwing darts at your bullet points. No clue what they were thinking.

Underlined Text

This is not a deal breaker, but I still advise against it. Underlined text usually looks pretty ugly, as it does here. Avoid it.

Header/Footer on Every Slide

Sometimes these can be tastefully done, but many people abuse it. In the first example, the presentation had a giant bar that says “Facebook” on every single slide. Perhaps it’s company policy; I don’t know, but it looks obnoxious and I don’t need to be reminded every single slide that this presentation is from Facebook. In the second example, it even includes the name of the conference in the footer. I often forget which conference I’m at in the middle of a talk, so I’m glad it was there. But really, you don’t need a header and/or footer on every slide. It’s distracting and unnecessary.

Walls of Text

I think people are wising up to this, but there were still a few cases. If you put too much text on a slide, then one of two things will happen: either people only pay attention to what you say – making the slide ineffective, or they spend time reading the slide instead of listening to you. Instead, use the text on your slide to support and enhance what you’re saying. The first example here just has too much text. The second example has a lot of code. Both make it difficult to simultaneously listen to the speaker and read the slides.


Again, this was just my opinion, but if you can avoid these things, please do. It will enhance your presentation and make it so people don’t get distracted.

ACM Symposium on Cloud Computing (SOCC 2010) Day 2

I’m back in Princeton after spending the week in Indianapolis, Indiana for the first ACM Symposium on Cloud Computing (SOCC 2010). I’m posting here with a brief summary of each talk from Day 2 of the conference as well as some of my thoughts. These are my reactions to the presentations only, as I haven’t read most of the papers.

[See my previous post for Day 1]

Keynote 2

Building Facebook: Performance at Massive Scale
Jason Sobel (Facebook)

Jason gave a great presentation about the architecture at Facebook and some of the lessons their engineering team has learned in the last several years. In the last three years, Facebook has seen a factor of 20 growth in the number of users they support – from 1 million users in 2007 to 400 million users in 2010. With this high rate of growth, the infrastructure team has had to keep the site running in addition to innovating.

The traditional website scaling technique is to have all of a user’s data stored in a single server so that data can be partitioned easily and therefore scale. The social network that Facebook has can’t use this method because the data is interconnected and there is no way to partition the user graph. Instead, Facebook uses a number of techniques – aggressive caching, a multi-tiered infrastructure, and innovative APIs for developers that allow for applications to easily scale.

The three tiered structure that Facebook uses consists of:

  1. a web server front end running PHP that processes user requests. Not much to say here except that Facebook now compiles PHP to C++ to speed up execution
  2. a caching layer that uses memcached which is simple, fast, and reliable. They have 10’s of TBs of memory for caching!
  3. a database layer that uses MySQL. They have 8,000 server-years of run time experience without data loss or corruption. Since most requests are cached, the traffic to MySQL stays low, but they require a 3:1 ratio of the innodb table size to RAM for each server in order to keep up with request rate.

One interesting caveat here is that they are double caching most data. It’s hard to provide enough caching space to only cache in the memcache layer or the database layer, so the double-caching approach is the simplest option right now for Facebook, but looking into reducing this in the future.

Jason also talked about their Hadoop/Hive/Scribe system used to run MapReduce queries as well as a system called TAO that they use to provide a simple API to their developers for writing applications. Rather than developers having to write database and memcache queries, TAO is an API-aware layer built on top of memcache that provides easy to use abstractions.

Overall, a great talk with nice insight into Facebook’s infrastructure.


Hermes: Clustering Users in Large-Scale E-mail Services
Thomas Karagiannis (Microsoft Research), Christos Gkantsidis (Microsoft Research), Dushyanth Narayanan (Microsoft Research) , Antony Rowstron (Microsoft Research)

The Hermes project comes out of MSR where they are looking into optimizing the storage of e-mail messages in Microsoft Exchange. The normal implementation for Exchange is to take a new user account and locate their messages on any server in the Exchange cluster that has space available. By doing this, it splits load across the Exchange servers well.

The problem is that when two users that are on different physical servers receive the same message in their mailbox, the message has to be stored twice – once on each server. Hermes looks into trying to cluster users to minimize the amount of storage redundancy and network traffic between servers.

An example system was given that has 128K users on 68 exchange servers (1800 users/server). The sample size was 22 weeks of e-mail which contained 337M messages. They calculated that there is 3TB of extra storage required per week to store messages. By using a clustering algorithm to cluster together users that frequently e-mail each other, Hermes was able to save 55TB of storage over the 22 weeks of e-mail. Since moving user mailboxes is expensive, they suggest partitioning once and then re-partitioning once a week.

This was nice work, and seems like it solved a big problem with Exchange, although I question the architecture of Exchange to begin with. Services like Gmail do a much better job of storing e-mail messages and don’t run into problems like this.

Defining Future Platform Requirements for e-Science Clouds
Lavanya Ramakrishnan (Lawrence Berkeley National Lab), Keith Jackson(Lawrence Berkeley National Lab) , Shane Canon (Lawrence Berkeley National Lab) , Shreyas Cholia (Laurence Berkeley National Lab) , John Shalf (Lawrence Berkeley National Lab)
[position paper]

Keith talked about the current state of how (non-computer) scientists use cloud services to do their work. Many scientists use things like Hadoop to process large datasets (e.g. Large Hadron Collider). There are three main types:

  1. Large scale synchronous workloads – generally run today on supercomputers
  2. Medium scale synchronous workloads – can use local clusters
  3. Large scale asynchronous workloads – these can be parallelized well so can be run on cloud services

He talked about how the interfaces to cloud services are not standardized, so once chosen, scientists are locked in to a single provider. The suggestion was that cloud services should be more standardized and make it easier to share and collaborate among scientists.

Programming Models and Optimization

Fluxo: A System for Internet Service Programming by Non-expert Developers
Emre Kiciman (Microsoft Research), Benjamin Livshits (Microsoft Research), Madanlal Musuvathi (Microsoft Research), Kevin Webb (University of California San Diego)

This work focused on trying to provide a way for non-experts to write Internet services that scale well. Non-experts now have access to large scale cloud platforms like Amazon, Google, and Microsoft, so hardware is no longer the bottleneck when needing to scale an application. Instead, it’s difficult to architect an application.

The current techniques for scaling Internet services are things like tiering, partitioning, replication, data duplication and de-normalization, pre-computation, and others. These techniques are notoriously difficult to build and deploy correctly. It would be nice for non-experts to have a way to write applications using these techniques easily.

In comes Fluxo – a compiler that takes input in the form of a restricted programming language and translates it to run on Azure. The restricted programming language simplifies program analysis and allows for the use of the common techniques listed above. Fluxo is profile driven, so it collects metrics, analyzes the program, transforms it, and repeats.

In an evaluation, pre-computation can get benefits of up to 60% in performance, and caching can decrease latency by up to 50%. Seems like a nice idea – I can see small to medium size companies finding a platform like this useful.

Nephele/PACs: A Programming Model and Execution Framework for Web-Scale Analytical Processing
Stephan Ewen (TU Berlin) , Fabian Hueske (TU Berlin) , Daniel Warneke (TU Berlin) , Dominic Battré (TU Berlin) , Volker Markl (TU Berlin) , Odej Kao (TU Berlin)

Data-centric parallel programming uses schema-free systems like MapReduce and Hadoop to run efficiently. Relational databases, on the other hand, are schema bound, have well-defined properties, are flexible and optimizable, but don’t scale well. This work looks at extending the capabilities of MapReduce. The notion of first-order and second-order function is introduced. First-order functions are user code, while second order functions are things like the Map and Reduce steps today.

PACTS extends the MapReduce paradigm with some additional functions. For example:

  • Cross (two inputs) – each combination of records from the two inputs is built and is independently processable
  • Match (two inputs) – each combination of records with equal keys from the two inputs is built
  • CoGroup (many inputs) – pairs with identical keys are grouped for each input, group of all inputs with identical keys are processed together

PACT programs are compiled to parallel data-flows to be executed in Nephele. Nephele is an execution engine for parallel data-flow programs. It provides the process scheduling, communication, and fault-tolerance mechanisms needed to execute the workload. Example given of PACT workloads are TCP-H aggregation and K-Means Clustering. I’m not too familiar with running MapReduce workloads, so not sure of the impact this work will have.

The Case For PIQL: A Performance Insightful Query Language
Michael Armbrust (UC Berkeley), Nick Lanham (UC Berkeley), Stephen Tu (UC Berkeley) , Armando Fox (UC Berkeley) , Michael Franklin (UC Berkeley) , David Patterson (UC Berkeley)
[position paper]

This work looked at the tradeoff between RDBMS’s and NoSQL-type databases. RDBMS’s have powerful declarative query languages, have a query optimizer that decides the best execution plan, has automatic parallelism, and is performance-opaque. Contrastingly, NoSQL has a simple query interface (get/set), but complex queries have to be implemented at the application layer and are non-trivial to architect.

The argument is made to use a system called PIQL which is designed to be a middle ground between the two extremes. PIQL is a scale-independent declarative language that only allows developers to write queries with a data-size independent upper bound. PIQL is able to say no to certain queries, where as an RDBMS can’t.

Most applications have natural cardinality constraints. Example given was that a Facebook user has an upper limit of 5000 friends. PIQL lets developers specify these constraints in their data model so the optimizer can use them to decide on an optimized execution plan. If a query would be too slow, PIQL rejects it. PIQL is designed to run on top of existing key/value data-stores and it scales well. I think this is the right direction for providing easier ways to write NoSQL-type applications. I’m interested in looking more at what their API provides.

Towards Automatic Optimization of MapReduce Programs
Shivnath Babu (Duke University)
[position paper]

Shivnath described some of the difficulties with using MapReduce today. The lifecycle of a MapReduce job is that the programer first provides the Map and Reduce functions, kicks off the job, which gets split into many Map waves, followed by many Reduce waves, until it finishes.

The complexity here is that the user has to specify things like the number of maps, the number of reduces, and memory usage. There are over 190 parameters that a user can specify in Hadoop, and for many, it isn’t clear how they impact performance. As an experiment, a 50GB terasort was run on EC2. By changing parameters, the running time drastically changed, and certain parameters depend on each other, so individual parameters can’t be tuned separately.

The argument was that a better approach should be used for default parameters in Hadoop. There has been a lot of work in traditional RDBMS’s and these techniques should be leveraged to try and tune MapReduce workloads.

Benchmarking and Testing

Benchmarking Cloud Serving Systems with YCSB
Brian Cooper (Yahoo! Research) , Adam Silberstein (Yahoo! Research) , Erwin Tam (Yahoo! Research) , Raghu Ramakrishnan (Yahoo!) , Russell Sears (Yahoo! Research)

This work looks at how to properly evaluate a NoSQL cloud database. There are many systems out there – PNUTS, BigTable, Megastore, Azure, Cassandra, CouchDB, and many more, but the tradeoffs between systems are not clear. It would be nice if there was a standard benchmark that could be used to analyze the performance of each system so that developers can choose the appropriate solution to meet their needs.

In comes YSCB – an open source, standard benchmark tool for evaluating NoSQL systems. YCSB focuses on performance (latency and throughput) and scalability (adding servers, latency as system scales, and elasticity). The YCSB tool is written in Java, which provides easily extendable workloads.

A test setup was run on six server-class machines, and systems evaluated are Cassandra, HBase, sharded MySQL, and PNUTS. Some of the example workloads given were (more in paper):

  • A – Update heavy workload that had 50:50 ratio of reads to updates
  • B – Read heavy workload with 95:5 ratio of reads to updates
  • E – Short range scans of 1 to 100 records of size 1KB
  • Elasticity – Run a read-heavy workload while adding servers from 2 to 6.

The graphs showed a clear distinction between different systems and provided insight to which systems would be best in different situations. It was also noted that these tests can be used to improve the database system, as happened with Cassandra which improved dramatically in subsequent version releases.

This looks like a great tool for evaluating the plethora of NoSQL systems out there. It would be nice to have a standard setup (e.g. on Emulab) so that the comparison between systems is fair and repeatable.

Automated Software Testing as a Service
George Candea (EPFL) , Stefan Bucur (EPFL) , Cristian Zamfir (EPFL)
[position paper]

Stefan pointed out that when we install privileged software today (like drivers) we have to take their reliability on faith. Third-party drivers run with the highest privilege in the operating system, so if they have bugs or malicious code, serious problems can occur. The analogy made was to cars – we wouldn’t give our car keys to a random driver without some guarantee of the car’s safety.

It would be nice to be able to test a driver by exploring all possible code paths and look for common bugs and problems. Testing as a Service (TaaS) is a system for doing just that. TaaS can be used by home users by uploading a driver to a website and evaluating it, by developers who can integrate testing into their IDE’s for convenience, and as a commercial certification service for software.

This seems like a nice idea, although I question how much traction it could get. A lot of Windows drivers today don’t get certified because companies like to cut corners and produce a product for the least amount of money possible. It seems like even if this service was provided, companies would still cut corners and just skip using it to save money. Being able to test as a home user is not that useful, as a failed test would just give you the option of not installing the software, the same option you have today. Perhaps if a big player like Microsoft started requiring certification for driver installation, it could get the traction it needed.

Keynote 3

The Internal Design of’s Multi-Tenant Architecture
Rob Woollen (

This was an interesting keynote. I wasn’t very familiar with exactly what Salesforce provided to its customers before this talk. I knew that a lot of companies used their services, but I didn’t realize the scale they were working at. Salesforce provides cloud services for 75,000 customers running 150,000 custom applications and serving 17,500 websites.

They provide many services, including a query language, API’s, OLAP, batch processing, search, mobile, content management, packaging, and many others. What’s interesting about Salesforce’s architecture, though, is that each customer (or “tenant” as they call them) is not given its own architecture. Instead, all tenants run on a single platform, supported by a single code base.

Each username is mapped to a tenant, which is mapped to an “instance”. An instance is a cluster of hardware including a SAN, application servers, load balancers, object storage, etc. To provide isolation, each tenant runs as a separate Java thread. They have a three-tier caching layer, first caching locally within each Java thread, then caching with remote requests, and finally providing an application-layer service to users running on memcache.

Underneath the cache is a persistence layer that uses Oracle. Every customer’s data is stored together in a table, with static (compile time) and dynamic (run time) query analysis that enforces that a customer can only read their own data. All data is stored as a string in the database, so to provide indexes, a separate table has a column for each possible type which can then be indexed.

Salesforce only maintains a single code base for their infrastructure. To provide compatibility with previous releases, functions are annotated with API version numbers, such that if a customer is still using an old version, they don’t get access to changes in the API. When customers package their own code, they are required to submit unit tests that cover most of their code paths, so when new versions are released, Salesforce uses customer tests to verify compatibility with old API versions. An analytics component is provided that does real-time calculations instead of offline processing like most other analytics systems.

This was a nice talk. The presentation was a little dry (Rob didn’t seem very enthusiastic about their services!) but it was still interesting.

Data Services

G-Store: A Scalable Data Store for Transactional Multi key Access in the Cloud
Sudipto Das (UC Santa Barbara) , Divyakant Agrawal (UC Santa Barbara) , Amr El Abbadi (UC Santa Barbara)

G-Store addresses the weakness of most NoSQL databases that they don’t provide transactions over multiple keys. Many NoSQL databases provide test-and-set type semantics for a single key, but can’t provide any guarantees about multiple keys. To do this, users have to implement any transactional logic in the application layer.

G-Store is a layer that runs on top of a key/value store and provides additional functions. Users can submit a list of keys to add to a group. When a group is created, a single node is assigned as the controller for this set of keys. Since a single node controls the group, it can execute transactions over the group. Once a user is finished, the group can be deleted.

This work is similar to our own CRAQ work, which can provide multi-key transactions over a chain. CRAQ only provides static group membership, in that it doesn’t support explicitly moving keys around after their initial creation, but by leveraging Sinfonia-type mini-transactions, it provides the same service as G-Store. I was disappointed that we weren’t cited.

Google Fusion Tables: Data Management, Integration and Collaboration in the Cloud
Alon Halevy (Google) , Hector Gonzalez (Google) , Jayant Madhavan (Google) , Christian Jensen (Aalborg University) , Jonathan Goldberg-Kidon (MIT) , Warren Shen (Google) , Rebecca Shapley (Google) , Anno Langen (Google)
[industrial presentation]

Google Fusion Tables is a service that allows data management for the web era. Google wants to provide data storage that can integrate seamlessly with the web, is searchable, embeddable, extensible, easy to use, provides incentives for sharing data, and facilitates collaboration.

The example given was biking trails in Europe. There is a website that allows users to upload GPS data of bike trails. They want to allow data like this to be integrated into the Google Maps API for visualization and allow users to query it easily, e.g. find trail lengths < 25 miles. Google Fusion Tables allows users to extend data tables in order to collaborate. Tables can be merged. For example, a user might merge the bike trails table with a sponsors table and a results table. Users can have discussions about the data, set permissions, and perform other types of operations. The architecture to support these operations has a front end dispatcher that receives requests, a query processing component, a backend that does plan execution, and a replicated storage layer running on BigTable. This seems like a nice, useful API. Looking forward to checking it out.

High Availability and Reliability

Making Cloud Intermediate Data Fault-Tolerant
Steve Ko (Princeton University) , Imranul Hoque (University of Illinois at Urbana-Champaign) , Brian Cho (University of Illinois at Urbana-Champaign) , Indranil Gupta (University of Illinois at Urbana-Champaign)

This work tries to address some shortcomings in Hadoop. In MapReduce, there is a two stage computation: the Map step and the Reduce step. Between processing steps, there is a large amount of intermediate data that has to get shipped from server to server. For example, Google’s indexing application runs a chain of 24 MapReduce jobs. In many cases, this intermediate data is larger than either the input data or output data.

If a MapReduce node fails, the computation it was supposed to provide has to be rerun on another node. The problem is that to rerun the computation, the intermediate data that was sent to the failed node has to be sent to the new node as input. If it’s not the first step in the chain, this intermediate data is no longer available – it was the output of the previous step. To recover it, the previous step has to be rerun as well. This is a problem when chaining multiple steps together, because the failure will cascade and cause every previous step to have to be rerun.

This work extends Hadoop by replicating this intermediate data. Unfortunately, this isn’t a simple thing to do because most MapReduce jobs are network-bound (verified experimentally), meaning that adding additional replication slows down the job. An observation is made, though, that often the bottleneck is at the inter-rack switches. The network capacity within a rack is not saturated. To take advantage of this, replication can be done at the rack level.

Failures really do occur often in large MapReduce clusters (verified from industry reports), and an experimental evaluation shows that when using rack-level replication, jobs can be sped up significantly when failures occur. When no failures occur, there is only a small overhead associated with the replication. This seems like a nice addition to Hadoop, and I hope to see it added to the main branch.

Characterizing Cloud Computing Hardware Reliability
Kashi Vishwanath (Microsoft Research) , Nachi Nagappan (Microsoft Research)

This work looks at hardware failures in large scale datacenters. Microsoft has 100K+ servers that are geographically distributed around the world running heterogeneous hardware. They observe lots of failures and have to deal with them with a support staff. Since a person has to diagnose hardware faults and perform the necessary fix, failures come at a high cost.

The dataset used here was from Microsoft’s online services. This runs things like Messenger, Hotmail, etc. The dataset was over a 14 month period, looking at 100K+ servers, and contained a work log from their datacenter. The work log records which server fails along with what component had to be replaced (e.g. disk, dimm, raid controller, etc).

The analysis they performed looked at characteristics in aggregate, failure rates across individual servers and datacenters, and tried to see if failures could be predicted. The method used was a decision tree. Some interesting takeaways:

  • For every 100 machines, 9.5 machines recorded a failure annually
  • For every 100 machines, 20 failures were recorded annually (so 2 repairs per machine)
  • 70.7% of failures were disks, 6.5% raid controller, 5.1% memory, 17.6% other
  • When a failure occurs, there is a 20% chance another failure will occur within 1 day!
  • Failure probability is correlated with number of disks in a server, slope is greater than 1!
  • Many times, a faulty raid controller would cause failures to be reported as a disk failure

A Self-Organized, Fault-Tolerant and Scalable Replication scheme for Cloud Storage
Nicolas Bonvin (EPFL), Thanasis Papaioannou (EPFL), Karl Aberer (EPFL)

I really have nothing to say about this talk. I couldn’t understand anything from the presentation at all. I verified it wasn’t just me – others that I asked didn’t understand it, and there were no audience questions either. This is a great example of how not to give a talk.

Storage and System Modeling

Robust and Flexible Power-Proportional Storage
Hrishikesh Amur (Georgia Institute of Technology) , James Cipar (Carnegie Mellon University) , Varun Gupta (Carnegie Mellon University) , Michael Kozuch (Intel Corporation) , Gregory Ganger (Carnegie Mellon University) , Karsten Schwan (Georgia Institute of Technology)

This work looked at trying to provide a distributed storage service that is power-proportional. This means that as the load on the system increases, the power consumption should also increase proportionally. This is not the case today – idling servers still use a large percentage of the required power for peak load.

One way to reduce power consumption is just to turn off servers during times of low demand, but traditional distributed storage services use methods like consistent hashing to get good load balancing, so if you shut off a server, the data on that server is no longer available.

Instead, what Rabbit does is store a primary replica of every key in a small number of nodes. A secondary replica goes to a larger number of nodes, a third replica to an even larger set of nodes, and a fourth replica to the largest number of nodes. By doing this, you can turn off 95% of your servers without losing availability. If you have 5 servers on, each serves 20% of requests. If you have 100 servers on, each serves 1% of requests.

This allows for fast, fine-grained scaling by simply powering off machines. An example benchmark of a 100GB terasort shows that read throughput scales linearly with percentage of maximum power used. However, write performance decreases because all data has to be written to a small number of nodes. A method is used from previous work to offload writes to get better write performance.

RACS: A Case for Cloud Storage Diversity
Lonnie Princehouse (Cornell University) , Hussam Abu-Libdeh (Cornell University) , Hakim Weatherspoon (Cornell University)

RACS is a method for combining multiple cloud platforms together for data storage, such that a single vendor doesn’t have to be relied on. Today, when choosing a cloud datastore, customers are often locked in because of the high cost of switching providers. To transfer a large amount of data from vendor A to vendor B, the customer must pay for the bandwidth cost of the transfer twice – once for each provider.

What RACS does is use erasure coding across many cloud platforms to reduce the costs associated with switching vendors. The notation RACS(k,n) is used, where it stripes data over n providers such that any k-subset contains a complete copy of the data. This can tolerate up to (n-k) failures before data is lost.

An evaluation used an FTP trace from the Internet Archive over 1.5 years. The cost of three providers over time were very similar. Using RACS is more expensive: RACS(1,2) is much more expensive, while RACS(8,9) is only slightly more expensive. However, the cost of switching providers is very expensive when not using RACS, on the order of two-months of operating costs, compared to much cheaper cost when using RACS. This is interesting work – I hope they implement more cloud platforms in addition to their current prototype.

Characterizing, Modeling, and Generating Workload Spikes for Stateful Services
Peter Bodik (UC Berkeley) , Armando Fox (UC Berkeley) , Michael Franklin (UC Berkeley) , Michael Jordan (UC Berkeley) , David Patterson (UC Berkeley)

Web applications often experience periods of spikes in traffic that are unpredictable. For example, Wikpedia, after the death of Michael Jackson, saw 13% of total requests to the single page of Michael Jackson. The first contribution of the paper was to characterize these spikes. There are two main types – workload volume spikes consisting of the time to peak, duration, magnitude, and maximum slope; and data hotspots consisting of the number of hotspots, spacial locality, and entropy.

The conclusion was that there is no typical type of spike because they vary widely. The next contribution was to try and generate workload spikes using the 7 characteristics listed above. They were able to create a generative spike model using two parameters N, the number of hotspots, and L, a clustering parameter.

After using the generative model, they were able to match real world examples they collected with their counterpart from the output of the generative model, validating that the model produces valid workload spikes.


SOCC turned out to be a great conference, and was definitely a success overall. The next SOCC conference will be co-hosted with SOSP in 2011 in Portugal, so I’m looking forward to that if I’ll be able to attend!

ACM Symposium on Cloud Computing (SOCC 2010) Day 1

I’m currently in Indianapolis, Indiana for the first ACM Symposium on Cloud Computing (SOCC 2010). I’m posting here with a brief summary of each talk at the conference as well as some of my thoughts. These are my reactions to the presentations only, as I haven’t read most of the papers.

[See my next post for Day 2

Keynote 1

[Note: My SIGMOD paper was being presented opposite the first keynote, so Steve Ko who is also at the conference wrote this first summary.]

Evolution and Future Directions of Large-Scale Storage and Computation Systems at Google
Jeffrey Dean (Google)

Jeff Dean gave a keynote about different services running at Google and general principles on how to build large-scale services. The talk was roughly divided into three parts. The first part was about Google’s data centers that house a few hundreds of clusters. Each cluster has thousands of machines with one or a handful of configurations. Each machine (at least) runs GFS and Colossus (next-gen GFS), and a cluster scheduling daemon.

The second part was about Google’s back-end services including MapReduce, BigTable, and Spanner. A few interesting notes about these systems are:

  1. BigTable now has a dedicated team that manages BigTable service clusters. Because of this, there has been a lot of work on fair-share scheduling and performance isolation.
  2. BigTable has something called coprocessors, which are basically “arbitrary code that runs next to each tablet in table”.
  3. Spanner is a storage & computation system that runs across data centers. It supports the mix of strong & weak consistency models, fine-grained replication, “zone”-based hierarchy (1 master per zone), etc.

The third part was about experiences and design patterns for building large-scale services. There were too many design patterns to post here (Jeff said he would post slides on his website), but here are a few which I find interesting:

  1. Use back-of-the-envelope calculation whenever possible before choosing a design
  2. Design for growth, but don’t overdo it (e.g., 5x – 10x OK, but not 100x growth), and
  3. Canary requests: some requests crash every single process in the same service, so try a few machines first.

Overall, it was an interesting talk about quite a broad set of topics. There are only a few places where you can accumulate wisdom about building truly large-scale systems, and it is always interesting to see what they are doing to cope with the scale.

Operating Systems

An Operating System for Multicore and Clouds: Mechanisms and Implementation
David Wentzlaff (MIT) , Charles Gruenwald III (MIT CSAIL) , Nathan Beckmann (MIT CSAIL) , Kevin Modzelewski (MIT CSAIL) , Adam Belay (MIT CSAIL) , Lamia Youseff (MIT CSAIL) , Jason Miller (MIT CSAIL) , Anant Agarwal (MIT CSAIL)

This work addresses the issue of how to create applications that run in the cloud. Machines today have many cores (16 today, up to 1000 in five years), and parallelizing across many cores is very difficult. They created a new system called Fos (Factored Operating System) which runs on top of Xen and splits the OS into separate services. The kernel is a microkernel and all services run on top (e.g. File System). The communication between components in the system is done with message passing. There are no shared locks.

Because of this abstraction, each component can be adjusted elastically to handle demand. For example, if an application starts accessing the File System component at too high of a rate, the underlying system can spawn another File System component on another core or another machine and start transparently redirecting requests to the new component.

The implementation of Fos is fairly complete – some applications include a web server, slide viewer, video transcoder (ffmpeg), and busy box. In fact, it was revealed at the end of the talk that the presentation was running on Fos, and it didn’t crash! The system’s website is running on the Fos web server.

I’m not sure how this work will play out. I could see this becoming a standard approach as servers start having thousands of cores, but I’m not sure how applications here would be able to cope with network latency involved in inter-machine message passing.


Lithium: Virtual Machine Storage for the Cloud (can’t find this paper online yet)
Jacob Hansen (VMware) , Eric Jul (Bell Labs, Dublin)

This work looks at trying to increase the performance of shared file systems that are used by virtual machine clusters. In the traditional setup, a datacenter has a SAN server that hosts files for VMs in a cluster to access. The SAN is set up such that it is highly reliable and provides high throughput.

The problem with a SAN is that it’s expensive and it doesn’t scale very well to hundreds or thousands of servers accessing it simultaneously. It’s also limited by network bandwidth. VMware’s Lithium technology does away with the SAN and instead arranges the VM hosts into a peer-to-peer network and uses local storage to store files, replicating data for redundancy and performance.

The system still preserves the features required of a VM file server, e.g. cloning, snapshots, etc. It uses a branching method similar to some source control systems like Mercurial for quick copying of large files.

When compared to a SAN, Lithium doesn’t perform as well with a small number of hosts, but as the number of VM hosts increases, Lithium scales linearly while the SAN maxes out at a constant throughput. This approach seems like a great idea, and hope to see it pushed to production in future VMware releases.

Differential Virtual Time (DVT): Rethinking I/O Service Differentiation for Virtual Machines
Mukil Kesavan (Georgia Institute of Technology), Ada Gavrilovska (Georgia Institute of Technology), Karsten Schwan (Georgia Institute of Technology)

The presentation of this work was quite confusing, but the basic idea is that when doing fair scheduling for resources in a VM, ill effects are often encountered. For example, as is well known, TCP doesn’t react very well to congestion in the network (e.g. why TCP doesn’t perform very well over wireless networks), so when a VM host artificially limits a guest’s access to the network, the TCP congestion window will drop dramatically and negatively impact performance.

To try and solve this problem, DVT takes an approach of never sharply decreasing a guest’s share of the network. For example, if host X is receiving 100% of network access, but hosts W, Y and Z suddenly request access, rather than the traditional approach of reducing X immediately to 25%, DVT instead slowly reduces it over time. To keep access fair, DVT will eventually reduce X to lower than 25% to make up for the increased share it got previously.

By doing this, DVT increases the performance of VM guests by up to 25% since the TCP congestion window doesn’t drop sharply. It was mentioned that future work will look at applying the same method to disk access, although it isn’t clear to me how slowly reducing disk access instead of sharply reducing it would increase performance in an application.

Virtual Machine Power Metering and Provisioning
Aman Kansal (Microsoft Research) , Feng Zhao (Microsoft Research) , Jie Liu (Microsoft Research) , Nupur Kothari (USC) , Arka Bhattacharya (IIT Kharagpur)

This work asked the question “Can we tell how much power a VM guest is consuming?”. I was wondering what the motivation for measuring this was throughput the talk until it was finally mentioned at the end. I’ll start with the motivation first instead – the reasoning given was mainly to use knowledge of a VM guest’s consumption to provision your datacenter power accordingly. Other usages are to charge your cloud users according to power consumption (although I don’t buy this, as I don’t see how it would differ from billing with current methods – cpu, memory, storage, bandwidth), and to track which VMs are consuming the most power so you can target power reduction for “green” initiatives.

To answer this question is a two step process. First, they measure the power consumption of each component in the system when a context switch between VM guests happens by using OS-level events. Next, they have to figure out which VM guest is using each component. Rather than doing this live, they monitor a VM guest for a period of time to determine its power consumption and then use that value for future calculations. The reason for this is that different loads may use different internal power for the same externally visible power state, so learning a VM guest’s power profile over time is a more accurate measurement of power consumption.

There were a lot of technical details glossed over in this talk, and there were many formulas on the slides that weren’t explained or accompanied by variable descriptions, so I found the presentation somewhat confusing. I’m sure reading the paper would make this more clear.

Distributed and Parallel Processing

Stateful Bulk Processing for Incremental Algorithms
Dionysios Logothetis (UC San Diego), Christopher Olston (Yahoo! Research) , Benjamin Reed (Yahoo! Research) , Kevin Webb (UC San Diego) , Kenneth Yocum (UC San Diego)

This work targets large data applications. These are things like web analytics, graph mining, log analysis, and PageRank, which use massive amounts of data. An insight here is that these applications have to continually process on the order of TBs of new data per day and they are stateful, but the running time is proportional to the total amount of state, not proportional to the amount of new data.

Continuous Bulk Processing (CBP) provides users with an additional API of a Translate() function and a RouteBy() function similar to the map and reduce stages of MapReduce. Current systems only do “outer” grouping, while CBP allows for “inner” grouping so that only the state that needs to be accessed is shipped around. In an evaluation, this inner grouping method reduced running time by up to 53%.

Perhaps I’m not familiar enough with MapReduce, but the presentation went too fast for me to follow the details of the Translate and RouteBy API, so see the paper for details.

Comet: Batched Stream Processing for Data Intensive Distributed Computing
Bingsheng He (Microsoft Research), Mao Yang (Microsoft Research) , Zhenyu Guo (Microsoft Research) , Rishan Chen (Beijing University) , Wei Lin (Microsoft Research) , Bing Su (Microsoft Research) , lidong Zhou (Microsoft Research)

Comet is a system that tries to reduce the running time of large data-intensive applications that have work in common. The example given was a set of four jobs: one that computes the top ten hottest Chinese pages daily, another that computes the top ten hottest English pages daily, and corresponding jobs that compute the top ten hottest Chinese and English pages weekly. The first two jobs have the same input data, but have different filters in place, so the first step of each of those jobs is the same.

The flow of the system is: query series, normalization, logical optimization, physical optimization, execution plan, and then execution. By doing these optimization steps, Comet is able to reduce the amount of work done by 52% for jobs that have commonalities with previous jobs.

The evaluation showed that computing the top ten pages weekly was able to take advantage of the top ten daily calculation, but the top ten pages for the week don’t necessarily overlap with the top ten pages of each day, so it’s not clear how this works. This question was asked in the Q&A, but the author wasn’t able to answer. The presentation of this work was very confusing, and it was clear that the rest of the audience didn’t understand either. I’m sure reading the paper would make more sense.

Skew-Resistant Parallel Processing of Feature-Extracting Scientific User-Defined Functions
YongChul Kwon (University of Washington) , Balazinska Magdalena (University of Washington) , Bill Howe (University of Washington) , Jerome Rolia (HP)

This work addresses how to reduce the running time of large scientific experiments. The example given here was an application that takes astronomical images, does feature extraction to identify the celestial objects in the images, and then runs a Friends of Friends algorithm, which is used in astronomy to cluster celestial objects. MapReduce-type systems like Hadoop are a great fit for workloads like this, but it is hard to express complex algorithms and get good performance. As an example, this algorithm when first implemented took 14 hours to run, and after a week of working, they were able to reduce the running time to 70 minutes.

The reason these types of algorithms can take a long time to run is that the same amount of input data doesn’t always have the same running time (e.g. if the cluster of celestial objects is more dense, takes longer), so a static partitioning scheme doesn’t get good performance. An alternative is to use micro partitions, which reduces the impact of skew, but there is additional framework overhead and to find the sweet spot, the algorithm must be run many times, which is undesirable.

The SkewReduce algorithm takes a sampling approach to figure out the best partitioning scheme. In evaluation, this SkewReduce algorithm was able to reduce the running time of algorithms by 2-8 times. This seems like a nice scheduling algorithm, and I hope this finds its way into the main branch of Hadoop. A person who works at Google shared a similar optimization that Google uses, but they do their optimizations in the Reduce stage rather than the Map stage.

I will attempt to get my writeup for Day 2 posted late tomorrow.
[Update: Day 2 posted.]

Computing for Global Development

Oct 2009 issue of the SIGCOMM CCR has an editorial by Kentaro Toyama and me where we ask the question if technologies for developing regions be considered a core area of computer science research? It is relatively easy to argue that technology can help improve the lives of the poorest billion people on the planet. But, is it research? More specifically, is it computer science research? This editorial stems out of our discussions at the CCC Workshop on Global Development. Keshav asked us to merge our, somewhat opposing, views into an editorial. You can read it here. In this post, I will give a summary of the Workshop on Global Development:

CCC Workshop on Global Development:

IMG_0147 The Workshop on Global Development was held at the Claremont, Berkeley. Thanks to the CCC funding, we were able to fly 40+ participants and cover all their expenses. Although I was one of the organizers, the views presented here are mine and not of the participants or sponsors.

Digital technologies have done wonders for mankind. However, benefits of digital technologies (e.g., the Internet) are often limited to the “first world”, leading to the so-called “digital divide”. People in developing regions get access to a decreasing share of digital resources, which are critical for socio-economic development in the 21st century.

The last decade has seen a resurgence of interest in the application of digital technologies to address problems in global development. Eric Brewer’s TIER group at Berkeley is one example. However, this area has not (yet) gained acceptance as core computer science research. There are many reasons for this and the CCR editorial talks about them in detail.

The purpose of the CCC workshop was to bring together lead researchers in the area, along with experts in more established areas, and have a frank discussion about the future of developing regions research. This was a true workshop in the sense that unlike a “mini-conference” there were no PowerPoint presentations. The participants came with a vague idea of what their thoughts were and, after some intense discussion, left with a deeper understanding of the problem and possible solutions. We had to feel the pulse of the audience and adapt the workshop agenda on-the-fly. This was truly exciting.

Highlights of Day 1:
Panel on Technical vs. Non-Technical: Eric Brewer (Berkeley), Kentaro Toyama (Microsoft Research), Tapan Parikh (Berkeley), Bhaskar Raman (ITT-Bombay) went over the “Is ICTD a technical or a non-technical field”. There is a combination of some hard technical problems and some application of existing technologies in new ways. Computer scientists won’t get interested unless the problems are technically hard. The challenge is to either find problems that are both technically hard and can have a real impact or find innovative use of existing technologies that solves problems at large scale i.e., for millions of poor people.

Panel on Starting New Research Areas:
Before more researchers get interested in the area and a plethora of research material is published, we need to step back and learn from history. Deborah Estrin (UCLA), Randy Katz (Berkeley), Gaetano Borriello (Univ. of Washington), and Rakesh Agrawal (Microsoft Research) gave their insights on starting a new field. Deborah talked about her early experiences with sensor networks, Gaetano’s commented on the early days of pervasive computing, Rakesh on data mining and Randy talked about the opportunities and problems facing new fields in general. Food for thought: what is the use of X amount of papers published that all solve a problem Y, which will never actually occur in real life? Ever. I have purposely left out specific views of the panelists from this post. Some information is available from the workshop summary and proceedings, but we decided not to make the workshop minutes public. Feel free to contact me in private, if you want more information.

Keynote – Anil Gupta (Honey Bee Network):
Most of the afternoon was spent on discussing issues like branding, publication venues, funding, and education. And then Anil Gupta gave a very interesting keynote. He showed specific examples of innovation in developing regions, where poor people have engineered technology in various ways to solve their everyday problems. This shows that a) technology, specifically designed for their needs, can help their daily lives and b) people in the developing regions should not only be consumers of such research/products, but can actively participate as innovators themselves. Watch this Discovery Channel video about Honey Bee Network to get an idea of what Anil was talking about:

I must admit that at the end of the first day, everything was unclear. We raised more questions than answers and it was not clear what direction this research community will take. Or if this research community even exists in the first place.

Highlights of Day 2:
We started the day with presenting specific problems in networks/systems, HCI, AI, applications, and software engineering that the participants thought were important problems in the area. This helped to set some “grand challenges” for each sub-area. Frankly, I think most of them were examples of good individual projects instead of “grand challenges”. Nevertheless it helped the participants to get a feel of what research direction each sub-area may take.


Panel on Technology Transfer:
We had a panel on technology transfer with Nathan Eagle (MIT), Umar Saif (LUMS), Jonathan Jackson (Dimagi), and Vijay Chandru (Strandls). Vijay was behind the famous Indian Simputer project and talked about how they went about commercializing the Simputer and what lessons they learned (the image of the left is Vijay holding a Simputer). Umar talked about a startup that provides citizen journalism and another one that is like a SMS-Twitter for Pakistan. He described how these services have helped in disaster situations or how they are providing means for disseminating information in a place where there are often restrictions on the freedom of media. Jonathan talked about the ups and downs of providing healthcare technology solutions in Africa.

New SIG/Conference:
A turning point in the workshop was when Bill Thies presented a proposal for a new ACM SIG. There was unanimous support for forming such a SIG and everyone agreed with the goals and purpose of the SIG. Suddenly, there was a sense of community building. The area now had a name (sort-of) and a SIG to associate with. If developing regions research becomes part of main stream computer science in the coming decades, this exact moment was it’s birth in a way. We went on to discuss the specifics of publication venues (conferences and journals). I don’t want to disclose information about what exactly is happening, but it will be public soon.

Closing & Acknowledgments:
In the closing comments someone emphasized that “let’s not forget that we are all here because we want to touch human lives in someway. What we are doing in the end boils down to the direct impact on the underprivileged. This is the single most important and unique aspect of this area of research.” This thought stuck with me after the event.

I’d like to thank Tapan Parikh (Berkeley), Lakshmi Subramanian (NYU), and Bill Thies (Microsoft Research) for doing all the work. I merely helped in a few tasks here and there.

IPTPS ’10 call for papers

Together with Arvind Krishnamurthy, I’ll be chairing this year’s International Workshop on Peer-to-Peer Systems (IPTPS).  The workshop was started in 2002, which coincided both with the popularization of P2P file sharing (Napster, KaZaA) and the introduction of distributed hash tables (DHTs) from several different research groups.

Eight years later, P2P file sharing is still going strong (now through BitTorrent), while the previously-academic DHTs have found their way into real use.  DHTs now form the decentralized lookup structures for file sharing services—in the form of so-called “trackerless” BitTorrent—with the DHT in the Vuze service comprising more than a million concurrent users.  (As an aside, I’m proud to note that Vuze’s DHT is based on Kademlia, which was proposed by one of my officemates in grad school, Petar Maymounkov.)

These self-organizing systems have also found their way into the datacenter.  One notable example is the storage system, Dynamo, that forms the basis for Amazon’s shopping cart and other back-end  applications.  Or Facebook’s Cassandra, used for its Inbox search.  Or the rest of the key-value stores that do automated partitioning.  And we are starting to see these techniques being proposed for scaling enterprise networks as well.  With that in mind, we wanted to broaden the scope of this year’s IPTPS to include topics relating to self-organizing and self-managing distributed systems, even those running in single administrative domains.

We also plan to have a demo session at this year’s IPTPS to highlight developed and deployed systems.  The workshop will be collocated with NSDI in San Jose, so will be especially convenient for those in the Bay Area.  We welcome submissions (both paper and demos) from researchers, developers, and hackers.  If you don’t want to write a paper, come show off your running P2P system.

Paper submissions are due Friday, December 18, 2009.  More information can be found at

Continue reading IPTPS ’10 call for papers

Firecoral @ IPTPS

We’ve recently been working hard on Firecoral – a browser-based, peer-to-peer content distribution network for web caching. I’ll be presenting a short talk on Firecoral at the 8th International Workshop on Peer-to-Peer Systems (IPTPS) on April 21st in Boston, MA.

Peer-to-peer content distribution has been inarguably successful for large file distribution (e.g. BitTorrent), but P2P services have been restricted to stand-alone applications, not transparently incorporated into Web browsing and seamlessly running over HTTP. CoralCDN has served as a web content distribution network for the past five years, but its deployment has been limited to PlanetLab and demand quickly outgrew capacity.

Firecoral’s goal is to scale web content distribution to end users by allowing mutually distrustful users to share their web browser caches, yet ensure the authenticity of content and enable users to preserve privacy by expressing flexible content sharing policies.

To achieve these goals, we have built a Firefox extension that uses subscription-based XPath queries to extract URLs from web sites and redirect them to a proxy server running inside the browser. For example, all external links from Slashdot are redirected to Firecoral. Once a URL is received by the proxy, a tracker is queried for other users caching the same URL, and the content is fetched from peers instead of the origin web server. We ensure data integrity with cryptographic signatures from a trusted service.

We will be releasing a beta version of Firecoral on our recently launched Firecoral website soon. For more details about Firecoral, please see our paper, Bringing P2P to the Web: Security and Privacy in the Firecoral Network.

History of NSDR

The call for papers for the 3rd Workshop on Networked Systems for Developing Regions (NSDR) was announced today. NSDR 2009 will be held with ACM SOSP this year at Big Sky, Montana. Direct all whining about the location to the SOSP organizers please!

I thought I’d share a little history of NSDR on this blog. Research in technologies for developing regions has been going on for a while. For example, the TIER group at Berkeley started in 2003. However, this area (often dubbed as ICTD) lacked a sense of community with no specialized workshops/conferences.

In 2006, I was attending SenSys at Boulder, Colorado when I came across the call for workshops of SIGCOMM 2007. This was, apparently, the first time that SIGCOMM was going to Asia. I thought that a workshop focusing on networking challenges in developing regions can be a nice fit for the first SIGCOMM in Asia. I wrote a proposal hours before (and during) my flight from Denver to NYC. Submitted it before the deadline during my transit at NYC, with help from Umar Saif. We withdrew the proposal during my transit at Heathrow (mainly because we couldn’t get in touch with potential PC members). And after encouragement from Paul Francis (the SIGCOMM’07 workshops chair), re-submitted the proposal after landing at Amsterdam.

One crazy flight. And look where we are now. NSDR is, arguably, the leading event in technical ICTD research right now. So while we look forward to having a successful third workshop, lets not forget our humble (and slightly amusing) beginning.