Bridging the Gap with HashCache
[Today we'll be having a guest post by Anirudh Badam, a PhD student in the larger Network Systems Group at Princeton, related to systems research for developing regions. The work he'll be talking about was recently named one of Technology Review's Top 10 Emerging Technologies for 2009. -- Mike]
To provide Internet connectivity in the developing world is a daunting task, with problems pertaining to a high cost of bandwidth, ill-provisioned equipment and power, scarcity of on-site expertise, and adverse environmental conditions. Most common way to offset bandwidth cost/consumption is to deploy high-performance web proxy caches at the edge. Such high-performance proxies, primarily designed for the developed world, run as a dedicated service on a well-provisioned server with a large amount of RAM. The problems with “porting” this solution as-is are manifold and would only exacerbate the problems pertaining to power consumption and adverse environmental conditions. The initial expenditure involved for acquiring such powerful servers would make inventory control harder and also make deployments expensive because of having to supply, manage and maintain not only the end host laptops and but also their servers.
A high-performance, efficient cache storage engine which can index large disks is essential not only for static web caching but also for enabling other bandwidth/latency reduction techniques like WAN acceleration and web pre-fetching. Disk price is plummeting by the month and is currently at a dime/GB, but the price of a server which can index a terabyte disk could be anywhere between $1000 to $2000. Such servers will have high power consumption due to large amounts of RAM and the requirement of an operational temperature that further complicates the problem. Further, most of the RAM has to be dedicated to provide only a web cache, which is undesirable considering the low bandwidths in these environments. Cheap laptops and netbooks, on the other hand, are ruggedized for developing world environments. They are made for low power consumption and are designed for use in adverse environmental conditions where power might be irregular. If we can design a cache storage engine which can run efficiently on cheap laptops and index terabytes of disk, then we will be able to address an important problem that arises in providing good web services for developing-regions. Laptops solve the problems pertaining to power, environment, and maintenance, whereas large disks solve the problems pertaining to low bandwidth.
Our work, HashCache, is a configurable cache storage engine which can index terabytes of disk by using 6 to 20 times less memory compared to traditional web proxy caches.
The high memory requirement of traditional proxies stems from two reasons. First, regular hashtable data structures spend a lot of memory on pointers and keys. Second, they alienate the disk, which is the performance bottleneck, by using it only as a storage device. HashCache is an entirely new way of building a disk cache engine with indexing schemes devoid of any pointers and the disk itself being used as an application-level data structure. What this does is enable the minimal usage of memory for indexing and allow the application to directly interact with the disk, which is the performance bottleneck. HashCache is a range of disk indexing policies ranging from ones using no main memory at all for indexing to ones which consume 6-10 times less memory than traditional high-end proxies yet provide similar or better performance. The figure on the left illustrates the comparison. It shows cost vs. performance comparison of various HashCache policies and existing proxies.
While the intricate details of the seven different policies can be found in our paper, the following is a reasonable attempt at a summary. Traditional proxies build fully associative caches and hence their index hashtables need to contain pointers to be able to reference content that can possibly go at any location. The keys are the names (hashes of names) and the values are pointers in to the filesystem. The simplest HashCache policy uses that disk directly as a large set-associative hashtable with fixed sized bins, it stores the metadata of on-disk hashtable in main memory for speeding up lookup operations. The value in this hashtable is the final object itself. Restricting the number of locations that an object can be placed by using set-associativity and having those locations contiguous on the disk eliminates the requirement of pointers. It also reduces the number of bits needed to represent the hash value of the object name in the hashtable since the lower order bits can be decoded from the bin number in the hashtable. Objects larger than a bin need additional place for storage. The remaining part of the object is stored in a circular log. This bare-bones index needs 11 bits of RAM to index an object on the disk. The rest of the policies revolve around memory requirement and disk performance optimizations over this simple setting.
We have built a web proxy cache using the HashCache indexing policies and have been actively deploying it freely for educational and academic purposes. A commercial offering of HashCache is also planned. Just a few days back, we helped a group at Cornell College of Agriculture to install HashCache and a WAN Accelerator on top of that called Waprox at a node in Uganda. They are part of the Durable Rust Resistance in Wheat project and wish to use the node to download a large number of agricultural documentation for the education of local farmers. On the left is a picture of Stefan Einarson handing over the HashCache box to head of Crop Sciences at Makerere University in Uganda. We wish to continue doing such deployments for free for educational and academic purposes in the developing world. You can contact us directly if you need faster connectivity for an educational project in a developing region. We will be glad to help.
The future direction in this line of work is a larger project aimed at developing a networking stack for poorly provisioned edge networks. The idea is to develop on top of HashCache and provide a full software stack of bandwidth and latency reduction techniques including WAN Acceleration, web pre-fetching, local disk web-search engines and ultimately a generic offline cache of any popular website. Our proposal paper at the CCC Developing Regions Workshop has more details. We are also working towards incorporating new and efficient memory technologies like flash memory to offset problems relating to scale and power. We are currently developing a hybrid memory management technique that takes into account the density and functionality differences of different memory types in a hybrid architecture and comes up with clever strategies to provide applications with large amounts of efficient random access memory. Stay tuned and we will keep you posted with the progress we make towards this developing regions network stack.