Category Archives: System Management

Easing the Hybrid RAM/SSD Memory Management with SSDAlloc

Today, I am going to talk about a system called SSDAlloc, a hybrid RAM/SSD memory manager, that I have been working on over the past year and a half or so, jointly with my adviser Vivek Pai. The post is more of a research report briefing. I’ll first talk about what motivated us to explore this area and then delve into the technical jargon.

I will start with describing what motivated us to use SSD as program memory and then I will describe why using SSD as a swap space is a bad idea. After that, I will describe a SSD based non-transparent object storage engine we built that provides high performance but requires many application modifications. Later, I discuss the broader research goal, relevant to various situations (from developing regions to datacenters) which is to use as few modifications as possible to provide a nearly-transparent path to enable applications to use flash memory, providing ease of migration, high performance, and when desired, simple ACID transactions (because many applications expect transactions from non-volatile memories). After describing our solution in detail we provide some initial performance numbers.

SSD as memory: When we first started running HashCache on desktop towers, we ran into problems related to power, dust and slow restarts — described more formally by Surana et al in this paper. We contemplated running HashCache on netbooks, since they are sturdy, ruggedized, come with a battery, and are fairly immune to dust and a lot less power hungry compared to desktop towers/laptops. The proposition was not without challenges; the small amount of RAM (1-2GB) was not amenable to index large hard drives even with HashCache! We wondered if we could use the internal SSD (netbooks come with a 4-16GB SSD) as a cheap RAM substitute and an external drive as cache. This is a good idea since proxy caches’ performance is disk driven and not limited by the number of index lookups when using HashCache. Most netbooks SSD supported 500 to 2000 random reads/sec, which is an order of magnitude more than disk-seek performance. Also, with the growing gap between CPU and disk speed, flash has attracted system designers in recent times. We thought it was a good time to explore an SSD based solution.

SSD as Swap?: The next thing we did was to move the operating system to an external drive and use the internal SSD as swap. Another external hard drive, ~512GB, was used as the cache (pic) which needed an index of 8GB size. As a proxy cache the setting did just fine. On that small CPU (single core 1.2 Ghz 256KB cache), it was able to swap in and out enough number of times to hit the disk limit (about 200 requests per second). Later, when we tried to use the setting as a WAN Accelerator using Wanax, we realized that using SSD as a swap was a bad idea. Wanax is a WAN Accelerator, which makes multiple index lookups for populating a single HTTP object. Highest performance version of Wanax needs tens of index inserts/lookups per HTTP object insert/lookup. Swap was exhausting the random write throughput on the SSD. Also, RAM turned into a page cache with each 4KB page containing only about 100 bytes of useful data. That was when we realized we needed special techniques to reduce random writes and make the best of use of the limited RAM in the system.

SSD as object store: Towards that end we built an SSD based object management tool that had non-transparent read and write calls, with a back-end SSD based log-structured object store and used RAM as a compact object cache, described more formally in our workshop paper. Additionally, in that paper we proposed a minor improvement to HashCache (called HashCache-Striping), which is better suited for the hybrid SSD/RAM setting. HashCache-Striping is an efficient hashtable representation for hybrid RAM/SSD settings where the RAM:SSD ratio is high (typically 1:16 or higher). It requires only 1 byte of RAM for each index entry and provides as many index operations per second as reads/sec of the SSD used. More recently, other researchers have explored such indexes for hybrid settings.

Broader research goal: While the non-transparent technique did wonders with respect to performance it did take a lot of development effort to do so. We wondered if one could obtain the same performance benefits without having to take the painful route of having to redesign and tweak ones data structures and algorithms to make them SSD-aware. Essentially, our new goal was to use as few modifications as possible to provide a nearly-transparent path to enable applications to use flash memory, providing ease of migration, high performance, and when desired, simple ACID transactions. I am glad to announce that we were successful in our endeavor. The solution is beneficial for many situations including datacenter servers, which could use a 4TB SSD with 64-128GB RAM to scale performance or gateway servers for developing regions on netbooks with 1-2GB RAM and 4-16GB SSD. Before I delve into to the details of the solution I would first like to summarize the design space and describe the full benefits of our solution called SSDAlloc. The following table provides a nice summary. The table shows analytical performance numbers for a high-end enterprise SSD.

Random writes are only half the devil: One could claim that the only vice of swap was the random writes and hence having a write-logged swap is the panacea, but that is only half-true. Using SSD as swap forces a page level access granularity for every access. Each random write would lead to a full page-write, which would significantly increase the write traffic when the application object is very small (~100 bytes in case of HashCache-Striping). The situation (one with a RAM:SSD ratio of 1:16 or higher) motivates the need to treat SSD as something other than swap space. When most objects reside on the SSD, each object access would transfer a full page from SSD to RAM. If the objects on a page do not have temporal locality, then the rest of the page, while still being populated, really presents little benefit to the application. Instead we propose using a separate memory allocation and management tool that explicitly addresses this situation. If only a few lines of memory allocation code need to be modified to migrate an existing application to an SSD-enabled one, this one-time development cost is low compared to the cost of high-density memory (and a high-end server which can house that much RAM).

Introduction to SSDAlloc: I now describe our solution in more detail. SSDAlloc is a memory management tool that eases the hybrid RAM/SSD development. SSDAlloc is a slab based memory allocator whose interface looks similar to calloc. The only additional information needed by SSDAlloc for providing the performance benefits is the size of the application object. Each object allocated by SSDAlloc resides primarily on the SSD, is cached in RAM (RAM Object Cache, as shown above) when needed and also has a permanent virtual memory address where the object gets mapped to on demand. The SSD is managed as a log structured object storage engine, where the location of an object can change over time. Additionally, SSDAlloc is accompanied by various styles in which the virtual memory space is managed. We discuss two of them in this article. The first one, called Memory Pages (MP), is similar to how malloced memory looks. A contiguous memory space containing pages, each of which could contain multiple application objects. When object are allocated using MP the pages containing these objects are managed on the SSD in a log-structured manner (whole pages treated as objects).

Object per page model: Another virtual memory space management that we use is called Object Per Page (OPP). In this model, each object resides on its own page; if the size of the object is lesser than a page then the rest of the page is kept empty. To avoid wasting too much RAM we map only those objects to their pages, which are currently in use. Frequently accessed objects are cached in RAM Object Cache, which is a compact object store in RAM. Unused objects are pushed to SSD. There are two methods by which we minimize the wastage of RAM from using OPP. In the first method, applications explicitly declare session boundaries, which are used as markers by the SSDAlloc runtime to map objects out of their virtual memory locations. In the second method, the runtime maintains a page buffer, which is a small buffer of pages with currently mapped objects on them. When the buffer fills up, the runtime unmaps all the objects mapped to their pages so far, pushes them to RAM object cache and starts filling the buffer again by mapped objects to their virtual memory page on demand. In case of MP, SSDAlloc deals in full pages and treats each page as an object in itself. For objects spanning multiple pages in both the cases, the individual pages are still managed as separate objects.

Virtual memory management: As mentioned earlier, SSDAlloc runtime manages the SSD as a log structured object/page storage where the location of an object can change over time. To keep track of the location of an object we maintain a table called ObjectTable (shown in the figure above) that is similar to page tables. ObjectTables indicate the location of object regardless of where they currently reside (on the SSD or in the RAM Object cache or in the page buffer). A new ObjectTable is created for every set of objects obtained via SSDAlloc. When the application faults on a page for accessing an object, the runtime populates the required page “on-the-fly” by looking up the ObjectTable and inserts the new page in the page buffer. For this process to be quick, an efficient mechanism is needed for translating a virtual memory address of an object to its ObjectTable. In practice, we found a binary search tree was good enough for such a translation mechanism. Obviously, the lookups will be faster if the number of ObjectTables is minimized (to reduce the size of the binary search tree). Towards that end, SSDAlloc works as a slab allocator so that controlling the number of slabs leads to an optimal balance between SSD space usage and performance.

ACID transactions: While programmed session information (referred to as soft transactions) and buffered pages (referred to as agnostic transactions) are enough for reducing space wastage they do not provide any kind of guarantees. Some applications need ACID transactions from non-volatile memory. SSDs readily provide durability; atomicity can be ensured by evicting all or none of the objects belonging to a session (specified by the programmer) to the SSD. Isolation can be provided in the runtime library by allowing multiple threads to map the same set of objects at a different virtual memory address. Trapping accesses to different addresses of the same object allow a user defined synchronous data access model to take action when two threads try to modify the same object. Consistency has to be an application level semantic. With ACID transactions, SSDAlloc provides feature improvement over using SSD as swap alongside the performance improvement.

Performance: Using SSDAlloc we modified some applications including memcached (a key-value storage system), a boost library based B+Tree (transactional inserts, lookups and deletes) and HashCache (an efficient hashtable representation for in memory indexes). We also developed a transactional filesystem from scratch. We benchmarked these applications with and without SSDAlloc. When not using SSDAlloc these systems simply used the SSD as a swap space. The table above summarizes the results. In case of memcached, only the values are stored on the SSD using SSDAlloc. In case of the transactional filesystem the modified lines of code represent the number of lines needed to make the operations transactional. With only 15–36 lines of code modifications we obtain a speedup of upto an order of magnitude improvement over using SSD as a swap. Throughput gains obtained when using hard disk as swap are also indicated.

Additional information on the project can be obtained from the SSDAlloc project page. Stay tuned for the first code release which is around the corner.

CoralCDN Lesson: Interacting with virtualized and shared hosting services

In the previous post, I discussed how CoralCDN implemented bandwidth restrictions that were fair-shared between “customer” domains. There was another major twist to this problem, however, that I didn’t talk about: the challenge of performing such a technique on a virtualized and shared platform such as PlanetLab.  While my discussion is certainly PlanetLab-centric, its questions are also applicable to other P2P deployments where users run peers within resource containers, or to commercial hosting environments using billing models such as 95th percentile usage.

Interacting with hosting platforms

CoralCDN’s self-regulation works well in trusted environments, and this approach is used similarly in other peer-to-peer (e.g., BitTorrent and tor) and server-side (e.g., Apache mod_bandwidth) environments.  But when the resource provider (such as PlanetLab, a commercial hosting service, or peer-to-peer end-users) wants to enforce resource restrictions, rather than assume the software functions correctly, the situation becomes more challenging.

Many services run on top of PlanetLab; CoralCDN only being one of them.  Each of these instances is allocated a resource container (a “slice”) across all PlanetLab nodes.  This doesn’t have the same level of isolation as a virtual machine instance, but its much more scalable (in number of slices per node, each per-node slice called in sliver in PlanetLab).

PlanetLab began enforcing average daily bandwidth limits per sliver in 2006; prior to that, sliver usage was all self-enforced. Thereafter, however, when a sliver hit 80% of its daily limit, the PlanetLab kernel began enforcing bandwidth caps (using Linux’s Hierarchical Token Bucket scheduler) as calculated over five-minute epochs.  CoralCDN’s daily limit is 17.2 GB/day per sliver to the public Internet.

So, we see here two levels of bandwidth control: admission control by CoralCDN proxies and rate limiting by the underlying hosting service. Even though CoralCDN uses a relatively conservative limit for itself (10 GB/day), it still surpasses the 80% mark (13.8 GB) of its hosting platform on 5–10 servers per day. And once this happens, these servers begin throttling CoralCDN traffic, leading to degraded performance.  The main cause of this overage is that, while CoralCDN counts successful HTTP responses, its hosting platform accounts for all traffic—HTTP, DNS, DHT RPCs, system log transfers, and other management traffic—generated by CoralCDN’s sliver.

Unfortunately, there does not appear to be sufficiently lightweight or simple user-space mechanisms for proper aggregate resource accounting.  Capturing all traffic via libpcap, for example, would be too heavy-weight for our purposes.  Furthermore, a service would often like to make its own policy-based queuing decisions based on application knowledge.  For example, CoralCDN would prioritize DNS traffic before DHT RPCs, HTTP traffic next, and log collection of lowest priority. This is difficult through application-level control alone, while using a virtualized network interface that pushes traffic back through a user-space network stack would be expensive.

CoralCDN’s experience suggests two desirable properties from hosting platforms that enforce resource containers.  First, these platforms should provide slivers with their current measured resource consumption in a machine-readable format and in real time.  Second, these platforms should allow slices to express policies that affect how the underlying platform enforces resource containment. While this pushes higher-level preferences into lower layers, such behavior is not easily performed at these higher layers (and thus compatible with the end-to-end argument).  And it might be as simple as exposing multiple resource abstractions for slices to use, e.g., multiple virtual network connections with different priorities.

Somewhat amusingly, one of CoralCDN’s major outages came from a PlanetLab misconfiguration that changed its bandwidth caps from GBs to MBs.  As all packets were delayed for 30 seconds within the PlanetLab kernel, virtually all higher-layer protocols (e.g., DNS resolution for nyud.net) were timing out.  Such occasional misconfigurations are par for the course, and PlanetLab Central has been an amazing partner over the years.  Rather than criticize, however, my purpose is to simply point out how increased information sharing can be useful.  In this instance, for example, exposing information would have told CoralCDN to shut down many unnecessary services, while policy-based QoS could have at least preserved DNS responsiveness.

Over-subscription and latency sensitivity

While CoralCDN faced bandwidth tensions, there were latency implications with over-subscribed resources as well.  With PlanetLab services facing high disk, memory, and CPU contention, and even additional traffic shaping in the kernel, applications face both performance jitter and prolonged delays.  For example, application-level trace analysis performed on CoralCDN (in Chapter 6 of Rodrigo Fonseca’s PhD thesis) showed numerous performance faults that led to a highly variable client experience, while making normal debugging (“Why was this so slow?”) difficult.

These performance variations are certainly not restricted to PlanetLab, and they have been well documented in the literature across a variety of settings.  More recent examples have shown this in cloud computing settings.  For example, Google’s MapReduce found performance variations even among homogeneous components, which led to their speculative re-execution of work.  Recently, a Berkeley study of Hadoop on Amazon’s EC2 underscored how shared and virtualized deployment platforms provide new performance challenges.

cluster-timingsCoralCDN saw the implications of performance variations most strikingly with its latency-sensitive self-organization.  Coral’s DHT hierarchy, for example, was based on nodes clustering by network RTTs. A node would join a cluster provided some minimum fraction (85%) of its members were below the specified threshold (30 ms for level 2, 80 ms for level 1).  This figure shows the measured RTTs for RPC between Coral nodes, broken down by levels (with vertical lines added at 30 ms, 80 ms, and 1 s). While these graphs show the clustering algorithms meeting their targets and local clusters having lower RTTs, the heavy tail in all CDFs is rather striking.  Fully 1% of RPCs took longer than 1 second, even within local clusters.

Another lesson from CoralCDN’s deployment was the need for stability in the face of performance variations, which are only worse in heterogeneous deployments.  This translated to the following rule in Coral.  A node would switch to a smaller cluster if fewer than 70% of a cluster now satisfy its threshold, and form a singleton only if fewer than 50% of neighbors are satisfactory.  Before leveraging this form of hysteresis, cluster oscillations were much more common (leading to many stale DHT references).  A related focus on stability helped improve virtual network coordinate systems for both PlanetLab and Azureus’s peer-to-peer deployment, so it’s an important property to consider when performing latency-sensitive self-organization.

Next up…

In the next post, I’ll talk a bit about some of the major availability problems we faced, particularly because our deployment has machines with static IP addresses directly connected to the Internet (i.e., not behind NATs or load balancers).  In other words, the model that the Internet and traditional network layering was actually designed with in mind…

CoralCDN Lesson: Fair-sharing bandwidth via admission control

For commercial CDNs and other computing services, the typical answer to resource limits is simply to acquire more capacity.  As CoralCDN’s deployment on PlanetLab does not have that luxury, we instead apply admission control to manage its bandwidth resources.  This post describes some of these mechanisms, while we’ll take a step back in the next post to describe some of the challenges in doing resource accounting and management on a virtualized and shared platform such as PlanetLab.

asiantsunamivideos

Following the Asian tsunami of December 2004, and with YouTube yet to be created, CoralCDN distributed large quantities of amateur videos of the natural disaster.  With no bandwidth restrictions on PlanetLab at the time, CoralCDN’s network traffic to the public Internet quickly spiked.  PlanetLab sites threatened to pull their servers off the network if such use could not be curtailed.  It was agreed that the service should restrict its usage to approximately 10GB per node per day.

One could limit a proxy’s bandwidth use by rate-throttling all traffic (as in BitTorrent and tor), or by simply shutting a proxy down after exceeding its configured daily capacity (also supported by tor).  But if CoralCDN primarily provides a service for websites, as opposed to for clients, then perhaps it should provide some notion of fairness across origin domains. While this problem may seem difficult given that CoralCDN can interact with 10,000s of domains over the course of the day, our usage patterns greatly simplify the accounting.

forbiddenThe figure on the left shows the total number of requests per domain that CoralCDN received over one recent day in January, 2009 (the solid top line).  The distribution clearly has some very popular domains: the most popular one (Tamil YouTube) received 2.6M requests, the second most popular (a Brazilian forum on Japanese manga) received 1.2M requests, the third (screensaver videos of user-contributed “fractal flames”) received 880K requests, while the remaining distribution fell off in a Zipf-like manner. Given that CoralCDN’s traffic is dominated by a limited number of domains, its mechanisms can serve mainly to reject requests for (i.e., perform admission control on) the bandwidth hogs.  Still, CoralCDN should differentiate between peak limits and steady-state behavior to allow for flash crowds or changing traffic patterns. To achieve these aims, each CoralCDN proxy implements an algorithm that attempts to simultaneously (1) provide a hard-upper limit on peak traffic per hour (configured to 1000 MB per hour), (2) bound the expected total traffic per epoch in steady state (400 MB per hour), and (3) bound the steady-state limit per domain.  As setting this last limit statically—such as 1/k-th of the total traffic if there are k popular domains—would lead to good fairness but poor utilization (given the non-uniform distribution across domains), we dynamically adjust this last traffic limit to balance this trade-off.

During each hour-long epoch, a proxy records the total number of bytes transmitted for each domain.  It also calculates domains’ average bandwidth as an exponentially-weighted moving average (attenuated over one week), as well as the total average consumption across all domains.   Across epochs, bandwidth usage is only tracked, and durably stored, for the top-100 domains (although more traditional HTTP logs do store information about all requests).  If a domain is not currently one of the top-100 bandwidth users, its historical average bandwidth is set to zero (providing leeway to sites experiencing flash crowds).

The per-domain dynamic bandwidth limit is calculated at the beginning of each epoch for each of the top-100 domains as follows.  The proxy computes the sum of all top domains’ average bandwidths. If this sum exceeds the total steady-state limit, it reduces the per-domain limit until the point when, if that limit had been enforced in the prior time period, the bandwidth sum would have been <= the permitted total limit.  (This limit thus first affects the top-ranked domain, then affects the top-two domains, etc.)

When a particular requested domain is over its hourly budget (case 3 above), CoralCDN proxies respond with 403 Forbidden messages. If instead the proxy is over its peak or steady-state limit calculated over all domains (cases 1 or 2 above), then the proxy redirects the client back to the origin site, and the proxy temporarily making itself unavailable for new client requests, which would be rejected anyway.  (If clients are to be redirected back to the origin, a proxy appends the query-string coral-no-serve on the location URL returned to the client.  Origins that use redirection scripts with CoralCDN check for such a string prevent loops.)  Although not the default, operators of some sites preferred this redirection home even if their particular domain was to blame. Domains can specify this type of control in a X-Coral-Control response header (which we’ll discuss a bit later).

bwusageBy applying  these mechanisms, CoralCDN reduces its bandwidth consumption to manageable levels.  While its demand sometimes exceeds 10TBs per day, its actual HTTP traffic remains steady at about 2TB per day after rejecting a significant number of requests.  The scatter plot in the above figure shows the number of requests resulting in 403 responses per domain, most due to these admission control mechanisms.  We see how variances in domains’ object sizes yield different rejection rates.  The second-most popular domain mostly serves gifs smaller than 10KB and experiences a rejection rate of 3.3%.  Yet the fractal videos of the third-most popular domain are typically 5MB in size, and it sees an 89% rejection rate. Ranking these domains instead by bandwidth served, the figure on the left plots the average hourly traffic that four proxies permitted each of the top-100 bandwidth consumers.

Note that this accounting mechanism is applied independently on each node.  This brings up two interesting questions worth further consideration:

  1. Would the system behave differently if its accounting shared information between nodes?  One potential result from not sharing information is penalizing content with local interest (that is, files that may be of interest only to more regional Internet populations).  For if a system’s server-selection mechanisms actually redirect clients to more localized proxies, then most users from a particular region might only see a smaller subset of the total number of servers.  On the other hand, if our goal is more to prevent the total bandwidth hogs, rather than provide any strict notion of fairness, does this problem arise significantly in practice?
  2. From a more research perspective, how can one efficiently perform (approximate) accounting across the entire network, yet do so without centralization and in a bandwidth-efficient manner.  Work  in SIGCOMM ’07 called Cloud Control with Distributed Rate Limiting considered a similar problem, but effectively shared O(kN) information across the system via gossiping (where k is the number of domains we’re monitoring and N the number of nodes).  Can we provide accurate/precise rate limiting with o(kN) communication complexity?  If anybody has any pointers to effective algorithms here, please leave a comment.  It feels like the sensor network community probably has addressed this problem at some point when aggregating and collecting information.

In the next post, we’ll ask whether a hosting platform need bother provide any special interfaces or information to support such accounting.

Coordination in Distributed Systems (ZooKeeper)

Architecting distributed systems can be very difficult. Arguably the hardest part of programming a distributed application is getting node coordination correct. I’ll define a node in this context as a service running on a single server which communicates with other nodes and together make up your distributed application.

What I mean by coordination here is some act that multiple nodes must perform together. Some examples of coordination:

  • Group membership
  • Locking
  • Publisher/Subscriber
  • Ownership
  • Synchronization

One or more of these primitives show up in all distributed systems, so implementing them correctly is extremely important. While developing CRAQ, I originally implemented a very simple group membership service, but it didn’t provide reliability, replication, or scalability and was therefore a single point of failure. I went looking for an out-of-the-box coordination service and came across ZooKeeper.

ZooKeeper is an open-source, reliable, scalable, high-performance coordination service for distributed applications – just what I was looking for. It provides a metadata warehouse that is indexed like a file system and its primitive operations allows programmers to implement any of the common coordination examples I mentioned above. It has client bindings for Java and C, which made it relatively easy for me to integrate into CRAQ’s source code which is written in Tame.

The only thing I wish ZooKeeper had was support for wide-area deployments. I’m hopeful that this might eventually get added (maybe even by me!) some day.

I highly recommend ZooKeeper for anyone who is considering building a distributed system. It simplifies one of the hardest parts of implementing a distributed application.