Today, I am going to talk about a system called SSDAlloc, a hybrid RAM/SSD memory manager, that I have been working on over the past year and a half or so, jointly with my adviser Vivek Pai. The post is more of a research report briefing. I’ll first talk about what motivated us to explore this area and then delve into the technical jargon.
I will start with describing what motivated us to use SSD as program memory and then I will describe why using SSD as a swap space is a bad idea. After that, I will describe a SSD based non-transparent object storage engine we built that provides high performance but requires many application modifications. Later, I discuss the broader research goal, relevant to various situations (from developing regions to datacenters) which is to use as few modifications as possible to provide a nearly-transparent path to enable applications to use flash memory, providing ease of migration, high performance, and when desired, simple ACID transactions (because many applications expect transactions from non-volatile memories). After describing our solution in detail we provide some initial performance numbers.
SSD as memory: When we first started running HashCache on desktop towers, we ran into problems related to power, dust and slow restarts — described more formally by Surana et al in this paper. We contemplated running HashCache on netbooks, since they are sturdy, ruggedized, come with a battery, and are fairly immune to dust and a lot less power hungry compared to desktop towers/laptops. The proposition was not without challenges; the small amount of RAM (1-2GB) was not amenable to index large hard drives even with HashCache! We wondered if we could use the internal SSD (netbooks come with a 4-16GB SSD) as a cheap RAM substitute and an external drive as cache. This is a good idea since proxy caches’ performance is disk driven and not limited by the number of index lookups when using HashCache. Most netbooks SSD supported 500 to 2000 random reads/sec, which is an order of magnitude more than disk-seek performance. Also, with the growing gap between CPU and disk speed, flash has attracted system designers in recent times. We thought it was a good time to explore an SSD based solution.
SSD as Swap?: The next thing we did was to move the operating system to an external drive and use the internal SSD as swap. Another external hard drive, ~512GB, was used as the cache (pic) which needed an index of 8GB size. As a proxy cache the setting did just fine. On that small CPU (single core 1.2 Ghz 256KB cache), it was able to swap in and out enough number of times to hit the disk limit (about 200 requests per second). Later, when we tried to use the setting as a WAN Accelerator using Wanax, we realized that using SSD as a swap was a bad idea. Wanax is a WAN Accelerator, which makes multiple index lookups for populating a single HTTP object. Highest performance version of Wanax needs tens of index inserts/lookups per HTTP object insert/lookup. Swap was exhausting the random write throughput on the SSD. Also, RAM turned into a page cache with each 4KB page containing only about 100 bytes of useful data. That was when we realized we needed special techniques to reduce random writes and make the best of use of the limited RAM in the system.
SSD as object store: Towards that end we built an SSD based object management tool that had non-transparent read and write calls, with a back-end SSD based log-structured object store and used RAM as a compact object cache, described more formally in our workshop paper. Additionally, in that paper we proposed a minor improvement to HashCache (called HashCache-Striping), which is better suited for the hybrid SSD/RAM setting. HashCache-Striping is an efficient hashtable representation for hybrid RAM/SSD settings where the RAM:SSD ratio is high (typically 1:16 or higher). It requires only 1 byte of RAM for each index entry and provides as many index operations per second as reads/sec of the SSD used. More recently, other researchers have explored such indexes for hybrid settings.
Broader research goal: While the non-transparent technique did wonders with respect to performance it did take a lot of development effort to do so. We wondered if one could obtain the same performance benefits without having to take the painful route of having to redesign and tweak ones data structures and algorithms to make them SSD-aware. Essentially, our new goal was to use as few modifications as possible to provide a nearly-transparent path to enable applications to use flash memory, providing ease of migration, high performance, and when desired, simple ACID transactions. I am glad to announce that we were successful in our endeavor. The solution is beneficial for many situations including datacenter servers, which could use a 4TB SSD with 64-128GB RAM to scale performance or gateway servers for developing regions on netbooks with 1-2GB RAM and 4-16GB SSD. Before I delve into to the details of the solution I would first like to summarize the design space and describe the full benefits of our solution called SSDAlloc. The following table provides a nice summary. The table shows analytical performance numbers for a high-end enterprise SSD.
Random writes are only half the devil: One could claim that the only vice of swap was the random writes and hence having a write-logged swap is the panacea, but that is only half-true. Using SSD as swap forces a page level access granularity for every access. Each random write would lead to a full page-write, which would significantly increase the write traffic when the application object is very small (~100 bytes in case of HashCache-Striping). The situation (one with a RAM:SSD ratio of 1:16 or higher) motivates the need to treat SSD as something other than swap space. When most objects reside on the SSD, each object access would transfer a full page from SSD to RAM. If the objects on a page do not have temporal locality, then the rest of the page, while still being populated, really presents little benefit to the application. Instead we propose using a separate memory allocation and management tool that explicitly addresses this situation. If only a few lines of memory allocation code need to be modified to migrate an existing application to an SSD-enabled one, this one-time development cost is low compared to the cost of high-density memory (and a high-end server which can house that much RAM).
Introduction to SSDAlloc: I now describe our solution in more detail. SSDAlloc is a memory management tool that eases the hybrid RAM/SSD development. SSDAlloc is a slab based memory allocator whose interface looks similar to calloc. The only additional information needed by SSDAlloc for providing the performance benefits is the size of the application object. Each object allocated by SSDAlloc resides primarily on the SSD, is cached in RAM (RAM Object Cache, as shown above) when needed and also has a permanent virtual memory address where the object gets mapped to on demand. The SSD is managed as a log structured object storage engine, where the location of an object can change over time. Additionally, SSDAlloc is accompanied by various styles in which the virtual memory space is managed. We discuss two of them in this article. The first one, called Memory Pages (MP), is similar to how malloced memory looks. A contiguous memory space containing pages, each of which could contain multiple application objects. When object are allocated using MP the pages containing these objects are managed on the SSD in a log-structured manner (whole pages treated as objects).
Object per page model: Another virtual memory space management that we use is called Object Per Page (OPP). In this model, each object resides on its own page; if the size of the object is lesser than a page then the rest of the page is kept empty. To avoid wasting too much RAM we map only those objects to their pages, which are currently in use. Frequently accessed objects are cached in RAM Object Cache, which is a compact object store in RAM. Unused objects are pushed to SSD. There are two methods by which we minimize the wastage of RAM from using OPP. In the first method, applications explicitly declare session boundaries, which are used as markers by the SSDAlloc runtime to map objects out of their virtual memory locations. In the second method, the runtime maintains a page buffer, which is a small buffer of pages with currently mapped objects on them. When the buffer fills up, the runtime unmaps all the objects mapped to their pages so far, pushes them to RAM object cache and starts filling the buffer again by mapped objects to their virtual memory page on demand. In case of MP, SSDAlloc deals in full pages and treats each page as an object in itself. For objects spanning multiple pages in both the cases, the individual pages are still managed as separate objects.
Virtual memory management: As mentioned earlier, SSDAlloc runtime manages the SSD as a log structured object/page storage where the location of an object can change over time. To keep track of the location of an object we maintain a table called ObjectTable (shown in the figure above) that is similar to page tables. ObjectTables indicate the location of object regardless of where they currently reside (on the SSD or in the RAM Object cache or in the page buffer). A new ObjectTable is created for every set of objects obtained via SSDAlloc. When the application faults on a page for accessing an object, the runtime populates the required page “on-the-fly” by looking up the ObjectTable and inserts the new page in the page buffer. For this process to be quick, an efficient mechanism is needed for translating a virtual memory address of an object to its ObjectTable. In practice, we found a binary search tree was good enough for such a translation mechanism. Obviously, the lookups will be faster if the number of ObjectTables is minimized (to reduce the size of the binary search tree). Towards that end, SSDAlloc works as a slab allocator so that controlling the number of slabs leads to an optimal balance between SSD space usage and performance.
ACID transactions: While programmed session information (referred to as soft transactions) and buffered pages (referred to as agnostic transactions) are enough for reducing space wastage they do not provide any kind of guarantees. Some applications need ACID transactions from non-volatile memory. SSDs readily provide durability; atomicity can be ensured by evicting all or none of the objects belonging to a session (specified by the programmer) to the SSD. Isolation can be provided in the runtime library by allowing multiple threads to map the same set of objects at a different virtual memory address. Trapping accesses to different addresses of the same object allow a user defined synchronous data access model to take action when two threads try to modify the same object. Consistency has to be an application level semantic. With ACID transactions, SSDAlloc provides feature improvement over using SSD as swap alongside the performance improvement.
Performance: Using SSDAlloc we modified some applications including memcached (a key-value storage system), a boost library based B+Tree (transactional inserts, lookups and deletes) and HashCache (an efficient hashtable representation for in memory indexes). We also developed a transactional filesystem from scratch. We benchmarked these applications with and without SSDAlloc. When not using SSDAlloc these systems simply used the SSD as a swap space. The table above summarizes the results. In case of memcached, only the values are stored on the SSD using SSDAlloc. In case of the transactional filesystem the modified lines of code represent the number of lines needed to make the operations transactional. With only 15–36 lines of code modifications we obtain a speedup of upto an order of magnitude improvement over using SSD as a swap. Throughput gains obtained when using hard disk as swap are also indicated.
Additional information on the project can be obtained from the SSDAlloc project page. Stay tuned for the first code release which is around the corner.