Research and Advances
Architecture and Hardware Review articles

Software Challenges for the Changing Storage Landscape

Conventional storage software stacks are unable to meet the needs of high-performance Storage-Class Memory technology. It is time to rethink 50-year-old architectures.
Posted
  1. Introduction
  2. Key Insights
  3. Evolution of Storage IO
  4. Reconsidering the IO Stack
  5. Foundational Kernel-bypass Ecosystems
  6. Kernel-Bypass Design Considerations
  7. Integration of NVDIMMs
  8. Outlook
  9. References
  10. Authors
  11. Footnotes
red arrow swerves around obstacle

As we embark on a new era of storage performance, the limitations of monolithic OS designs are beginning to show. New memory technologies (for example, 3D XPoint technology) are driving multi-GB/s throughput and access latencies at sub-microsecond scales. As the performance of these devices approaches the realms of DRAM, the overhead incurred by legacy IO stacks increasingly dominates.

Back to Top

Key Insights

  • NVMe memory-based storage technologies are experiencing an exponential growth in performance with aggressive parallelism and fast new media. Traditional IO software architectures are unable to sustain these new levels of performance.
  • IOMMU hardware is a key enabler for realizing safe and maximal performing user space device drivers and storage IO stacks.
  • Kernel-bypass strategies rely on “asynchronous polling” whereby threads actively check device completion queues. Naive designs can lead to excessive busy-waiting and inefficient CPU utilization.

To address this concern, momentum is gathering around new ecosystems that enable effective construction of tailored and domain-specific IO architectures. These ecosystems rely on bringing both device control and data planes into user space, so that they can be readily modified and intensely optimized without jeopardizing system stability.

This article begins by giving a quantitative exploration of the need to shift away from kernel-centric generalized storage IO architectures. We then discuss the fundamentals of user space (kernel-bypass) operation and the potential gains that result. Following this, we outline key considerations necessary for their adoption. Finally, we briefly discuss software support for NVDIMM-based hardware and how this is positioned to integrate with a user space philosophy.

Back to Top

Evolution of Storage IO

Since the release of the Intel 8237 in the IBM PC platform (circa 1972), network and storage device IO has centered around the use of Direct Memory Access (DMA). This enables the system to transfer data to and from a device to main memory with no involvement of the CPU. Because DMA transfers can be initiated to any part of main memory, coupled with the need to execute privileged machine instructions (for example, masking interrupts), device drivers of this era were well suited to the kernel. While executing device drivers in user space was in theory possible, it was unsafe because any misbehaving driver could easily jeopardize the integrity of the whole system.

As virtualization technologies evolved, the consequence of broad access to memory for device drivers would become a prominent issue for system stability and protection between hosted virtual machines. One approach to this problem is the use of device emulation. However, emulation incurs a significant performance penalty because each access to the device requires a transition to the Virtual Machine Monitor (VMM) and back.

An alternative approach is to use para-virtualization that, by modifying the host OS device drivers so they interact directly with the hypervisor without entering the VMM, improves performance over prior emulation techniques. The downside is that the host OS code needs to be modified (including interrupt handling) and additional latency results from more IO layering.

Virtualization and direct device assignment. To minimize the impact of virtualization and indirection, certain use cases aim at providing a virtualized guest with direct ownership and access to specific hardware devices in the system. Such a scheme improves IO performance at the expense of the ability to transparently share devices across multiple guests using the hypervisor. The key enabling hardware technology for device virtualization is the IO Memory Management Unit (IOMMU). This provides the ability to remap device (DMA) addresses to physical addresses in the system, in much the same way that the MMU performs a translation from virtual to physical addresses (see Figure 1). Around 2006, IOMMU capabilities were made available in the Intel platform through its Virtualization Technology for Directed IO (VT-d)1 followed by AMD-Vi on the AMD x86 platform.

f1.jpg
Figure 1. MMU and IOMMU duality.

Although IOMMU technology was driven primarily by the need to provide virtual machines with direct device access its availability has, more recently, become a pivotal enabler for rethinking device driver and IO architectures in non-virtualized environments.

Another significant advance in device virtualization was made by the PCI Express SR-IOV (Single Root IO Virtualization) extension.11 SR-IOV enables multiple system images (SI) or virtual machines (VMs) in a virtualized environment to share PCI hardware resources. It reduces VMM overhead by giving VM guests direct access to the device.

New frontiers of IO performance. Over the last three decades, compute, network, and storage performance have grown exponentially according to Moore’s Law (see Figure 2). However, over the last decade, growth of compute performance has slowed compared to the growth of storage performance mainly due to the CPU frequency ceiling causing a shift in microprocessor growth strategy. We expect this trend to continue as new persistent memory technologies create an aggressive up-swing in storage performance.

f2.jpg
Figure 2. Relative IO performance growth. (Data collected by IBM Research, 2017)

The accompanying table shows some performance characteristics of select state-of-the-art IO devices. Lower latency, increased throughput and density, and improved predictability continue to be key differentiators in the networking and storage markets. As latency and throughput improve, the CPU cycles available to service IO operations are reduced. Thus, it is evident that the latency overhead imposed by traditional kernel-based IO paths has begun to exceed the latency introduced by the hardware itself.

Back to Top

Reconsidering the IO Stack

The prominent operating systems of today, such as Microsoft Windows and Linux, were developed in the early 1990s with design roots stemming from two decades earlier. Their architecture is that of monolithic. This means that core OS functionality executes in kernel space and it cannot be readily modified or adapted. Threads within the kernel, are alone, given access to privileged processor instructions (for example, x86 Ring-0). Kernel functionality includes interrupt handling, file systems, scheduling, memory management, security, IPC and device drivers. This separation of user and kernel space came about as a solution to guard against “untrusted” applications from accessing resources that could interfere with other applications or interfere with OS functionality directly. For example, disallowing applications from terminating other applications or writing to memory outside of their protected memory space, is fundamental to system stability.

In the early stages of modern OS development, hardware parallelism was limited. It was not until more than a decade later (2006) that the multicore microprocessor would appear. Following the footsteps of multicore, network devices would also begin to support multiple hardware queues so that parallel cores could be used to service high-performance networking traffic. This multi-queue trend would also appear in storage devices, particularly with the advent of NVMe SSD (Solid State Drive). Today, hardware-level parallelism, in both CPU and IO is prominent. Intel’s latest Xeon Platinum processors provide up to 28 cores and hyper-threading on each. AMD’s latest, Naples server processor, based on Zen, provides 32 cores (64 threads) in a single socket. Many state-of-the-art NVMe drives and network interface cards support 64 or more hardware queues. The trend toward parallel hardware is clear and not expected to diminish anytime soon. The consequence of this shift from single-core and single-queue designs, to multi-core and multi-queue designs is that the IO subsystem has needed to evolve to support concurrency.

uf1.jpg
Figure. State-of-the-art IO device performance reference points.

Strained legacy stacks. In terms of IO request rates, storage devices are an order of magnitude slower than network devices. For example, the fastest SSD devices operate at around 1M IOPS (IO operations per second) per device, whereas a state-of-the-art NIC device is capable of handling more than 70M packets per second. This slower rate means that legacy OS improvement efforts in the storage space are still considered worthwhile.

With the advent of multicore, enhancing concurrency is a clear approach to improving performance. Many legacy OS storage subsystems realize concurrency and asynchrony through kernel-based queues serviced by worker threads. These are typically allocated for each processor core. Software queues can be used to manage the mapping between application threads running on specific cores, and the underlying hardware queues available on the IO device. This flexibility was introduced into the Linux 3.13 kernel in 20125 providing greatly improved IO scaling for multicore and multi-queue systems. The Linux kernel block IO architecture is aimed at providing good performance in the “general” case. As new IO devices (both network and storage) reach the realms of tens of millions of IOPS, the generalized architecture and layering of the software stack begin to strain. Even state-of-the-art work on improving kernel IO performance is limited in success.15 Furthermore, even though the block IO layer may scale well, layering of protocol stacks and file systems typically increases serialization and locking, and thus impacts performance.

To help understand the relationship between storage IO throughput and CPU demand, Figure 3 shows IOPS scaling for the Linux Ext4 file system. This data is captured with the fio micro-benchmarking tool configured to perform random-writes of 4K blocks (random-read performance is similar). No filesharing is performed (the workload is independent). The experimental system is an Intel E5-2699 v4 two-socket server platform with 512GB main memory DRAM. Each processor has 22 cores (44 hardware threads) and the system contains x24 NVMe Samsung 172Xa SSD 1.5 TB PCIe devices. Total IO throughput capacity is ~6.5M IOPS (25GB/s). Each device is PCI Gen 3 x8 (7.8GB/s) onto the PCI bus and a single QPI (memory bus) link is ~19.2GB/s. Each processor has x40 PCI Gen 3.0 lanes (39.5GB/s).

f3.jpg
Figure 3. Ext4 file system scaling on software RAIDO.

The maximum throughput achieved is 3.2M IOPS (12.21GB/s). This is realized at a load of ~26 threads (one per device) and 30% total CPU capacity. Adding threads from 17 to 26 gives negligible scaling. Beyond 26 worker threads, performance begins to degrade and become unpredictable although CPU utilization remains linear for some time.

File systems and kernel IO processing also add latency. Figure 4 shows latency data for direct device access (using Micron’s kernel-bypass UNVMe framework) and the stock Ext4 file system. This data is from a single Intel Optane P4800X SSD. The filesystem and kernel latency (mean 13.92μsec) is approximately double that of the raw latency of the device (mean 6.25μsec). For applications where synchronous performance is paramount and latency is difficult to hide through pipelining, this performance gap can be significant.

f4.jpg
Figure 4. Ext4 vs. raw latency comparison.

Application-specific IO subsystems. An emerging paradigm is to enable customization and tailoring of the IO stack by “lifting” IO functions into user space. This approach improves system stability where custom IO processing is being introduced (that is, custom stacks can crash without jeopardizing system stability) and allows developers to protect intellectual property where open source kernel licenses implies source release.

Although not originally designed for this purpose, a key enabler for user-level IO is the IOMMU. Specifically, the IOMMU provides the same capabilities to user-kernel processes as it does to guest-host virtualized OSes (see Figure 5). This effectively means that user-space device drivers (unprivileged processes) can be compartmentalized so that memory regions valid for device DMA operations can be limited by the IOMMU, and therefore device drivers can be prevented from accessing arbitrary memory regions (via a device’s DMA engine).

f5.jpg
Figure 5. MMU and IOMMU duality.

Configuration of the IOMMU remains restricted to kernel functions operating at a higher privilege level (that is, ring 0). For example, in Linux, the Virtual Function IO (VFIO) kernel module can be used to configure the registered memory with the IOMMU and ensure memory is “pinned.”

New architectures also allow interrupt handling to be localized to a subset of processor resources (that is, mapping MSI to specific local APICs) that are associated to a specific device driver execution. Coupled with device interrupt coalescing and atomic masking, this means that user-level interrupt handling is also viable. However, the interrupt vector must still reside in the kernel and be executed at a privileged level, at least for Intel and IBM Power architectures.

Back to Top

Foundational Kernel-bypass Ecosystems

In this section, we introduce the basic enablers for kernel-bypass in the Linux operating system. This is followed by a discussion of the Data Plane Development Kit (DPDK) and Storage Performance Development Kit (SPDK), two foundational open source projects started by Intel Corporation. DPDK has been widely adopted for building kernel-bypass applications, with over 30 companies and almost 400 individuals contributing patches to the open source DPDK projects as of release 17.05. While DPDK is network-centric, it provides the basis for the SPDK storage-centric ecosystem. Other projects, such as FD.IO (http://fd.io) and Seastar (http://seastar-project.org) also use DPDK. These domain specifics are not discussed in this article.

Linux user space device enablers. Linux kernel version 2.6 introduced the User Space IO (UIO)a loadable module. UIO is the older of the two kernel-bypass mechanisms in Linux (VFIO being the other). It provides an API that enables user space handling of legacy INTx interrupts, but not message-signaled interrupts (MSI or MSI-X). UIO also does not support DMA isolation through IOMMU isolation. Even with these limitations, UIO is well suited for use in virtual machines, where direct IOMMU access is not available. In these situations, a guest VM user space process is not isolated from other processes in the same guest VM, but the hypervisor itself can isolate the guest VM from other VMs or host processes using the IOMMU.

For bare-metal environments, VFIOb is the preferred framework for Linux kernel-bypass. It operates with the Linux kernel’s IOMMU subsystem to place devices into IOMMU groups. User space processes can open these IOMMU groups and register memory with the IOMMU for DMA access using VFIO ioctls. VFIO also provides the ability to allocate and manage message-signaled interrupt vectors.

Data plane development kit. DPDK (http://dpdk.org) was originally aimed at accelerating network packet processing applications. The project was initiated by Intel Corporation, but is now under the purview of the open source Linux Foundation. At the core of DPDK is a set of polled-mode Ethernet drivers (PMDs). These PMDs by-pass the kernel, and by doing so, can process hundreds of millions of network-packets per second on standard server hardware.

DPDK also provides libraries to aid kernel-bypass application development. These libraries enable probing for PCI devices (attached via UIO or VFIO), allocation of huge-page memory, and data structures geared toward polled-mode message-passing applications such as lockless rings and memory buffer pools with per-core caches. Figure 6 shows key components of the DPDK framework.

f6.jpg
Figure 6. DPDK architecture.

Storage performance development kit. SPDK is based on the foundations of DPDK. It was introduced by Intel Corporation in 2015 with a focus on enabling kernel-bypass storage and storage-networking applications using NVMe SSDs. While SPDK is primarily driven by Intel, there are an increasing number of companies using and contributing to the effort. The project desires broader collaboration that may require adoption of a governance structure similar to DPDK. SPDK shows good promise for filling the same role for storage and storage networking as DPDK has for packet processing.

SPDK’s NVMe polled-mode drivers provides an API to kernel-bypass applications for both direct-attached NVMe storage as well as remote storage using the NVMe over Fabrics protocol. Figure 7 shows the SPDK framework’s core elements as of press time. Using SPDK, Walker22 shows reduction in IO submission/completion overhead by a factor of 10 as measured with the SPDK software overhead measurement tool.

f7.jpg
Figure 7. SPDK architecture.

To provide the reader with a better understanding of the impact of legacy IO we present data from the ‘fio’ benchmarking tool (https://github.com/ax-boe/fio). Figure 8 shows performance data, for kernel-based IO (with Ext4 and raw block access) and SPDK. The data compares throughput with the number of client threads. Configuration is queue depth of 32, and IO size of 4KiB. Sequential read, sequential write, random read, random write, and 50:50 read-write workloads are examined.

f8.jpg
Figure 8. Comparison of fio performance for Linux kernel vs. SPDK.

The key takeaway is that SPDK requires only one thread to get over 90% of the device’s maximum performance. Note also that the SPDK data represents a 1:1 mapping of threads to hardware queues and therefore the number of threads is limited to the number of queues available (limited to 16 queues in this case). The kernel-based data represents the number of user-threads multiplexed (via two layers of software queues) to the underlying device queues.5

From the data, we can see that generality and the associated functionality impact performance. Reducing software overhead by tailoring and optimizing the stack (according to specific application requirements) improves storage applications in two ways. First, with fewer CPU cycles spent on processing IO, more CPU cycles are available for storage services such as compression, encryption, or storage networking. Second, with the advent of ultra-low latency media, such as Intel Optane, higher performance can be achieved for low queue depth workloads since the software overhead is much smaller compared to the media latency.

Klimovic et al.14 have applied DPDK and SPDK in the context of distributed SSD access. Their results show performance improvements for the FlashX graph-processing framework of up to 40% versus iSCSI. They also make a comparison with Rocks-DB and show a delta of ~28% between iSCSI and their solution. This work is based on the IX Dataplane Operating system,3 which is fundamentally based on kernel-bypass approaches.

Back to Top

Kernel-Bypass Design Considerations

Here, we present some design aspects and insights that adopters of kernel-bypass technology, such as DPDK and SPDK, should consider.

Cost of context switching. Raising IO operations into user space requires careful consideration of software architecture. Traditional OS designs rely on interrupts and context switching to multiplex access to the CPU. In a default Linux configuration for example, the NVMe device driver will use a per-core submission queue, serviced by the same core, and therefore context switching cannot be avoided.

Context switches are costly (more so than system calls) and should be avoided at high IO rates. They result in cache pollution that arises from both eviction of cache by the task contexts and the subsequent impact of working set memory of the newly scheduled task. The typical cost of a context switch is in the order of 2,000–5,000 clock cycles. Figure 9 presents data from lmbench (http://www.bitmover.com/lmbench/) running on a dual-socket Intel E5-2650 v 4 @ 2.2GHz, 32K L1, 256K L2, and 30MB L3 caches.

f9.jpg
Figure 9. Context switch latencies on Intel E5-2650 based server.

Polling-based designs minimize IO latency by eliminating the need to execute interrupt handlers for inbound IO, and removing system calls/context switches for outbound IO. However, polling threads must be kept busy performing useful work as opposed to spending time polling empty or full queues (busy-work).

Asynchronous polling. A key design pattern that can be used to improve the utility of polling threads is asynchronous polling. Here, polling threads can round-robin (or some other scheduling policy) across multiple asynchronous tasks. For example, a single thread might service both hardware and software queues at the same time (see Figure 10). Hardware queues reside in memory on the device and are controlled by the device itself, while software queues reside in main memory and are controlled by the CPU. IO requests typically flow through both. Polling is asynchronous in that the thread does not synchronously wait for completions of a specific request, but retrieves the completion at a later point in time.

f10.jpg
Figure 10. Asynchronous polling pattern.

Asynchronous polling can be coupled with lightweight thread scheduling (co-routines) found, for example, in Intel Cilk.16 Such technologies allow program-level logical concurrency to be applied without the cost of context switching. Each kernel thread services a task queue by applying stack swapping to redirect execution. Lightweight scheduling schemes typically execute tasks to completion, that is, they are non-preemptive. This is well suited to asynchronous IO tasks.

Lock-free inter-thread communications. Because polling threads cannot perform extensive work without risking device queue overflow (just as conventional interrupt service routines must be tightly bound and therefore typically defer work) they must off-load work to, or receive work from, other application threads.

This requires that threads must coordinate execution. A practical design pattern for this is message passing across lock-free FIFO queues. Different lock-free queue implementations can be used for different ratios of producer and consumer (for example, single-producer and single-consumer, single-producer and multi-consumer). Lock-free queues are well suited to high-performance user-level IO since they do not require kernel-level locking, but rely on machine-level atomic instructions. This means that exchanges can be performed without forcing a context switch (although one may undoubtedly occur if scheduled).

A basic implementation of lockfree queues will perform busy-waiting when the queue is empty or full. This means that the thread is continuously reading memory state and thus consuming 100% of the CPU it is running on. This results in high-energy utilization. Alternatively, it is possible to implement lock-free queues that support thread sleeping in empty or full conditions. This avoids busy-waiting by allowing the OS to schedule other threads in its place. Sleeping can be supported on either or both sides of the queue. To avoid race-conditions, implementations typically use an additional “waker” thread. This pattern is well-known in the field of user-level IPC (Inter-Process Call).20

Lock-free queues are established in shared memory and can be used for both inter-process and inter-thread message exchange. To optimize inter-thread message passing performance processor cores (on which the threads execute) and memory should belong to the same NUMA zone. Accessing memory across remote NUMA zones incurs approximately twice the access latency.

Combining polling and interrupt modes. Another strategy to avoid busy-waiting on queues is to combine polling and interrupt modes. To support this, VFIO provides a capability to attach a signal (based on a file-descriptor) to an interrupt so that a blocking user-level thread can be alerted when an interrupt event occurs. The following excerpt illustrates connecting an MSI interrupt to a file handle using the POSIX API:

efd = eventfd(0, 0);

ioctl(vfIO fd , VFIO _ EVENT _ FDMSI, &efd);

In this case, the IO threads wait on file descriptor events through read or poll system calls. Of course, this mechanism is costly in terms of performance since waking up and signaling a user-level thread from the kernel is expensive. However, because the interrupt is masked when generated, the user-thread controls unmasking and can thus arbitrarily decide when to revert to interrupt mode (for example, when an extended period of “quiet” time has passed).

Memory paging and swapping. An important role of the kernel is to handle page-faults and “swapping” memory to backing store when insufficient physical memory is available. Most operating systems use a lazy mapping strategy (demand-paged) so that virtual pages will not be mapped to physical pages until they are touched. Swapping provides an extended memory model; that is, the system presents to the application the appearance of more memory than is in the system. The mechanisms behind swapping are also used for mapped files, where a file copy shadows a region of memory. Traditionally, swapping is not heavily used because the cost of transferring pages to storage devices that are considerably slower than memory is significant.

For monolithic OS designs, page swapping is implemented in the kernel. When there is no physical page mapping for a virtual address (that is, there is no page table entry), the CPU generates a page-fault. In the Intel x86 architecture, this is realized as a machine exception. The exception handler is run at a high privilege level (CPL 0) and thus remains in the kernel. When a page-fault occurs, the kernel allocates a page of memory from a pool (typically known as the page cache) and maps the page to the virtual address by updating the page table. When the physical memory pool is exhausted, the kernel must evict an existing page by writing out the content to backing store and invalidating the page table entry (effectively un-mapping the page). In Linux, the eviction policy is based on a variation of the Least Recently Used (LRU) scheme.10 This is a generalized policy aimed at working well for most workloads.

Because page-fault handling and page swapping rely on the use of privileged instructions and exception handling, implementing them in user space alone is inherently difficult. One approach is to use the POSIX mprotect and mmap/mumap system calls to explicitly control the page mapping process. In this case, page protection PROT_NONE can be used to force the kernel to raise a signal on the user-level process when the unmapped page is accessed. In our own work, we have been able to realize a paging overhead of around 20usec per 4K page (with SPDK-based IO), which is comparable to that of the kernel (tested against memory mapped files).

Memory flushing Linux. To optimize write-through to storage, it is also necessary to track dirty pages, so that only those that have been modified are flushed out to storage. If a page has only been read during its active mapping, there is no need to write it back out to storage. From the kernel’s perspective, this function can be easily achieved by checking the page’s dirty bit in its corresponding page table entry. However, as noted earlier, accessing the page table from user space is problematic. In our own work, we have used two different approaches to address this problem.


Polling-based designs minimize IO latency by eliminating the need to execute interrupt handlers for inbound IO, and removing system calls/context switches for outbound IO.


The first is to use a CRC checksum over the memory to identify dirty pages. Both Intel x86 and IBM Power architectures have CRC32 accelerator instructions that can compute a 4K checksum in less than ~1000 cycles. Note that optimizations such as performing the CRC32 on 1024 byte blocks and performing a “short circuit” of the dirty page identification can reduce further the cost of CRC in this context.

An alternative approach is to use a kernel module to collect dirty page information on request from an application. This, of course, incurs an additional system call and page table walk. Consequently, this approach performs well with small page tables, but is less performant than CRC when traversal across many page table entries is needed.

Legacy integration. Designing around a kernel bypass architecture is a significant paradigm shift for application development. Consequently, there are some practical limitations to their adoption in legacy systems. These include:

  • Integration with existing applications ased on a blocking threading model requires either considerable rewriting to adhere to an asynchronous/polling model, or shims to bridge the two together. The latter reduces the potential performance benefits.
  • Sharing storage devices between multiple processes. Network devices handle this well via SR-IOV, but NVMe SR-IOV has only recently been added to NVMe specification. Hence, sharing NVMe devices across multiple devices must be done through software.
  • Integration with the existing file system structures is difficult. While conceptually Filesystem in User Space (FUSE) technology could be used to integrate into the kernel-based file system hierarchy, the advantages of performance would be lost because of the need to still pass control into the kernel. Evolution of the POSIX API is needed to support hybrid kernel and user IO. “Pure” user-space file systems are still not broadly available.
  • Legacy file systems and protocol stacks incorporate complex software that has taken years of development and debugging. In some cases, this software can be integrated through “wrappers.” However, in general this is challenging and redeveloping the software from the ground up is more economic.

Back to Top

Integration of NVDIMMs

Non-Volatile Dual Inline Memory Modules (NVDIMMs) attach non-volatile memory directly to the memory bus, opening the possibility of application programs accessing persistent storage via load/store instructions. This requires additional libraries and/or programming language extensions5,9 to support the coexistence of both volatile and non-volatile memory. The fundamental building blocks needed are persistent memory management (for example, pool and heap allocators), cache management, transactions, garbage collection, and data structures that can operate with persistent memory (for example, support recovery and reinstantiation).

Today, two prominent open source projects are pushing forward the development of software support for persistent memory. These are pmem.io (http://pmem.io/), driven primarily by Intel Corporation in conjunction with SNIA, and The Machine project (https://www.labs.hpe.com/the-machine) from HP Labs. These projects are working to build tools and libraries that support access and management of NVDIMM. Key challenges that are being explored by these projects and others,3,8,17,20 include:

  • Cross-heap pollution: Pointers to volatile data structures should not “leak” into the non-volatile heap. New programming language semantics are needed to explicitly avoid programming errors that lead to dangling and invalid references.
  • Transactions: Support for ACID (atomicity, consistency, isolation, durability) transactions offering well-defined guarantees about modifications to data structures that reside in persistent memory and are accessible by multiple threads.
  • Memory leaks and permanent corruption: Persistence makes memory leaks and errors that are normally recoverable through program restart or reset, more pernicious. Strong safety guarantees are needed to avoid permanent corruption.
  • Performance: Providing tailored capabilities and leveraging the advantages of low latency and high throughput enabled by NVDIMM technology.
  • Scalability: Scaling data structures to multi-terabytes also require scaling of metadata and region management structures.
  • Pointer swizzling: Modifying embedded (virtual address) pointer references for object/data structure relocation.21

The real impact of NVDIMMs remains to be seen. However, work by Coburn et al.6 on NV-Heaps has shown that for certain applications the move from a transactional database to persistent memory can bring significant performance gains.

NVDIMM-based persistent memory lends itself to integration with user space approaches because it inherently provides access directly to the user space application (although mapping and allocation may remain the kernel’s control). This enables efficient, zero-copy DMA-centric movement of data through the memory hierarchy and into the storage device. A longer-term vision is for a converged memory-storage paradigm whereby traditional storage services (for example, durability, encryption) can be layered into the memory paradigm. However, to date, this topic remains largely unaddressed by the community.

Back to Top

Outlook

Mainstream operating systems are based on IO architectures with a 50-year heritage. New devices now challenging these traditional designs bring unprecedented levels of concurrency and performance. The result is that we are entering an era of CPU-IO performance inversion, where CPU resources are becoming the bottleneck. Careful consideration of execution paths is now paramount to effective system design.

User space, kernel-bypass strategies, provide a vehicle to explore and quickly develop new IO stacks. These can be used to exploit alignment of requirements and function, becoming readily tailored and optimized to meet the specific needs of an application. Flexibility of user space software implementation (as opposed to kernel space) enables easier development and debugging, and enables the leverage of existing application libraries (for example, machine learning).

For the next decade, microprocessor design trends are expected to continue to increase on die transistor count. As instruction-level parallelism and clock frequency increases have reached a plateau, increased core count and on-chip accelerators are the most likely differentiators for future processor generations. There is also the possibility of “big” and “little” cores whereby heterogeneous cores, with different capabilities (for example, pipelining, floating point units, and clock frequency), exist on the same processor package. This is already evident in ARM-based mobile processors. Such an approach could help drive a shift away from interrupt-based IO, toward polling IO whereby “special” cores are dedicated to IO processing (possibly at a lower clock frequency). This would both eliminate context switches and cache pollution, and would also enable improved energy management and determinism in the system.

Large capacity, NVDIMM-based persistent memory is on the horizon. The availability of potentially up to terabytes of persistent memory, with sub-microsecond access latencies and cache-line addressability, will accelerate the need to make changes in the IO software architecture. User space IO strategies are well positioned to meet the demands of high-performance storage devices and to provide an ecosystem that can effectively adopt load/store addressable persistence.

Back to Top

Back to Top

Back to Top

    1. Abramson, D. et al. Intel visualization technology for directed IO. Intel Technology J. 10, 3 (2006), 179–192.

    2. Atkinson, M. and Morrison, R. Orthogonally Persistent Object Systems. The VLDB J. 4, 3 (July 1995), 319–402.

    3. Belay, A., Prekas, G., Klimovic, A., Grossman, S., Kozyrakis, C. and Bugnion, E. IX: A protected dataplane operating system for high throughput and low latency. In Proceedings of USENIX Operating Systems Design and Implementation, Oct. 2014, 49–65.

    4. Bhattacharya, S.P. A Measurement Study of the Linux TCP/IP Stack Performance and Scalability on SMP systems, Communication System Software and Middleware, 2006.

    5. Bjørling, M., Axboe, J., Nellans, D. and Bonnet, P. Linux block IO: Introducing multi-queue SSD access on multi-core systems. In Proceedings of the 6th International Systems and Storage Conf., 2013, 22:1–22:10. ACM, New York, NY, USA.

    6. Coburn, J. et al. NV-Heaps: Making persistent objects fast and safe with next-generation, non-volatile memories. SIGPLAN Notices 46, 3 (Mar. 2011), 105–118.

    7. Dearle, A., Kirby, G.N.C. and Morrison, R. Orthogonal persistence revisited. In Proceedings of the 2nd International Conference on Object Databases, 2010, Springer Berlin, Heidelberg.

    8. Gorman, M. Understanding the Linux Virtual Memory Manager. Prentice Hall PTR, Upper Saddle River, NJ, USA, 2004.

    9. Grundler, G. Porting drivers to HP ZX1. Ottawa Linux Symposium, 2002.

    10. Intel Corporation. Intel 64 and IA-32 Architectures Optimization Reference Manual. No. 248966-033, June 2016.

    11. Intel Corporation. PCI-SIG Single Root IO Virtualization Support in Intel® Virtualization Technology for Connectivity; https://www.intel.com/content/dam/doc/white-paper/pci-sig-single-root-io-virtualization-support-in-virtualization-technology-for-connectivity-paper.pdf

    12. Kannan, S., Gavrilovska, A. and Schwan, K. PVM: Persistent virtual memory for efficient capacity scaling and object storage. In Proceedings of the 11th European Conference on Computer Systems, 2016, 13:1–13:16. ACM, New York, NY, USA.

    13. Kemper, A. and Kossmann, D. Adaptable pointer swizzling strategies in object bases: Design, realization, and quantitative analysis. International J. Very Large Data Bases 4, 3 (July 1995), 519–567.

    14. Klimovic, A., Litz, H. and Kozyrakis, C. ReFlex: Remote Flash Local Flash. In Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems, 2017, 345–359. ACM, New York, NY.

    15. Kumar, P. and Huang, H. Falcon: Scaling IO performance in multi-SSD volumes. In Proceedings of USENIX Annual Technical Conference (Santa Clara, CA, July 2017).

    16. Lewin-Berlin, S. Exploiting multicore systems with Cilk. In Proceedings of the 4th International Workshop on Parallel and Symbolic Computation, 2010, 18–19. ACM, New York, NY, USA. ACM.

    17. Lin, F.X. and Liu, X. Memif: Towards programming heterogeneous memory asynchronously. SIGARCH Computing Architecture News 44, 2 (Mar. 2016), 369- 383.

    18. Siemon, D. Queueing in the Linux network stack. Linux J. 231 (July 2013).

    19. Tuning throughput performance for Intel Ethernet adapters (2017); http://www.intel.com/content/www/us/en/support/network-and-i-o/ethernet-products/000005811.html

    20. Unrau, R. and Krieger, O. Efficient sleep/wake-up protocols for user-level IPC. In Proceedings of the 1998 International Conference on Parallel Processing.

    21. Volos, H., Tack, A.J. and Swift, M.M. Mnemosyne: Lightweight persistent memory. SIGPLAN Notices 47, 4 (Mar. 2011), 91–104.

    22. Walker, B. SPDK: Building blocks for scalable high-performance storage applications. SNIA Storage Developer Conference, 2016, Santa Clara, CA, USA; https://www.snia.org/sites/default/files/SDC/2016/presentations/performance/BenjaminWalker_SPDK_Building_Blocks_SDC_2016.pdf,

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More