泰晓科技 -- 聚焦 Linux - 追本溯源,见微知著!
网站地址:https://tinylab.org

泰晓Linux系统盘,不用安装,即插即跑
请稍侯

LWN 334649: Compcache,一种基于内存实现压缩交换(compressed swapping)的技术

Wang Chen 创作于 2019/03/10

请点击 LWN 中文翻译计划,了解更多详情。

原文:Compcache: in-memory compressed swapping 原创:By Nitin Gupta @ May 26, 2009 翻译:By unicornx 校对:By Benny Zhao

The idea of memory compression—compress relatively unused pages and store them in memory itself—is simple and has been around for a long time. Compression, through the elimination of expensive disk I/O, is far faster than swapping those pages to secondary storage. When a page is needed again, it is decompressed and given back, which is, again, much faster than going to swap.

内存压缩(memory compression,即将那些相对不怎么使用的内存页上的数据压缩后仍旧保存在内存中)的想法本质上并不复杂,而且存在了已经有一段时间了。这么做减少了对代价高昂的二级存储(磁盘)的读写,好处是带来了速度上的提升。当再次需要一个内存页时,可以通过解压缩的方式进行恢复,这比运行交换(swap,译者注,下文直接使用不再翻译)要快得多。

An implementation of this idea on Linux is currently under development as the compcache project. It creates a virtual block device (called ramzswap) which acts as a swap disk. Pages swapped to this disk are compressed and stored in memory itself. The project home contains use cases, performance numbers, and other related bits. The whole aim of the project is not just performance — on swapless setups, it allows running applications that would otherwise simply fail due to lack of memory. For example, Edubuntu included compcache to lower the RAM requirements of its installer.

compcache 项目是基于该想法在 Linux 上的一个具体实现。该方案目前仍处于开发过程中(译者注,这里是指相对于本文发表时)。它创建了一个虚拟块设备(称之为 ramzswap)充当用于 swap 的磁盘。换出到此 “磁盘” 的内存页将被压缩并仍旧保存在内存中。项目的主页上包含有用例,性能数据和其他相关的介绍。该项目的完整目标不仅仅是为了提高性能,对于那些没有提供交换设备的系统,一旦启用该机制后,还可以使得原本由于缺乏内存而无法运行的应用程序能够继续运行。例如,Edubuntu 就 包含了 compcache 用于降低安装程序对内存(RAM)的需求。

The performance page on the project wiki shows numbers for configurations that closely match netbooks, thin clients, and embedded devices. These initial results look promising. For example, in the benchmark for thin clients, ramzswap gives nearly the same effect as doubling the memory. Another benchmark shows that average time required to complete swap requests is reduced drastically with ramzswap. With a swap partition located on a 10000 RPM disk, average time required for swap read and write requests was found to be 168ms and 355ms, respectively. While with ramzswap, corresponding numbers were mere 12µs and 7µs, respectively — this includes time for checking zero-filled pages and compressing/decompressing all non-zero pages.

项目的 性能展示页面 上列出了针对上网本,瘦客户机和嵌入式设备等各种系统配置的性能测试数据。这些初步的测试结果看上去令人鼓舞。譬如,以瘦客户机的基准测试为例,启用 ramzswap 后系统运行表现堪比内存加倍条件下的效果。另一个 基准测试 显示,使用 ramzswap 可以大大减少完成 swap 请求所需的平均时间。假设作为 swap 分区的磁盘转速为 10000 RPM,则 swap 时执行读取和写入请求所需的平均时间分别为 168ms 和 355ms。换成 ramzswap 后,相应的数字分别仅为 12μs 和 7μs,这还包括了检查页框数据填充值是否为零以及压缩和解压缩所有非零页的时间。

The approach of using a virtual block device is a major simplification over earlier attempts. The previous implementation required changes to the swap write path, page fault handler, and page cache lookup functions (find_get_page() and friends). Those patches did not gain widespread acceptance due to their intrusive nature. The new approach is far less intrusive, but at a cost: compcache has lost the ability to compress page cache (filesystem backed) pages. It can now compress swap cache (anonymous) pages only. At the same time, this simplicity and non-intrusiveness got it included in Ubuntu, ALT Linux, LTSP (Linux Terminal Server Project) and maybe other places as well.

在早期版本的基础上所做的主要简化是采用了新的虚拟块设备技术。先前的方案需要涉及对 swap 的写路径,缺页处理逻辑和页缓存查找函数(包括 find_get_page() 和其他辅助函数)进行修改。由于改动太大,这些补丁未能获得社区的广泛接受。采用新方法后改动要小得多,但代价是:目前的 compcache 不再支持针对用于页缓存(服务于文件系统)的内存页的压缩能力。它现在只能压缩可交换的匿名页。正是由于这种简单的设计,该补丁受到多方青睐,并被应用于 Ubuntu,ALT Linux,LTSP (Linux Terminal Server Project) 等项目。

It should be noted that, when used at the hypervisor level, compcache can compress any part of the guest memory and for any kind of guest OS (Linux, Windows etc) — this should allow running more virtual machines for a given amount of total host memory. For example, in KVM the guest physical memory is simply anonymous memory for the host (Linux kernel in this case). Also, with the recent MMU notifier support included in the Linux kernel, nearly the entire guest physical memory is now swappable [PDF].

应该注意的是,当在虚拟机系统管理层(hypervisor level)使用时,compcache 可以针对客户机(guest)中各种操作系统(Linux,Windiws 等)所使用的内存进行压缩,这么做的好处是在总内存量一定的前提下我们可以在主机(host)中运行更多的虚拟机客户机。例如,在 KVM 中,客户机所看到的 “物理” 内存相对于主机来说都是匿名内存(以 Linux 内核为例)。此外,基于 Linux 内核中最新支持的 MMU notifier 功能,几乎整个客户机的 “物理” 内存现在都可以被交换,具体可以参考相关的 PDF 介绍

具体实现(Implementation)

All of the individual components are separate kernel modules:

  • LZO compressor: lzo_compress.ko, lzo_decompress.ko (already in mainline)
  • xvMalloc memory allocator: xvmalloc.ko
  • compcache block device driver: ramzswap.ko

compcache 项目(补丁)的所有组件都实现为独立的内核模块(译者注,compcache 补丁随 2.6.33 合入内核主线,但一直呆在 drivers/staging 目录下,直到 3.14 时才正式改名为 zram 并作为稳定的模块被内核所接纳,本文主要是以合入 2.6.33 之前的补丁为例介绍。):

  • LZO 压缩器:lzo_compress.ko,lzo_decompress.ko(已在主线中
  • xvMalloc 内存分配器:xvmalloc.ko (译者注:xvmalloc 和 ramzswap 两个模块的代码位置在 staging 目录 下,)
  • compcache 块设备驱动程序:ramzswap.ko

Once these modules are loaded, one can just enable the ramzswap swap device:

加载这些模块后,可以通过以下方式启用 ramzswap 交换设备:

swapon /dev/ramzswap0

Note that ramzswap cannot be used as a generic block device. It can only handle page-aligned I/O, which is sufficient for use as a swap device. No use case has yet come to light that would justify the effort to make it a generic compressed read-write block device. Also, to minimize block layer overhead, ramzswap uses the “no queue” mode of operation. Thus, it accepts requests directly from the block layer and avoids all overhead due to request queue logic.

请注意,ramzswap 并不能以通用块设备的方式使用。它只能处理满足以页框为单位进行对齐的输入输出操作,但这已足够支持我们把它当作交换设备使用了。至少到目前为止还没有收到任何形式的需求,希望将其实现为一种支持通用可压缩读写的块设备。此外,为了最小化块层(block layer)的开销,ramzswap 使用 “无队列(no queue)” 操作模式。也就是说,它直接接受来自块层的请求,避免了队列请求相关处理带来的开销。

The ramzswap module accepts parameters for “disk” size, memory limit, and backing swap partition. The optional backing swap partition parameter is the physical disk swap partition where ramzswap will forward read/write requests for pages that compress to a size larger than PAGE_SIZE/2 — so we keep only highly compressible pages in memory. Additionally, purely zero filled pages are checked and no memory is allocated for such pages. For “generic” desktop workloads (Firefox, email client, editor, media player etc.), we typically see 4000-5000 zero filled pages.

ramzswap 驱动模块可以接受对 “磁盘” 大小(”disk” size),内存限制(memory limit)和后备交换分区(backing swap partition)等参数值进行设置。所谓后备交换分区,即物理的磁盘交换分区,指定它们的目的是为了使得 ramzswap 可以将那些压缩后大小大于 PAGE_SIZE/2 的内容转储到指定的物理磁盘上,换句话说这么做的结果是 ramzswap 只会在内存中保存压缩率比较高(译者注,压缩后大小不超过原大小 50%)的页。此外,ramzswap 会避免给数据填充值完全为零(zero filled)的页分配实际内存。对于那些 “通用的” 桌面应用(譬如 Firefox,电子邮件客户端,编辑器,媒体播放器等),我们通常会看到这些应用的进程会映射 4000 到 5000 个这样的(zero filled)页。

内存管理(Memory management)

One of the biggest challenges in this project is to manage variable sized compressed chunks. For this, ramzswap uses memory allocator called xvmalloc developed specifically for this project. It has O(1) malloc/free, very low fragmentation (within 10% of ideal in all tests), and can use highmem (useful on 32-bit systems with >1G memory). It exports a non-standard allocator interface:

该项目面临的最大挑战之一是管理可变大小的压缩块。为此,ramzswap 使用了专门为此项目开发的名为 xvmalloc 的内存分配器(译者注,在 compcache 的继任者 zram 中,xvmalloc 也被新的内存分配器 zsmalloc 所替代 )。在内存分配(malloc)和释放(free)上它具有 O(1)的时间复杂度,非常低的内存碎片率(在所有测试中都可以达到理想的 10% 以内),并且可以使用高端内存(highmem,这对于具有大于 1G 内存的 32 位系统很有用)。它提供了一组 非标准 的内存分配接口:

struct xv_pool *xv_create_pool(void);
void xv_destroy_pool(struct xv_pool *pool);

int xv_malloc(struct xv_pool *pool, u32 size, u32 *pagenum, u32 *offset, gfp_t flags);
void xv_free(struct xv_pool *pool, u32 pagenum, u32 offset);

xv_malloc() returns a <pagenum, offset> pair. It is then up to the caller to map this page (with kmap()) to get a valid kernel-space pointer.

xv_malloc() 返回一对 <pagenum,offset> 值。然后由调用者自己(使用 kmap())对该页完成映射以获取有效的内核空间的内存地址。

The justification for the use of a custom memory allocator was provided when the compcache patches were posted to linux-kernel. Both the SLOB and SLUB allocators were found to be unsuitable for use in this project. SLOB targets embedded devices and claims to have good space efficiency. However, it was found to have some major problems: It has O(n) alloc/free behavior and can lead to large amounts of wasted space as detailed in this LKML post.

当 compcache 补丁 被提交到 linux-kernel 邮件列表中时,一并提供了为何要使用自定义内存分配器的理由。原因是发现 SLOB 和 SLUB 分配器都不适合在此项目中使用。SLOB 面向嵌入式设备并 声称 空间利用率较高。然而,它被发现存在一些主要问题:它的内存分配和释放接口的时间复杂度为 O(n),并且正如 这篇 LKML 帖子中详述的那样,SLOB 还会导致大量空间被浪费。

On the other hand, SLUB has different set of problems. The first is the usual fragmentation issue. The data presented here shows that kmalloc uses ~43% more memory than xvmalloc. Another problem is that it depends on allocating higher order pages to reduce fragmentation. This is not acceptable for ramzswap as it is used in tight-memory situations, so higher order allocations are almost guaranteed to fail. The xvmalloc allocator, on the other hand, always allocates zero-order pages when it needs to expand a memory pool.

而 SLUB 则存在另一些不同的问题。第一个是常见的内存碎片问题。此处 数据显示 kmalloc 比 xvmalloc 多使用大约 43% 的内存。另一个问题是它依赖于分配更高阶(higher order,即多页)的页框来减少碎片。这对于 ramzswap 是不可接受的,因为它主要应用于内存紧张的场景,因此几乎可以确定的是对于高阶的内存分配一定会失败。而 xvmalloc 分配器在需要扩展内存池时总是分配零阶(zero-order,即单页)页框。

Also, both SLUB and SLOB are limited to allocating from low memory. This particular limitation is applicable only for 32-bit system with more than 1G of memory. On such systems, neither allocator is able to allocate from the high memory zone. This restriction is not acceptable for the compcache project. Users with such configurations reported memory allocation failures from ramzswap (before xvmalloc was developed) even when plenty of high-memory was available. The xvmalloc allocator, on the other hand, is able to allocate from the high memory region.

此外,SLUB 和 SLOB 都限于从 低端内存(low memory) 分配。此特定限制仅针对具有 1G 以上内存的 32 位系统。在这样的系统上,以上两种分配器都不能从“高端内存区(high memory zone)” 分配。而这些限制对于 compcache 项目是不可接受的。(在开发 xvmalloc 之前)具有此类配置的用户报告说即使系统具有足够的高端内存可用,ramzswap 仍然会内存分配失败。而 xvmalloc 分配器则支持从高端内存区分配。

Considering above points, xvmalloc could potentially replace the SLOB allocator. However, this would involve lot of additional work as xvmalloc provides a non-standard malloc/free interface. Also, xvmalloc is not scalable in its current state (neither is SLOB) and hence cannot be considered as a replacement for SLUB.

考虑到以上几点,xvmalloc 有可能会取代 SLOB 分配器。但是,这将涉及许多额外的工作,因为 xvmalloc 提供了非标准的 malloc/free 接口。此外,当前的 xvmalloc 不支持可扩展性(scalable)(当然 SLOB 也同样不具备),因此还不能完全替代 SLUB。

The memory needed for compressed pages is not pre-allocated; it grows and shrinks on demand. On initialization, ramzswap creates an xvmalloc memory pool. When the pool does not have enough memory to satisfy an allocation request, it grows by allocating single (0-order) pages from kernel page allocator. When an object is freed, xvmalloc merges it with adjacent free blocks in the same page. If the resulting free block size is equal to PAGE_SIZE, i.e. the page no longer contains any object; we release the page back to the kernel.

用于存放压缩页的内存是无法预先分配的;因为它会随着需求变化而增长或者缩小。在初始化时,ramzswap 会创建一个 xvmalloc 内存池。当池中没有足够的内存来满足分配请求时,它会通过向内核的页框分配器(buddy 系统)申请分配单个(零阶)页框来扩展该内存池的大小。当数据被释放时,xvmalloc 会将其所占用的内存区域与同一个页框中的相邻空闲区域进行合并。如果合并后得到的空闲块大小等于 PAGE_SIZE,即得到一个不再包含任何数据的完整页框;则 xvmalloc 会将该页框返回给内核(buddy 系统)。

This allocation and freeing of objects can lead to fragmentation of the ramzswap memory. Consider the case where a lot of objects are freed in a short period of time and, subsequently, there are very few swap write requests. In that case, the xvmalloc pool can end up with a lot of partially filled pages, each containing only a small number of live objects. To handle this case, some sort of xvmalloc memory defragmentation scheme would need to be implemented; this could be done by relocating objects from almost-empty pages to other pages in the xvmalloc pool. However, it should be noted that, practically, after months of use on several desktop machines, waste due to xvmalloc memory fragmentation never exceeded 7%.

这种分配和释放数据对象的方式可能导致 ramzswap 中内存的碎片化。考虑一种场景,系统可能在短时间内释放了大量的数据对象,但此后,swap 的写请求一直不发生。在这种情况下,xvmalloc 的内存池最终可能会充满了很多只有部分填充了数据的页框,每个页框只包含少量的使用中的数据对象。要处理这种情况,需要实现某种针对 xvmalloc 的内存碎片整理方案;这可以通过把那些大部分区域都是空闲的页框中的数据对象重定位到 xvmalloc 池中的其他页框来完成。但是,应该注意的是,实际运行过程中我们发现,在几台桌面计算机上使用数月之后,由于 xvmalloc 产生的内存碎片所造成的浪费从未超过 7%。

Swap 功能中的一些限制和相关工具介绍(Swap limitations and and tools)

Being a block device, ramzswap can never know when a compressed page is no longer required — say, when the owning process has exited. Such stale (compressed) pages simply waste memory. But with recent “swap discard” support, this is no longer as much of a problem. Swap discard sends BIO_RW_DISCARD bio request when it finds a free swap cluster during swap allocation. Although compcache does not get the callback immediately after a page becomes stale, it is still better than just keeping those pages in memory until they are overwritten by another page. Support for the swap discard mechanism was added in compcache-0.5.

作为一种块设备,ramzswap 自身并不会知道何时不再需要维护一个压缩页,比如,当拥有该压缩页框的进程退出时。这些不被使用的(压缩)页框对于内存来说只能是一种浪费。但是随着最近对 “swap discard” 技术的支持,这不再是一个问题。“swap discard” 在 swap 分配内存过程中一旦遇到一个空闲的 swap 块时会发送类型为 BIO_RW_DISCARD 的 bio 请求(译者注,使得那些不被使用的压缩页框会在分配过程中被回收)。虽然这么做并不能保证 compcache 中的页框一旦被废弃后就会触发回调(从而被回收),但这至少比将这些页框留在内存中直到它们被新的分配覆盖要更好。在 compcache-0.5 中添加了对 swap discard 机制的支持。

In general, the discard request comes a long time after a page has become stale. Consider a case where a memory-intensive workload terminates and there is no further swapping activity. In those cases, ramzswap will end up having lots of stale pages. No discard requests will come to ramzswap since no further swap allocations are being done. Once swapping activity starts again, it is expected that discard requests will be received for some of these stale pages. So, to make ramzswap more effective, changes are required in the kernel (not yet done) to scan the swap bitmap more aggressively to find any freed swap clusters — at least in the case of RAM backed swap devices. Also, an adaptive compressed cache resizing policy would be useful — monitor accesses to the compressed cache and move relatively unused pages to a physical swap device. Currently, ramzswap can simply forward uncompressible pages to a backing swap disk, but it cannot swap out memory allocated by xvmalloc.

通常,BIO_RW_DISCARD 请求只会在页框不再被使用后很长一段时间后才会被触发。对于那种内存密集型应用的场景,有可能在应用进程终止后一直不发生 swap 活动。在这种情况下,ramzswap 最终会被大量陈旧的页框所填满。这是因为没有进一步的 swap 内存分配请求,导致 ramzswap 无法触发 BIO_RW_DISCARD 请求。换句话说,只有当 swap 活动再次开始,才会产生 BIO_RW_DISCARD 请求使得这些无用的页框被回收。因此,为了使得 ramzswap 工作更加有效,至少在支持 RAM 方式的 swap 设备时,内核行为需要进行适当的改变(该项改动尚未完成),更加积极地扫描 swap bitmap 从而查找任何可以释放的 swap 页框,此外,采用自适应(adaptive)的方式对压缩缓存大小进行调整也将会非常有用,所谓自适应即通过跟踪对压缩缓存的访问,将相对使用很少的压缩数据移动到物理交换设备上去。目前,ramzswap 可以简单地将不可压缩的页面转储到后备的交换磁盘,但它目前还不能对通过 xvmalloc 分配的内存执行此类转储操作。

Another interesting sub-project is the SwapReplay infrastructure. This tool is meant to easily test memory allocator behavior under actual swapping conditions. It is a kernel module and a set of userspace tools to replay swap events in userspace. The kernel module stacks a pseudo block device (/dev/sr_relay) over a physical swap device. When kernel swaps over this pseudo device, it dumps a <sector number, R/W bit, compress length> tuple to userspace and then forwards the I/O request to the backing swap device (provided as a swap_replay module parameter). This data can then be parsed using a parser library which provides a callback interface for swap events. Clients using this library can provide any action for these events — show compressed length histograms, simulate ramzswap behavior etc. No kernel patching is required for this functionality.

另一个有趣的子项目是 SwapReplay(译者注,下文直译为 swap replay)。开发此工具的目的是为了方便测试实际交换过程中的内存分配器行为。它由一个内核模块和一组用户空间的工具组成,可以在用户空间中回放(replay)交换过程中所发生的事件。该内核模块在一个物理交换设备之上叠加了一个虚拟的块设备(/dev/sr_relay)。当内核通过该虚拟设备执行交换操作时,它会将形如 <扇区号,读/写标志位,压缩长度> 这样的三元组记录转储到用户空间,然后将读写请求转发给实际的物理交换设备(该物理交换设备在加载 swap_replay 模块时作为参数值提供)。转储的数据可以使用解码库来解析,解码函数以回调函数的方式在交换事件中被调用。使用该解码库的客户端程序可以基于这些事件自定义任何操作,譬如显示压缩长度的直方图,模拟 ramzswap 行为等。启用该功能不需要给内核打额外的补丁。

The swap replay infrastructure has been very useful throughout ramzswap development. The ability to replay swap traces allows for easy and consistent simulation of any workload without the need to set it up and run it again and again. So, if a user is suffering from high memory fragmentation under some workloads, he could simply send me swap trace for his workload and I have all the data needed to reproduce the condition on my side — without the need to set up the same workload.

在整个 ramzswap 的开发过程中,该 swap replay 工具非常有用。swap replay 通过回放交换跟踪事件,使得我们可以方便,一致地模拟任何工作场景,而无需搭建复杂的环境一次又一次地尝试复现。打个比方来说,如果一个用户在某些工作负载下发现很高的内存碎片现象,他可以简单地将他工作环境下的交换跟踪信息发送给我,而我则拥有了重现该条件的所有数据,再也无需费劲去搭建相同的复现环境了。

Clients for the parser library were written to simulate ramzswap behavior over traces from a variety of workloads leading to easier evaluation of different memory allocators and, ultimately, development and enhancement of the xvmalloc allocator. In the future, it will also help testing variety of eviction policies to support adaptive compressed cache resizing.

通过编写使用解码库的客户端,基于跟踪的日志我们模拟出各种工作负载场景下 ramzswap 的行为,这使得我们更容易评估不同内存分配器的行为,帮助我们最终开发和优化了 xvmalloc 分配器。将来,它还将帮助我们测试各种回收策略以支持采用自适应技术对压缩缓存大小进行调整。

结论(Conclusion)

The compcache project is currently under active development; some of the additional features planned are: adaptive compression cache resizing, allow swapping of xvmalloc memory to physical swap disk, memory defragmentation by relocating compressed chunks within memory and compressed swapping to disk (4-5 pages swapped out with single disk I/O). Later, it might be extended to compress page-cache pages too (as earlier patches did) — for now, it just includes the ramzswap component to handle anonymous memory compression.

compcache 项目目前正处于积极开发中;计划新增的一些功能包括:自适应压缩缓存大小调整,支持将 xvmalloc 内存交换到物理交换磁盘,通过在内存中重新定位压缩块从而实现内存碎片整理,以及将数据压缩后再交换到磁盘(即在 swap 过程中实现一次向磁盘写出压缩前 4 到 5 个页的内容)。未来,该项目可能会扩展到支持对页缓存(page cache)进行压缩(就像之前的补丁一样),虽然现在它只包含 ramzswap 组件来处理对匿名内存(anonymous memory)的压缩。

Last time the ramzswap patches were submitted for review, only LTSP performance data was provided as a justification for this feature. Andrew Morton was not satisfied with this data. However, now there is a lot more data uploaded to the performance page on the project wiki that shows performance improvements with ramzswap. Andrew also pointed out lack of data for cases where ramzswap can cause performance loss:

We would also be interested in seeing the performance _loss_ from these patches. There must be some cost somewhere. Find a worstish-case test case and run it and include its results in the changelog too, so we better understand the tradeoffs involved here.

上次提交 ramzswap 补丁进行审核时,作为对该功能性能的展示,仅提供了 LTSP 的相关数据。Andrew Morton 对这些数据 并不满意。但是,现在已有更多的数据被上传到项目维基的性能展示网页上,显示出 ramzswap 的性能有所改进。但 Andrew 也指出,有关 ramzswap 是否可能导致性能下降,还缺乏可以证明的数据:

我们对该补丁是否会导致性能损失也有兴趣。加入该补丁后必定会引入一些开销。建议找出一个最坏情况下的测试用例并运行之,同时将其结果包含在修改日志中,这样我们就能更好地理解并衡量其对性能的影响。

The project still lacks data for such cases. However, it should be available by the 2.6.32 time frame, when these patches will be posted again for possible inclusion in mainline.

该项目仍缺乏此类案例的数据。但是,只要该补丁能够在 2.6.32 合并之前完成再次提交并提供以上数据,就有可能被合入主线。(译者注,compcache 补丁随 2.6.33 合入内核主线。)

请点击 LWN 中文翻译计划,了解更多详情。



Read Album:

Read Related:

Read Latest: