LWN 717707: 页交换（swap）的改进计划

Wang Chen 创作于 2019/03/24

请点击 LWN 中文翻译计划，了解更多详情。

原文：The next steps for swap 原创：By corbet @ Mar. 22, 2017 翻译：By unicornx 校对：By Xiaojie Yuan

Swapping has long been an unloved corner of the kernel’s memory-management subsystem. As a general rule, the thinking went, if a system starts swapping the performance battle has already been lost, so there is little reason to try to optimize the performance of swapping itself. The growth of fast solid-state storage devices is changing that calculation, though, making swapping interesting again. At the 2017 Linux Storage, Filesystem, and Memory-Management Summit, Tim Chen led a session in the memory-management track that looked at the ways that swapping performance can be improved.

页交换（Swapping）一直是内核内存管理子系统中不太受人待见的那个模块。大家总是认为，一旦系统开始运行页交换，说明整体性能已经下降到了很糟糕的程度，所以再花太多的精力去优化页交换似乎也就没有什么太大的意义了。可是，随着快速固态存储（solid-state storage）设备的快速普及，这种想法正在发生改变，大家再次对页交换变得感兴趣起来。在 2017 年的 Linux 存储，文件系统和内存管理峰会（Linux Storage, Filesystem, and Memory-Management Summit，简称 LSFMM）上，Tim Chen 在内存管理分会场主持了一次会议，讨论了可以改进页交换性能的各种方法。

Chen has been working on swapping performance for a while; the first set of swap scalability patches has already been merged. His next priority is improving swap readahead performance. This mechanism, which tries to read pages from swap ahead of an anticipated need for them, currently reads pages back in the order in which they were swapped out. This, he noted, is not necessarily the best order and, with mixed access patterns, performance can be poor.

Chen 多年来一直致力于改进页交换的性能；他所提交的第一组提高页交换可扩展性（scalability）的补丁已经被合入内核主线。下一个他将优先处理的事项是提高页交换的预读（readahead）性能（这里所谓的预读，是指通过预测，在某些备份在交换设备上的内容可能被使用之前，将它们提前读取出来）。当前的预读机制按照页框当初被换出（swap out）的顺序将其重新读回内存。Chen 指出，这种处理顺序并不是最好的选择，特别地当换出访问呈现出随机性时，性能反而会变差。

The recently submitted VMA-based swap readahead patches try to improve readahead performance by watching the swap-in behavior of each virtual memory area (VMA). If it appears that memory is being accessed in a serial fashion, the readahead window is increased in the hope of bringing in more pages before they are needed. For random patterns, instead, readahead has little value, so the window is reduced.

最近提交的基于 VMA 的页交换预读补丁尝试通过持续跟踪每个虚拟内存区域（virtual memory area，简称 VMA）的换入（swap-in）行为来提高预读性能。如果看起来内存正在被连续地（在虚拟地址上）读入，则增加预读窗口，以便在实际需要之前提前准备好更多内存页。相反，如果发现读入呈现随机模式，则说明预读并不会带来什么好处，因此就减小预读窗口。

Rik van Riel noted that the current readahead algorithm was designed for rotational media and asked how well the VMA-based mechanism works on such devices. Chen, with visible embarrassment, said that this hasn’t been tried. Van Riel added that, with rotational devices, a group of adjacent blocks can be read as quickly as a single block can, so it makes sense to speculatively read extra data. The same is not true for solid-state storage. So, he suggested, it might make sense for the readahead code to see which type of device is hosting the swap space and change its behavior accordingly.

Tim Chen

Rik van Riel 指出，目前的预读算法是专门针对马达驱动的机械硬盘设计的，为此他特意询问基于 VMA 的预读机制在这类设备上的工作情况如何。略显尴尬的 Chen 回答说，他们还没有在机械硬盘上测试过（译者注，从他提交的测试报告上可以看出，其测试使用的 Swap 设备是 NVMe disk）。Van Riel 补充说，在传统的机械硬盘上，可以像读取单个块一样快速地读取一组相邻的块，因此（按照原先的处理方式）预读额外数据是有意义的。但这对于固态存储设备情况却不一样（译者注，固态存储设备本身读取和存放数据就是随机的）。因此，他建议，对于预读代码来说，最好通过判断交换空间位于哪种类型的设备上并相应地改变其行为。（译者注，基于 VMA 的页交换预读补丁随 4.14 合入内核主线，由于该方法对于传统的硬盘会引入额外的开销，所以内核提供了一个运行期配置 /sys/kernel/mm/swap/vma_ra_enabled 来手动使能该特性。具体参考修改提交 “mm, swap: VMA based swap readahead” 和 “mm, swap: add sysfs interface for VMA based swap readahead”。）

Matthew Wilcox, instead, said that the real problem might be at swap-out time. Pages are swapped based on their position on the least-recently-used (LRU) lists, which may not reflect the order in which they will be needed again. He said that, perhaps, writes to swap could be buffered; swapped pages would go into a “victim cache” and sorted before being written to storage. The value of this approach wasn’t clear to everybody in the room, though, given that access patterns can change over time.

相对于预读专注于提高换入（swap-in）的效率，Matthew Wilcox 认为，真正影响效率的问题应该是出在换出（swap-out）上。内核按照页框在最近最少使用（least-recently-used，简称 LRU）链表上的顺序被依次换出到交换设备上，但这个顺序和它们被换入时的顺序却可能并不一致。他说，也许一种可行的方法是对页框上的数据在换出时执行缓冲处理；具体做法是在实际写入交换设备之前将需要交换的页数据先写入一个缓存区（cache）并在其中进行排序。然而，考虑到访问模式随时在变化，在会议上大家对这种方法是否会带来好处并不是很确定。

The next subject was the swapping of transparent huge pages. Currently, the first step is to split those pages into their component single pages, then to write those to swap individually — not the most efficient way to go about things. Chen and company would like to improve this behavior in a few steps, the first of which is to delay the splitting of the page until space has been allocated in the swap area. That should result in the allocation of a single cluster of pages for the entire huge page, at which point the whole thing can be written in a single operation. Patches implementing this change have been submitted; they result in a 14% swap-out performance increase.

会议的下一个讨论主题是有关透明巨页（transparent huge pages）的交换处理。目前，该操作的第一步是首先将这些巨页拆分为单个的页框，然后再将这些页框分别换出，显然这么做并不是最有效的方式。Chen 和相关开发人员计划分以下几个步骤来逐步改进当前的行为，第一步是推迟对巨页的拆分，直到在交换设备中分配了相应空间后再执行拆分。这么做的好处是可以为整个巨页分配一个连续的交换备份空间，而且可以一次性将巨页上的所有数据都换出。基于该设计而实现的补丁已提交；利用该补丁换出（swap-out）性能可以提升 14%。（译者注，该补丁随 4.13 合入内核主线。具体参考修改提交 “mm, THP, swap: delay splitting THP during swap out”。）

The next step is to delay the splitting of huge pages further, until the swap-out operation is finished. Those patches are in development; benchmarking shows that they result in a 37% improvement in swap-out performance.

改进的第二步是进一步推迟对巨页的拆分，推迟到直到换出操作完成。相关补丁还在开发中；基准测试表明，应用该补丁后换出性能提高了 37%。（译者注，第二步的修改随 4.14 合入内核主线。具体参考修改提交 “mm, THP, swap: delay splitting THP after swapped out”。）

Finally, it would be nice to be able to swap huge pages directly back in. This idea needs more thought, he said. It is not always a performance win; if the application only needs a couple of small pages of data, there is no point in bringing in the whole huge page. One possible heuristic could be to only swap in huge pages for memory regions marked with MADV_HUGEPAGE or which have a large readahead window.

改进的最终目标是能够直接交换巨页（即不拆分）。对这个想法还需要更多考量，他说。这么做并不总是能够在性能上获得收益；如果一个应用程序的数据只占用几个小的页框，那么换入整个巨页就不划算了。根据经验来看，仅当一个内存区域被标记为 MADV_HUGEPAGE 或具有大的预读窗口时，适合对其启用巨页交换。

There was a bit of discussion on how to justify the inclusion of these patches once they are ready. The best motivator is good benchmark results. It was suggested that Linus Torvalds is less likely to block the patches if they do not slow down kernel builds. Michal Hocko said that the patches were interesting, but that they were optimizing a rare event; the current code assumes that we don’t ever want to swap. But Johannes Weiner said that the swap-out changes, at least, make a lot of sense; batching operations by keeping huge pages together will speed things up.

会上还就在这些补丁完成后如何可以顺利地被内核主线所接受展开了一些讨论。大家认为最好的方法就是提供表现良好的基准测试结果。总之，只要补丁不会导致内核性能降低，Linus Torvalds 一般是不会拒绝该补丁的。Michal Hocko 表示对这些补丁很感兴趣，但同时认为存在的一个问题是，它们所试图优化的只是一个不那么常见的场景；因为当前的代码假定我们并不期望页交换会发生。然而 Johannes Weiner 表示，至少针对换出（swap-out）的改进在很大程度上还是有意义的；通过将巨页保持在一起并进行批量操作将提高执行的速度。

The next topic was the use of the DAX direct-access mechanism with swapped data. If swapping is done to a persistent memory array, the data can still be accessed directly without the need to read it back into RAM. There is “an almost-working prototype” that does this, Chen said. The hard part is deciding when it makes sense to bring pages back into RAM; memory that will be frequently accessed, especially if the accesses are writes, is better read back in.

接下来的议题是有关使用 DAX 直接访问（direct-access）机制来处理交换数据。如果交换中使用的是持久性存储（persistent memory）设备，则可以直接访问设备上的数据，而无需将其读回内存。Chen 说，对此已经开发了一个 “几乎可以工作的原型”。但其中最困难的部分是如何决定什么时候将数据读回内存；内存中应该存放经常被访问的内容，特别地如果该部分内容会被更新，则最好要将其读进内存进行处理。

Wilcox said that the decision really depends on the performance difference between dynamic RAM and persistent memory on the system in question; in some cases, the right answer might be “never”. Sometimes, for example, the “persistent-memory array” is actually dynamic RAM hosted in a hypervisor. There was some talk of using the system’s performance-monitoring unit (PMU) to track page accesses, but that idea didn’t get far. Developers prefer that the kernel not take over the PMU, the runtime cost is high, and the results are not always all that useful.

Wilcox 表示，这完全取决于实际系统上主存和持久存储设备之间的性能差异；在某些情况下，正确的答案可能是 “不需要（将数据读回内存）”。譬如，在一些虚拟机环境下，所谓 “持久存储器阵列（persistent-memory array）” 实际上就是主机的动态内存而已。有人提出使用系统的性能监控单元（performance-monitoring unit，简称 PMU）来跟踪对内存页的访问，但这个想法并没有获得大家的赞同。开发人员更倾向于内核不操作 PMU，因为这么做的话运行时的开销太大，而效果也不总是那么好。

After some discussion, the conclusion reached was that the kernel should just bring a random set of pages back into RAM occasionally. With luck, the frequently used pages will stay there, while the rest will age back out to swap.

经过一番讨论后得出的结论是，内核仅需要偶尔将一组随机的页数据读回内存。幸运的话，经常使用的页数据将被保留在内存中，其余的数据将因为不经常被访问而最终被换出。

Finally, there was a brief discussion of further optimizing the swap-device locking, which still sees significant contention even after the recent scalability improvements. So there is some interest in using lock elision toward this end.

最后，简要讨论了进一步优化针对交换设备的锁（locking）的问题，即使在最近针对扩展性进行改进之后，锁的竞争问题依旧严峻。为此，大家觉得可以使用 lock elision 机制来对该问题加以改进。

请点击 LWN 中文翻译计划，了解更多详情。