LWN 158211: 避免内存碎片化（fragmentation avoidance）

Wang Chen 创作于 2018/11/14

原文：Fragmentation avoidance 原创：By corbet @ Nov. 2, 2005 翻译：By unicornx 校对：By simowce

Mel Gorman’s fragmentation avoidance patches were covered here last February. This patch set divides all memory allocations into three categories: “user reclaimable,” “kernel reclaimable,” and “kernel non-reclaimable.” The idea to support multi-page contiguous allocations by grouping reclaimable allocations together. If no contiguous memory ranges are available, one can be created by forcing out reclaimable pages. Since non-reclaimable pages have been segregated into their own area, the chances of such a page blocking the creation of a contiguous set of free pages is relatively small.

今年二月份，我们给大家介绍了 Mel Gorman 的 “避免碎片化”（fragmentation avoidance，译者注，这个补丁的名字下文直接使用，不再翻译为中文）补丁。该补丁将所有内存分为三类：“用户态可回收”（”user reclaimable”），“内核态可回收”（”kernel reclaimable”）和“内核态不可回收”（”kernel non-reclaimable”）。其想法是通过将可回收的物理页框组织在一起来支持多页的连续内存分配。当缺少连续的物理内存区域时，内核可以通过强制换出可回收的页框来拼凑出一个满足要求的内存块。由于不可回收的页框已被隔离到其他独立的区域中，所以那种因为其中夹杂着不可换出页框而导致无法形成连续多个空闲页框的内存块的概率也会减小。

Mel recently posted version 19 of the fragmentation avoidance patch and requested that it be included in the -mm kernel. That request started a lengthy discussion on whether this patch set is a good idea or not. There is, it seems, a fair amount of uncertainty over whether this code belongs in the kernel. There are a few reasons for wanting fragmentation avoidance, and the arguments differ for each of them.

Mel 最近提交了 fragmentation avoidance 补丁的的第 19 个版本，并请求将它合入 -mm 内核仓库中。该请求引发了社区中有关该补丁是否真有必要的激烈讨论。看起来，这个补丁是否最终会被内核主线所接纳，还存在相当大的不确定性。下面列举了一些为内核开发这个补丁的原因，以及针对这些需求的不同观点。

The first of these reasons is to increase the probability of high-order (multi-page) allocations in the kernel. Nobody denies that Mel’s patch achieves that goal, but there are developers who claim that a better approach is to simply eliminate any such allocations. In fact, most multi-page allocations were dealt with some time ago. A few remain, however, including the two-page kernel stacks still used by default on most systems. When the kernel stack allocation fails, it blocks the creation of a new process. The kernel may eventually move to single-page stacks in all situations, but a few higher-order allocations will remain. It is not always possible to break required memory into single-page chunks.

第一个原因自然是为了增加内核中 “高阶”（high-order，即多个连续页框）内存分配的成功概率。Mel 的补丁的确有助于实现这个目标，但有些开发者声称，其实最好的方法就是避免分配任何形式的 high-order 内存。事实上，大多数此类内存分配在不久前都被清除了。但仍存在一些地方，譬如大多数系统上目前仍然默认使用两页大小的内核栈。一旦内核栈分配失败，它会阻止创建新的进程。虽然最终内核会全面转变为使用单页栈，但总是会有其他一些地方会申请分配 higher-order 内存。并不是所有的应用都允许使用分离的单页内存块（译者注，即并不是所有情况都可以采用虚拟地址连续而物理地址离散的方式，譬如一些 DMA 传输等要求物理地址必须连续）。

The next reason, strongly related to the first, is huge pages. The huge page mechanism is used to improve performance for certain applications on large systems; there are few users currently, but that could change if huge pages were easier to work with. Huge pages cannot be allocated for applications in the absence of a large - and suitably aligned - region of contiguous memory. In practice, they are very difficult to create on systems which have been running for any period of time. Failure to allocate a huge page is relatively benign; the application simply has to get by with regular pages and take the performance hit. But, given that you have a huge page mechanism, making it work more reliably would be worthwhile.

下一个原因，和前一个原因有很强的关联，就是有关巨页（huge pages）。巨页机制用于提高大型机系统上某些应用程序的性能；当前利用该机制的用户还很少，但如果随着该特性越来越易于使用，情形可能会发生变化。在没有大块且适当对齐的连续内存区域的情况下，是无法为应用程序分配巨页的。实际上，在一个已经运行了一段时间的系统上它们也很难被创建。虽然，分配不了巨页并不会导致很严重的后果；应用程序仍然可以使用常规页框，无非在性能上受到一定的损失。但是，鉴于内核已经支持了巨页功能，如果能够使其更可靠地工作岂不是更好。

The fragmentation avoidance patches can help with both high-order allocations and huge pages. There is some debate over whether it is the right solution to the problem, however. The often-discussed alternative would be to create one or more new memory zones set aside for reclaimable memory. This approach would make use of the zone system already built into the kernel, thus avoiding the creation of a new layer. A zone-based system might also avoid the perceived (though somewhat unproven) performance impact of the fragmentation avoidance patches. Given that this impact is said to be felt in that most crucial of workloads - kernel compiles - this argument tends to resonate with the kernel developers.

fragmentation avoidance 补丁的确有助于改进 high-order 内存分配问题和巨页问题。然而，关于它是否就是正确的解决方案这个问题上仍然存在争议。经常讨论的替代方案是新建一个或多个专用于可回收（reclaimable）内存的域（zone）。该方案利用内核中内建的对域（zone）的支持，从而避免在原有伙伴系统中引入新的设计层次（译者注，指 fragmentation avoidance 补丁中进一步把空闲页按照是否可以回收分为三类）。基于域（zone）的方法或许还可以避免 fragmentation avoidance 补丁可能引入的性能影响（虽然还未经证实）。由于据称该影响体现在大家最看重的内核编译过程中，可想而知如果真的是这样很容易招致内核开发人员的抱怨。

The zone-based approach is not without problems, however. Memory zones, currently, are static; as a result, somebody would have to decide how to divide memory between the reclaimable and non-reclaimable zones. This adjustment looks like it would be hard to get right in any sort of reliable way. In the past, the zone system has also been the source of a number of performance problems, mostly related to balancing of allocations between the zones. Increasing the complexity of the zone system and adding more zones could well bring those problems back.

但基于域（zone）的方法也并非完美。目前，内存域（zone）是静态分配的；因此，依赖于人工来决定划分可回收域和不可回收域之间的比例。这种调整目前并没有什么可靠的依据。在过去，内核对域（zone）的支持带来了许多性能上的问题，主要和平衡各个域之间的比例有关。增加更多的域（zone），使得域的实现更复杂，也只会让以上问题变得更加突出。

There is another motivation for fragmentation avoidance which brings a different set of constraints: support for hot-pluggable memory. This feature is useful on high-availability systems, but it is also heavily used in association with virtualization. A host running a number of virtualized Linux instances can, by way of the hotplug mechanism, shift its memory resources between those instances in response to the demands of each.

内核引入 fragmentation avoidance 补丁还有另外一个好处，就是由于它可以把不同类型的内存限制在不同的区域中，而这有助于内核的另一个特性，即内存热插拔（hot-pluggable memory）的实现。内存热插拔功能对于高可靠性（high-availability）系统很有用，同时在虚拟化技术中也被大量使用。运行多个虚拟化 Linux 实例（instance）的主机（host）可以通过内存热插拔机制在这些实例之间迁移其内存以响应每个实例对资源的请求。

Before memory can be removed from a running system, its contents must be moved elsewhere - at least, if one wants to still have a running system afterward. The fragmentation avoidance patches can help by putting only reclaimable allocations in the parts of memory which might be removed. As long as all the pages in a region can be reclaimed, that region is removable.

当我们需要在一个当前正在运行的系统上移除物理内存的时候，如果希望移除后当前系统还可以正常工作，则至少需要确保将被移除的内存上存储的内容移动到其他地方保存起来。fragmentation avoidance 补丁的优势恰好在于它可以确保将要被移除的内存所对应的页框都是可回收的。换句话说，只要某个内存区域上的页框都可以被回收（译者注，即和上段所述的那样，这些页框上的内容都可以被移动到别的地方），则这个内存区域就可以被安全地移除。

A very different argument has surfaced here: Ingo Molnar is insisting that any mechanism claiming to support hot-pluggable memory be able to provide a 100% success rate. The current code need not live up to that metric, but there needs to be a clear path toward that goal. Otherwise, the kernel developers risk advertising a feature which they may not ever be able to support in a reliable way. The backers of fragmentation avoidance would like to merge the patches, solving 90% of the problem, and leave the other 90% for later. Ingo, instead, fears that second 90%, and wants to know how it will get done.

但也有人对此提出了质疑：Ingo Molnar 坚称任何一种方案（译者注，这里应该就是针对 fragmentation avoidance 补丁）对于内存热插拔的支持必须要确保 100% 的成功率。虽然当前的代码还无法满足这一点，但我们需要明确基于该方案最终是可以完全支持该目标的。否则的话，我们现在的所有努力都是在冒险（即很有可能到最后发现无法实现目标而失败）。Mel 希望将这个补丁合入内核，并宣称已经解决了 90% 的问题，这是可以理解的，但谁又能保证剩下的 10% 不会又发展成另一个 90% 呢？所以，Ingo 所担心的是，如果我们无法明确地保证该方案可以 100% 地支持内存热插拔，那补丁会不会越改越多，总之他希望得到一个明确的答复。

Why can’t the current patches offer 100% reliability if they only put reclaimable memory in hot-pluggable regions? There are ways to lock down pages which were once reclaimable; these include DMA operations and pages explicitly locked by user space. There is also the issue of what happens when the kernel runs out of non-reclaimable memory. Rather than fail a non-reclaimable allocation attempt, the kernel will allocate a page from the reclaimable region. This fallback is necessary to avoid inflicting reliability problems on the rest of the kernel. But the presence of a non-reclaimable page in a reclaimable region will prevent the system from vacating that region.

为什么当前的 fragmentation avoidance 补丁还不能 100% 地确保热插拔区域中的内存都是可回收的呢？首先，一些原本可以回收的页框可能会被锁定（lock down）变成不可回收的；譬如由于 DMA 操作或是一些来自用户空间的明确操作锁定了该页框。其次，当内核耗尽不可回收的内存时，也会出现问题，因为内核会尝试从可回收区域中分配内存页框，而不是简单地拒绝对不可回收类型的内存分配请求。这种退而求其次（fallback）的做法是必要的，避免了在一些关键应用中内存分配失败影响可靠性。但是很显然，在可回收区域中一旦出现不可回收的物理页将阻止系统回收整片区域。

This problem can be solved by getting rid of non-reclaimable allocations altogether. And that can be done by changing how the kernel’s address space works. Currently, the kernel runs in a single, contiguous virtual address space which is mapped directly onto physical memory - often using a single, large page table entry. (The vmalloc() region is a special exception, but it is not an issue here). If the kernel were, instead, to use normal-sized pages like the rest of the system, its memory would no longer need to be physically contiguous. Then, if a kernel page gets in the way, it can simply be moved to a more convenient location.

要解决这个问题很简单，就是所有的内存都必须是可回收的。但这需要通过改变内核态地址空间的工作方式来完成。目前，内核代码在单个连续的虚拟地址空间中运行，该空间直接映射到物理内存，通常这需要使用一个很大的页表项来实现映射。（vmalloc() 所涉及的区域是一个特殊的例外，但这不涉及这里讨论的问题）。如果内核像系统其他部分那样使用正常大小的页面（译者注，即像用户态空间那样，采用页表动态映射），那么就不需要为其固定保留物理上连续的页框（译者注，即不可回收）。那么，一旦内核使用的某个页面需要被移除，我们就可以将其上的内容简单地移动到其他更方便的地方去。

Beyond the fact that this approach fundamentally changes the kernel’s memory model, there are a couple of little issues with it. There would be a performance hit caused by the higher translation buffer use, and an increase in the amount of memory needed to store the kernel’s page tables. Certain kernel operations - DMA in particular - cannot tolerate physical addresses which might change at arbitrary times. So there would have to be a new API where drivers could request physically-nailed regions - and be told by the kernel to give them up. In other words, breaking up the kernel’s address space opens a substantial barrel of worms. It is not the sort of change which would be accepted in the absence of a fairly strong motivation, and it is not clear that hot-pluggable memory is a sufficiently compelling cause.

且不论这么做会从根本上改变内核的内存模型，光它带来的一些小问题就够我们头疼的了。首先，更频繁地对转换缓冲区（translation buffer）的使用会导致性能上的损失；其次，用于存储内核页表所需的内存量也会增加。某些内核操作，特别是有关 DMA 的操作，是不能容忍物理地址随时发生变化的。因此，还必须提供一个新的 API，便于驱动程序可以请求一片固定的物理区域，并由内核告知驱动何时释放它们。总而言之，一旦对内核地址空间的操作方式进行改变会带来很多我们意想不到的问题。在没有特别强烈的需求背景下，这种变化是不会被接受的，况且目前还尚不清楚内存热插拔本身是否是一个迫切的需求。

So no conclusions have been reached on the inclusion of the fragmentation avoidance patches. In the short term, Andrew Morton’s controversy avoidance mechanisms are likely to keep the patch out of the -mm tree, however. But there are legitimate reasons for wanting this capability in the kernel, and the issue is unlikely to go away. Unless somebody comes up with a better solution, it could be hard to keep Mel’s patch out forever.

总之，对于内核是否要合入这个 fragmentation avoidance 补丁，目前还没有任何结论。短期内，依据 Andrew Morton 的处理原则，只要还有人反对就不会轻易将该补丁合入 -mm 代码仓库。但是我们真的有那么多的理由（译者注，如前文列举的那些）为内核添加这种功能啊，所以这个问题不太可能就此结束。除非有人提出更好的解决方案，我相信 Mel 的补丁迟早还是会被合入内核主线的（译者注，百转千折，此文发表两年多后，Mel 的 fragmentation avoidance 补丁随 2.6.24 版本合入主线）。