LWN 684611: 连续内存分配器（Contiguous Memory Allocator）和内存规整（compaction）

Wang Chen 创作于 2018/12/05

原文：CMA and compaction 原创：By corbet @ Apr. 23, 2016 翻译：By unicornx 校对：By Fan Xin

The nice thing about virtual-memory systems is that the physical placement of memory does not matter — most of the time. There are situations, though, where physically contiguous memory is needed; operating systems often struggle to satisfy that need. At the 2016 Linux Storage, Filesystem, and Memory-Management Summit, two brief sessions discussed issues relating to a pair of techniques used to ensure access to physically contiguous memory: the contiguous memory allocator (CMA) and compaction.

内核支持虚拟内存的优点在于大多数情况下，用户不用关心内存的实际物理位置。但在某些场景下，我们仍然会需要确保物理上内存是连续的；而操作系统在分配内存时经常难以满足这种要求。在 2016 年度的 Linux 存储，文件系统和内存管理（Linux Storage, Filesystem, and Memory-Management，简称 LSFMM）峰会上，召开了两个简短的，有关如何确保分配连续物理内存的会议，它们讨论的技术专题分别是：连续内存分配器（contiguous memory allocator 简称 CMA）和内存规整。

CMA troubles

…

（译者注，目前仅关注内存规整，所以 CMA 的相关内容暂不翻译。）

内存规整（Compaction）

“Compaction” is the process of shifting pages of memory around to create contiguous areas of free memory. It helps the system’s ability to satisfy higher-order allocations, and is crucial for the proper functioning of the transparent huge pages (THP) mechanism. Vlastimil Babka started off the session on compaction by noting that it is not invoked by default for THP allocations, making those allocations harder to satisfy. That led to some discussion of just where compaction should be done.

所谓 “内存规整” 指的是通过迁移内存页框（上的内容）以腾出空闲页框从而方便创建连续的可分配内存块。该技术有助于内核支持更 “高阶” 内存的分配（译者注：“高阶”（higher-order），伙伴系统内存分配术语，指包含多个连续页框（个数大于 1，且是 2 的整数次幂）的内存块），这对于实现 “透明巨页”（transparent huge pages，以下简称 THP）功能至关重要。Vlastimil Babka 在 “内存规整” 专题会议的开幕致辞中提醒大家，当初内核引入规整技术并非是为了实现 THP，而 THP 则使得内存分配变得愈加复杂。围绕这个话题，大家就内核中应该在什么地方对内存进行规整展开了一些讨论。

One option is the khugepaged thread, whose job is to collapse sets of small pages into huge pages. It might do some compaction on its own, but it can be disabled, which would disable compaction as well. Thus, khugepaged cannot guarantee that background compaction will be done. The kswapd thread is another possibility, but Rik van Riel pointed out that it tends to be slow for this purpose, and it can get stuck in a shrinker somewhere waiting for a lock. Another possibility, perhaps the best one, is a separate kcompactd thread dedicated to this particular task.

一种方案是利用 khugepaged 线程，其原本的任务是将小块内存合并成大块内存。可以在其中加入规整功能，但由于该线程的运行可能会被关闭，而这么一来规整也就无法执行了。因此，使用 khugepaged 并不能确保内存规整在后台运行。还有一种可能是利用 kswapd 线程，但 Rik van Riel 指出，使用该线程实现内存规整，响应往往会比较慢，因为该线程可能会在执行 shrinker（译者注：Shrinker 是内核回收页框的一种机制，由 kswapd 负责监控并调用执行）并等待持有锁的过程中被阻塞。另一种可能性，也许是最好的一种，就是为内存规整专门创建一个特定的线程 kcompactd（译者注，kcompactd 随 4.6 版本合入内核主线）。

Michal Hocko said that he ran into compaction problems while working on the out-of-memory detection problem. He found that the compaction code is hard to get useful feedback from; it “does random things and returns random information.” It has no notion of costly allocations, and makes decisions that are hard to understand.

Michal Hocko 说他在处理 “内存不足检测（out-of-memory detection）” 问题时也遇到了内存规整的问题。他发现使用规整并没有给他带来帮助；相反由于它 “选择和移动的页框是随机的所以导致规整后的内存块的分布毫无规律。” 总之它并没有给 Michal 带来大块连续的可分配内存，而是给出了一些很奇怪的结果。

Part of the problem, he said, is that compaction was implemented for the THP problem and is focused a little too strongly there. THP requires order-9 (i.e. “huge”) pages; if the compaction code cannot create such a page in a given area, it just gives up. The system needs contiguous allocations of smaller sizes, down to the order-2 (four-page) allocations needed for fork() to work, but the compaction code doesn’t care about creating contiguous chunks of that size. A similar problem comes from the “skip” bits used to mark blocks of memory that have proved resistant to compaction. They are an optimization meant to head off fruitless attempts at compaction, but they also prevent successful, smaller-scale compaction. Hacking the compaction code to ignore the skip bits leads to better results overall.

Michal Hocko 认为，问题的部分原因在于内存规整是为了解决 THP 的问题而开发的（译者注，貌似这个结论和会议一开始 Vlastimil Babka 提醒大家的有点矛盾），其实现中过于侧重于 THP 的需求了。THP 需要 order 为 9 的内存块（译者注，即包含页框个数是 2 的 9 次方幂的连续内存块，这也是我们称其为 “巨大” 的原因）；如果规整代码无法在给定区域中创建满足该要求的内存块，就会放弃执行，不再继续处理。而对于整个系统来说，还需要分配较小的连续页框内存块，譬如派生进程（通过执行 fork()）时就会需要分配 order 为 2 （即大小为四个页框）的内存块，对于这类情况规整算法并没有考虑。还有一个问题也很类似，就是在规整算法中会将扫描中识别为不满足规整要求的内存块标识为 “可忽略”（”skip”，即不执行规整）。作为一种优化，目的是防止运行没必要的规整操作，但带来一个副作用就是这也阻止了对小块内存的规整操作。通过修改代码不执行忽略会从整体上得到更好的结果。

Along the same lines, compaction doesn’t even try with page blocks that hold unmovable allocations. As Mel pointed out, that was the right decision for THP, since a huge page cannot be constructed from such a block, but it’s the wrong thing to do for smaller allocations. It might be better, he said, for the compaction code to just scan all of memory and do the best it can.

同样地，如果一个内存块中含有不可移动页框，算法也会放弃对它的规整操作。正如 Mel 所指出的，这是基于 THP 的需求做出的正确决定，因为我们无法基于这样的情况构建大的连续内存块，但这对于较小的内存分配需求来说是不公平的。他认为，如果要追求更好的效果，最好的做法是，扫描所有内存并尽最大的努力执行规整。

There was some talk of adding flexibility to the compaction code so that it will be better suited for more use cases. If the system is trying to obtain huge pages for THP, compaction should not try too hard or do anything too expensive. But if there is a need for order-2 blocks to keep things running, compaction should try a lot harder. One option here would be to have a set of flags describing what the compaction code is allowed to do, much like the “GFP flags” used for memory allocation requests. The alternative, which seemed to be more popular, is to have a single “priority” level controlling compaction behavior.

为此会议还讨论了是否可以为规整算法增加一些灵活性，以便其支持更广泛的使用场景。如果系统更倾向于支持 THP，则规整算法在必要时可以执行一些优化以提高效率。但如果是为了一些更小的内存分配，譬如 order 为 2 的，则规整算法需要继续尝试。为了解决这个矛盾，可以为规整操作提供一些选项标志，有点类似于内存分配请求参数中的 “GFP” 标志来告知内核使用者的选择。另一种似乎更受欢迎的方案是定义一个单独的 “优先级” 级别来控制规整算法的行为。

The final topic of discussion was the process of finding target pages when compaction decides to migrate a page that is in the way. The current compaction code works from both ends of a range of memory toward the middle, trying to accumulate free pages at one end by migrating pages to the other end. But it seems that, in some settings, scanning for the target pages takes too long; it was suggested that, maybe, those pages should just come from the free list instead. Mel worried, though, that such a scheme could result in two threads doing compaction just moving the same pages back and forth; the two-scanner approach was designed to avoid that. There was some talk of marking specific blocks as migration targets, but it is not clear that work in this area will be pursued.

会议的最后一个主题是讨论规整算法中有关为需要迁移的页找到空闲页框的处理过程。当前的算法从每个内存域（zone）的两端向中间扫描，试图在一端收集空闲的页框来容纳从另一端迁移过来的页面数据。但在某些情况下，扫描空闲页框的操作需要耗费较长的时间；所以有人建议，可以直接将数据移动到伙伴系统的空闲页框中。Mel 担心，这么做可能会出问题，譬如当两个线程同时执行规整时，有可能发生互相干扰，导致来回移动相同的页框数据。而当初采用从两端向中间扫描的方法就是为了避免这种情况。也有人提出将特定的内存区域标记出来，预留作为迁入，但尚不清楚该建议是否会有人持续跟进。