LWN 684611: 连续内存分配器(Contiguous Memory Allocator)和内存规整(compaction)
原文:CMA and compaction 原创:By corbet @ Apr. 23, 2016 翻译:By unicornx of TinyLab.org 校对:By Fan Xin
The nice thing about virtual-memory systems is that the physical placement of memory does not matter — most of the time. There are situations, though, where physically contiguous memory is needed; operating systems often struggle to satisfy that need. At the 2016 Linux Storage, Filesystem, and Memory-Management Summit, two brief sessions discussed issues relating to a pair of techniques used to ensure access to physically contiguous memory: the contiguous memory allocator (CMA) and compaction.
内核支持虚拟内存的优点在于大多数情况下,用户不用关心内存的实际物理位置。但在某些场景下,我们仍然会需要确保物理上内存是连续的;而操作系统在分配内存时经常难以满足这种要求。在 2016 年度的 Linux 存储,文件系统和内存管理(Linux Storage, Filesystem, and Memory-Management,简称 LSFMM)峰会上,召开了两个简短的,有关如何确保分配连续物理内存的会议,它们讨论的技术专题分别是:连续内存分配器(contiguous memory allocator 简称 CMA) 和 内存规整。
CMA troubles
…
(译者注,目前仅关注内存规整,所以 CMA 的相关内容暂不翻译。)
内存规整(Compaction)
“Compaction” is the process of shifting pages of memory around to create contiguous areas of free memory. It helps the system’s ability to satisfy higher-order allocations, and is crucial for the proper functioning of the transparent huge pages (THP) mechanism. Vlastimil Babka started off the session on compaction by noting that it is not invoked by default for THP allocations, making those allocations harder to satisfy. That led to some discussion of just where compaction should be done.
所谓 “内存规整” 指的是通过迁移内存页框(上的内容)以腾出空闲页框从而方便创建连续的可分配内存块。该技术有助于内核支持更 “高阶” 内存的分配(译者注:“高阶”(higher-order),伙伴系统内存分配术语,指包含多个连续页框(个数大于 1,且是 2 的整数次幂)的内存块),这对于实现 “透明巨页”(transparent huge pages,以下简称 THP)功能至关重要。Vlastimil Babka 在 “内存规整” 专题会议的开幕致辞中提醒大家,当初内核引入规整技术并非是为了实现 THP,而 THP 则使得内存分配变得愈加复杂。围绕这个话题,大家就内核中应该在什么地方对内存进行规整展开了一些讨论。
One option is the
khugepaged
thread, whose job is to collapse sets of small pages into huge pages. It might do some compaction on its own, but it can be disabled, which would disable compaction as well. Thus,khugepaged
cannot guarantee that background compaction will be done. Thekswapd
thread is another possibility, but Rik van Riel pointed out that it tends to be slow for this purpose, and it can get stuck in a shrinker somewhere waiting for a lock. Another possibility, perhaps the best one, is a separatekcompactd
thread dedicated to this particular task.
一种方案是利用 khugepaged
线程,其原本的任务是将小块内存合并成大块内存。可以在其中加入规整功能,但由于该线程的运行可能会被关闭,而这么一来规整也就无法执行了。因此,使用 khugepaged
并不能确保内存规整在后台运行。还有一种可能是利用 kswapd
线程,但 Rik van Riel 指出,使用该线程实现内存规整,响应往往会比较慢,因为该线程可能会在执行 shrinker(译者注:Shrinker 是内核回收页框的一种机制,由 kswapd 负责监控并调用执行)并等待持有锁的过程中被阻塞。另一种可能性,也许是最好的一种,就是为内存规整专门创建一个特定的线程 kcompactd
(译者注,kcompactd
随 4.6 版本合入内核主线)。
Michal Hocko said that he ran into compaction problems while working on the out-of-memory detection problem. He found that the compaction code is hard to get useful feedback from; it “does random things and returns random information.” It has no notion of costly allocations, and makes decisions that are hard to understand.
Michal Hocko 说他在处理 “内存不足检测(out-of-memory detection)” 问题时也遇到了内存规整的问题。他发现使用规整并没有给他带来帮助;相反由于它 “选择和移动的页框是随机的所以导致规整后的内存块的分布毫无规律。” 总之它并没有给 Michal 带来大块连续的可分配内存,而是给出了一些很奇怪的结果。
Part of the problem, he said, is that compaction was implemented for the THP problem and is focused a little too strongly there. THP requires order-9 (i.e. “huge”) pages; if the compaction code cannot create such a page in a given area, it just gives up. The system needs contiguous allocations of smaller sizes, down to the order-2 (four-page) allocations needed for
fork()
to work, but the compaction code doesn’t care about creating contiguous chunks of that size. A similar problem comes from the “skip” bits used to mark blocks of memory that have proved resistant to compaction. They are an optimization meant to head off fruitless attempts at compaction, but they also prevent successful, smaller-scale compaction. Hacking the compaction code to ignore the skip bits leads to better results overall.
Michal Hocko 认为,问题的部分原因在于内存规整是为了解决 THP 的问题而开发的(译者注,貌似这个结论和会议一开始 Vlastimil Babka 提醒大家的有点矛盾),其实现中过于侧重于 THP 的需求了。THP 需要 order 为 9 的内存块(译者注,即包含页框个数是 2 的 9 次方幂的连续内存块,这也是我们称其为 “巨大” 的原因);如果规整代码无法在给定区域中创建满足该要求的内存块,就会放弃执行,不再继续处理。而对于整个系统来说,还需要分配较小的连续页框内存块,譬如派生进程(通过执行 fork()
)时就会需要分配 order 为 2 (即大小为四个页框) 的内存块,对于这类情况规整算法并没有考虑。还有一个问题也很类似,就是在规整算法中会将扫描中识别为不满足规整要求的内存块标识为 “可忽略”(”skip”,即不执行规整)。作为一种优化,目的是防止运行没必要的规整操作,但带来一个副作用就是这也阻止了对小块内存的规整操作。通过修改代码不执行忽略会从整体上得到更好的结果。
Along the same lines, compaction doesn’t even try with page blocks that hold unmovable allocations. As Mel pointed out, that was the right decision for THP, since a huge page cannot be constructed from such a block, but it’s the wrong thing to do for smaller allocations. It might be better, he said, for the compaction code to just scan all of memory and do the best it can.
同样地,如果一个内存块中含有不可移动页框,算法也会放弃对它的规整操作。正如 Mel 所指出的,这是基于 THP 的需求做出的正确决定,因为我们无法基于这样的情况构建大的连续内存块,但这对于较小的内存分配需求来说是不公平的。他认为,如果要追求更好的效果,最好的做法是,扫描所有内存并尽最大的努力执行规整。
There was some talk of adding flexibility to the compaction code so that it will be better suited for more use cases. If the system is trying to obtain huge pages for THP, compaction should not try too hard or do anything too expensive. But if there is a need for order-2 blocks to keep things running, compaction should try a lot harder. One option here would be to have a set of flags describing what the compaction code is allowed to do, much like the “GFP flags” used for memory allocation requests. The alternative, which seemed to be more popular, is to have a single “priority” level controlling compaction behavior.
为此会议还讨论了是否可以为规整算法增加一些灵活性,以便其支持更广泛的使用场景。如果系统更倾向于支持 THP,则规整算法在必要时可以执行一些优化以提高效率。但如果是为了一些更小的内存分配,譬如 order 为 2 的,则规整算法需要继续尝试。为了解决这个矛盾,可以为规整操作提供一些选项标志,有点类似于内存分配请求参数中的 “GFP” 标志来告知内核使用者的选择。另一种似乎更受欢迎的方案是定义一个单独的 “优先级” 级别来控制规整算法的行为。
The final topic of discussion was the process of finding target pages when compaction decides to migrate a page that is in the way. The current compaction code works from both ends of a range of memory toward the middle, trying to accumulate free pages at one end by migrating pages to the other end. But it seems that, in some settings, scanning for the target pages takes too long; it was suggested that, maybe, those pages should just come from the free list instead. Mel worried, though, that such a scheme could result in two threads doing compaction just moving the same pages back and forth; the two-scanner approach was designed to avoid that. There was some talk of marking specific blocks as migration targets, but it is not clear that work in this area will be pursued.
会议的最后一个主题是讨论规整算法中有关为需要迁移的页找到空闲页框的处理过程。当前的算法从每个内存域(zone)的两端向中间扫描,试图在一端收集空闲的页框来容纳从另一端迁移过来的页面数据。但在某些情况下,扫描空闲页框的操作需要耗费较长的时间;所以有人建议,可以直接将数据移动到伙伴系统的空闲页框中。Mel 担心,这么做可能会出问题,譬如当两个线程同时执行规整时,有可能发生互相干扰,导致来回移动相同的页框数据。而当初采用从两端向中间扫描的方法就是为了避免这种情况。也有人提出将特定的内存区域标记出来,预留作为迁入,但尚不清楚该建议是否会有人持续跟进。
支付宝打赏 | 微信打赏 | |
![]() | 原创路上, 有您认可, 更为精彩! | ![]() |
Read Album:
- LWN 286472: 页框回收处理中着眼于可扩展性能(scalability)改进的最新介绍
- LWN 257541: 大容量内存系统的页框回收处理
- LWN 226756: 改进页框回收(page replacement)
- LWN 712467: 页缓存(page cache)的未来
- LWN 372384: 改善文件预读(readahead)
- LWN 235164: 按需预读(On-demand readahead)
- LWN 155510: 自适应(Adaptive)文件预读(readahead)算法
- LWN 685894: 后台回写(Background writeback)
- LWN 682582: 改进后台回写(writeback)引入的延迟
- LWN 648292: 回写(Writeback)和控制组(control groups)
- LWN 456904: 避免磁盘回写(writeback),抑制(throttling)缓存(page cache)写入
- LWN 405076: 动态回写抑制(Dynamic writeback throttling)
- LWN 396561: 解决 direct reclaim 中的 writeback 问题
- LWN 384093: 有关 “回写”(writeback)的问题讨论
- LWN 326552: 一种替代 pdflush 的新方案
- LWN 717656: 主动(proactive)内存规整(compaction)
- LWN 591998: 内存规整(memory compaction)所存在的问题
- LWN 368869: 内存规整(compaction)
- LWN 211505: 避免和解决内存碎片化
- LWN 159110: 更多有关避免内存碎片化的报道(More on fragmentation avoidance)
- LWN 158211: 避免内存碎片化(fragmentation avoidance)
- LWN 121618: 另一种避免内存碎片化(memory fragmentation)的方法
- LWN 105021: 主动内存碎片整理
- LWN 101230: Kswapd 和 “高阶”(high-order)内存申请
- LWN 155344: 有关 `gfp_t`
- LWN 320556: 为页框分配器(page allocator)加速
- LWN 565097: 对 `struct page` 的进一步改进
- LWN 335768: 我们究竟可以为物理页定义多少个状态标志?
- LWN 121845: 内核 2.6 中地址空间的随机化
- LWN 91829: 重新组织地址空间(address space)的布局
- LWN 753267: 针对页表遍历方式进行改造的讨论
- LWN 717293: 五级页表
- LWN 117749: 合入四级页表功能
- LWN 116810: 对四级页表设计的再思考
- LWN 106177: 四级页表
- LWN 761215: 关于内核初始化早期阶段内存分配管理机制的发展回顾
- LWN 387083: 针对 x86 平台移植 LMB(Logical Memory Block)内存分配器
- LWN 382559: `NO_BOOTMEM` 补丁
- LWN 383162: 案例分析,复杂设计下的匿名页反向映射处理
- LWN 75198: 虚拟内存专题二:基于对象的反向映射(object-based reverse mapping,简称 objrmap)的回归
- LWN 23732: 虚拟内存之基于对象的反向映射技术(object-based reverse-mapping)
- LWN 558284: 整个系统都空闲了吗?
- LWN 574962: 时钟广播框架(The tick broadcast framework)
- LWN 549580: 3.10 版本开始支持(接近)完全无周期时钟(full tickless)
- LWN 223185: 时钟事件(Clockevents)和动态时钟(dynamic tick)
- LWN 149877: 动态时钟补丁的最新状况
- LWN 145973: HZ 值应该多少合适
- LWN 138969: 动态时钟(dynamic tick)补丁
- LWN 70465: 引入 kgdb 到 2.6
- LWN 120850: 一个新的内核时间管理计时子系统
- LWN 167897: 高精度定时器编程接口
- LWN 156325: ktimers 补丁进展情况
- LWN 152436: 一种实现内核定时器的新方法
- LWN 184750: 一个新的通用中断(IRQ)框架
- LWN 532748: 名字空间实作,第四章:更多有关 PID 名字空间的介绍
- LWN 531114: 名字空间实作,第一章:名字空间(namespaces)概述
- LWN 531419: 名字空间实作,第三章:PID 名字空间
- LWN 531381: 名字空间实作,第二章:名字空间的 API
- LWN 718803: 文件系统的管理接口
- LWN 577961: Btrfs 同多设备协作
- LWN 616859: 设备树动态叠加技术
- LWN 465358: (部分)就绪的 IIO
- LWN 577218: Btrfs 入门
- LWN 357487: 内核峰会 2009: 通用设备树
- LWN 533632: 内核 GPIO 子系统的未来发展方向
- LWN 468759: 引脚控制子系统
- LWN 532714: 内核中的 GPIO 子系统介绍
- LWN 576276: Btrfs文件系统介绍
- LWN 240474: CFS 组调度
- LWN 222860: 资源管理编程接口
- LWN 215996: 设备资源管理
- LWN 448502: 平台设备和设备树
- LWN 448499: 平台设备 API