
LWN 75198: 虚拟内存专题二:基于对象的反向映射(object-based reverse mapping,简称 objrmap)的回归
原文:Virtual Memory II: the return of objrmap 原创:By corbet @ Mar. 10, 2004 翻译:By unicornx of TinyLab.org 校对:By Wen Yang
Andrea Arcangeli not only wants to make the Linux kernel scale to and beyond 32GB of memory on 32-bit processors; he seems to be in a real hurry. There are, it would seem, customers waiting for a 2.6-based distribution which can run in such environments.
对于 Andrea Arcangeli 来说,他所要做的可不仅仅是使 Linux 能够在 32 位处理器上支持大于 32 GB 的内存;他还有更重要的使命(译者注,即本文要重点介绍的 objrmap)。大家都在等待他的工作完成,从而能够在基于 2.6 内核的发行版上享受到该功能。
For Andrea, the real culprit in the exhaustion of low memory is clear: it’s the reverse-mapping virtual memory (“rmap”) code. The rmap code was first described on this page in January, 2002; its purpose is to make it easier for the kernel to free memory when swapping is required. To that end, rmap maintains, for each physical page in the system, a chain of reverse pointers; each pointer indicates a page table which has a reference for that page. By following the rmap chains, the kernel can quickly find all mappings for a given page, unmap them, and swap the page out.
对于 Andrea 来说,真正导致低端内存被耗尽的罪魁祸首是虚拟内存中实现的反向映射代码(译者注,反向映射,英文是 reverse-mapping,简称 “rmap”,下文用 rmap 指代早期内核版本中的反向映射技术,和 objrmap 相对)。 在 2002 年 1 月 首次给大家介绍了 rmap;引入它的目的是为了方便内核在执行页交换(swap)时释放内存。rmap 为系统中的每个物理页维护一个链表用于保存反向映射指针;每个指针指向映射该物理页的一个页表。在给定一个物理页后,通过遍历其 rmap 链表,内核可以快速查找到映射该物理页的所有进程,逐个取消这些映射关系后就可以将该物理页交换出来(page out)。
The rmap code solved some real performance problems in the kernel’s virtual memory subsystem, but it, too has a cost. Every one of those reverse mapping entries consumes memory - low memory in particular. Much effort has gone into reducing the memory cost of the rmap chains, but the simple fact remains: as the amount of memory (and the number of processes using that memory) goes up, the rmap chains will consume larger amounts of low memory. Eliminating the rmap overhead would go a long way toward allowing the kernel to scale to larger systems. Of course, one wants to eliminate this overhead while not losing the benefits that rmap brings.
rmap 技术解决了内核虚拟内存子系统中的性能问题,但它也是有成本的。链表中用于保存反向映射指针的每个元素都要消耗内存,而且是低端内存(low memory)。内核社区为降低 rmap 的内存消耗付出了很多努力,但现实是:随着系统的内存量(以及使用内存的进程数目)的增加,rmap 将消耗更多的低端内存。只有降低 rmap 的内存开销才能推动内核支持更大规模的系统。当然,前提是不能失去 rmap 带给我们的在性能上的提升。
Andrea’s approach is to bring back and extend the object-based reverse mapping patches. The initial object-based patch was created by Dave McCracken; LWN covered this patch a year ago. Essentially, this patch eliminates the rmap chains for memory which maps a file by following pointers “the long way around” and searching candidate virtual memory areas (VMAs). Andrea has updated this patch and fixed some bugs, but the core of the patch remains the same; see last year’s description for the details.
Andrea 参考了原先的基于对象的反向映射(object-based reverse mapping)补丁并基于该补丁做了改进。这个补丁最初是由 Dave McCracken 提交的;LWN 一年前为大家介绍过。这个补丁最主要的优点,是针对文件映射使用的物理页,消除了 rmap 对内存的巨大需求,但代价是它需要通过 “更复杂” 的方式反向查找到映射该物理页的页表项,这其中还包括需要搜索关联的虚拟内存区域(virtual memory area,简称 VMA)。Andrea 对该补丁进行了修改并修复了一些错误,但补丁的核心思想仍然保持不变;有关其核心思想可以参阅去年的详细介绍。
Last week, we raised the possibility that the virtual memory subsystem could see fundamental changes in the course of the 2.6 “stable” series. This week, Linus confirmed that possibility in response to Andrea’s object-based reverse mapping patch:
I certainly prefer this to the 4:4 horrors. So it sounds worth it to put it into -mm if everybody else is ok with it.
上周,我们提出是否有可能在 2.6 的 “稳定”版本系列中看到这个重大改变。本周,Linus 确认了这种可能性并提到了 Andrea 的基于对象的反向映射补丁:
相对于 “4:4”(译者注,指 [4G/4G 补丁](http://lwn.net/Articles/39925/)),我更倾向于合入这个补丁(译者注,指 Andrea 的基于对象的反向映射补丁)。如果其他人都觉得没问题的话,我将把它合入 “-mm” 代码版本库。
Assuming this work goes forward, it has the usual implications for the stable kernel. Even assuming that it stays in the -mm tree for some time, its inclusion into 2.6 is likely to destabilize things for a few releases until all of the obscure bugs are shaken out.
这么做势必会影响到内核版本的稳定性。即使该补丁会在 “-mm” 代码库中保留一段时间,但一旦当它被合入 2.6 中后肯定还会出现这样或者那样的问题,可以预期的是这种不稳定性会延续多个版本,直到所有那些不易察觉的错误都被消除干净。
Dave McCracken’s original patch, in any case, only solves part of the problem. It gets rid of the rmap chains for file-backed memory, but it does nothing for anonymous memory (basic process data - stacks, memory obtained with
malloc()
, etc.), which has no “object” behind it. File-backed memory is a large portion of the total, especially on systems which are running large Oracle servers and use big, shared file mappings. But anonymous memory is also a large part of the mix; it would be nice to take care of the rmap overhead for that as well.
Dave McCracken 提交的补丁起初只解决了部分问题。它解决了那些文件映射内存(file-backed memory)的反向映射,但对匿名内存(anonymous memory),即那些没有映射实际文件的内存(那些最基本的进程运行时所需要的内存,譬如栈、以及使用 malloc()
所分配的堆等),并没有任何改进(译者注,即还是采用旧的 rmap 方式)。用于文件映射的内存占内存消耗总数的很大一部分,特别是在那些运行大型 Oracle 系统的服务器上,这些系统会对很大的共享文件进行内存映射;但这并不意味着我们可以忽略匿名内存所消耗的内存数量;它涉及的 rmap 开销也不少。
To that end, Andrea has posted another patch (in preliminary form) which provides object-based reverse mapping for anonymous memory as well. It works, essentially, by replacing the rmap chain with a pointer to a chain of virtual memory area (VMA) structures.
为此,Andrea 提交了另一个补丁(目前还处于原型状态),它为匿名内存也提供了基于对象的反向映射机制。它本质上是用虚拟内存区域(VMA)链表替换了 rmap 所使用的针对每个物理页所维护的反向映射链表。
Anonymous pages are always created in response to a request for memory from a single process; as a result, they are never shared at creation time. Given that, there is no need for a new anonymous page to have a chain of reverse mappings; we know that there can be only a single mapping. Andrea’s patch adds a union to
struct page
which includes the existingmapping
pointer (for non-anonymous memory) and adds a couple of new ones. One of those is simply calledvma
, and it points to the (single) VMA structure pointing to the page. So if a process has several non-shared, anonymous pages in the same virtual memory area, the structure looks somewhat like this:
我们知道,只有当一个进程发起内存申请请求时内核才会为其创建匿名页;因此,匿名页在创建之初不存在共享的情况。鉴于此,不需要对于一个新建的匿名页维护一个反向映射的链表;也就是说此时只会存在一个映射。Andrea 在补丁中为 struct page
添加了一个联合体(union)类型的成员,该联合体中除了包含现有的 mapping
指针(用于非匿名内存)外还添加了几个新的成员。其中一个简称为 vma
,它指向(单个)VMA 结构体,而通过该 VMA 结构体则可以找到其关联的页。因此,假设某个进程的一个连续的虚拟地址区间映射了多个非共享的匿名页,则它们之间的关系看起来有点像下图这样:
With this structure, the kernel can find the page table which maps a given page by following the pointers through the VMA structure.
利用以上结构,在给定一个物理页的条件下,内核可以通过其关联的 VMA 结构找到映射该物理页的页表。(译者注,这里的描述应该是针对还未合入内核的补丁实现的描述,在正式合入后,并没有上文的联合体和 vma
字段,对于匿名页,无论是否共享,都采用下文的 anon_vma
和 VMA 链表进行管理,对于未共享的场景,VMA 链表上只有一个节点)
Life gets a bit more complicated when the process forks, however. Once that happens, there will be multiple page tables pointing to the same anonymous pages and a single VMA pointer will no longer be adequate. To deal with this case, Andrea has created a new “anon_vma” structure which implements a linked list of VMAs. The third member of the new
struct page
union is a pointer to this structure which, in turn, points to all VMAs which might contain the page. The structure now looks like:
当进程派生子进程(fork)时,情况会变得复杂一些。此时,将会有多个页表指向相同的匿名页,显然单个 VMA 指针将不再适用。为了解决这个问题,Andrea 创建了一个新的 “anon_vma” 结构体类型,该结构体中包含了一个 VMA 的链表。struct page
的 union 成员中新增的第三个成员是指向此结构体类型的指针。数据结构现在看起来如下图所示:
If the kernel needs to unmap a page in this scenario, it must follow the linked list and examine every VMA it finds. Once the page is unmapped from every page table found, it can be freed.
在该场景下如果内核需要取消对该物理页的映射,可以遍历 anon_vma
所维护的 VMA 链表,对链表上的每个 VMA 找到其对应的页表(Page table),然后取消页表中对物理页(page)的映射。遍历完成后说明所有映射关系都被取消,此时就可以释放该物理页了。
There are some memory costs to this scheme: the VMA structure requires a new
list_head
structure, and theanon_vma
structure must be allocated whenever a chain must be formed. One VMA can refer to thousands of pages, however, so a per-VMA cost will be far less than the per-page costs incurred by the existing rmap code.
该方案存在一些内存成本:VMA 结构体中需要新增 list_head
结构,并且在创建链表时需要分配 anon_vma
结构体。但是,由于一个 VMA 可以引用数千个物理页,因此相对于现有的 rmap 的代码,采用 VMA 的方式所消耗的内存要少得多得多。
This approach does incur a greater computational cost. Freeing a page requires scanning multiple VMAs which may or may not contain references to the page under consideration. This cost will increase with the number of processes sharing a memory region. Ingo Molnar, who is fond of O(1) solutions, is nervous about object-based schemes for this reason. According to Ingo, losing the possibility of creating an O(1) page unmapping scheme is a heavy cost to pay for the prize of making large amounts of memory work on obsolete hardware.
这种方法确实会产生更大的计算成本。释放物理页需要扫描多个 VMA,这些 VMA 可能映射了该物理页,也可能没有。查找的成本将随共享内存区的进程的数量增加而增加。更倾向于 O(1)
解决方案的 Ingo Molnar 针对该场景下的 objrmap 方案表达了他的担忧。根据 Ingo 的说法,在取消页面映射处理过程中,仅仅是为了在过时的机器上支持大容量内存就放弃 O(1)
的算法实在是得不偿失。
The solution that Ingo would like to see, instead, is to reduce the per-page memory overhead by reducing the number of pages. The means to that end is page clustering - grouping adjacent hardware pages into larger virtual pages. Page clustering would reduce rmap overhead, and reduce the size of the main kernel memory map as well. The available page clustering patch is even more intrusive than object-based reverse mapping, however; it seems seriously unlikely to be considered for 2.6.
相反,Ingo 建议的解决方案是通过减少物理页的数量来减少每页的内存开销。解决的方案是对页面进行合并(page clustering),即将相邻的物理页面组合为更大的虚拟页面。合并物理页后会减少 rmap 的开销,同时也会减少内核中内存映射表的大小。然而,相对于 objrmap 补丁,“page clustering” 补丁的修改过于激进,看上去不太可能为 2.6 版本所接受。
支付宝打赏 | 微信打赏 | |
![]() | ![]() 请作者喝杯咖啡吧 | ![]() |
Read Album:
- LWN 520076: 软中断对实时性的影响
- LWN 452884: 实时 Linux 中的 Per-CPU 变量处理
- LWN 302043: 中断线程化
- LWN 296419: SCHED_FIFO 和实时任务抑制(throttling)
- LWN 271817: 实时自适应锁
- LWN 178253: 内核中的 “优先级继承(Priority Inheritance)”
- LWN 146861: 实时抢占补丁综述
- “LWN 中文翻译计划” 二周年工作小结
- LWN 106010: 实现 “实时(realtime)” Linux 的多种方法
- LWN 433904: 一个 “组调度(group scheduling)” 的运行实例
- LWN 418884: 针对 “组调度”(Group scheduling)的不同分组方案
- LWN 415740: 基于 TTY 的组调度
- LWN 428230: CFS 带宽控制
- LWN 230574: 内核调度器替换方案的激烈竞争
- LWN 668126: 更加可靠(reliable)和更可预期(predictable)的 OOM 处理机制
- LWN 562211: 更加可靠的 OOM 处理
- LWN 391222: 重写(rewrite)OOM Killer
- LWN 517465: 为 “巨页”(huge page)增加一个 “零页”(zero page)
- LWN 423584: 对 2.6.38 版本中新增的 “透明巨页(Transparent Huge Pages)” 特性的介绍
- LWN 359158: 透明巨页(Transparent Hugepages)
- LWN 758677: 优化巨页(huge page)交换(swapping)的终极之役
- LWN 717707: 页交换(swap)的改进计划
- LWN 240474: CFS 组调度
- LWN 704478: 让页交换(swapping)更具扩展性(scalable)
- LWN 439298: 可靠地通过网络执行页交换(swapping)
- LWN 334649: Compcache,一种基于内存实现压缩交换(compressed swapping)的技术
- LWN 83588: 内核 2.6 版本的交换(swapping)行为
- LWN 550463: 更好的 Shrinker 机制
- LWN 495543: 一种更好的平衡 active/inactive 链表长度的算法(Refault Distance 算法)
- LWN 333742: 降低存放可执行指令的页框被换出的可能性
- LWN 286472: 页框回收处理中着眼于可扩展性能(scalability)改进的最新介绍
- LWN 257541: 大容量内存系统的页框回收处理
- LWN 226756: 改进页框回收(page replacement)
- LWN 712467: 页缓存(page cache)的未来
- LWN 372384: 改善文件预读(readahead)
- LWN 235164: 按需预读(On-demand readahead)
- LWN 155510: 自适应(Adaptive)文件预读(readahead)算法
- LWN 685894: 后台回写(Background writeback)
- LWN 682582: 改进后台回写(writeback)引入的延迟
- LWN 648292: 回写(Writeback)和控制组(control groups)
- LWN 456904: 避免磁盘回写(writeback),抑制(throttling)缓存(page cache)写入
- LWN 405076: 动态回写抑制(Dynamic writeback throttling)
- LWN 396561: 解决 direct reclaim 中的 writeback 问题
- LWN 384093: 有关 “回写”(writeback)的问题讨论
- LWN 326552: 一种替代 pdflush 的新方案
- LWN 717656: 主动(proactive)内存规整(compaction)
- LWN 684611: 连续内存分配器(Contiguous Memory Allocator)和内存规整(compaction)
- LWN 591998: 内存规整(memory compaction)所存在的问题
- LWN 368869: 内存规整(compaction)
- LWN 211505: 避免和解决内存碎片化
- LWN 159110: 更多有关避免内存碎片化的报道(More on fragmentation avoidance)
- LWN 158211: 避免内存碎片化(fragmentation avoidance)
- LWN 121618: 另一种避免内存碎片化(memory fragmentation)的方法
- LWN 105021: 主动内存碎片整理
- LWN 101230: Kswapd 和 “高阶”(high-order)内存申请
- LWN 155344: 有关 `gfp_t`
- LWN 320556: 为页框分配器(page allocator)加速
- LWN 565097: 对 `struct page` 的进一步改进
- LWN 335768: 我们究竟可以为物理页定义多少个状态标志?
- LWN 121845: 内核 2.6 中地址空间的随机化
- LWN 91829: 重新组织地址空间(address space)的布局
- LWN 753267: 针对页表遍历方式进行改造的讨论
- LWN 717293: 五级页表
- LWN 117749: 合入四级页表功能
- LWN 116810: 对四级页表设计的再思考
- LWN 106177: 四级页表
- LWN 761215: 关于内核初始化早期阶段内存分配管理机制的发展回顾
- LWN 387083: 针对 x86 平台移植 LMB(Logical Memory Block)内存分配器
- LWN 382559: `NO_BOOTMEM` 补丁
- LWN 383162: 案例分析,复杂设计下的匿名页反向映射处理
- LWN 23732: 虚拟内存之基于对象的反向映射技术(object-based reverse-mapping)
- LWN 558284: 整个系统都空闲了吗?
- LWN 574962: 时钟广播框架(The tick broadcast framework)
- LWN 549580: 3.10 版本开始支持(接近)完全无周期时钟(full tickless)
- LWN 223185: 时钟事件(Clockevents)和动态时钟(dynamic tick)
- LWN 149877: 动态时钟补丁的最新状况
- LWN 145973: HZ 值应该多少合适
- LWN 138969: 动态时钟(dynamic tick)补丁
- LWN 70465: 引入 kgdb 到 2.6
- LWN 120850: 一个新的内核时间管理计时子系统
- LWN 167897: 高精度定时器编程接口
- LWN 156325: ktimers 补丁进展情况
- LWN 152436: 一种实现内核定时器的新方法
- LWN 184750: 一个新的通用中断(IRQ)框架
- LWN 532748: 名字空间实作,第四章:更多有关 PID 名字空间的介绍
- LWN 531114: 名字空间实作,第一章:名字空间(namespaces)概述
- LWN 531419: 名字空间实作,第三章:PID 名字空间
- LWN 531381: 名字空间实作,第二章:名字空间的 API
- LWN 718803: 文件系统的管理接口
- LWN 577961: Btrfs 同多设备协作
- LWN 616859: 设备树动态叠加技术
- LWN 465358: (部分)就绪的 IIO
- LWN 577218: Btrfs 入门
- LWN 357487: 内核峰会 2009: 通用设备树
- LWN 533632: 内核 GPIO 子系统的未来发展方向
- LWN 468759: 引脚控制子系统
- LWN 532714: 内核中的 GPIO 子系统介绍
- LWN 576276: Btrfs文件系统介绍
- LWN 222860: 资源管理编程接口
- LWN 215996: 设备资源管理
- LWN 448502: 平台设备和设备树
- LWN 448499: 平台设备 API