泰晓科技 -- 聚焦 Linux - 追本溯源,见微知著!


LWN 335768: 我们究竟可以为物理页定义多少个状态标志?

Wang Chen 创作于 2018/10/21

原文:How many page flags do we really have? 原创:By Jonathan Corbet @ June. 3, 2009 翻译:By unicornx of TinyLab.org 校对:By Benny Zhao

The recently-discussed kernel memory sanitization patch was criticized on a number of points, one of which was its use of a dedicated page flag. Andi Kleen’s HWPOISON patch (enabling upcoming Intel CPU features for dealing with memory errors) have run into trouble on similar grounds. The desperate shortage of page flags has been an article of faith among kernel developers for years. But, interestingly, not everybody agrees that a problem exists, and almost nobody can answer the simple question of how many flags are available in the first place. So a look at the Linux page flags issue seems in order.

最近讨论过的内核内存清理补丁在很多方面受到批评,其中一个问题就是它定义了一个新的页标志(page flag,译者注,page flag 的定义请见下一小节,为了行文流畅,下文中 page flag 不再翻译成中文)。Andi Kleen 的 HWPOISON 补丁(该补丁用于支持 Intel 即将推出的一项处理器的新功能,可以处理内存校验错误)也遇到了类似的麻烦。即将耗尽的 page flag 多年来一直困扰着内核开发人员。但有趣的是,并非所有人都认为这是个问题,而对于 “到底我们还可以再定义多少个 page flag” 这样简单的问题,却几乎从来没有人能够说清楚过。所以接下来我们就这个事情和大家一起讨论一下。

“Page flags” are simple bit flags describing the state of a page of physical memory. They are defined in <linux/page-flags.h>. Flags exist to mark “reserved” pages (kernel memory, I/O memory, or simply nonexistent), locked pages, those under writeback I/O, those which are part of a compound page, pages managed by the slab allocator, and more. Depending on the target architecture and kernel configuration options selected, there can be as many as 24 individual flags defined.

在逻辑概念上,“page flag” 是一些简单的布尔量(译者注,本小节用 “page flag” 时侧重于描述逻辑概念上的一种布尔值,而第三小节的 flags 和比特位的概念则侧重于 “page flag” 在编程中的具体实现),用于描述物理内存页(page)的一些状态是否存在(是或不是)。这些 page flag 定义在头文件 <linux/page-flags.h> 中,用于表示某个物理页当前是否存在如下情况,譬如:是否是 “保留” 的(“PG_reserved”,即该页是否被内核占用,或者该页所对应的物理地址范围是否用于映射输入输出(译者注,这里指的是过去 IBM 兼容 PC 上保留给 ISA 图形卡上从 640K 到 1M 的物理地址范围)或者是否根本就不存在),是否被锁定(“PG_locked”),是否正在被回写(“PG_writeback”),是否是复合页的一部分(“PG_compound”,译者注,所谓复合页即当系统启用扩展分页(Extended Paging)机制后的物理页处理方式),以及是否由 slab 分配器管理(“PG_slab”)等等。对于不同的系统体系架构和内核配置选项,最多可以有多达 24 个独立的 flag(译者注,按照原文发表的时间推测,本文的描述可能是基于当时最新的 2.6.29 内核)。

These flags live in the flags field of struct page. This field is declared to be an unsigned long, so one might think that figuring out how much space is left for new flags would be a straightforward task. To a casual observer, it would look like, on a 32-bit system, 24 flags have been used, leaving eight available:

这些 flag 保存在 struct page 结构体类型的 flags 字段中。flags 的类型被声明为无符号长整数(unsigned long,译者注,flags 的每个比特位用于标识一个 page flag,unsigned long 在 32 位系统上占 4 个字节即 32 个比特位),如此看来,大家可能会认为要确定还剩多少个 flag 可用是很简单的事情。对一个 32 位系统来说,已经定义了 24 个 flag,那么最多还能定义 8 个:

Page flags

In other words, the situation is starting to get tight, but it is not a crisis quite yet.

换句话说,剩余的用于保存 flag 的比特位已经不多了,但至少还可以维持一段时间。

But little is straightforward when it comes to struct page. One of these structures exists for every physical page in the system; on a 4GB system, there will be one million page structures. Given that every byte added to struct page is amplified a million times, it is not surprising that there is a strong motivation to avoid growing this structure at any cost. So struct page contains no less than three unions and is surrounded by complicated rules describing which fields are valid at which times. Changes to how this structure is accessed must be made with great care.

但如果你仔细了解一下 struct page 这个结构体类型就会发现事情不是那么简单。由于在系统中,每一个物理页都对应一个这样的结构体类型变量;以一个内存为 4GB 的系统为例,(按一个物理页为 4KB 计算)换算成物理页的数目就是一百万个。我们为 struct page 结构体每增加一个字节,对于整个系统来说所消耗的内存都将乘上一百万倍,这也是内核为啥要不惜一切代价避免增加该结构体大小的原因。仔细观察 struct page 这个结构体类型的定义,我们发现其中使用了不少于三个联合体(union),联合体中的不同成员基于复杂的规则用于不同场景。如果要改变对该结构体的访问方式必须十分地小心。

Unions are not the only technique used to shoehorn as much information as possible into this small structure. Non-uniform memory access (NUMA) systems need to track information on which node each page belongs to, and which zone within the node as well. Rather than add fields to struct page, the NUMA hackers grabbed the free bits at the top of the flags field, yielding something like this:

为了在有限的结构体中包含尽可能多的信息,解决方法有很多,使用联合体只是其中的一种,下面就是另外一种解决方法的例子。在一个 NUMA (Non-Uniform Memory Access)系统上我们还需要为每一个物理页维护其归属于哪个节点(node)中的哪个域(zone)这类信息(译者注,为行文流畅,node 和 zone 在下文不再翻译为中文)。从节省内存的角度出发,设计人员并没有在 struct page 中添加额外的字段,而是利用了其成员 flags 中未使用的那些高端比特位,如下图所示:

Page flags

So, on a 32-bit system with 24 page flags defined (a pessimistic scenario), there are eight bits available for the node and zone information, practically limiting 32-bit NUMA systems to 64 nodes, which is almost certainly adequate. But the addition of more page flags would come at the cost of supporting fewer NUMA nodes, and that would be unwelcome.

因此,在一个 32 位系统上假设使用了 24 个 page flag (最大情况下),那么还剩下 8 个比特位可用于存放 node 和 zone 信息,(译者注,参考上图,最高 6 位存放 node 信息,次高 2 位存放 zone 信息)实际上这限制了 32 位的 NUMA 系统最大可以支持的 node 数目不能超过 64 个,照理说也应该是足够了的。但是如果我们要使用更多的 page flag,势必得减少能够支持的 NUMA 的 node 数目,这就不是我们所希望看到的了。

Things get worse on systems with complicated physical memory layouts. On such systems, memory is not organized into a single, continuous range of physical addresses; instead, it is spread out with holes in the middle. Memory management on these “sparse memory” systems requires that each page have a “section” number associated with it. That section number is stored - you guessed it - in the spare bits at the top of the flags field. If space gets too tight, the kernel will move the node number into a separate array, slowing things down in the process. Either way, it seems clear that there is not a whole lot of spare room in the flags field on these systems.

当系统的物理内存布局变得复杂时情况会变得更糟。在某些系统上,存储单元的物理地址不连续;呈离散分布。针对这种所谓的 “稀疏内存”(”sparse memory”)方式的系统,其内存管理要求对应每个物理页都分配一个 “section” 编号。你肯定会想到,这些 section 的编号信息最佳的存放位置仍然是利用 flags 字段中那些空闲的高端比特位。如果实在放不下,内核会将 node 编号拿出来放到一个单独的数组中存放,但副作用是导致处理效率降低。无论哪种方式,很明显,对于这些系统来说,flags 字段中可用的空间都是十分紧张的。

So the real answer to “how many page flags are free?” is, for all practical purposes, “zero,” at least on 32-bit NUMA systems. Making room for more would require expanding struct page, which is a heavy cost to pay. Developers should, thus, not be surprised when proposals to use new page flags run into stiff opposition. It’s only one bit, but that bit is in the middle of some of the most sought-after real estate in the entire kernel.

那么到底 “我们还可以再定义多少个 page flag 呢?” 如果从实际的情况出发,至少对于一个 32 位 的 NUMA 系统来说,答案是 “一个也不能增加了”。再增加 page flag,势必要扩展 struct page 结构体类型的大小,这个代价是我们承受不起的。因此,当有人试图定义新的 page flag 的时候,遭到如此强烈的反对,一点也不奇怪。虽然对于 struct page 结构体类型来说,需要的只是扩展一个比特位,但对于整个内核来说却会占用其大量宝贵的低端内存(译者注,原文采用暗喻手法称会占用内核中最宝贵的黄金地段,应该指的就是低端内存,所以这里就直接意译了)。

In the case of Andi’s HWPOISON patch, this opposition has come in the form of a number of alternative suggestions. One was to simply use the “reserved” bit, but that could lead to difficulties in parts of the code where that usage is not expected. Then it was suggested that the combination of the “reserved” and “writeback” flags could indicate a poisoned page, but Andi claims that this approach cannot work. Andrew Morton has suggested that HWPOISON could be made into a 64-bit-only feature; Andi allows as to how that might be possible, but he clearly doesn’t like the idea.

就 Andi 的 HWPOISON 补丁而言,虽然同样招致反对,但也有人给出了一些可以替代的方案。一种建议 是简单地复用 “PG_reserved” 位,但这可能导致逻辑冲突。后来也有人建议 将 “PG_reserved” 和 “PG_writeback” 两个 flag 组合起来表示物理页已 “损坏” 的概念,但 Andi 认为这种方法存在问题。Andrew Morton 建议 HWPOISON 这个功能仅针对 64 位架构合入;Andi 觉得这办法可行,但他显然并不喜欢这个想法。

Instead, Andi takes the position that the page flag shortage does not really exist. It’s not a problem at all on 64-bit systems, where unsigned long is twice as wide. The number of 32-bit systems with a large number of NUMA nodes is small and shrinking; it’s not something that the developers need be concerned about. And, says Andi, if things get really bad, the sparse memory section number can be moved into a separate array like the NUMA node number. Given this view of the problem, worries about adding a useful new feature over concerns about a single page flag bit seem misplaced.

其实,Andi 认为 实际上我们并不缺乏存放 page flag 的空间。首先在 64 位系统上这根本就不是一个问题,因为在 64 位系统上,unsigned long 类型的长度是 32 位系统的两倍。其次,具有大量 NUMA 节点的 32 位系统其数量本身并不多且此类产品呈下降趋势;开发人员对此根本无需担心。Andi 说,如果情况真的变得非常糟糕的话,我们也可以将离散的 section 编号像 NUMA 节点编号一样集中到另一个单独的数组中存放。按照 Andi 的观点来看,如果仅仅是为了不增加一个 page flag 比特位而拒绝添加一个有用的新功能是完全没有必要的。

Nobody has challenged Andi’s view that the problem is not as severe as most people think, though Andrew Morton has hinted that Andi should go ahead and prove his ideas about moving the section number out of the page structure. That might not be a bad idea. Even if page flags are a little more abundant than most developers think, it still is not hard to foresee a time when they are exhausted, at least on 32-bit systems. Proposals involving new page flags are not particularly rare; unless we want to restrict features needing page flags to 64-bit systems, we’ll need to make some more flags available before too long.

目前还没有人站出来质疑 Andi 的观点(即上文所述的他认为这个问题可能并没有大多数人想象的那么严重),但 Andrew Morton 还是建议 Andi 按照他的想法尝试把 section 编号拿到 struct page 外面去存放,看看具体效果如何。这应该是一个好的建议。因为即使 page flags 的确比大多数开发人员担忧的要富裕那么一点,但可以预见的是,它们仍然很快就会被耗尽(至少在 32 位系统上)。那些试图增加新的 page flag 的设计提案还是有的;所以,除非我们以后规定,任何一个新功能,如果需要使用新的 page flag,则该功能只能限制在 64 位系统上,否则的话,在不久的将来我们仍然需要为提供更多的 page flag 而伤脑筋。(译者注,HWPOISON 补丁最后随 2.6.32 合入了内核主线,而且看上去最后社区就 page flag 的使用达成了一致,即新的 flag 可以继续增加,所以对 HWPOISON 补丁来说,最完美的结果就是获得了一个新的 flag - “PG_hwpoison”。)

Read Album:

Read Related:

Read Latest: