泰晓科技 -- 聚焦 Linux - 追本溯源,见微知著!

插上 Linux Lab 实验盘,“秒变” Linux 本本,可即刻开展 Linux 内核开发

LWN 106177: 四级页表

Wang Chen 创作于 2018/09/19

原文:Four-level page tables 原创:By corbet @ Oct. 12, 2004 翻译:By unicornx of TinyLab.org 校对:By simowce & Li lingjie

Most Linux users probably have a sufficiently interesting life that they spend little time imagining how page tables are represented in the kernel. Many of those who do ponder on that issue may think in terms of a linear array which maps virtual addresses onto their corresponding physical addresses. This view of page tables is enough to understand the basic function that they perform, but the real situation is more complicated than that.

绝大多数 Linux 的用户并不会真正去仔细考虑内核中有关处理页表(page tables)的问题。即使对于那些真的关心这个问题的人来说,其中的大部分人可能会认为,在设计上,使用一个简单的数组,数组的每一项用于记录虚拟地址到物理地址的对应关系,如此这般已足以支持地址翻译的基本功能。但实际上真正实现起来,远比这复杂得多。

A single array large enough to hold the page table entries for a single process would be huge. On a typical x86 system, a page table entry requires 32 bits, so 1024 of them (covering 4MB of virtual address space) can be stored in one page. If the virtual address space is 3GB (as it is on many x86 systems), 768 pages would be required to hold all of the page table entries. Allocating that much contiguous memory (for each process) would be impossible, even if that sort of memory overhead were tolerable.

对每一个进程来说,如果我们用一个数组来存放该进程的所有页表项,其耗费的内存量是非常巨大的。在一个典型的 x86 系统上,一个页表项占用 32 个比特位(即 4 个字节),因此一个物理页(page,在 32 位的架构中典型的大小是 4K 字节)中可以存放的页表项个数就是 1024 个(也就是可以映射总共 4MB 大小的虚拟地址空间)。如果要覆盖 3GB 大小的虚拟地址空间(常见的 x86 系统基本上都是这种情况),则需要 768 个物理页来保存所有的页表项条目。话再说回来,即便我们可以真的容忍这么大的内存开销,但要一次性分配如此容量的 “连续” 的内存(注意是对每个进程),也几乎是不可能的。

The fact is that most processes only use a small portion of the total virtual address space - but the parts they use are widely scattered over that space. Program text lives down near the bottom, heap memory and dynamic libraries are distributed throughout the middle, and the stack is put up at the very top. So the real page table structure must handle a sparse, widely distributed set of virtual addresses without wasting excessive amounts of memory or requiring large, physically-contiguous arrays.


To that end, modern processors which use page tables use a hierarchical, tree structure. This structure allows the table to be broken up into individual pages, and the subtrees corresponding to unused parts of the address space can be absent. The Linux kernel works with a three-level structure which looks like this:

基于以上考虑,支持分页机制的现代处理器在页表设计上使用的都是分层的树状结构。该结构允许将大的表(table)分解成单独的(不连续的)页(page),这样下一级子树中没有对地址进行映射的表项就可以不加载(即不占用内存)。当前(译者注,内核版本 2.6.11 之前)Linux 内核的页表模型使用如下图所示的三级结构:

Page table tree

On an x86 system running in the PAE mode (only needed when more than 4GB of memory is installed), all three levels of page tables are present. The page global directory (PGD) contains only four entries, each corresponding to 1GB of virtual address space; the PGD is indexed using the top two bits of the virtual address. Each PGD entry points to a page middle directory (PMD), which holds 512 entries indexed by bits 21-29 of the virtual address. The PMD entry (if it is not empty) points to an actual page table. Using bits 12-20 of the virtual address to index into that page table yields the actual physical address of the page, assuming that page is currently resident in RAM.

当 x86 系统运行在 PAE (Physical Address Extension)模式时(该模式仅在系统安装了超过 4GB 的物理内存时才需要使能),Linux 内核会全面启用这种三级的页表机制。页全局目录(Page Global Directory,以下简称 PGD)仅包含四项,每一项对应 1GB 的虚拟地址空间;PGD​​ 使用虚拟地址的前两位进行索引。每一项 PGD 条目指向一个页中间目录(Page Middle Directory,以下简称 PMD),每一个 PMD 表包含 512 项条目,每一项可以通过虚拟地址的 21-29 位进行索引。每一个 PMD 条目(除非其值为空)指向页表(Page Table,即上图中的 PTE。每个 PTE 表也是同样包含 512 个条目)。通过虚拟地址的 12-20 位作为索引可以在 PTE 表中定位表项并最终获得页(page)的物理地址(译者注,这里的页(page)指的是虚拟地址所映射的物理内存字节单元所在的物理页),除非该页的内容当前没有驻留在内存中。

The current 2.6 kernel implements a three-level page table for all architectures. As it turns out, the bulk of x86 systems will not be running in the PAE mode; on those systems, the hardware only supports two levels of page tables. The PGD holds 1024 entries (bits 22-31), each of which points to a 1024-entry page table (bits 12-21). For the benefit of the rest of the kernel, the page table access functions are set up to emulate the existence of a single-entry PMD, so these systems still appear to use a three-level page table.

当前的 2.6 版本的内核对所有的体系架构均采用通用的三级页表进行处理。但事实上,大部分 x86 系统并不会在 PAE 模式下运行;在有些系统上,硬件仅支持两级页表。相应地在这些处理器上运行的 Linux 内核中,每个 PGD​ 拥有 1024 个条目(使用虚拟地址的 22-31 位进行索引),每个 PGD 条目直接指向页表(Page Table,即 PTE),这里的每个 PTE 含有 1024 个条目(采用虚拟地址的 12-21 位进行索引)。因为中间少了一级 PMD,为了让内核的其他部分在访问页表时感觉不到体系架构的差异,相关页表 API 函数会模拟一个虚拟的 PMD 级别,每个 PMD 只含一项。这种抽象保持了代码结构的统一,让整个系统看上去仍然使用的是三级页表。

The three-level design is wired deeply into the kernel. Any code which must manually map a virtual address into its physical counterpart must do something like this (error handling and other details omitted):


pmd = pmd_offset(pgd, address);
pte = *pte_offset_map(pmd, address);
page = pte_page(pte);

Similarly, any kernel function which affects a range of virtual addresses must implement a depth-first traversal of the relevant portion of the three-level tree. Most of these traversals of the page table tree have been isolated behind functions, but it is still surprising how many places are coded around the three-level assumption. But it all works fine, since the architecture-specific code makes it looks like all systems have three-level page tables.


The only problem is that some hardware actually supports four-level tables. The example which is driving the current changes is x86-64. The current x86-64 port emulates a three-level architecture by using a single, shared, top-level directory (“PML4”) and fitting (most of) the virtual address space in a three-level tree pointed to by a single PML4 entry. It all works, but it limits Linux processes to a mere 512GB of virtual address space. Such limits are irksome to the kernel developers when the hardware can do more, and, besides, somebody is likely to release a web browser or office suite which runs into that limit in the near future.

目前唯一的问题是某些硬件实际上是按照四级页表方式工作的。譬如 x86-64,这也是目前推动内核进一步升级的原因。当前移植 x86-64 的代码中定义了一个共享的顶级表(“PML4”),其每一项都指向一个三层的页表结构,通过这种方式对 Linux 内核模拟三级页表模型,以支持(大部分的)虚拟地址空间。但该方式存在一个问题,就是会限制一个 Linux 进程的虚拟地址空间不能超过 512GB。对内核开发人员来说,这是一个令人讨厌的限制,因为这导致我们无法充分发挥硬件的潜能。而且,很有可能在不久的将来,就会出现一个 Web 浏览器或者办公软件在内存需求上会突破这个限制。

The solution is to shift the kernel over to using four-level page tables everywhere, with the fourth level emulated (and optimized out of existence) on architectures which do not support it. Andi Kleen has posted a four-level page tables patch which implements this change. With Andi’s patch, the x86-64 architecture implements a 512-entry PML4 directory, 512-entry PGD, 512-entry PMD, and 512-entry PTE. After various deductions, that is sufficient to implement a 128TB address space, which should last for a little while.

看起来目前有必要将内核的通用页表模型升级到四级,对物理上不支持四级的体系架构则在软件上对第四级进行模拟(编译优化使其实际并不存在)。基于以上思路,Andi Kleen 开发了一个四级页表补丁。在该补丁中,x86-64 架构上的页表实现方式为:第四级 PML4 表包含 512项,第三级 PGD 表 包含 512 项,第二级 PMD 表包含 512 项、第一级 PTE 表包含 512 项。按以上方式计算,至少可以支持 128TB 的地址空间,这应该够我们用一段时间的了。

The actual patch works as one might expect; code which currently handles three-level page tables is extended to deal with the fourth level. There is a default PML4 implementation which can be included by architectures which do not have four-level tables; that should make porting most architectures to the new scheme relatively easy. That work is likely to happen in the near future, after which Andi has stated his intention to get the four-level patch merged into the -mm tree. Andrew Morton has already said (at the kernel summit) that he would consider merging such a patch. Your Linux system may be running with four-level page tables in the near future.

补丁的运行效果看上去还不错; 当前处理三级页表的代码被扩展为支持处理第四级(即 PML4)。补丁中提供了一个缺省的 PML4 的实现,没有四级页表的架构可以直接使用(译者注,指用其模拟第四级);这使得大多数体系架构移植到新方案上时相对容易。移植工作很快就会展开,Andi 表示他打算在移植工作完成后将他的四级页表补丁合并到 “-mm” 代码仓库中。Andrew Morton 也已经(在内核峰会上)表示他会考虑合并该补丁。您的 Linux 系统很可能在不久的将来就会使用四级页表。

Read Album:

Read Related:

Read Latest: