LWN 423584: 对 2.6.38 版本中新增的 “透明巨页（Transparent Huge Pages）” 特性的介绍

Wang Chen 创作于 2019/07/01

请点击 LWN 中文翻译计划，了解更多详情。

原文：Transparent huge pages in 2.6.38 原创：By Jonathan Corbet @ Jan. 19, 2011 翻译：By unicornx 校对：By Libin Zhao

The memory management unit in almost any contemporary processor can handle multiple page sizes, but the Linux kernel almost always restricts itself to just the smallest of those sizes - 4096 bytes on most architectures. Pages which are larger than that minimum - collectively called “huge pages” - can offer better performance for some workloads, but that performance benefit has gone mostly unexploited on Linux. That may change in 2.6.38, though, with the merging of the transparent huge page feature.

几乎所有现代处理器中的内存管理单元（memory management unit，简称 MMU）都可以处理多种大小的内存页，但 Linux 内核长久以来却总是只支持其中最小的页类型（在大多数体系结构上，其大小是 4096 字节）。对大于这个最小值的页，我们统称为 “巨页（huge pages，译者注：也翻译为 “大页”，下文直接使用英文不再翻译）”，在某些应用场景下，使用 huge pages 可以获得更好的性能，可惜的是，这种性能优势在 Linux 上几乎一直未被利用。随着 “透明巨页（transparent huge page）” 功能的合入，从内核版本 2.6.38 开始，这种状况将会发生转变。

Huge pages can improve performance through reduced page faults (a single fault brings in a large chunk of memory at once) and by reducing the cost of virtual to physical address translation (fewer levels of page tables must be traversed to get to the physical address). But the real advantage comes from avoiding translations altogether. If the processor must translate a virtual address, it must go through as many as four levels of page tables, each of which has a good chance of being cache-cold, and, thus, slow. For this reason, processors maintain a “translation lookaside buffer” (TLB) to cache the results of translations. The TLB is often quite small; running cpuid on your editor’s aging desktop machine yields:

使用 huge page 可以减少缺页异常（page fault）的发生次数（每次缺页异常处理可以读入更多的数据到内存中），降低从虚拟地址到物理地址的转换成本（为了获得最终映射的物理地址遍历的页表层级数会更少），这些都改进了系统性能。但其真正的优势在于可以极大地避免地址翻译。处理器转换（translate）虚拟地址时，必须经过多达四级页表，每一级的表项都可能不在缓存中（cache-cold），导致处理速度很慢。出于这个原因，处理器会维护一个 “转换后援缓冲区（translation lookaside buffer，简称 TLB）” 来缓存转换的结果。TLB 通常很小; 在我的那台老旧的台式计算机上运行 cpuid 命令可以查看到如下输出：

   cache and TLB information (2):
      0xb1: instruction TLB: 2M/4M, 4-way, 4/8 entries
      0xb0: instruction TLB: 4K, 4-way, 128 entries
      0x05: data TLB: 4M pages, 4-way, 32 entries

So there is room for 128 instruction translations, and 32 data translations. Such a small cache is easily overrun, forcing the CPU to perform large numbers of address translations. A single 2MB huge page requires a single TLB entry; the same memory, in 4KB pages, would need 512 TLB entries. Given that, it’s not surprising that the use of huge pages can make programs run faster.

以上表明在我的机器中，处理器的指令（instruction） TLB 的容量是 128 项，数据（data） TLB 的容量是 32 项。缓冲区中的条目这么少自然很容易被耗尽，这迫使处理器在整个运行过程中要花费大量的时间执行繁重的地址转换工作。如果采用大小为 2MB 的 huge page，一个 huge page 只需要对应一项 TLB 条目；同样大小的内存，换成使用 4KB 大小的页，则需要 512 项 TLB 条目。这么一对比，为何使用 huge page 可以使得程序运行得更快也就不用过多地解释了。（译者注，还是再解释一下：假设应用程序需要 2MB 的内存，如果操作系统以 4KB 作为分页的单位，则需要 512 个页框，进而在 TLB 中需要 512 个表项，同时也需要 512 个页表项，操作系统需要经历至少 512 次 TLB Miss 和 512 次缺页异常才能将 2MB 应用程序空间全部映射到物理内存；然而，当操作系统采用 2MB 作为分页的基本单位时，只需要一次 TLB Miss 和一次缺页异常，就可以为 2MB 的应用程序空间建立虚地址到物理地址之间的映射。另外如果假设不发生 TLB 项替换和 Swap 的话，程序在运行过程中也无需再经历 TLB Miss 和缺页异常。）

The main kernel address space is mapped with huge pages, reducing TLB pressure from kernel code. The only way for user-space to take advantage of huge pages in current kernels, though, is through the hugetlbfs, which was extensively documented here in early 2010. Using hugetlbfs requires significant work from both application developers and system administrators; huge pages must be set aside at boot time, and applications must map them explicitly. The process is fiddly enough that use of hugetlbfs is restricted to those who really care and who have the time to mess with it. Hugetlbfs is often seen as a feature for large, proprietary database management systems and little else.

内核地址空间本身采用 huge page 方式映射物理内存，这减少了内核代码对 TLB 的压力。然而，在当前内核中，对于用户空间来说，要想利用 huge page，唯一的方法是通过 hugetlbfs，LWN 在 2010 年年初曾经通过一系列详细的文章给大家介绍过相关内容。为了使用 hugetlbfs，需要在应用程序开发上和系统管理上涉及大量额外的工作；譬如，必须在内核启动时为 huge pages 预留出内存空间，应用程序必须在代码中明确地为 huge pages 完成映射。这个过程非常繁琐，以至于 hugetlbfs 的使用仅限于那些真正需要 huge pages 并且愿意为此付出精力的用户。Hugetlbfs 通常被视为专门为大型数据库管理系统开发的一项功能特性，用于其他方面的例子很少。

There would be real value in a mechanism which would make the use of huge pages easy, preferably requiring no development or administrative attention at all. That is the goal of the transparent huge pages (THP) patch, which was written by Andrea Arcangeli and merged for 2.6.38. In short, THP tries to make huge pages “just happen” in situations where they would be useful.

最好的方法是使得对 huge page 的使用变得方便，最好不要引入开发和系统管理上的特殊要求。这就是 Andrea Arcangeli 开发 “透明巨页（transparent huge pages，简称 THP）” 补丁的初衷，该补丁计划合入 2.6.38 版本。简而言之，THP 试图在任何使用 huge page 会带来好处的情况下自动引入对它的支持。（译者注，该补丁集随 2.6.38 正式合入。）

Current Linux kernels assume that all pages found within a given virtual memory area (VMA) will be the same size. To make THP work, Andrea had to start by getting rid of that assumption; thus, much of the initial part of the patch series is dedicated to enabling mixed page sizes within a VMA. Then the patch modifies the page fault handler in a simple way: when a fault happens, the kernel will attempt to allocate a huge page to satisfy it. Should the allocation succeed, the huge page will be filled, any existing small pages in the new page’s address range will be released, and the huge page will be inserted into the VMA. If no huge pages are available, the kernel falls back to small pages and the application never knows the difference.

当前的 Linux 内核假设在给定虚拟内存区域（virtual memory area，简称 VMA）中映射的所有物理页框大小都是相同的。但为了支持 THP，Andrea 必须首先去除这个假设：因此，该补丁集首先花费了大量的篇幅实现在 VMA 上支持混合形式的页框大小。其次，补丁集以一种简单的方式修改了缺页异常处理程序（page fault handler）的逻辑：当缺页异常发生时，内核将尝试优先分配一个 huge page。如果分配成功，内核将对该 huge page 完成填充，同时该 huge page 所对应的虚拟内存地址范围中重复映射的小页框会被释放掉，最后将该 huge page 加入到 VMA 数据结构中。如果没有可用的 huge page，内核将退而求其次，继续使用小页，以上操作对于应用程序来说是无感的（译者注，这就是 “透明” 的含义）。

This scheme will increase the use of huge pages transparently, but it does not yet solve the whole problem. Huge pages must be swappable, lest the system run out of memory in a hurry. Rather than complicate the swapping code with an understanding of huge pages, Andrea simply splits a huge page back into its component small pages if that page needs to be reclaimed. Many other operations (mprotect(), mlock(), …) will also result in the splitting of a page.

这种方案使得我们可以 “透明地” 使用 huge page，但这还不足以解决全部的问题。我们还必须支持针对 huge page 的页交换（swap），否则系统很快就会耗尽内存。为了避免引入 huge page 概念后使得页交换部分的代码变得过于复杂，Andrea 采用的方式是：一旦该 huge page 需要被回收，则简单地将其拆分成多个小页后再换出。其他的一些操作（譬如 mprotect()，mlock() 等）也会导致 huge page 被拆分。（译者注，和以上两小节所介绍的内容相关的具体代码修改可以参考 thp: transparent hugepage core。）

The allocation of huge pages depends on the availability of large, physically-contiguous chunks of memory - something which Linux kernel programmers can never count on. It is to be expected that those pages will become available at inconvenient times - just after a process has faulted in a number of small pages, for example. The THP patch tries to improve this situation through the addition of a “khugepaged” kernel thread. That thread will occasionally attempt to allocate a huge page; if it succeeds, it will scan through memory looking for a place where that huge page can be substituted for a bunch of smaller pages. Thus, available huge pages should be quickly placed into service, maximizing the use of huge pages in the system as a whole.

huge page 是否可以分配成功取决于当前系统中是否存在物理上连续的大块内存，但这对于 Linux 内核开发人员来说却不是一件总是可以被满足的事情（译者注，可分配的内存总是倾向于碎片化）。有时候我们还需要在一些特殊的时间点上申请分配 huge page，例如，在通过缺页异常处理为一个进程换入了很多小页之后（译者注，如果在缺页异常处理过程中我们不能申请到 huge page 将小页合并掉，则为了避免内存中小页数目过多，内核有必要采取一定的手段在缺页异常处理之后的某个时间点上再次尝试申请 huge page 从而将这些小页合并起来。但这个时间点并不可预期，所以原文作者称其为 “inconvenient”。）。为了解决此类问题，THP 补丁专门创建了一个名为 “khugepaged” 的内核线程。该线程会以很低的频度尝试分配一个空闲的 huge page；如果可以成功，它将扫描内存，寻找可以被该 huge page 替换的那些（多个）小页。替换成功后的 huge page 会被尽快投入使用，从而达到尽可能多地采用 huge page 来运行整个系统的目的。（译者注，有关 “khugepaged” 的具体代码修改可以参考 thp: khugepaged。）

The current patch only works with anonymous pages; the work to integrate huge pages with the page cache has not yet been done. It also only handles one huge page size (2MB). Even so, some useful performance improvements can be seen. Mel Gorman ran some benchmarks showing improvements of up to 10% or so in some situations. In general, the results were not as good as could be obtained with hugetlbfs, but THP is much more likely to actually be used.

当前补丁仅支持匿名页（anonymous pages）；还不支持对页缓存（page cache）使用 huge page。它也只处理一种 huge page 大小（2MB）。即便如此，该补丁对性能的改进仍然不可小觑。Mel Gorman 运行了一些基准测试，某些情况下显示可以获得高达 10% 左右的改善。总体来说，改进效果还比不上基于 hugetlbfs ，但 THP 在实际使用上会更受欢迎。

No application changes need to be made to take advantage of THP, but interested application developers can try to optimize their use of it. A call to madvise() with the MADV_HUGEPAGE flag will mark a memory range as being especially suited to huge pages, while MADV_NOHUGEPAGE will suggest that huge pages are better used elsewhere. For applications that want to use huge pages, use of posix_memalign() can help to ensure that large allocations are aligned to huge page (2MB) boundaries.

应用程序并不需要为了使用 THP 而做特别的改动，感兴趣的话，应用程序的开发人员也可以尝试对其使用进行优化。优化的方式包括：通过调用 madvise() 并传入 MADV_HUGEPAGE 参数可以指定对某段内存范围专门启用 huge page，而 MADV_NOHUGEPAGE 则起到相反的作用，特别指明对某段内存范围不使用 huge page。对于想要使用 huge page 的应用程序，还可以使用 posix_memalign() 确保大块内存的分配与 2MB 的 huge page 边界对齐。

System administrators have a number of knobs that they can tweak, all found under /sys/kernel/mm/transparent_hugepage. The enabled value can be set to “always” (to always use THP), “madvise” (to use huge pages only in VMAs marked with MADV_HUGEPAGE), or “never” (to disable the feature). Another knob, defrag, takes the same values; it controls whether the kernel should make aggressive use of memory compaction to make more huge pages available. There’s also a whole set of parameters controlling the operation of the khugepaged thread; see Documentation/vm/transhuge.txt for all the details.

补丁为系统管理员提供了许多用于调节的参数，这些参数都可以在 /sys/kernel/mm/transparent_hugepage 下找到。譬如 enabled 值可以设置为 “always”（表示始终使用 THP），或者 “madvise”（表示仅对标记为 MADV_HUGEPAGE 的虚拟内存地址范围使用 huge page），或者 “never”（表示禁用此功能）。另一个调节参数 defrag 接受相同的值；它控制内核是否应该积极使用内存规整（memory compaction）来获取更多空闲的 huge pages。除此之外，还有一整套用于控制 khugepaged 线程运行的参数；更多详细信息请参阅 Documentation/vm/transhuge.txt。

The THP patch has had a bit of a rough ride since being merged into the mainline. This code never appeared in linux-next, so it surprised some architecture maintainers when it caused build failures in the mainline. Some bugs have also been found - unsurprising for a patch which is this large and which affects so much core code. Those problems are being ironed out, so, while 2.6.38-rc1 testers might want to be careful, THP should be in a usable state by the time the final 2.6.38 kernel is released.

自从集成到主线以来，THP 补丁集的测试经历并不顺利。由于该部分代码从未在 linux-next 代码仓库中出现过，因此当它被合入主线中并导致内核构建失败时，曾经一度给部分架构维护人员造成混乱。测试中还发现了一些错误，但这对于一个改动量巨大并且影响了如此多核心代码的补丁来说并不是一件令人惊讶的事情。问题正在被逐个解决，因此，虽然对于当前 2.6.38-rc1 的测试人员来说，可能需要小心一些，但当最终的 2.6.38 内核发布时，THP 这个特性应该会稳定下来。（译者注，如期所愿，该补丁集随 2.6.38 正式合入了内核主线。）

请点击 LWN 中文翻译计划，了解更多详情。