泰晓科技 -- 聚焦 Linux - 追本溯源,见微知著!


LWN 712467: 页缓存(page cache)的未来

Wang Chen 创作于 2019/01/26

请点击 LWN 中文翻译计划,了解更多详情。

原文:The future of the page cache 原创:By corbet @ Jan. 25, 2017 翻译:By unicornx of TinyLab.org 校对:By Benny Zhao

The promise of large-scale persistent memory has forced a number of changes in the kernel and has raised questions about whether the kernel’s page cache will be needed at all in the future. In his linux.conf.au 2017 talk, Matthew Wilcox asserted that not only do we still need the page cache, but that its role should be increased. First, though, there is the small matter of correcting a mistake made by a certain Mr. Wilcox a couple of years ago.

随着大容量持久化存储设备(persistent memory,译者注,这里指类似 NVDIMM 这样的存储设备,下文直接使用不再翻译)的逐渐普及和推广,迫使内核随之发生了许多变化,并引发了对未来内核中是否还需要页面缓存(page cache,译者注,下文直接使用不再翻译)的质疑。Matthew Wilcox 在 2017 年澳洲 Linux 峰会(linux.conf.au 2017)上发表的主题演讲中声称,我们不仅仍然需要 page cache,还应该继续提升其重要性。不过,首先需要先纠正几年前他本人犯下的一个错误。

This was, he started, his first talk ever as a Microsoft employee — something he thought he would never find himself saying. He then launched into his topic by saying that computing is all about caching. His new laptop can execute 10 billion instructions per second, but only as long as it doesn’t take a cache miss. Memory on that system can only deliver 530 million cache lines per second, so it doesn’t take many cache misses to severely impact its performance. Things get even worse if the data you want isn’t cached in main memory and has to be read from a storage device, even a fast solid-state device.

演讲中 Wilcox 首先开玩笑说,作为一名微软的员工,这是他第一次在 Linux 大会上发言,而从前他一直以为这是不可能的事情。Wilcox 在演讲中提出,计算机运行的快慢完全取决于缓存效率(computing is all about caching)。他举例说如果缓存全部命中,他的新笔记本电脑每秒可以执行一百亿条指令。由于该系统上的主存每秒只能传输五亿三千万个缓存行,因此只要有很少的缓存未命中就会严重影响其性能。如果你想要的数据没有缓存在主存中,以至于还必须从二级存储设备(即使是快速固态设备)中读取,那么情况只会变得更糟。

It has always been that way; a PDP-11 was also significantly slowed by cache misses. But the problem is getting worse. CPU speeds have increased more than memory speeds, which, in turn, have increased more than storage speeds. The cost of not caching your data properly is thus going up.

缓存问题一直都在困扰着我们;同样的缓存未命中问题当年也显著拖累了 PDP-11 的执行速度(译者注,PDP-11 是 DEC 于 1970 到 1980 年代所销售的一款 16位小型计算机,UNIX 的第一个版本就是在这款计算机上开发出来的)。几十年过去了,这个问题反而变得愈加恶化。因为处理器的发展速度比一级存储(译者注,即我们常说的主存 RAM)快,而同时一级存储的发展速度又比二级存储(译者注,譬如硬盘)快,所以缓存未命中所带来的性能损失也日益严重。

页缓存(The page cache)

Unix systems have had a buffer cache, which sits between the filesystem and the disk for the purpose of caching disk blocks in memory, for a long time. While preparing the talk, he went back to look at Sixth-edition Unix (released in 1975) and found a buffer cache there. Linux has had a buffer cache since the beginning. In the 1.3.50 release in 1995, Linus Torvalds added a significant innovation in the form of the page cache. This cache differs from the buffer cache in that it sits between the virtual filesystem (VFS) layer and the filesystem itself. With the page cache, there is no need to call into filesystem code at all if the desired page is present already. Initially, the page and buffer caches were entirely separate, but Ingo Molnar unified them in 1999. Now, the buffer cache still exists, but its entries point into the page cache.

缓冲区缓存(buffer cache,译者注,下文直接使用不再翻译)在 Unix 系统中已经存在有很长一段时间了,它位于文件系统和磁盘之间,用于长时间缓存磁盘的块数据。在 Wilcox 为他的演讲做准备的过程中,他甚至回头查阅了 Unix 第六版(1975年发布)的源码,并找到了 buffer cache 的雏形。Linux 一开始就有 buffer cache。在 1995 年发行的 1.3.50 版本中,Linus Torvald 引入了一个重大创新,即 page cache。page cache 与 buffer cache 的区别在于,它是位于虚拟文件系统层(VFS)与具体的文件系统之间。有了 page cache,如果所需的页已经存在,则根本无需调用文件系统的代码。最初,page cache 和 buffer cache 是完全独立的,但是 Ingo Molnar 在 1999 年统一了它们。现在,buffer cache 仍然存在,但是其内容实际是指向 page cache。

The page cache has a great deal of functionality built into it. There are some obvious functions, like finding a page at a given index; if the page doesn’t exist, it can be created and optionally filled from disk. Dirty pages can be pushed back to disk. Pages can be locked, unlocked, and removed from the cache. Threads can wait for changes in a page’s state, and there are interfaces to search for pages in a given state. The page cache is also able to keep track of errors associated with persistent storage.

page cache 支持很多功能,比较显著的包括:比如可以通过给定的索引查找页框。如果页框不存在,则创建之并适时从磁盘填充相应的内容。脏页可以被写回磁盘,页可以被锁定,解锁,以及从缓存中删除。线程可以等待一个页的状态变化,也可以通过接口搜索给定状态的页,page cache 还可以用于追踪持久存储(persistent storage)设备发生的故障等等。

Matthew Wilcox

Locking for the page cache is handled internally. There tends to be disagreement in the kernel community over the level at which locking should be handled; in this case it has been settled in favor of internal locking. There is a spinlock to control access when changes are being made to the page cache, but lookups are handled using the lockless read-copy-update (RCU) mechanism.

对 page cache 的锁定(locking)在该机制内部进行处理。对于到底应该在哪个级别上处理锁定,内核社区一直存在分歧;当前的解决方式还是在 page cache 内部进行处理。当对 page cache 进行更改时,通过一个自旋锁来控制访问,但对 page cache 的查找则是基于无锁方式的 RCU(read-copy-update)机制来进行。

Caching is the art of predicting the future, he said. When the cache grows too large, various heuristics come into play to decide which pages should be removed. Pages used only once are likely to not be used again, so those are kept in the “inactive” list and pushed out relatively quickly. A second use will promote a page from the inactive list to the active list. Unused pages eventually age off the active list and are put back onto the inactive list. Exceptional “shadow” entries are used to track pages that have fallen off the end of the inactive list and have been reclaimed; these entries have the effect of lengthening the kernel’s memory about pages that were used in the relatively distant past.

使用缓存(caching)就意味着会涉及各种预测技术。具体来说,为了避免缓存变得过大,需要引入各种经验(heuristics)算法来选择哪些缓存页应当被回收。仅被使用了一次的缓存页很可能不会被再使用,因此它们将被放在 “不活跃” 链表(inactive list,译者注,下文直接使用不再翻译)中从而会被相对较快地回收。如果某个缓存页又被再次访问了,则内核会将它从 inactive list 转移到 “活跃” 链表(active list,译者注,下文直接使用不再翻译),active list 上的页也会因为超时而被移到 inactive list。值得注意的是,内核维护了一个额外的 “影子” 条目(”shadow” entries) 用于跟踪已脱离 inactive list 并已被回收的页,这些条目使得内核可以继续跟踪那些曾经使用过(但目前已经被回收)的页框并进行相关的处理(译者注,以上描述和 Refault Distance 算法的实现有关)。

Huge pages have been a challenge for the page cache for a while. The kernel’s transparent huge page feature initially only worked with anonymous (non file-backed) memory. There are good reasons for using huge pages in the page cache, though. Initial work in this area simply adds a large set of single-page entries to the page cache to correspond to a single huge page. Wilcox concluded that this approach was “silly”; he enhanced the radix tree code, used to track pages in the page cache, to be able to handle huge-page entries directly. Pending patches will cause the page cache to use a single entry for huge pages.

曾经有一段时间,巨页(huge page,译者注,下文直接使用不再翻译)的引入对 page cache 造成了一定的压力。内核的 透明巨页(transparent huge page)功能 最初只能用于匿名(anonymous,即没有对应文件系统上的文件(non file-backed))内存。但是,使用 huge page 来实现 page cache 是大势所趋。一开始为了支持在 page cache 中使用 huge page, 采用的方法是将一个 huge page 所对应的多个单个的 page 分别存放在(基数树(radix tree)的)独立的节点上。Wilcox 评价这种做法是 “愚蠢的”;为此他 改进了用于查找 page cache 中页框的基数树(radix tree)代码,以便使用单个的节点就能够直接对应一个 huge page,相关的补丁还在开发过程中。

我们还需要页缓存吗?(Do we still need the page cache?)

Recently, Dave Chinner asserted that there was no longer a need for a page cache. He noted that the DAX subsystem, initially created by Wilcox to provide direct access to file data stored in persistent memory, bypasses the page cache entirely. “There is nothing like having your colleagues question your entire motivation”, Wilcox said. There are people who disagree with Chinner, though, including Torvalds, who popped up in a separate forum saying that the page cache is important because good things don’t come from having low-level filesystem code in the critical path for data access.

最近,Dave Chinner 宣称 page cache 将逐渐退出历史舞台。他指出,最初由 Wilcox 创建的 DAX 子系统用于直接访问存储在 persistent memory 中的文件数据,已经完全绕过了 page cache。“没有什么比让你的同事质疑你的整个想法更糟糕的事情了”,Wilcox 说。原因是一些人表达了和 Chinner 不同的意见,甚至包括 Torvalds,他在另一个论坛中 指出 page cache 真的很重要,因为在数据访问的关键路径上,性能问题的解决从来不会出自低层的文件系统代码(译者注,参考 Torvalds, 即 Linus 本人在论坛上的 文字原文,其意思大致是说 page cache 作为位于各个文件系统之上的公共层,可以更加高屋建瓴地对读写效率进行优化,而这是众多相对下层的文件系统模块所做不到的。而 DAX 子系统应该是位于文件系统的层次)。

With that last statement in mind, Wilcox delved into how an I/O request using DAX works now. He designed the original DAX code and, in so doing, concluded that there was no need to use the page cache. That decision, he said, was wrong.

Linus 先生的话触动了 Wilcox ,他再次深入研究了使用 DAX 执行读写的整个流程。正是他设计了最初的 DAX 代码,并在此过程中得出不需要使用 page cache 的结论。但现在他发现,当初的这个决定是错误的。

In current kernels, when an application makes a system call like read() to read some data from a file stored in persistent memory, DAX gets involved. Since the requested data is not present in the page cache, the VFS layer calls the filesystem-specific read_iter() function. That, in turn, calls into the DAX code, which will call back into the filesystem to turn the file offset into a block number. Then the block layer is queried to get the location of that block in persistent memory (mapping it into the kernel’s address space if need be) so that the block’s contents can be copied back to the application.

在当前的内核中,当一个应用调用类似于 read() 的系统调用从 persistent memory 上读取文件内容时,DAX 会介入。由于请求的数据不在 page cache 里,所以 VFS 层会调用文件系统特定的 read_iter() 函数。这会进入 DAX 代码,它将回调文件系统的函数将文件的偏移量(offset)转换为块索引(block number),然后查询块层(block layer)以获取 persistent memory 中该块的位置(必要的话,还会将其映射到内核地址空间),最终使得块的内容可以被拷贝回应用。

That is “not awful”, but it should work differently, he said. The initial steps would be the same, in that the read_iter() function would still be called, and it would call into the DAX code. But, rather than calling back into the filesystem, DAX should call into the page cache to get the physical address associated with the desired offset in the file. The data is then copied back to user space from that address. This all assumes that the information is already present in the page cache but, when that is the case, the low-level filesystem code need not get involved at all. The filesystem had already done the work, and the page cache had cached the result.

他说,这么做 “并没有错”,但应该换成另一种(更好的)方式工作。初始步骤是相同的​​,仍然会调用 read_iter() 函数,然后它将调用 DAX 的代码。但是,不同的是,DAX 此时无需回调回文件系统,而是应该利用 page cache 的逻辑来将文件偏移量转换为物理地址,然后直接从该地址开始将数据复制回用户空间。这一切都假定所需信息已经存在于 page cache 中,而且在这种情况下,根本无需涉及底层的文件系统代码。文件系统早已完成它该做的事情,page cache 中也已经缓存了我们需要的数据。

When Torvalds wrote the above-mentioned post about the page cache, he said:

It's also a major disaster from a locking standpoint: trust me, if you think your filesystem can do fine-grained locking right when it comes to things like concurrent lookup of pathnames, you're living in a dream world.

在上述 Torvalds 发表的有关 page cache 言论的帖子中还说到:


This, Wilcox said, was “so right”; the locking in DAX has indeed been disastrous. He originally thought it would be possible to get away with relatively simple locking, but complexity crept in with each new edge case that was discovered. DAX locking is now “really ugly” and he is sorry that he made the mistake of thinking that he could bypass the page cache. Now, he said, he has to fix it.

Wilcox 承认,Linux 先生的看法 “完全正确”;DAX 中锁的实现是灾难性的。他原本以为可以通过相对简单的锁来解决问题,但针对不断出现的边界条件的异常处理使得问题愈来愈复杂。DAX 中的锁机制现在 “真的非常丑陋”,对于他曾经错误地认为可以绕过 page cache 这件事,Wilcox 表示非常遗憾。他说,现在他必须设法解决这个问题。

未来的工作(Future work)

He concluded with a number of enhancements he would like to see made around DAX and the page cache. The improved huge-page support mentioned above is one of them; that is already sitting in the -mm tree and should be done soon. The use of page-frame numbers instead of page structures has been under discussion for a while since there is little desire to make the kernel store vast numbers of page structures for large persistent memory arrays.

Wilcox 总结了一些他所期望的,围绕 DAX 和 page cache 可以改进的功能。前文提到的对 huge page 支持的改进就是其中之一;相关改动已经合入 -mm 代码仓库,应该很快可以完成。关于使用页框索引(page-frame number, 简称 PFN)代替 page 结构体类型的 讨论 已经有一段时间了,因为让内核在这些大型 persistent memory 设备阵列中保存大量的 page 结构体是完全没有必要的。

He would like to revisit the idea of filesystems with a block size larger than the system’s page size. That is something that people have wanted for many years; now that the page cache can handle more than one page size, it should be possible. “A simple matter of coding”, he said. He is looking for other interested developers to work with on this project.

另外他想重新考虑一下如何在文件系统中支持块(block)的尺寸大于系统页(page)尺寸,这是人们多年来梦寐以求的东西。既然 page cache 已经可以处理巨页,那么这个想法应该是有希望的。按照他的理解,“剩下的工作只不过是简单的编码”。他正在寻找其他感兴趣的开发人员一起来做这个项目。

Huge swap entries are also an area of interest. We have huge anonymous pages in memory but, when it comes time to swap them out, they get broken up into regular pages. “That is probably the wrong answer”. There is work in improving swap performance, but it needs to be reoriented toward keeping huge pages together. That might help with the associated idea of swapping to persistent memory. Data in a persistent-memory swap space can still be accessed, so it may well make sense to just leave it there, especially if it is not being heavily modified.

巨页交换(Huge swap entries)也是一个令人感兴趣的方向。我们已经在内存中使用了匿名巨页,但当这些巨页被换出时,它们又被分解成普通尺寸的页。“这看上去不是正确的做法”,目前已经有一些提升交换性能的工作在进行,正确的处理应该是让交换出去的巨页保持不被分解。这同样也有助于我们将内存交换到 persistent memory。因为在 persistent memory 上的交换空间中的数据仍然可能会被访问,因此将其保留为巨页形式应该是有意义的,尤其是当数据没有被大量修改的时候。

The video of this talk, including a bonus section on page-cache locking, is available.

最后附上本次演讲的视频,包括附加的有关 page cache 加锁的介绍。

请点击 LWN 中文翻译计划,了解更多详情。

Read Album:

Read Related:

Read Latest: