泰晓科技 -- 聚焦嵌入式 Linux - 追本溯源,见微知著!


关注 @泰晓科技
招聘求职 | 实验云台 | 共享书籍 | 直播回放

LWN 550463: 更好的 Shrinker 机制

Wang Chen 创作于 2019/03/02

了解更多有关 “LWN 中文翻译计划”,请点击 这里

原文:Smarter shrinkers 原创:By Jonathan Corbet @ May. 14, 2013 翻译:By unicornx 校对:By Chumou Guo

One of the kernel’s core functions is the management of caching; by maintaining caches at various levels, the kernel is able to improve performance significantly. But caches cannot be allowed to grow without bound or they will eventually consume all of memory. The kernel’s answer to this problem is the “shrinker” interface, a mechanism by which the memory management subsystem can request that cached items be discarded and their memory freed for other uses. One of the recurring topics at the 2013 Linux Storage, Filesystem, and Memory Management Summit was the need to improve the shrinker interface. The proposed replacement is out for review, so it seems like time for a closer look.

内核的核心功能之一是缓存(cache)管理;通过维护各种级别的缓存,内核能够显著提高性能。但是要小心不能让缓存无限制地增长,否则它们最终将耗尽所有内存。为了应对该问题,内核提供了一套 “shrinker” 接口(译者注,shrink 原意有收缩、减小的意思,在这里表示通过回收内存减小缓存体积。从习惯出发不再翻译为中文),通过该机制,内存管理子系统可以(通过回调的方式)请求(相关缓存的管理者)丢弃部分缓存项,从而释放内存以供其他用途使用。在 2013 年 Linux 存储,文件系统和内存管理峰会(Linux Storage, Filesystem, and Memory Management Summit,简称 LSFMM) 上反复讨论的主题之一是有关改进 shrinker 接口的必要性。具体的 修改建议 已经提交审核,在此给大家详细介绍一下。

一套新的 shrinker 编程接口(A new shrinker API)

In current kernels, a cache would implement a shrinker function that adheres to this interface:

在当前的内核中,某种类型的缓存可以按照以下方式实现一个 shrinker (回调)函数:

#include <linux/shrinker.h>

struct shrink_control {
	gfp_t gfp_mask;
	unsigned long nr_to_scan;

int (*shrink)(struct shrinker *s, struct shrink_control *sc);

The shrink() function is packaged up inside a shrinker structure (along with some ancillary information); the whole thing is then registered with a call to register_shrinker().

内核定义了一个结构体类型 shrinker,其成员之一就是 shrink() 这个回调函数(连同一些辅助信息);我们可以通过调用 register_shrinker() 并传入该结构体来注册一个 shrinker 实例。

When memory gets tight, the shrink() function will be called from the memory management subsystem. The gfp_mask will reflect the type of allocation that was being attempted when the shrink() call was made; the shrinker should avoid any actions that contradict that mask. So, for example, if a GFP_NOFS allocation is in progress, a filesystem shrinker cannot initiate filesystem activity to free memory. The nr_to_scan field tells the shrinker how many objects it should examine and free if possible; if, however, nr_to_scan is zero, the call is really a request to know how many objects currently exist in the cache.

当内存变得紧张时,内存管理子系统会回调这个 shrink() 函数(译者注,具体的回调参考 shrink_slab 函数的逻辑,回调时会传入类型为 struct shrink_control 的参数)。(struct shrink_control 参数中的)gfp_mask 成员反映了 shrink() 函数被回调时内核被请求分配内存的方式;shrink() 函数的实现应避免任何与 gfp_mask 所包含的掩码值相矛盾的行为。举例来说,如果掩码值中指定了 GFP_NOFS,则文件系统所实现的 shrink() 函数就不应该调用文件系统的相关函数以释放内存。nr_to_scan 字段用于告诉 shrink() 函数如果可能的话,最多检查和释放缓存对象的个数;如果 nr_to_scan 的值为零,则回调函数对应的实现应该返回缓存中当前存在的对象的个数。

The use of a single callback function for two purposes (counting objects and freeing them) irks some developers; it also makes the interface harder to implement. So, one of the first steps in the new shrinker patch set is to redefine the shrinker API to look like this:

一个回调函数需要实现两个目的(统计缓存对象的个数或者释放缓存对象),这本身就不合理。因此,新的 shrinker 补丁集的第一步要做的就是重新定义 shrinker 的编程接口(API),如下所示(译者注,该部分改进随 3.12 版本合入内核,读者可以查看 相关代码):

long (*count_objects)(struct shrinker *s, struct shrink_control *sc);
long (*scan_objects)(struct shrinker *s, struct shrink_control *sc);

The roughly two-dozen shrinker implementations in the kernel have been updated to use this new API.

粗略统计可以看到内核中大约有二十几个 shrinker 需要同步更新为使用这个新的 API。

The current shrinker API is not NUMA-aware. In an effort to improve that situation, the shrink_control structure has been augmented with a new field:

当前的 shrinker API不支持 NUMA。为了改善这种情况,shrink_control 结构体中增加了一个新的字段如下:

nodemask_t nodes_to_scan;

On NUMA systems, memory pressure is often not a global phenomenon. Instead, some nodes will have plenty of free memory while others are running low. The current shrinker interface will indiscriminately free memory objects; it pays no attention to which NUMA node any given object is local to As a result, it can dump a lot of cached data without necessarily helping to address the real problem. In the new scheme, shrinkers should observe the nodes_to_scan field and only free memory from the indicated NUMA nodes.

在 NUMA 系统上,内存压力通常并不均衡。相反,一些节点(node)会拥有足够的可用内存,而其他节点则相对不足。当前的 shrinker 接口不加区分地释放内存对象;并不关心缓存对象是属于哪个 NUMA 节点的。因此,它可能释放了大量的缓存数据,却没有解决实际(内存不足的节点上)的问题。在新方案中,shrinkers 应根据 nodes_to_scan 字段仅释放指定 NUMA 节点的内存。

LRU 链表(LRU lists)

A maintainer of an existing shrinker implementation may well look at the new NUMA awareness requirement with dismay. Most shrinker implementations are buried deep within filesystems and certain drivers; these subsystems do not normally track their cached items by which NUMA node holds them. So it appears that shrinker implementations could get more complicated, but that turns out not to be the case.

由于新的实现要求考虑 NUMA,这可能会让现有 shrinker 的维护者感到些许沮丧。大多数的 shrinker 实现深入到文件系统内部和某些驱动程序中;而这些子系统通常不记录它们的缓存项属于哪个 NUMA 节点。因此看起来 shrinker 的实现可能会变得更复杂,但补丁集考虑到了该问题,为实现 shrinker 提供了便利。

While looking at the shrinker code, Dave Chinner realized that most implementations look very much the same: they maintain a least-recently-used (LRU) list of cached items. When the shrinker is called, a pass is made over the list in an attempt to satisfy the request. Much of that code looked well suited for a generic replacement; that replacement, in the form of a new type of linked list, is part of the larger shrinker patch set.

在查看 shrinker 的代码时,Dave Chinner 意识到大多数回调函数的实现看起来非常相似:它们都会采用一个 “最近最少使用(least-recently-used,简称 LRU)” 链表保存缓存项。当 shrinker 被回调时,它们都会遍历该链表以完成相应的请求。这些代码看起来非常适合提取出来实现为公共代码;而这个 shrinker 补丁集的另一个重要的改动就是以公共代码的形式提供了一个新的链表。(译者注,该部分改动随 3.12 版本合入内核。)

The resulting “LRU list” data structure encapsulates a lot of the details of object cache management; it goes well beyond a simple ordered list. Internally, it is represented by a set of regular list_head structures (one per node), a set of per-node object counts, and per-node spinlocks to control access. The inclusion of the spinlock puts the LRU list at odds with normal kernel conventions: low-level data structures do not usually include their own locking mechanism, since that locking is often more efficiently done at a higher level. In this case, putting the lock in the data structure allows it to provide per-node locking without the need for NUMA awareness in higher-level callers.

基于此目新增的 “LRU 链表” 数据结构(译者注,即 struct list_lru_node)封装了很多对象缓存管理的细节;其实现远远超出了一个简单的有序链表。该结构体由以下几部分组成,一个常规的 list_head 结构体成员(struct list_head list),一个对象计数(long nr_items)和一个用于访问控制的自旋锁(spinlock_t lock),这个结构体(struct list_lru_node)每个节点(node)一个。结构体中的自旋锁成员使得这个 LRU 链表的实现与正常的内核链表实现与众不同:这么做的考虑是底层的数据结构通常不需要包括它们自己的锁机制,在较高的层次级别实现往往更有效。基于这个考虑,将锁放在这个数据结构中允许它提供基于节点级别的锁定,而无需在更高级别的调用者中考虑对 NUMA 的保护。

The basic API for the management of LRU lists is pretty much as one might expect:

补丁提供了常见的一组基本 API 帮助人们管理 LRU 链表:

#include <linux/list_lru.h>

int list_lru_init(struct list_lru *lru);
int list_lru_add(struct list_lru *lru, struct list_head *item);
int list_lru_del(struct list_lru *lru, struct list_head *item);

A count of the number of items on a list can be had with list_lru_count(). There is also a mechanism for walking through an LRU list that is aimed at the needs of shrinker implementations:

可以使用 list_lru_count() 计算链表中缓存项的数量 。补丁还提供了一种遍历 LRU 链表的机制(如下),方便 shrinker 的实现:

unsigned long list_lru_walk(struct list_lru	*lru, 
			    list_lru_walk_cb isolate,
			    void *cb_arg,
			    unsigned long nr_to_walk);
unsigned long list_lru_walk_nodemask(struct list_lru *lru, 
			    list_lru_walk_cb isolate,
			    void *cb_arg,
			    unsigned long nr_to_walk,
			    nodemask_t *nodes_to_walk);

Either function will wander through the list, calling the isolate() callback and, possibly, modifying the list in response to the callback’s return value. As one would expect, list_lru_walk() will pass through the entire LRU list, while list_lru_walk_nodemask() limits itself to the specified nodes_to_walk. The callback’s prototype looks like this:

这两个函数都会遍历列表(译者注,在最终的提交中 list_lru_walk_nodemask() 函数更名为 list_lru_walk_node()),在遍历过程中会回调 isolate() 函数,并可能根据回调函数的返回结果修改链表本身。正如函数名所指出的那样,list_lru_walk() 会遍历整个 LRU 链表,而 list_lru_walk_nodemask() 函数仅会遍历 nodes_to_walk 指定的节点。回调函数的原型如下所示:

typedef enum lru_status (*list_lru_walk_cb)(struct list_head *item, 
					    spinlock_t *lock,
					    void *cb_arg);

Here, item is an item from the list to be examined, lock is the spinlock controlling access to the list, and cb_arg is specified by the original caller. The return value can be one of four possibilities, depending on how the callback deals with the given item:

这里,item 是要检查的链表中的项(译者注,即遍历过程中的每一项),lock 是控制对链表的访问的自旋锁,并且 cb_arg 是调用形如 list_lru_walk() 函数时传入的 cb_arg 参数。返回值是如下四种可能之一,具体取决于回调函数如何处理给定的项:

  • LRU_REMOVED indicates that the callback removed the item from the list; the number of items on the list will be decremented accordingly. In this case, the callback does the actual removal of the item.
  • LRU_ROTATE says that the given item should be moved to the (“most recently used”) end of the list. The LRU list code will perform the move operation.
  • LRU_RETRY indicates that the callback should be called again with the same item. A second LRU_RETRY return will cause the item to be skipped. A potential use for this return value is if the callback notices a potential deadlock situation.
  • LRU_SKIP causes the item to be passed over with no changes.
  • LRU_REMOVED 表示回调函数从链表中删除了该项;链表中的项目的总数需要相应减少。对于这种情况,项目的实际删除由回调函数完成。
  • LRU_ROTATE 表示应将给定项目移动到列表的 “最近使用”(“most recently used”)端。LRU 链表代码将执行实际的移动操作。
  • LRU_RETRY 表示应该使用相同的项再次调用回调函数。如果第二次调用后依然返回 LRU_RETRY 则忽略该返回值(译者注,避免死循环)。回调函数有可能在检查到 “死锁” 发生时会返回此值。
  • LRU_SKIP 表示当前项被略过,没有任何更改。

With this infrastructure in place, a lot of shrinker implementations come down to a call to list_lru_walk_nodemask() and a callback to process individual items.

以上框架就绪后,shrinker 要做的就是实现自己的回调函数并调用 list_lru_walk_nodemask() 函数(对缓存项进行遍历)。

LRU 链表对内存控制组的支持(Memcg-aware LRU lists)

While an improved shrinker interface is well worth the effort on its own, much of the work described above has been driven by an additional need: better support for memory control groups (memcgs). In particular, memcg developer Glauber Costa would like to be able to use the shrinker mechanism to free only memory that is associated with a given memcg. All that is needed to reach this goal is to expand the LRU list concept to include memcg awareness along with NUMA node awareness.

对 shrinker 接口的改进意义重大,但对于上面介绍的大部分工作来说,当初的出发点却是为了另一个目的:就是能够对内存控制组(memory control group,下文简称 memcg)实现更好的支持。特别地,memcg 的开发人员 Glauber Costa 希望能够使用 shrinker 机制来释放与给定 memcg 相关的内存。为了实现此目标,所需要做的就是在扩展 LRU 链表概念过程中除了记录 NUMA 节点的相关信息外还要记录 memcg 的相关信息。

The result is a significant reworking of the LRU list API. What started as a simple list with some helper functions has now become a two-dimensional array of lists, indexed by node and memcg ID. A call to list_lru_add() will now determine which memcg the item belongs to and put it onto the relevant sublist. There is a new function — list_lru_walk_nodemask_memcg() — that will walk through an LRU list, picking out only the elements found on the given node(s) and belonging to the given memcg. The more generic functions described above have been reimplemented as wrappers around the memcg-specific versions. At this point, the “LRU list” is no longer a generic data structure (though one could still use it that way); it is, instead, a core component of the memory management subsystem.

改造的结果是重新定义了 LRU 链表的 API(译者注,对 memcg 的支持并没有和前面介绍的修改一起合入 3.12 版本,而是直到 4.0 版本才合入内核主线,具体参考其代码提交 commit list_lru: introduce per-memcg lists)。从原先只有一个简单的链表加上一些辅助函数,现在变成一个二维的链表数组,数组下标索引为 node ID 和 memcg ID。调用 list_lru_add() 函数时现在需要确定添加项所属的 memcg 从而可以将其放入相关的子链表中。新增的函数,譬如 list_lru_walk_nodemask_memcg() ,调用它将遍历 LRU 链表,仅挑选不仅属于给定 node 还属于给定 memcg 的元素。上面介绍的更通用的函数已经重新实现为 memcg 特定版本的封装函数。改造完成后,“LRU 链表” 已不再是一个通用的数据结构(尽管仍然可以和以前一样使用它);相反,它已成为内存管理子系统的一个核心组件。

结束语(Closing notes)

A review of the current shrinker implementations in the kernel reveals that not all of them manage simple object caches. In many cases, what is happening is that the code in question wanted a way to know when the system is under memory pressure. In current kernels, the only way to get that information is to register a shrinker and see when it gets called. Such uses are frowned upon; they end up putting marginally related code into the memory reclaim path.

通过对内核中当前 shrinker 实现的代码进行审阅,我们发现其中一些 shrinker 对缓存对象的管理并不简单。在许多情况下,代码需要一种方法来了解系统何时内存压力比较大。在当前的内核中,获取该信息的唯一方法是注册一个 shrinker 并查看它何时被回调。这种方式是有问题的;因为这会导致内存的回收路径中被引入一些无关的逻辑(译者注,任何人都可能通过回调注册的方式在内核核心处理逻辑中 “注入” 自己的处理逻辑,无论是恶意的或是善意的)。

The shrinker patch set seeks to eliminate those users by providing a different mechanism for code that wants to learn about memory pressure. It essentially hooks into the vmpressure mechanism to set up an in-kernel notification mechanism, albeit one that does not use the kernel’s usual notifier infrastructure. Interested code can call:

shrinker 补丁集旨在通过为想要了解内存压力的用户提供一种不同的途径来避免上述问题。它本质上是通过 vmpressure 机制 所提供的 钩子(hook,译者注,本质上也是一种回调)函数方式来建立内核态的通知机制,尽管该机制并没有使用内核提供的通用通知器(notifier)架构。感兴趣的代码可以通过调用以下函数注册 hook:

int vmpressure_register_kernel_event(struct cgroup *cg, void (*fn)(void));

The given fn() will be called at the same time that pressure notifications are sent out to user space. The concept of “pressure levels” has not been implemented for the kernel-side interface, though.

fn() 函数(译者注,即上文提到的 hook)会在内存压力状态被通知到用户空间的同时被调用。但是,“压力级别”(“pressure levels”)的概念尚未针对内核侧接口实现。

Most of this code is relatively new, and it touches a fair amount of core memory management code. The latter stages of the patch set, where memcg awareness is added, could be controversial, but, then, it could be that developers have resigned themselves to memcg code being invasive and expensive. One way or another, most or all of this code will probably find its way into the mainline; the benefits of the shrinker API improvements will be nice to have. But the path to the mainline could be long, and this patch set has just begun, so it may be a while before it is merged.

补丁的大部分代码相对较新,触及了大量的核心内存管理代码。后期阶段补丁集添加的有关支持 memcg 的改动可能会引起争议,但是,开发人员自己也觉得对 memcg 的改动过大所以回退了这部分修改。总之,补丁修改的大部分或全部总会找到自己的方式进入内核主线;特别地,对内核来说,对 shrinker API 的改进还是很有好处的。合入主线的道路可能还很长,考虑到这个补丁集刚刚开始,让我们拭目以待。(译者注,该补丁集最终随 3.12 版本合入内核主线。)

了解更多有关 “LWN 中文翻译计划”,请点击 这里

Read Album:

Read Related:

Read Latest: