泰晓科技 -- 聚焦 Linux - 追本溯源,见微知著!


LWN 549580: 3.10 版本开始支持(接近)完全无周期时钟(full tickless)

Wang Chen 创作于 2018/07/16

原文:(Nearly) full tickless operation in 3.10 原创:By corbet @ May. 8, 2013 翻译:By unicornx 校对:By guojian-at-wowo

On a typical Linux system, each running CPU will be diverted between 100 and 1000 times each second by the periodic timer interrupt. That interrupt is the CPU’s cue to reconsider which process should be running, catch up with read-copy-update (RCU) callbacks, and generally handle any necessary housekeeping. This periodic “tick” can be reasonably compared to the infamous big kernel lock (BKL): it is convenient to have around, but it also has an effect on performance that makes developers wish to abolish it. The key difference might be that getting rid of the timer tick has taken rather longer than was required to eliminate the BKL. The 3.10 kernel will take an important step in that direction, though, with the addition of the “full NOHZ” mode — but a lot of limitations still apply.

在一个典型的 Linux 系统中,每个运行中的处理器在一秒钟的时间内会接收到最少 100 次,最多 1000 次周期性时钟中断。在每次中断中,内核要处理很多事情,包括任务调度、完成 read-copy-update(RCU)的回调函数、以及其他一些必要的日常事务。这就是我们所熟知的“周期性时钟中断处理”(下文简称 “tick”),和 “声名远扬” 的大内核锁(big kernel lock,下文简称 BKL)一样,它给内核的运行带来了便利,但由于该中断对系统的性能有影响,所以也有一些开发人员倾向于从内核中移除该机制。这样的想法(指废除周期性时钟中断机制的想法)甚至比移除 BKL 的提议还要早一些。时至今日,虽然还存在诸多限制,但内核终于朝着这个方向迈出了重要的一步,从 3.10 版本开始,开始支持新的 “full NOHZ” 模式(译者注:即下文将要提到的完全无周期时钟(Full tickless)模式)。

Linux has had a partial solution to the timer tick problem for years in the form of the CONFIG_NO_HZ configuration option. If that option is set, the timer tick will be turned off, but only when the CPU is idle. This mode improves the situation considerably; it allows idle CPUs to stay in deep sleep states, reducing power use. Systems with virtualized guests also benefit, since, otherwise, each guest would be servicing timer interrupts when it should be doing nothing. In short: disabling the timer tick when the processor is idle makes enough sense that most distributions do it by default.

针对 tick 问题,好几年前,Linux 已经提供了一个部分解决方案。实际使用时可以通过选择配置选项 CONFIG_NO_HZ 来启用该功能,启用后,当且仅当处理器进入空闲态时内核会试图关闭时钟中断。这样可以使空闲的处理器进入深度睡眠,极大地节省了能耗。运行虚拟客户机的系统会从中受益,否则每个客户机即使在系统空闲时也要处理定时器中断。总之:在处理器闲置时禁用时钟中断有害无益,所以大多数发行版都在缺省配置中使能该选项。

Indeed, given that letting sleeping CPUs lie is generally a good policy, one might wonder why this behavior is optional at all. The answer is that it increases the cost of moving into and out of the idle state, (very) slightly increasing the time it takes to get an idle CPU back to work. That cost may be considered excessive in highly latency-sensitive environments. For everybody else, disabling the timer tick for idle CPUs is almost certainly the right thing to do; for battery-powered systems that is doubly true.


The next step — disabling the tick for non-idle processors — is a lot more work with a smaller reward, so it is not surprising that it has taken a while to come about. Frederic Weisbecker finally took up the challenge in 2010; after a lot of changes and help by others (Paul McKenney made some significant RCU changes, for example), this work has been pulled into the 3.10 kernel.

在此基础之上(译者注:指 CONFIG_NO_HZ 所提供的针对空闲处理器禁用 tick)再进一步,即对非空闲的处理器也尝试禁用 tick,看起来似乎只是一个小小的改进,但实际实现起来的难度并不小,颇费一番周折。这个挑战最终由 Frederic Weisbecker 于 2010 年承接下来;在经历了多次修改以及在众多内核骇客的帮助之下(例如 Paul McKenney 针对 RCU 进行了一些重要的修改),这项改进终于被合入了 3.10 版本。

In 3.10, the CONFIG_NO_HZ option has been replaced by a three-way choice:

  • CONFIG_HZ_PERIODIC is the old-style mode wherein the timer tick runs at all times.
  • CONFIG_NO_HZ_IDLE (the default setting) will cause the tick to be disabled at idle, the way setting CONFIG_NO_HZ did in earlier kernels.
  • CONFIG_NO_HZ_FULL will enable the “full” tickless mode.

在 3.10 版本的内核中,CONFIG_NO_HZ 选项被分解为三个选项:

  • CONFIG_HZ_PERIODIC:传统模式,时钟中断始终运行。
  • CONFIG_NO_HZ_IDLE:(默认开启)处理器空闲时会尝试禁止时钟中断,对应早期版本中的 CONFIG_NO_HZ 选项。
  • CONFIG_NO_HZ_FULL:将启用 “完全” 的无时钟中断(”full” tickless)模式(译者注:从术语和方便出发,下文直接使用 Full tickless,对此不再翻译为中文)。

The build-system code has been set up so that “make oldconfig” on 3.10 should yield a configuration that matches the previous setting of CONFIG_NO_HZ with no intervention required. Full tickless mode defaults to off; selecting that mode will enable tasks to run without the timer tick, but there are a number of things to be aware of.

3.10 版本的内核编译构造系统支持运行 “make oldconfig” 后可以自动产生一个兼容老版本的 CONFIG_NO_HZ 选项的配置文件。Full tickless 模式缺省情况下是关闭的,选择该模式后支持任务在没有时钟中断情况下运行,但有一些问题需要注意。

Among those are the requirement that the CPUs available for running without a timer tick must be designated at boot time using the nohz_full= command-line parameter. The boot CPU cannot run in this mode — at least one CPU needs to continue to receive interrupts and do the necessary housekeeping. The CONFIG_NO_HZ_FULL_ALL configuration option causes all CPUs (other than the boot CPU) to run in the full tickless mode by default; it can still be overridden with nohz_full=, though. The set of full tickless CPUs cannot be changed after boot; the amount of work required to make that possible would be large, and there does not seem to be a pressing need for this ability.

其中一个需要注意的是:内核启动时可以在命令行中通过 nohz_full= 参数指定需要对哪些处理器尝试在运行时关闭时钟定时器中断。但不能对用于引导内核的处理器使用此模式,因为至少需要保留一个处理器继续接收中断并执行必要的日常管理事务。CONFIG_NO_HZ_FULL_ALL 选项默认情况下会针对所有的 CPU(引导内核的 CPU 除外)启用 Full tickless 模式;但如果在命令行中指定了 nohz_full= 参数则该选项会被覆盖(overridden)。目前不支持内核启动后重新指定启用 Full tickless 模式的处理器;一来实现这个功能所涉及的改动量太大,二来似乎也没有迫切的需求需要这种能力。

Even then, a running CPU will only disable the timer tick if there is a single runnable process on that CPU. As soon as a second process appears, the tick is needed so that the scheduler can make the necessary time-slice decisions. And even with a single runnable process, it is not technically tickless, since the timer tick still needs to happen at least once per second to keep the scheduler happy. But dropping from as high as 1000Hz to 1Hz is obviously a significant improvement. Response-time jitter due to timer interrupts will be nearly eliminated, and, according to Ingo Molnar, as much as 1% of the CPU’s time will be saved.

即便如此,当且仅当某个处理器上只有一个可运行的任务时,内核才会对该处理器禁用时钟中断。一旦出现第二个任务,时钟中断就得恢复,以便调度器可以利用中断安排时间片。实际上,即使只有一个可运行的任务,也不能完全停止时钟中断,为了保证调度器可以正常工作,时钟中断至少需要每秒发生一次。但是从每秒 1000 次的频率下降到每秒 1 次俨然是很大的进步了。时钟中断所导致的响应时间的抖动基本可以消除,而且,根据 Ingo Molnar 的说法,如此一来可以节省处理器至少“百分之一”的处理时间。

There are workloads out there that will benefit significantly from those improvements. High-performance computing (HPC) and realtime are obvious candidates; in both cases, dedicating a CPU to a single task is a fairly common tactic already. But, in an era where even phones have quad-core processors, having a single runnable process on a given CPU is not an uncommon situation.

有些工作将会从这个改进中获益匪浅。最典型的,譬如高性能计算(High-performance computing,简称 HPC)和实时性(realtime)应用;在这两种情况下,将某个处理器专门分配给单个任务使用是一种相当常见的策略。而且在移动应用上,即使手机已经进入四核处理器的时代,也经常会遇到在某个处理器上只有单个任务在运行的场景。

There are a lot more details to making full tickless operation work properly; setting up a system to use this feature requires a fair amount of fiddling at this time. At a minimum, the administrator should make extensive use of CPU affinities to keep unwanted processes (including kernel threads) off the relevant processors. Some RCU configuration is required as well; see Documentation/timers/NO_HZ.txt for lots of details on the various options.

如果要让 full tickless 更有效地工作还有更多的细节需要考虑;目前,要使用好这个功能需要对系统进行很多的调整和设置。至少,系统管理员需要深入地调配任务和处理器之间的关系,避免不需要的任务(包括内核线程)在相关处理器上运行。除此之外,还需要配置一些 RCU 的选项;更详细的信息,请参阅 Documentation/timers/NO_HZ.txt

Full tickless operation, as seen in 3.10, is clearly a significant step forward, but, equally clearly, this project is not yet complete. There is a fair amount of detail work to be done, including making the feature work on 32-bit processors (a patch exists), getting rid of that final once-per-second tick, mitigating some unfortunate side effects on the scheduler’s statistics and load balancing, and fixing the inevitable bugs. This is a large and invasive change to how the core kernel works; there will almost certainly be some surprising behaviors that emerge once the tickless mode starts to get wider testing.

很显然,3.10 内核中所引入的 full tickless 特性是内核进化的重要一步,但同样可以清楚地看到,该改进尚未全部完成。有相当多的细节有待完善,包括使该功能在 32 位处理器上也能工作(目前有一个对应的补丁在开发中);去除仅剩的每秒一次的时钟中断;尽量降低那些对调度器统计和负载均衡的副作用;以及修复那些不可避免的代码错误。这是一次对内核关键部分巨大而深入的改造;相信随着 tickless 这个特性逐渐获得更广泛的测试,更多的精彩会呈现在我们的面前。

The biggest item on the “to do” list, though, must surely be getting rid of the single-runnable-process requirement. Just in case the developers involved did not already feel that way, Linus made his opinion on the matter clear:

`So as long as the NOHZ is for HPC-style loads, then quite frankly, I don't feel it is worth it. The _only_ thing that makes it worth it is that "future plans" part where it would actually help real loads.`

至于在“有待完善事项”清单中谁是最急需解决的问题,毋庸置疑必然是要消除当且仅当处理器上有一个可运行任务时才可以启用 Full tickless 这个限制。虽然相关的开发人员可能并没有这么认为,但 Linus 已经就此事明确表达了他自己的看法

`如果 NOHZ (译者注,即 Full tickless)只是为了解决 HPC 所关注的负载问题,那么坦率地说,我觉得将该特性引入内核并不值得。就该特性对于内核来说,其真正的价值在于它的 “将来的计划” 部分,解决这些问题将真正有助于改善内核的实际负载问题。“

So, chances are, this limitation will be removed from the tickless implementation in some future development cycle, along with the other various rough edges. In the meantime, the 3.10 kernel will contain a significant step forward in the evolution of the core Linux kernel: the partial removal of a source of latency and overhead that has been there since the very first kernel release. Not even the big kernel lock endured anywhere near that long.

因此,很可能,这个限制(译者注,即上文所提到的最急需解决的问题)和其他遗留的问题将会在未来的某个开发周期中一起被从 tickless 的实现中移除。3.10 版本注定将成为 Linux 内核发展史中的一个里程碑:Full tickless 的引入部分消除了内核有史以来一直存在的延迟和开销问题。解决这个问题所延续的时间之长,甚至超过了当年移除 BKL 这件事。

Read Album:

Read Related:

Read Latest: