了解更多有关 “LWN 中文翻译计划”，请点击 这里
Imagine a system with two processes running, one at high priority and the other at a much lower priority. These processes share resources which are protected by locks. At some point, the low-priority process manages to run and obtains a lock for one of those resources. If the high-priority process then attempts to obtain the same lock, it will have to wait. Essentially, the low-priority process has trumped the high-priority process, at least for as long as it holds the contended lock.
Now imagine a third process, one which uses a lot of processor time, and which has a priority between the other two. If that process starts to crank, it will push the low-priority process out of the CPU indefinitely. As a result, the third process can keep the highest-priority process out of the CPU indefinitely.
This situation, called “priority inversion,” tends to be followed by system failure, upset users, and unemployed engineers. There are a number of approaches to avoiding priority inversion, including lockless designs, carefully thought-out locking scenarios, and a technique known as priority inheritance. The priority inheritance method is simple in concept: when a process holds a lock, it should run at (at least) the priority of the highest-priority process waiting for the lock. When a lock is taken by a low-priority process, the priority of that process might need to be boosted until the lock is released.
我们称上面的这种现象叫 “优先级反转（priority inversion）”，这往往会导致系统运行异常，用户可能会感到莫名其妙（译者注，因为表现出来的现象是优先级控制似乎失去了作用，优先级高的任务反而争抢不过优先级低的任务），可怜的工程师甚至可能会为此丢了饭碗。有许多种方法可以避免 “优先级反转”，包括无锁设计，充分考虑清楚上锁的所有情况，以及采用被称之为 “优先级继承（priority inheritance）” 的技术。“优先级继承” 这种方法的理念其实很简单：当一个进程持有锁时，它的优先级（至少）应该和等待该锁的进程中优先级最高的那个进程的优先级一样高（译者注，这样别的进程就很难抢占它（除非别的进程的优先级比等待该锁的所有进程的优先级都要高），从而使得该持有锁的进程可以不受打扰地快速处理完临界区并释放掉锁）。所以，当一个低优先级的进程获取了一把锁时，内核可能需要提升该进程的优先级，直到该锁被它释放掉。
There are a number of approaches to priority inheritance. In effect, the kernel performs a very simple form of it by not allowing kernel code to be preempted while holding a spinlock. In some systems, each lock has a priority associated with it; whenever a process takes a lock, its priority is raised to the lock’s priority. In others, a high-priority process will have its priority “inherited” by another process which holds a needed lock. Most priority inheritance schemes have shown a tendency to complicate and slow down the locking code, and they can be used to paper over poor application designs. So they are unpopular in many circles. Linus was reasonably clear about how he felt on the subject last December:
有许多种实现 “优先级继承” 的方法。当前（Linux）内核采用的是一种最简单的形式，就是在获取 “自旋锁（spinlock）” 的同时禁止了任务抢占。有些操作系统中，每个锁具有自己的优先级属性；只要某个进程获取了这把锁，其优先级就会被提升到和锁的优先级相同。在其他的一些操作系统中，采取的方式是当一个高优先级的进程尝试获取锁失败时其优先级将会被占有相同锁的其他进程所 “继承”。绝大多数 “优先级继承” 的实现方案都倾向于使得和锁有关的处理逻辑变得复杂而低效，这些方案或许可以拿来用于发表论文但在工程实践中却并不实用。所以在很多情况下（这些方案）并不受业界的欢迎。在这个问题上，去年 12 月 Linus 表达了 非常明确 的看法：
"Friends don't let friends use priority inheritance". Just don't do it. If you really need it, your system is broken anyway.
Faced with this sort of opposition, many developers would quietly shelve their priority inheritance designs and go back to working on accounting code. The kernel development community, however, happens to have a member who has a track record of getting code merged in spite of this sort of objection: Ingo Molnar. History may well repeat itself, as Ingo (working with Thomas Gleixner) has posted a priority-inheriting futex implementation with a request that it be merged into the mainline. This approach, says Ingo, provides a useful functionality to user space (it is not meant to provide priority-inheriting kernel mutual exclusion primitives) while avoiding the pitfalls which have hit other implementations.
面对这些反对意见，许多开发人员悄悄搁置了他们在 “优先级继承” 上的设计并转而去开始研究那些相对容易解决的问题。然而，内核开发社区还就是存在这样的一个人，这个人的履历告诉我们，他向来不会因为有人反对就停止向内核提交代码，这个人就是：Ingo Molnar。类似的历史可能会再次重演，因为 Ingo（与 Thomas Gleixner 合作）发布了 一个针对 futex 的优先级继承实现（译者注，Futex 是 Fast Userspace muTexes 的缩写，提供了一种用户态和内核态混合作用下的互斥锁机制，该补丁下文简称为 PI-futex），并提请将其合并入主线中。Ingo 说，这种方法为用户空间提供了一种有用的功能（并不是为内核态的互斥锁原语提供优先级继承机制），同时避免了其他实现中所遇到的问题（译者注，指上文提到的实现上的复杂和运行上的低效）。
The PI-futex patch adds a couple of new operations to the
FUTEX_UNLOCK_PI. In the uncontended case, a PI-futex can be taken without involving the kernel at all, just like an ordinary futex. When there is contention, instead, the
FUTEX_LOCK_PIoperation is requested from the kernel. The requesting process is put into a special queue, and, if necessary, that process lends its priority to the process actually holding the contended futex. The priority inheritance is chained, so that, if the holding process is blocked on a second futex, the boosted priority will propagate to the holder of that second futex. As soon as a futex is released, any associated priority boost is removed.
FUTEX_UNLOCK_PI。在不发生竞争的条件下，可以在不涉及切换到内核态的情况下（译者注，即在用户态直接）获取 PI-futex，就像普通的 futex 一样。相反，当存在争用时，会向内核发起
FUTEX_LOCK_PI 请求。发起请求的进程会被放入一个特殊队列，必要的话，内核会按照该进程的优先级提升实际持有 futex 的进程。“优先级继承” 的实现具备可传递性，也就是说，如果拥有锁的进程因为另一个 futex 而被阻塞，则其所被提升的优先级将被传递到第二个 futex 的持有者身上。一旦一个所有者释放了 futex，其本身被提升的优先级则会被恢复。
As with regular futexes, the kernel only needs to know about a PI-futex while it is being contended. So the number of futexes in the system can become quite large without serious overhead on the kernel side.
与以往常规的 futex 一样，内核只会在某个 PI-futex 被争用时才会介入。因此，即使系统中的 futex 数量变得很巨大，但对内核来说开销也不会很严重。
Within the kernel, the PI-futex type is implemented by way of a new primitive called an
rt_mutexis superficially similar to regular mutexes, with the addition of the priority inheritance capability. They are, however, an entirely different type, with no code shared with the mutex implementation. The API will be familiar to mutex users, however; in brief, it is:
rt_mutex 的新结构体类型实现。看上去，除了增加了 “优先级继承” 能力外，该
#include <linux/rtmutex.h> void rt_mutex_init(struct rt_mutex *lock); void rt_mutex_destroy(struct rt_mutex *lock); void rt_mutex_lock(struct rt_mutex *lock); int rt_mutex_lock_interruptible(struct rt_mutex *lock, int detect_deadlock); int rt_mutex_timed_lock(struct rt_mutex *lock, struct hrtimer_sleeper *timeout, int detect_deadlock); int rt_mutex_trylock(struct rt_mutex *lock); void rt_mutex_unlock(struct rt_mutex *lock); int rt_mutex_is_locked(struct rt_mutex *lock);
The alert reader may have noticed that this looks much like the realtime mutex type found in the realtime preemption patch. Ingo once said that the realtime patches would slowly trickle into the mainline, and that is what appears to be happening here. With this patch set, the PI-futex code is the only user of the new
rt_mutextype, but that could certainly change over time.
rt_mutex 看上去很像 “实时抢占补丁（realtime preemption patch）” 中的实时互斥锁（realtime mutex）类型。Ingo 曾经说过，他会逐步地将实时补丁中的内容合入主线，这件事似乎正在发生。虽然在这个补丁集中，PI-futex 代码是新的
The PI-futex patch also includes a new, priority-sorted list type which could find users elsewhere in the kernel.
There has been relatively little discussion of this patch so far; it has been included in recent -mm trees. It is too late for 2.6.17, but, if no real opposition develops, the PI-futex code might just find its way into a subsequent kernel.
到目前为止，对该补丁的讨论还相对较少；它已被包含在最新的 -mm 代码仓库中。如果要想将这个补丁随 2.6.17 合入内核主线，看上去时间有点紧张，如果没有反对意见，PI-futex 代码很有可能会随着后继版本合入内核。（译者注，PI-futex 补丁最终 随 2.6.18 版本合入内核主线。）
了解更多有关 “LWN 中文翻译计划”，请点击 这里
- LWN 531148: Linux 内核文件中的非常规节
- Linux 内核的代码仓库管理与开发流程简介
- LWN 600644: 扩展内核栈
- LWN 563185: 优化抢占
- LWN 575497: 我们很快就可以有 Deadline 调度器了吗？