泰晓科技 -- 聚焦 Linux - 追本溯源,见微知著!
网站地址:https://tinylab.org

儿童Linux系统,可打字编程学数理化
请稍侯

LWN 532748: 名字空间实作,第四章:更多有关 PID 名字空间的介绍

Wang Chen 创作于 2018/04/19

原文:Namespaces in operation, part 4: more on PID namespaces 原创:By Michael Kerrisk @ Jan 23, 2013 翻译:By unicornx 校对:By w-simon

In this article, we continue last week’s discussion of PID namespaces (and extend our ongoing series on namespaces). One use of PID namespaces is to implement a package of processes (a container) that behaves like a self-contained Linux system. A key part of a traditional system—and likewise a PID namespace container—is the init process. Thus, we’ll look at the special role of the init process and note one or two areas where it differs from the traditional init process. In addition, we’ll look at some other details of the namespaces API as it applies to PID namespaces.

本文我们将继续上周有关 PID 名 字空间的讨论(作为当前名字空间系列文章 的一部分)。 PID 名字空间的用途之一是将一组进程包装起来(形成一个容器)使其工作于一个独立的虚拟 Linux 系统之中。 init 进程在传统意义上是一个系统的关键组成部分,这个概念在一个 PID 名字空间的容器中同样存在。本文我们将重点介绍 PID 名字空间中的 init 进程的特殊功能,以及与传统意义上系统中 init 进程的不同之处。除此以外,我们还将更详细地学习一下相关 API 在 PID 名字空间上的应用。

PID 名字空间中的 init 进程(The PID namespace init process)

The first process created inside a PID namespace gets a process ID of 1 within the namespace. This process has a similar role to the init process on traditional Linux systems. In particular, the init process can perform initializations required for the PID namespace as whole (e.g., perhaps starting other processes that should be a standard part of the namespace) and becomes the parent for processes in the namespace that become orphaned.

在一个 PID 名字空间内创建的第一个进程其 PID 的值为 1。 该进程与传统 Linux 系统上的 init 进程具有类似的作用。特别的,init 进程可以执行 PID 名字空间所需的初始化(例如,可用于启动名字空间所需要的其他进程),并在名字空间中的其他进程变为孤儿进程时成为它们的父进程。

In order to explain the operation of PID namespaces, we’ll make use of a few purpose-built example programs. The first of these programs, ns_child_exec.c, has the following command-line syntax:

为了解释 PID 名字空间的这些操作,我们将给出一些专门构建的示例程序。 这些程序中的第一个 ns_child_exec.c 具有以下命令行语法:

ns_child_exec [options] command [arguments]

The ns_child_exec program uses the clone() system call to create a child process; the child then executes the given command with the optional arguments. The main purpose of the options is to specify new namespaces that should be created as part of the clone() call. For example, the -p option causes the child to be created in a new PID namespace, as in the following example:

ns_child_exec 程序使用 clone() 系统调用来创建子进程;子进程根据传入的可选参数(arguments)执行给定的命令(command)。options 的主要用途是指定 clone() 创建的名字空间的类型。 例如,-p 选项在创建子进程后将其加入一个新的 PID 名字空间,如下例所示:

$ su                  # Need privilege to create a PID namespace
Password:
# ./ns_child_exec -p sh -c 'echo $$'
1

That command line creates a child in a new PID namespace to execute a shell echo command that displays the shell’s PID. With a PID of 1, the shell was the init process for the PID namespace that (briefly) existed while the shell was running.

该命令行会新建一个子进程,并将其放在一个新建的 PID 名字空间中,新建的子进程执行 sh 程序后运行 echo 命令以显示当前 shell 的 PID 的值。 我们可以看到打印的值为 1,说明执行 shell 程序的子进程正是新的 PID 名字空间中的 init 进程。

Our next example program, simple_init.c, is a program that we’ll execute as the init process of a PID namespace. This program is designed to allow us to demonstrate some features of PID namespaces and the init process.

我们的下一个示例程序 simple_init.c 将被作为 PID 名字空间的 init 进程所执行。该程序会演示 PID 名字空间和 init 进程的某些特性。

The simple_init program performs the two main functions of init. One of these functions is “system initialization”. Most init systems are more complex programs that take a table-driven approach to system initialization. Our (much simpler) simple_init program provides a simple shell facility that allows the user to manually execute any shell commands that might be needed to initialize the namespace; this approach also allows us to freely execute shell commands in order to conduct experiments in the namespace. The other function performed by simple_init is to reap the status of its terminated children using waitpid().

simple_init 程序演示 init 的两个主要功能。功能之一是有关“系统初始化”。实际的系统初始化过程往往比较复杂,需要通过表配置的方式来进行驱动。相比而言我们的初始化程序要简单得多,simple_init 程序提供了一个简单的命令行 shell,允许用户手动执行任意的 shell 命令来初始化名字空间; 这种方法还允许我们自行输入命令,方便在名字空间中进行实验。simple_init 的另一个功能是利用 waitpid() 回收终止的子进程。

Thus, for example, we can use the ns_child_exec program in conjunction with simple_init to fire up an init process that runs in a new PID namespace:

举个例子,我们可以将 ns_child_exec 程序与 simple_init 一起使用,在新的 PID 名字空间中执行初始化:

# ./ns_child_exec -p ./simple_init
init$

The init$ prompt indicates that the simple_init program is ready to read and execute a shell command.

出现 init$ 提示符说明 simple_init 程序已准备好读取和执行 shell 命令。

We’ll now use the two programs we’ve presented so far in conjunction with another small program, orphan.c, to demonstrate that processes that become orphaned inside a PID namespace are adopted by the PID namespace init process, rather than the system-wide init process.

现在我们可以使用已经介绍过的两个程序与另一个小程序 orphan.c,来演示在一个子 PID 名字空间中收留孤儿进程的是该名字空间中的 init 进程,而不是系统范围的 init 进程。

The orphan program performs a fork() to create a child process. The parent process then exits while the child continues to run; when the parent exits, the child becomes an orphan. The child executes a loop that continues until it becomes an orphan (i.e., getppid() returns 1); once the child becomes an orphan, it terminates. The parent and the child print messages so that we can see when the two processes terminate and when the child becomes an orphan.

orphan 程序通过执行 fork() 来创建一个子进程。随后父进程退出,而子进程继续运行;父进程退出时,子进程成为一个孤儿进程(orphan)。子进程循环检测,直到发现自己成为一个孤儿进程(检查的方法是通过调用 getppid() 并判断是否返回 1);一旦检测成功,子进程即可终止。通过父进程和子进程打印的消息我们可以看到两个进程何时终止以及子进程何时成为孤儿。

In order to see what that our simple_init program reaps the orphaned child process, we’ll employ that program’s -v option, which causes it to produce verbose messages about the children that it creates and the terminated children whose status it reaps:

为了查看我们的 simple_init 程序的详细执行过程,我们可以使用该程序的 -v 选项,该选项会导致程序打印出有关它创建子进程和回收子进程的详细信息:

# ./ns_child_exec -p ./simple_init -v
        init: my PID is 1
init$ ./orphan
        init: created child 2
Parent (PID=2) created child with PID 3
Parent (PID=2; PPID=1) terminating
        init: SIGCHLD handler: PID 2 terminated
init$                   # simple_init prompt interleaved with output from child
Child  (PID=3) now an orphan (parent PID=1)
Child  (PID=3) terminating
        init: SIGCHLD handler: PID 3 terminated

In the above output, the indented messages prefixed with init: are printed by the simple_init program’s verbose mode. All of the other messages (other than the init$ prompts) are produced by the orphan program. From the output, we can see that the child process (PID 3) becomes an orphan when its parent (PID 2) terminates. At that point, the child is adopted by the PID namespace init process (PID 1), which reaps the child when it terminates.

在上面的输出中,前缀为 init: 的缩进消息是 simple_init 程序打开详细输出开关后打印出来的内容。 其他所有消息(除了以 init$ 提示符开头的行之外)都由 orphan 程序生成。 从输出中,我们可以看到子进程(PID 值为 3)在其父进程(PID 值为 2)终止时变成孤儿。此时,子进程被 PID 名字空间的 init 进程(PID 值为 1)所收留,并进而在其终止时被 init 进程所回收。

信号和 init 进程(Signals and the init process)

The traditional Linux init process is treated specially with respect to signals. The only signals that can be delivered to init are those for which the process has established a signal handler; all other signals are ignored. This prevents the init process—whose presence is essential for the stable operation of the system—from being accidentally killed, even by the superuser.

传统的 Linux init 进程针对信号有特殊处理。只有那些注册了处理函数的信号才会被传递给 init 进程;否则都被忽略。init 进程的存在对于系统的稳定运行至关重要,这样做可以防止其被意外终止,特别是被那些具有超级用户权限的用户。

PID namespaces implement some analogous behavior for the namespace-specific init process. Other processes in the namespace (even privileged processes) can send only those signals for which the init process has established a handler. This prevents members of the namespace from inadvertently killing a process that has an essential role in the namespace. Note, however, that (as for the traditional init process) the kernel can still generate signals for the PID namespace init process in all of the usual circumstances (e.g., hardware exceptions, terminal-generated signals such as SIGTTOU, and expiration of a timer).

内核为 PID 名字空间中的 init 进程实现了一些类似的行为。名字空间中的其他进程(包括特权进程)只能向 init 进程发送已建立处理程序的信号。这可以防止其他成员无意中杀死这个名字空间中最重要的进程。同时需要注意的是(和传统的 init 进程一样),在某些常见情况下内核仍然可以向 PID 名字空间中的 init 进程发送信号(例如,硬件异常,终端生成的信号(如 SIGTTOU)和定时器到期通知)。

Signals can also (subject to the usual permission checks) be sent to the PID namespace init process by processes in ancestor PID namespaces. Again, only the signals for which the init process has established a handler can be sent, with two exceptions: SIGKILL and SIGSTOP. When a process in an ancestor PID namespace sends these two signals to the init process, they are forcibly delivered (and can’t be caught). The SIGSTOP signal stops the init process; SIGKILL terminates it. Since the init process is essential to the functioning of the PID namespace, if the init process is terminated by SIGKILL (or it terminates for any other reason), the kernel terminates all other processes in the namespace by sending them a SIGKILL signal.

一个进程可以向其派生的 PID 名字空间中的 init 进程发送信号(只要通常的权限检查 合法)。同样,这些信号要求 init 进程已注册了相应的处理函数,但有两个例外:SIGKILL 和 SIGSTOP。这两个信号被强制传递(delivered)(且不能被捕获(caught))。SIGSTOP 信号会导致 init 进程暂停; SIGKILL 信号则终止进程的执行。由于 init 进程对于 PID 名字空间的运行至关重要,因此如果 init 进程被 SIGKILL 终止(或者因任何其他原因而终止),内核将同时发送 SIGKILL 信号来终止名字空间中的所有其他进程。

Normally, a PID namespace will also be destroyed when its init process terminates. However, there is an unusual corner case: the namespace won’t be destroyed as long as a /proc/PID/ns/pid file for one of the processes in that namespaces is bind mounted or held open. However, it is not possible to create new processes in the namespace (via setns() plus fork()): the lack of an init process is detected during the fork() call, which fails with an ENOMEM error (the traditional error indicating that a PID cannot be allocated). In other words, the PID namespace continues to exist, but is no longer usable.

通常,一个 PID 名字空间在其 init 进程终止时也会被销毁。但是,有一个不常见的情况:只要该名字空间中某个进程的/proc/PID/ns/pid 文件被绑定挂载或仍然处于打开状态中,则该名字空间就不会被销毁。对于这种情形,也无法在该名字空间中创建新进程(譬如通过 setns() 加上 fork()):在 fork() 调用期间检测到缺少 init 进程会返回失败并报告 ENOMEM 错误(传统方式下该错误的含义是指无法分配新的 PID)。换句话说,在这种情况下, PID 名字空间仍然继续存在,但不再可用。

挂载一个 procfs 文件系统(再探)(Mounting a procfs filesystem (revisited))

In the previous article in this series, the /proc filesystems (procfs) for the PID namespaces were mounted at various locations other than the traditional /proc mount point. This allowed us to use shell commands to look at the contents of the /proc/PID directories that corresponded to each of the new PID namespace while at the same time using the ps command to look at the processes visible in the root PID namespace.

在本系列的前一篇文章中,我们将 PID 名字空间的 /proc 文件系统(procfs)安装在除传统的 /proc 挂载点以外的其他位置。这使得我们可以使用 shell 命令查看与每个新的 PID 名字空间相对应的 /proc/PID 目录下的内容,同时又可以使用 ps 命令查看根名字空间中的进程信息。

However, tools such as ps rely on the contents of the procfs mounted at /proc to obtain the information that they require. Therefore, if we want ps to operate correctly inside a PID namespace, we need to mount a procfs for that namespace. Since the simple_init program permits us to execute shell commands, we can perform this task from the command line, using the mount command:

诸如 ps 之类的工具依赖于安装在 /proc 处的 procfs 的内容来获取它们所需的信息。 因此,如果我们希望 ps 在 PID 名字空间内正确运行,就需要为该名字空间安装 procfs。simple_init 程序允许我们输入 shell 命令,因此我们可以从命令行执行 mount:

# ./ns_child_exec -p -m ./simple_init
init$ mount -t proc proc /proc
init$ ps a
  PID TTY      STAT   TIME COMMAND
    1 pts/8    S      0:00 ./simple_init
    3 pts/8    R+     0:00 ps a

The ps a command lists all processes accessible via /proc. In this case, we see only two processes, reflecting the fact that there are only two processes running in the namespace.

ps a 命令列出了 /proc 下所有可以访问的进程。在以上例子中,我们只看到两个进程,这反映了该名字空间中的确只有两个进程正在运行。

When running the ns_child_exec command above, we employed that program’s -m option, which places the child that it creates (i.e., the process running simple_init) inside a separate mount namespace. As a consequence, the mount command does not affect the /proc mount seen by processes outside the namespace.

上面运行 ns_child_exec 命令时,我们使用了该程序的 -m 选项,该选项将其创建的子进程(即运行 simple_init 的进程)放入单独的 mount 名字空间中。 因此,后面运行 mount 命令时不会影响该名字空间以外的进程对 /proc 的使用。

unshare()setns()unshare() and setns()

In the second article in this series, we described two system calls that are part of the namespaces API: unshare() and setns(). Since Linux 3.8, these system calls can be employed with PID namespaces, but they have some idiosyncrasies when used with those namespaces.

在本系列的 第二篇文章 中,我们描述了名字空间相关的两个系统调用:unshare()setns()。 从 Linux 3.8 开始,这些系统调用开始支持 PID 名字空间,但和操作其他名字空间相比,有一些特殊的地方需要注意。

Specifying the CLONE_NEWPID flag in a call to unshare() creates a new PID namespace, but does not place the caller in the new namespace. Rather, any children created by the caller will be placed in the new namespace; the first such child will become the init process for the namespace.

调用 unshare() 时如果指定 CLONE_NEWPID 标志会创建一个新的 PID 名字空间,但不会将调用者进程放入这个新的名字空间。相反,只有该调用者创建的子进程才会被放置到这个新的名字空间中;第一个子进程将成为该名字空间的 init 进程。

The setns() system call now supports PID namespaces:

setns() 系统调用现在也支持 PID 名字空间:

setns(fd, 0);   /* Second argument can be CLONE_NEWPID to force a
                   check that 'fd' refers to a PID namespace */

The fd argument is a file descriptor that identifies a PID namespace that is a descendant of the PID namespace of the caller; that file descriptor is obtained by opening the /proc/PID/ns/pid file for one of the processes in the target namespace. As with unshare(), setns() does not move the caller to the PID namespace; instead, children that are subsequently created by the caller will be placed in the namespace.

fd 参数是一个文件描述符,用于标识调用者进程所在的 PID 名字空间所派生的 PID 名字空间; 该文件描述符是通过打开目标名字空间中某个进程所对应的 /proc/PID/ns/pid 文件而获得的。 与 unshare() 一样,setns() 不会将调用者进程移动到指定的 PID 名字空间;只有该调用者进程所创建的子进程才会被加入指定的名字空间。

We can use an enhanced version of the ns_exec.c program that we presented in the second article in this series to demonstrate some aspects of using setns() with PID namespaces that appear surprising until we understand what is going on. The new program, ns_run.c, has the following syntax:

我们可以在本系列第二篇文章中介绍过的 ns_exec.c 程序基础上继续修改代码来演示如何针对 PID 名字空间使用 setns()。新程序ns_run.c 的命令行语法如下:

ns_run [-f] [-n /proc/PID/ns/FILE]... command [arguments]

The program uses setns() to join the namespaces specified by the /proc/PID/ns files contained within -n options. It then goes on to execute the given command with optional arguments. If the -f option is specified, it uses fork() to create a child process that is used to execute the command.

该程序通过 -n 选项所携带的 /proc/PID/ns 目录下的文件路径参数指定目标名字空间对象,然后调用 setns() 加入该名字空间。加入后该程序执行 command 程序,arguments 可用于指定执行 command 时的参数。如果指定了 -f 选项,该程序将调用 fork() 创建子进程并在子进程中执行 command

Suppose that, in one terminal window, we fire up our simple_init program in a new PID namespace in the usual manner, with verbose logging so that we are informed when it reaps child processes:

假定打开一个终端窗口,按前述方式(译者注:指运行 ns_child_exec 程序)执行如下命令,创建一个新的 PID 名字空间,然后启动 simple_init 程序,并开启详细的日志记录,以便能够及时查看到子进程被回收的详细过程:

# ./ns_child_exec -p ./simple_init -v
        init: my PID is 1
init$ 

Then we switch to a second terminal window where we use the ns_run program to execute our orphan program. This will have the effect of creating two processes in the PID namespace governed by simple_init:

然后我们切换到另一个终端窗口,使用 ns_run 程序来执行 orphan 程序。 这会在 simple_init 所在的 PID 名字空间中创建两个进程:

# ps -C sleep -C simple_init
  PID TTY          TIME CMD
 9147 pts/8    00:00:00 simple_init
# ./ns_run -f -n /proc/9147/ns/pid ./orphan
Parent (PID=2) created child with PID 3
Parent (PID=2; PPID=0) terminating
# 
Child  (PID=3) now an orphan (parent PID=1)
Child  (PID=3) terminating

Looking at the output from the “Parent” process (PID 2) created when the orphan program is executed, we see that its parent process ID is 0. This reflects the fact that the process that started the orphan process (ns_run) is in a different namespace—one whose members are invisible to the “Parent” process. As already noted in the previous article, getppid() returns 0 in this case.

查看执行 orphan 程序时创建的 “Parent” 进程(PID 值为 2)的输出,我们看到它的父进程的 PID 的值为 0。这是因为创建子进程并运行 orphan 程序的父进程(即执行 ns_run 的进程)和 orphan 程序所在进程处于不同的名字空间,这导致 “Parent” 进程无法看见 ns_run 的进程信息。 正如前一篇文章中已经指出的那样,在这种情况下 getppid() 会返回 0。

The following diagram shows the relationships of the various processes before the orphan “Parent” process terminates. The arrows indicate parent-child relationships between processes.

下图显示了 orphan 程序所创建的 “Parent” 进程终止之前各种进程之间的关系。图上的箭头表达了进程之间的父子关系。(译者注:箭头指向的是子进程)

Relationship of processes inside PID namespaces

Returning to the window running the simple_init program, we see the following output:

回到运行 simple_init 程序的窗口,我们看到以下输出:

init: SIGCHLD handler: PID 3 terminated

The “Child” process (PID 3) created by the orphan program was reaped by simple_init, but the “Parent” process (PID 2) was not. This is because the “Parent” process was reaped by its parent (ns_run) in a different namespace. The following diagram shows the processes and their relationships after the orphan “Parent” process has terminated and before the “Child” terminates.

orphan 程序创建的 “Child” 进程(PID 值为 3)由 simple_init 负责回收,但 “Parent” 进程(PID 值为 2)不是。这是因为 “Parent” 进程是由其父进程(ns_run)在另一个名字空间中回收。下图显示的是 orphan 的 “Parent” 进程已经终止但 “Child” 进程还未终止时的进程关系。

Relationship of processes inside PID namespaces

It’s worth emphasizing that setns() and unshare() treat PID namespaces specially. For other types of namespaces, these system calls do change the namespace of the caller. The reason that these system calls do not change the PID namespace of the calling process is because becoming a member of another PID namespace would cause the process’s idea of its own PID to change, since getpid() reports the process’s PID with respect to the PID namespace in which the process resides. Many user-space programs and libraries rely on the assumption that a process’s PID (as reported by getpid()) is constant (in fact, the GNU C library getpid() wrapper function caches the PID); those programs would break if a process’s PID changed. To put things another way: a process’s PID namespace membership is determined when the process is created, and (unlike other types of namespace membership) cannot be changed thereafter.

需要强调的是,setns()unshare() 处理 PID 名字空间方式比较特殊。 对于其他类型的名字空间,这些系统调用会改变调用者进程所在的名字空间。 而针对 PID 名字空间则不会,原因是因为一旦一个进程改变了其所属的 PID 名字空间则该进程在新的 PID 名字空间中的 PID 的值势必也有可能发生改变,getpid() 系统调用会根据进程所在的 PID 名字空间的实际分配值返回该进程的 PID 值。许多用户空间程序和库依赖于进程的 PID 值(通过 getpid() 获得)是恒定的(实际上,GNU C 库 的 getpid() 函数是对相应系统调用的封装,封装函数会缓存 PID 的值); 一旦一个进程的 PID 值改变了,这些程序就无法正常工作。换句话说:一个进程归属于哪一个 PID 名字空间是在该进程被创建时就确定了的,(和其他类型的名字空间不同)不能在创建后进行更改。

结论(Concluding remarks)

In this article we’ve looked at the special role of the PID namespace init process, shown how to mount a procfs for a PID namespace so that it can be used by tools such as ps, and looked at some of the peculiarities of unshare() and setns() when employed with PID namespaces. This completes our discussion of PID namespaces; in the next article, we’ll turn to look at user namespaces.

本文我们一起学习了 PID 名字空间中 init 进程的特殊作用,演示了为了使 ps 等工具可以正常工作如何为 PID 名字空间安装 procfs,以及介绍了针对 PID 名字空间 unshare()setns() 的特殊使用方式。至此完成了我们对 PID 名字空间的介绍; 在下一篇文章中,我们将转而学习 user 名字空间。



Read Album:

Read Related:

Read Latest: