[置顶] 泰晓 RISC-V 实验箱,配套 30+ 讲嵌入式 Linux 系统开发公开课
LWN 531419: 名字空间实作,第三章:PID 名字空间
原文:Namespaces in operation, part 3: PID namespaces 原创:By Michael Kerrisk @ Jan 16, 2013 翻译:By unicornx 校对:By w-simon
Following on from our two earlier namespaces articles (Part 1: namespaces overview and Part 2: the namespaces API), we now turn to look at PID namespaces. The global resource isolated by PID namespaces is the process ID number space. This means that processes in different PID namespaces can have the same process ID. PID namespaces are used to implement containers that can be migrated between host systems while keeping the same process IDs for the processes inside the container.
继前两篇文章(第 1 部分:名字空间概述 和第 2 部分:名字空间 API)之后,我们现在来看看 PID 名字空间(PID namespaces)。PID 名字空间负责隔离的全局资源是进程的标识符(译者注:Process ID,下文简称 PID)。这意味着在不同 PID 名字空间中的进程可以具有相同的进程标识符。PID 名字空间可用于实现容器在不同主机系统之间的无缝迁移,即迁移时保持容器内进程的 PID 不变。
As with processes on a traditional Linux (or UNIX) system, the process IDs within a PID namespace are unique, and are assigned sequentially starting with PID 1. Likewise, as on a traditional Linux system, PID 1—the init process—is special: it is the first process created within the namespace, and it performs certain management tasks within the namespace.
与传统 Linux(或 UNIX)系统上的进程概念一样, PID 名字空间中的 PID 是唯一的,并且其值从 1 开始按顺序分配。同样,与传统 Linux 系统一样,PID 值为 1 的进程(即 init 进程)是一个特殊的进程:它是在一个名字空间内创建的第一个进程,它在名字空间内执行某些管理任务(译者注:譬如回收孤儿进程等)。
初探(First investigations)
A new PID namespace is created by calling
clone()
with theCLONE_NEWPID
flag. We’ll show a simple example program that creates a new PID namespace usingclone()
and use that program to map out a few of the basic concepts of PID namespaces. The complete source of the program (pidns_init_sleep.c
) can be found here. As with the previous article in this series, in the interests of brevity, we omit the error-checking code that is present in the full versions of the example program when discussing it in the body of the article.
通过指定 CLONE_NEWPID
选项调用 clone()
可以创建新的 PID 名字空间。这里我们给出一个简单的示例程序,它使用 clone()
创建一个新的 PID 名字空间,通过该程序我们可以学习 PID 名字空间的一些基本概念。程序的完整源代码(pidns_init_sleep.c
)可以在这里 找到 。与本系列前一篇文章一样,为了简洁起见,在下文中讨论示例代码时,我们省略了完整版中有关错误检查部分的代码。
The main program creates a new PID namespace using
clone()
, and displays the PID of the resulting child:
main 函数调用 clone()
创建一个新的 PID 名字空间,并打印子进程的 PID:
child_pid = clone(childFunc,
child_stack + STACK_SIZE, /* Points to start of
downwardly growing stack */
CLONE_NEWPID | SIGCHLD, argv[1]);
printf("PID returned by clone(): %ld\n", (long) child_pid);
The new child process starts execution in
childFunc()
, which receives the last argument of theclone()
call (argv[1]
) as its argument. The purpose of this argument will become clear later.
新的子进程执行 childFunc()
函数,并使用 clone()
函数的最后一个参数(argv [1]
)作为其入参。下文稍后将详细介绍该参数的用途。
The
childFunc()
function displays the process ID and parent process ID of the child created byclone()
and concludes by executing the standard sleep program:
childFunc()
函数打印新建子进程的 PID 和 父进程的 PID,并在最后执行 sleep 程序:
printf("childFunc(): PID = %ld\n", (long) getpid());
printf("ChildFunc(): PPID = %ld\n", (long) getppid());
...
execlp("sleep", "sleep", "1000", (char *) NULL);
The main virtue of executing the sleep program is that it provides us with an easy way of distinguishing the child process from the parent in process listings.
执行 sleep 程序的主要用途是便于我们区分子进程和父进程。
When we run this program, the first lines of output are as follows:
运行这个程序,输出的前几行如下所示:
$ su # Need privilege to create a PID namespace
Password:
# ./pidns_init_sleep /proc2
PID returned by clone(): 27656
childFunc(): PID = 1
childFunc(): PPID = 0
Mounting procfs at /proc2
The first two lines line of output from
pidns_init_sleep
show the PID of the child process from the perspective of two different PID namespaces: the namespace of the caller ofclone()
and the namespace in which the child resides. In other words, the child process has two PIDs: 27656 in the parent namespace, and 1 in the new PID namespace created by theclone()
call.
pidns_init_sleep
程序输出的前两行打印了同一个子进程在两个名字空间中的 PID 值:这两个名字空间分别是调用 clone()
的父进程所在的名字空间和其创建的子进程所在的名字空间。换句话说,该子进程有两个 PID 的值:在父进程所在名字空间中的值为 27656,在 clone()
函数创建的新 PID 名字空间中的值为 1。
The next line of output shows the parent process ID of the child, within the context of the PID namespace in which the child resides (i.e., the value returned by
getppid()
). The parent PID is 0, demonstrating a small quirk in the operation of PID namespaces. As we detail below, PID namespaces form a hierarchy: a process can “see” only those processes contained in its own PID namespace and in the child namespaces nested below that PID namespace. Because the parent of the child created byclone()
is in a different namespace, the child cannot “see” the parent; therefore,getppid()
reports the parent PID as being zero.
接下来的一行显示在子进程所在的 PID 名字空间上下文中其父进程的 PID 值(由 getppid()
返回)。其值为 0,这正是 PID 名字空间操作中一个有趣的地方。如下文所述,系统中的多个 PID 名字空间形成一个层次结构:一个进程只能 “看到” 自己所在的 PID 名字空间中的其他进程以及从该 PID 名字空间派生出来的子名字空间中的那些进程。由于由 clone()
所创建的子进程的父进程和子进程位于不同的名字空间中,因此子进程 “看不见” 它的父进程;这就是示例代码中 getppid()
报告父进程 PID 的值为零的原因。
For an explanation of the last line of output from
pidns_init_sleep
, we need to return to a piece of code that we skipped when discussing the implementation of thechildFunc()
function.
为了解释 pidns_init_sleep
的最后一行输出,我们需要回头看一下在前面讨论 childFunc()
函数时跳过的一段代码。
/proc/PID
和 PID 名字空间(/proc/PID
and PID namespaces)
Each process on a Linux system has a
/proc/PID
directory that contains pseudo-files describing the process. This scheme translates directly into the PID namespaces model. Within a PID namespace, the/proc/PID
directories show information only about processes within that PID namespace or one of its descendant namespaces.
Linux 系统上的每个进程都对应有一个 /proc/PID
目录,该目录中包含一些描述该进程属性的虚拟文件。PID 名字空间模型依然沿用该机制。一个 PID 名字空间下的 /proc/PID
目录会显示该名字空间及其派生的名字空间中的进程信息。
However, in order to make the
/proc/PID
directories that correspond to a PID namespace visible, the proc filesystem (“procfs” for short) needs to be mounted from within that PID namespace. From a shell running inside the PID namespace (perhaps invoked via thesystem()
library function), we can do this using a mount command of the following form:
但需要注意的是,为了使该 PID 名字空间的 /proc/PID
目录可见,需要在该 PID 名字空间中挂载一次 proc 文件系统(简称 “procfs”)。如果在某个 PID 名字空间内启动了一个 shell(譬如通过 system()
这个库函数),我们可以执行如下形式的 mount 命令来执行挂载操作:
# mount -t proc proc /mount_point
Alternatively, a procfs can be mounted using the
mount()
system call, as is done inside our program’schildFunc()
function:
或者,可以使用 mount()
系统调用来挂载 procfs ,就像我们在 childFunc()
函数中做的那样:
mkdir(mount_point, 0555); /* Create directory for mount point */
mount("proc", mount_point, "proc", 0, NULL);
printf("Mounting procfs at %s\n", mount_point);
The
mount_point
variable is initialized from the string supplied as the command-line argument when invokingpidns_init_sleep
.
变量 mount_point
的值来自执行 pidns_init_sleep
程序时给出的命令行参数。
In our example shell session running
pidns_init_sleep
above, we mounted the new procfs at/proc2
. In real world usage, the procfs would (if it is required) usually be mounted at the usual location,/proc
, using either of the techniques that we describe in a moment. However, mounting the procfs at/proc2
during our demonstration provides an easy way to avoid creating problems for the rest of the processes on the system: since those processes are in the same mount namespace as our test program, changing the filesystem mounted at/proc
would confuse the rest of the system by making the/proc/PID
directories for the root PID namespace invisible.
在上文执行 pidns_init_sleep
的 shell 会话示例中,我们在 /proc2
上挂载了新的 procfs 。在实际应用中,利用上文介绍的两种方法之一,我们通常将 procfs(如果需要的话)安装在缺省的位置 /proc
。之所以在我们的演示中,选择在 /proc2
安装 procfs 仅仅是为了避免对系统上其他进程的运行造成影响:因为这些进程与我们的演示进程位于同一个 mount 名字空间中(译者注,pidns_init_sleep
中调用 clone()
时仅指定了 CLONE_NEWPID
标志,并没有创建新的 mount 名字空间),所以在 /proc
目录两次挂载 proc 文件系统会覆盖根(root)PID 名字空间中原来的 /proc/PID
目录,导致系统中其余进程的运行受到影响。
Thus, in our shell session the procfs mounted at
/proc
will show the PID subdirectories for the processes visible from the parent PID namespace, while the procfs mounted at/proc2
will show the PID subdirectories for processes that reside in the child PID namespace. In passing, it’s worth mentioning that although the processes in the child PID namespace will be able to see the PID directories exposed by the/proc
mount point, those PIDs will not be meaningful for the processes in the child PID namespace, since system calls made by those processes interpret PIDs in the context of the PID namespace in which they reside.
因此,从演示的 shell 会话中可以看到,安装在 /proc
的 procfs 显示的是父进程所在 PID 名字空间中可见的所有进程对应的 PID 子目录,而安装在 /proc2
的 procfs 显示的是子进程所在的 PID 名字空间中可以看到的所有进程的 PID 子目录。值得注意是,虽然子进程所在的 PID 名字空间中的进程能够看到 /proc
挂载下的 PID 目录,但对它们来说这些 PID 是没有意义的,因为一个进程所调用的系统调用函数只会按照进程所在的 PID 名字空间的上下文来理解这些 PID 的值。
Having a procfs mounted at the traditional
/proc
mount point is necessary if we want various tools such as ps to work correctly inside the child PID namespace, because those tools rely on information found at/proc
. There are two ways to achieve this without affecting the/proc
mount point used by parent PID namespace. First, if the child process is created using theCLONE_NEWNS
flag, then the child will be in a different mount namespace from the rest of the system. In this case, mounting the new procfs at/proc
would not cause any problems. Alternatively, instead of employing theCLONE_NEWNS
flag, the child could change its root directory withchroot()
and mount a procfs at/proc
.
当然在传统的 /proc
目录上安装 procfs 依然是有必要的,否则在演示程序所创建的子进程的 PID 名字空间中,很多原本需要依赖于 /proc
路径下的信息进行工作的程序工具都会失效,譬如 ps。有两种方法可以解决这个问题,而且也不会影响父进程的 PID 名字空间下对 /proc
的使用。第一种方法,是在创建子进程时指定 CLONE_NEWNS
选项,那么该子进程将拥有独立的 mount 名字空间。在这种情况下,在 /proc
中挂载新的 procfs 是不会导致任何问题的。或者,也可以不使用 CLONE_NEWNS
选项,而是采用先调用 chroot()
更改其根目录,然后再在 /proc
上挂载 procfs 的方法。
Let’s return to the shell session running
pidns_init_sleep
. We stop the program and use ps to examine some details of the parent and child processes within the context of the parent namespace:
让我们回到运行 pidns_init_sleep
的 shell 会话。在父进程和子进程所在的名字空间中,先暂停进程,然后使用 ps 检查父进程和子进程的一些细节:
^Z Stop the program, placing in background
[1]+ Stopped ./pidns_init_sleep /proc2
# ps -C sleep -C pidns_init_sleep -o "pid ppid stat cmd"
PID PPID STAT CMD
27655 27090 T ./pidns_init_sleep /proc2
27656 27655 S sleep 600
The “PPID” value (27655) in the last line of output above shows that the parent of the process executing
sleep
is the process executingpidns_init_sleep
.
上面输出的最后一行中的 “PPID” 值(27655)表明执行 sleep
的进程的父进程正是执行 pidns_init_sleep
的进程。
By using the readlink command to display the (differing) contents of the
/proc/PID/ns/pid
symbolic links (explained in last week’s article), we can see that the two processes are in separate PID namespaces:
通过使用 readlink 命令对比 /proc/PID/ns/pid
符号链接的内容(相关内容已在上周的文章中解释过),我们可以看到这两个进程位于不同的 PID 名字空间中:
# readlink /proc/27655/ns/pid
pid:[4026531836]
# readlink /proc/27656/ns/pid
pid:[4026532412]
At this point, we can also use our newly mounted procfs to obtain information about processes in the new PID namespace, from the perspective of that namespace. To begin with, we can obtain a list of PIDs in the namespace using the following command:
我们同样还可以使用新安装的 procfs 获取新的 PID 名字空间中进程的信息。首先,我们可以使用以下命令获取名字空间中的 PID 列表:
# ls -d /proc2/[1-9]*
/proc2/1
As can be seen, the PID namespace contains just one process, whose PID (inside the namespace) is 1. We can also use the
/proc/PID/status
file as a different method of obtaining some of the same information about that process that we already saw earlier in the shell session:
可以看出,该 PID 名字空间仅包含一个进程,其 PID(在该名字空间内)为 1。我们还可以通过访问 /proc/PID/status
文件得到以上 shell 交互中获取的类似信息:
# cat /proc2/1/status | egrep '^(Name|PP*id)'
Name: sleep
Pid: 1
PPid: 0
The PPid field in the file is 0, matching the fact that
getppid()
reports that the parent process ID for the child is 0.
该文件中的 PPid 字段的值为 0,这与通过 getppid()
获得父进程 PID 为 0 的事实相一致。
嵌套的 PID 名字空间(Nested PID namespaces)
As noted earlier, PID namespaces are hierarchically nested in parent-child relationships. Within a PID namespace, it is possible to see all other processes in the same namespace, as well as all processes that are members of descendant namespaces. Here, “see” means being able to make system calls that operate on specific PIDs (e.g., using
kill()
to send a signal to process). Processes in a child PID namespace cannot see processes that exist (only) in the parent PID namespace (or further removed ancestor namespaces).
如前所述,PID 名字空间是分层按照父子关系嵌套的。一个进程在其所在的 PID 名字空间里,除了可以看到同一名字空间中的所有其他进程外还可以“看到”逐层派生的名字空间中的所有进程。在这里,所谓“看到”指得是能够指定 PID 值进行系统调用(例如,使用 kill()
向某个进程发送信号)。派生的子 PID 名字空间中的进程无法看到父 PID 名字空间中的进程(以及更上层的名字空间中的进程,依此类推)。
A process will have one PID in each of the layers of the PID namespace hierarchy starting from the PID namespace in which it resides through to the root PID namespace. Calls to
getpid()
always report the PID associated with the namespace in which the process resides.
对一个进程来说,从其所在的 PID 名字空间向上层反推一直到根(root)PID 名字空间(译者注,即最顶级的 PID 名字空间),每一层都会对该进程分配一个 PID 值。调用 getpid()
返回的是调用者进程在其所在名字空间中的 PID 值。
We can use the program shown here (
multi_pidns.c
) to show that a process has different PIDs in each of the namespaces in which it is visible. In the interests of brevity, we will simply explain what the program does, rather than walking though its code.
我们可以使用一个例子程序 (multi_pidns.c
)来演示一个进程在每个可见的名字空间中都有不同的 PID 值。为了简洁起见,我们将着重解释程序实现了什么功能,而不是逐行解释其代码。
The program recursively creates a series of child process in nested PID namespaces. The command-line argument specified when invoking the program determines how many children and PID namespaces to create:
该程序递归地在嵌套的 PID 名字空间中创建一系列的子进程。调用程序时通过命令行参数指定要创建多少次子进程以及多少层 PID 名字空间(译者注:创建一次子进程即向下嵌套创建一层名字空间):
# ./multi_pidns 5
In addition to creating a new child process, each recursive step mounts a procfs filesystem at a uniquely named mount point. At the end of the recursion, the last child executes the sleep program. The above command line yields the following output:
除了创建一个新的子进程之外,每个递归步骤还在一个唯一命名的挂载点上挂载一次 procfs 文件系统。递归到最底层时,最后一个子进程执行睡眠程序。上述操作产生以下输出:
Mounting procfs at /proc4
Mounting procfs at /proc3
Mounting procfs at /proc2
Mounting procfs at /proc1
Mounting procfs at /proc0
Final child sleeping
Looking at the PIDs in each procfs, we see that each successive procfs “level” contains fewer PIDs, reflecting the fact that each PID namespace shows only the processes that are members of that PID namespace or its descendant namespaces:
通过检查每个名字空间中的 procfs 所包含的 PID 文件,我们会发现其个数逐层递减,这验证了前面所说的每个 PID 名字空间仅显示该 PID 名字空间以及其派生的子名字空间中的进程成员:
^Z Stop the program, placing in background
[1]+ Stopped ./multi_pidns 5
# ls -d /proc4/[1-9]* Topmost PID namespace created by program
/proc4/1 /proc4/2 /proc4/3 /proc4/4 /proc4/5
# ls -d /proc3/[1-9]*
/proc3/1 /proc3/2 /proc3/3 /proc3/4
# ls -d /proc2/[1-9]*
/proc2/1 /proc2/2 /proc2/3
# ls -d /proc1/[1-9]*
/proc1/1 /proc1/2
# ls -d /proc0/[1-9]* Bottommost PID namespace
/proc0/1
A suitable grep command allows us to see the PID of the process at the tail end of the recursion (i.e., the process executing sleep in the most deeply nested namespace) in all of the namespaces where it is visible:
运行一个精心设计的 grep 命令后我们可以发现在每个名字空间中都可以看到嵌套最深的那个名字空间中所执行的进程的 PID(正是执行 sleep 的那个进程):
# grep -H 'Name:.*sleep' /proc?/[1-9]*/status
/proc0/1/status:Name: sleep
/proc1/2/status:Name: sleep
/proc2/3/status:Name: sleep
/proc3/4/status:Name: sleep
/proc4/5/status:Name: sleep
In other words, in the most deeply nested PID namespace (
/proc0
), the process executing sleep has the PID 1, and in the topmost PID namespace created (/proc4
), that process has the PID 5.
换句话说,在嵌套最深层的 PID 名字空间( /proc0
)中,执行 sleep 的进程的 PID 的值为 1,而在最上面的 PID 名字空间(/proc4
)中,该进程的 PID 值为 5。
If you run the test programs shown in this article, it’s worth mentioning that they will leave behind mount points and mount directories. After terminating the programs, shell commands such as the following should suffice to clean things up:
运行本文中给出的测试程序时,值得注意的是程序结束后并不会自动卸载挂载点和删除挂载目录。所以在终止这些程序之后,需要运行下面这样的 shell 命令执行相关清理:
# umount /proc?
# rmdir /proc?
结束语 (Concluding remarks)
In this article, we’ve looked in quite some detail at the operation of PID namespaces. In the next article, we’ll fill out the description with a discussion of the PID namespace init process, as well as a few other details of the PID namespaces API.
在本文中,我们详细介绍了有关 PID 名字空间的操作。在下一篇文章中,我们将讨论 PID 名字空间中的 init 进程以及其他有关 PID 名字空间 API 的细节。
猜你喜欢:
- 我要投稿:发表原创技术文章,收获福利、挚友与行业影响力
- 知识星球:独家 Linux 实战经验与技巧,订阅「Linux知识星球」
- 视频频道:泰晓学院,B 站,发布各类 Linux 视频课
- 开源小店:欢迎光临泰晓科技自营店,购物支持泰晓原创
- 技术交流:Linux 用户技术交流微信群,联系微信号:tinylab
支付宝打赏 | 微信打赏 | |
请作者喝杯咖啡吧 |
Read Album:
- LWN 531148: Linux 内核文件中的非常规节
- Linux 内核的代码仓库管理与开发流程简介
- LWN 600644: 扩展内核栈
- LWN 563185: 优化抢占
- LWN 575497: 我们很快就可以有 Deadline 调度器了吗?