泰晓科技 -- 聚焦 Linux - 追本溯源,见微知著!
网站地址:https://tinylab.org

泰晓Linux实验盘,即刻上手内核与嵌入式开发
请稍侯

LWN 531381: 名字空间实作,第二章:名字空间的 API

Wang Chen 创作于 2018/03/05

原文:Namespaces in operation, part 2: the namespaces API 原创:By Michael Kerrisk @ Jan 8, 2013 翻译:By unicornx 校对:By w-simon

A namespace wraps a global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the resource. Namespaces are used for a variety of purposes, with the most notable being the implementation of containers, a technique for lightweight virtualization. This is the second part in a series of articles that looks in some detail at namespaces and the namespaces API. The first article in this series provided an overview of namespaces. This article looks at the namespaces API in some detail and shows the API in action in a number of example programs.

内核中的名字空间(namespace)将全局的系统资源封装起来,面向进程提供了一种抽象的机制使得一个名字空间中的进程只能访问自己的私有资源。名字空间可用于多种用途,最值得关注的是可以基于其实现 容器(containers),从而实现一种轻量级的虚拟化。本文是名字空间实作系列文章中的第二篇,它将详细地介绍名字空间和相关的 API。该系列的第一篇提供了对名字空间的概述。本文继续详细介绍名字空间相关的 API,并通过多个示例程序演示这些 API 如何使用。

The namespace API consists of three system calls—clone(), unshare(), and setns()—and a number of /proc files. In this article, we’ll look at all of these system calls and some of the /proc files. In order to specify a namespace type on which to operate, the three system calls make use of the CLONE_NEW* constants listed in the previous article: CLONE_NEWIPC, CLONE_NEWNS, CLONE_NEWNET, CLONE_NEWPID, CLONE_NEWUSER, and CLONE_NEWUTS.

与名字空间相关的 API 由三个系统调用(clone()unshare(),和 setns())以及一些 /proc 文件系统中的文件组成。本文我们将一起来看看这些系统调用和 /proc 文件。我们在调用这些函数时需要使用上一篇文章中列出的 CLONE_NEW* 常量来指定要访问的名字空间类型,这些常量包括:CLONE_NEWIPCCLONE_NEWNSCLONE_NEWNETCLONE_NEWPIDCLONE_NEWUSER,和 CLONE_NEWUTS

在一个新的名字空间中创建一个子进程的方法:clone() (Creating a child in a new namespace: clone())

One way of creating a namespace is via the use of clone(), a system call that creates a new process. For our purposes, clone() has the following prototype:

创建名称空间的一种方法是使用 clone(),该系统调用会创建一个新进程。创建的方法,参考如下 clone() 的原型:

int clone(int (*child_func)(void *), void *child_stack, int flags, void *arg);

Essentially, clone() is a more general version of the traditional UNIX fork() system call whose functionality can be controlled via the flags argument. In all, there are more than twenty different CLONE_* flags that control various aspects of the operation of clone(), including whether the parent and child process share resources such as virtual memory, open file descriptors, and signal dispositions. If one of the CLONE_NEW* bits is specified in the call, then a new namespace of the corresponding type is created, and the new process is made a member of that namespace; multiple CLONE_NEW* bits can be specified in flags.

本质上,clone() 是传统 UNIX 系统调用 fork() 的一个更通用的版本,其通用性体现在我们可以通过 flags 参数,指定超过 20 种 CLONE_* 选项来设置父进程在派生子进程过程中是否共享诸如虚拟内存、打开的文件描述符和分发的信号等资源。如果在调用中指定了一个 CLONE_NEW* 位,则会创建相应类型的新名字空间,并且新创建的进程将成为该名字空间的成员;具体操作时通过 flags 参数进行指定并且可以同时设定多个 CLONE_NEW* 位。

Our example program (demo_uts_namespace.c) uses clone() with the CLONE_NEWUTS flag to create a UTS namespace. As we saw last week, UTS namespaces isolate two system identifiers—the hostname and the NIS domain name—that are set using the sethostname() and setdomainname() system calls and returned by the uname() system call. You can find the full source of the program here. Below, we’ll focus on just some of the key pieces of the program (and for brevity, we’ll omit the error checking code that is present in the full version of the program).

我们的示例程序(demo_uts_namespace.c)调用 clone() 时传入 CLONE_NEWUTS 来创建 UTS 名字空间。正如我们上周看到的,UTS 名字空间隔离了两个系统的标识符(主机名 hostname 和 NIS 域名),系统调用 sethostname()setdomainname() 可以用于设置这两个标识符,如果要获取这两个值可以调用 uname()。你可以在这里 获得示例程序的完整源代码。下面,我们将重点介绍程序的一些关键部分(为简洁起见,我们将省略完整版本中的错误检查部分的代码)。

The example program takes one command-line argument. When run, it creates a child that executes in a new UTS namespace. Inside that namespace, the child changes the hostname to the string given as the program’s command-line argument.

示例程序运行时需要一个命令行参数。运行时,它会创建一个在新的 UTS 名字空间中执行的子进程。子进程会将该 UTS 名字空间的 hostname 更改为命令行参数传入的字符串。

The first significant piece of the main program is the clone() call that creates the child process:

main 函数的第一个重要部分是对 clone() 调用,该调用会创建一个子进程:

child_pid = clone(childFunc, 
                  child_stack + STACK_SIZE,   /* Points to start of 
                                                 downwardly growing stack */ 
                  CLONE_NEWUTS | SIGCHLD, argv[1]);

printf("PID of child created by clone() is %ld\n", (long) child_pid);

The new child will begin execution in the user-defined function childFunc(); that function will receive the final clone() argument (argv1) as its argument. Since CLONE_NEWUTS is specified as part of the flags argument, the child will execute in a newly created UTS namespace.

子进程执行用户定义的函数 childFunc();该函数将 clone() 函数的最后一个参数(argv[1])作为其输入的参数。clone() 函数的第三个参数指定了 CLONE_NEWUTS,所以子进程将在新创建的 UTS 名字空间中执行。

The main program then sleeps for a moment. This is a (crude) way of giving the child time to change the hostname in its UTS namespace. The program then uses uname() to retrieve the host name in the parent’s UTS namespace, and displays that hostname:

此后 main 函数睡眠了一会儿。如此简单的处理,仅仅是为了给子进程足够的时间改变其所在 UTS 名字空间中的 hostname。程序然后使用 uname() 来获取父进程所在 UTS 名字空间中的 hostname 并打印之:

sleep(1);           /* Give child time to change its hostname */

uname(&uts);
printf("uts.nodename in parent: %s\n", uts.nodename);

Meanwhile, the childFunc() function executed by the child created by clone() first changes the hostname to the value supplied in its argument, and then retrieves and displays the modified hostname:

同时,由 clone() 系统调用所创建的子进程执行 childFunc() 函数,首先将 hostname 更改为程序运行参数指定的值,然后获取并打印修改后的 hostname:

sethostname(arg, strlen(arg);

uname(&uts);
printf("uts.nodename in child:  %s\n", uts.nodename);

Before terminating, the child sleeps for a while. This has the effect of keeping the child’s UTS namespace open, and gives us a chance to conduct some of the experiments that we show later.

在子进程的执行函数结束之前,子进程会睡眠一会儿。这么做的主要目的是让子进程所在的 UTS 名字空间维持一段时间,以便我们有机会手动输入一些命令进行一些实验,见下文所示。

Running the program demonstrates that the parent and child processes have independent UTS namespaces:

运行该程序可以看到父进程和子进程具有独立的 UTS 名字空间:

$ su                   # Need privilege to create a UTS namespace
Password: 
# uname -n
antero
# ./demo_uts_namespaces bizarro
PID of child created by clone() is 27514
uts.nodename in child:  bizarro
uts.nodename in parent: antero

As with most other namespaces (user namespaces are the exception), creating a UTS namespace requires privilege (specifically, CAP_SYS_ADMIN). This is necessary to avoid scenarios where set-user-ID applications could be fooled into doing the wrong thing because the system has an unexpected hostname.

与大多数其他名字空间(user 名字空间是个例外)一样,创建 UTS 名字空间需要特殊权限(尤其是 CAP_SYS_ADMIN)。这么做是有必要的,否则在运行拥有 set-user-ID 的程序时可能会因为访问了一个不认识的 hostname 而导致其执行一些错误的操作。

Another possibility is that a set-user-ID application might be using the hostname as part of the name of a lock file. If an unprivileged user could run the application in a UTS namespace with an arbitrary hostname, this would open the application to various attacks. Most simply, this would nullify the effect of the lock file, triggering misbehavior in instances of the application that run in different UTS namespaces. Alternatively, a malicious user could run a set-user-ID application in a UTS namespace with a hostname that causes creation of the lock file to overwrite an important file. (Hostname strings can contain arbitrary characters, including slashes.)

另一种可能性是具备 set-user-ID 的应用程序可能使用 hostname 作为锁文件(lock file)的名称的一部分。如果一个非法用户可以在该 UTS 名字空间中运行该应用程序并指定任意的 hostname 值,那么他就可以运行该应用程序开展各种恶意攻击。最简单的后果是,这会使锁文件失效,导致在不同 的 UTS 名字空间中运行该应用程序会产生各种错误行为。更为严重的是,通过在 UTS 名字空间中运行具备 set-user-ID 的应用程序,一个恶意用户可以指定 一个和某个重要文件同名的 hostname,这样一旦该锁文件被创建,就会覆盖同名的重要文件。(注意 hostname 是一个字符串,可以包含任意字符,包括斜杠。)

/proc/PID/ns 文件(The /proc/PID/ns files)

Each process has a /proc/PID/ns directory that contains one file for each type of namespace. Starting in Linux 3.8, each of these files is a special symbolic link that provides a kind of handle for performing certain operations on the associated namespace for the process.

每个进程都对应一个 /proc/PID/ns 目录,该目录下每种类型的名字空间对应着一个文件。从 Linux 内核版本 3.8 开始,每个文件都是一个特殊的符号链接,该符号链接内容包含了该进程所属的某个名字空间的一些基本信息可供应用做进一步的读取分析。

$ ls -l /proc/$$/ns         # $$ is replaced by shell's PID
total 0
lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 ipc -> ipc:[4026531839]
lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 mnt -> mnt:[4026531840]
lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 net -> net:[4026531956]
lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 pid -> pid:[4026531836]
lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 user -> user:[4026531837]
lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 uts -> uts:[4026531838]

One use of these symbolic links is to discover whether two processes are in the same namespace. The kernel does some magic to ensure that if two processes are in the same namespace, then the inode numbers reported for the corresponding symbolic links in /proc/PID/ns will be the same. The inode numbers can be obtained using the stat() system call (in the st_ino field of the returned structure).

这些符号链接的用途之一是可以据此判断两个进程是否在同一个名字空间中。内核的实现通过一些小技巧,确保如果两个进程在同一个名字空间中,那么在 /proc/PID/ns 目录下创建的相应的符号链接中所包含的 inode 值将是相同的。如果你想获取 inode 的数值可以通过调用 stat() 得到(该值保存在返回的结构体的 st_ino 字段里)。

However, the kernel also constructs each of the /proc/PID/ns symbolic links so that it points to a name consisting of a string that identifies the namespace type, followed by the inode number. We can examine this name using either the ls -l or the readlink command.

内核构造的每个 /proc/PID/ns 的符号链接,其内容的格式是以代表它所指向的名称空间类型的字符串开头,后跟 inode 的数值编号。我们可以使用 ls -lreadlink 命令来检查这个符号链接的内容 。

Let’s return to the shell session above where we ran the demo_uts_namespaces program. Looking at the /proc/PID/ns symbolic links for the parent and child process provides an alternative method of checking whether the two processes are in the same or different UTS namespaces:

让我们回到上面运行 demo_uts_namespaces 程序的 shell 会话。通过查看父进程和子进程的 /proc/PID/ns 符号链接也可以检查这两个进程是否位于相同或不同的 UTS 名字空间中:

^Z                                # Stop parent and child
[1]+  Stopped          ./demo_uts_namespaces bizarro
# jobs -l                         # Show PID of parent process
[1]+ 27513 Stopped         ./demo_uts_namespaces bizarro
# readlink /proc/27513/ns/uts     # Show parent UTS namespace
uts:[4026531838]
# readlink /proc/27514/ns/uts     # Show child UTS namespace
uts:[4026532338]

As can be seen, the content of the /proc/PID/ns/uts symbolic links differs, indicating that the two processes are in different UTS namespaces.

可以看出来,两者的 /proc/PID/ns/uts 符号链接的内容不同,表明这两个进程位于不同的 UTS 名字空间中。

The /proc/PID/ns symbolic links also serve other purposes. If we open one of these files, then the namespace will continue to exist as long as the file descriptor remains open, even if all processes in the namespace terminate. The same effect can also be obtained by bind mounting one of the symbolic links to another location in the file system:

/proc/PID/ns 下的符号链接还有其他的用途。如果我们打开其中的一个文件,那么只要该文件描述符保持打开状态,名字空间将持续存在,即使该名字空间中的所有进程都已终止。通过将其中一个符号链接绑定挂载到文件系统中的另一个位置,也可以获得相同的效果:

# touch ~/uts                            # Create mount point
# mount --bind /proc/27514/ns/uts ~/uts

Before Linux 3.8, the files in /proc/PID/ns were hard links rather than special symbolic links of the form described above. In addition, only the ipc, net, and uts files were present.

在 Linux 3.8 之前,/proc/PID/ns 中的文件是硬链接,而不是上述特殊形式的符号链接。另外,当时只有 ipc,net 和 uts 文件存在。

加入一个已存在的名字空间:setns()(Joining an existing namespace: setns())

Keeping a namespace open when it contains no processes is of course only useful if we intend to later add processes to it. That is the task of the setns() system call, which allows the calling process to join an existing namespace:

对于一个名字空间来说,即使已经不包含任何进程,但仍然保持其为打开状态的原因是方便我们稍后向其中添加新的进程。添加进程的工作可以通过调用 setns() 这个系统调用完成,它会将调用该函数的进程加入一个已存在的名字空间:

int setns(int fd, int nstype);

More precisely, setns() disassociates the calling process from one instance of a particular namespace type and reassociates the process with another instance of the same namespace type.

更确切地说,setns() 会将调用进程与指定类型的名字空间分离,然后再将该进程与同类型的另一个名字空间重新关联。

The fd argument specifies the namespace to join; it is a file descriptor that refers to one of the symbolic links in a /proc/PID/ns directory. That file descriptor can be obtained either by opening one of those symbolic links directly or by opening a file that was bind mounted to one of the links.

fd 参数用于指定要加入的名字空间; 它是一个文件描述符,指向 /proc/PID/ns 目录中的一个符号链接 。该文件描述符可以通过直接打开其中一个符号链接或者打开一个绑定挂载到某个符号链接的文件来获得。

The nstype argument allows the caller to check the type of namespace that fd refers to. If this argument is specified as zero, no check is performed. This can be useful if the caller already knows the namespace type, or does not care about the type. The example program that we discuss in a moment (ns_exec.c) falls into the latter category: it is designed to work with any namespace type. Specifying nstype instead as one of the CLONE_NEW* constants causes the kernel to verify that fd is a file descriptor for the corresponding namespace type. This can be useful if, for example, the caller was passed the file descriptor via a UNIX domain socket and needs to verify what type of namespace it refers to.

nstype 参数可以帮助系统调用者检查 fd 所关联的名字空间的类型。如果该参数被指定为零,则不执行检查。这在调用者已经明确知道该名字空间的类型,或者不关心该类型时可能会有用。我们稍后讨论的示例程序(ns_exec.c)属于后一类:它被设计用于处理任何名字空间类型。将 nstype 指定为 CLONE_NEW* 常量之一会导致内核验证 fd 是否是相应名称空间类型的文件描述符。这在如下场景下会非常有用,例如,调用者得到了一个通过 UNIX 域套接字传递来的文件描述符,那么在使用该文件描述符之前会有必要验证一下它对应的是哪种类型的名称空间。

Using setns() and execve() (or one of the other exec() functions) allows us to construct a simple but useful tool: a program that joins a specified namespace and then executes a command in that namespace.

使用 setns()execve()(或其他类型的 exec() 函数)允许我们构建一个简单但有用的程序,它加入一个指定的名字空间,然后在该名字空间中执行一个命令。

Our program (ns_exec.c, whose full source can be found here) takes two or more command-line arguments. The first argument is the pathname of a /proc/PID/ns/* symbolic link (or a file that is bind mounted to one of those symbolic links). The remaining arguments are the name of a program to be executed inside the namespace that corresponds to that symbolic link and optional command-line arguments to be given to that program. The key steps in the program are the following:

我们的程序(ns_exec.c,其完整源代码可以在这里 找到)需要两个或更多的命令行参数。第一个参数是 /proc/PID/ns/* 符号链接(或绑定到这些符号链接之一的文件)的路径名 。其余的参数是要在符号链接所对应的名字空间内执行的程序的名称以及执行该程序时可选的命令行参数。程序的核心代码如下:

fd = open(argv[1], O_RDONLY);   /* Get descriptor for namespace */

setns(fd, 0);                   /* Join that namespace */

execvp(argv[2], &argv[2]);      /* Execute a command in namespace */

An interesting program to execute inside a namespace is, of course, a shell. We can use the bind mount for the UTS namespace that we created earlier in conjunction with the ns_exec program to execute a shell in the new UTS namespace created by our invocation of demo_uts_namespaces:

举例时为方便起见,可以在名字空间内执行一个 shell(译者注:/bin/bash)。执行 ns_exec 时指定第一个参数使用前面我们创建的绑定挂载文件(译者注:~/uts)就可以使 bash 在我们先前执行 demo_uts_namespaces 时所创建的 UTS 名字空间中运行:

# ./ns_exec ~/uts /bin/bash     # ~/uts is bound to /proc/27514/ns/uts
My PID is: 28788

We can then verify that the shell is in the same UTS namespace as the child process created by demo_uts_namespaces, both by inspecting the hostname and by comparing the inode numbers of the /proc/PID/ns/uts files:

我们可以通过检查主机名并比较 /proc/PID/ns/uts 文件的 inode 编号来 验证 bash 所在进程是否与 demo_uts_namespaces 创建的子进程位于同一个 UTS 名字空间中:

# hostname
bizarro
# readlink /proc/27514/ns/uts
uts:[4026532338]
# readlink /proc/$$/ns/uts      # $$ is replaced by shell's PID
uts:[4026532338]

In earlier kernel versions, it was not possible to use setns() to join mount, PID, and user namespaces, but, starting with Linux 3.8, setns() now supports joining all namespace types.

在较早的内核版本中,并不能使用 setns() 来加入 mount,PID 和 user 类型的名字空间,但是从 Linux 3.8 开始, setns() 可以支持加入所有类型的名字空间。

退出一个名字空间:unshare()(Leaving a namespace: unshare()

The final system call in the namespaces API is unshare():

本文要介绍的最后一个名字空间 API 是 unshare()

int unshare(int flags);

The unshare() system call provides functionality similar to clone(), but operates on the calling process: it creates the new namespaces specified by the CLONE_NEW* bits in its flags argument and makes the caller a member of the namespaces. (As with clone(), unshare() provides functionality beyond working with namespaces that we’ll ignore here.) The main purpose of unshare() is to isolate namespace (and other) side effects without having to create a new process or thread (as is done by clone()).

unshare() 系统调用提供类似 clone() 的功能 ,但和 clone() 不同的是它直接作用于调用该函数的进程:它根据 flags 中指定的 CLONE_NEW* 位创建新的名字空间,同时使调用该函数的进程成为新的名字空间中的一员。(与 clone() 一样,我们这里暂时不讨论 unshare() 所提供的其他与名字空间无关的功能)。unshare() 的主要目的是使调用进程拥有一个自己的名字空间(和其他方面),但它不会创建新进程或线程(这是和 clone() 所不同的地方)。

Leaving aside the other effects of the clone() system call, a call of the form:

不考虑 clone() 系统调用的其他影响,如下形式的调用:

clone(..., CLONE_NEWXXX, ....);

is roughly equivalent, in namespace terms, to the sequence:

仅考虑对名字空间的操作,大致等价于如下代码:

if (fork() == 0)
	unshare(CLONE_NEWXXX);      /* Executed in the child process */

One use of the unshare() system call is in the implementation of the unshare command, which allows the user to execute a command in a separate namespace from the shell. The general form of this command is:

unshare() 系统调用的一个用途是实现 unshare 命令,该命令允许用户在另一个(和 shell 所在进程不是同一个)名字空间中执行某个程序。unshare 命令的一般形式是:

unshare [options] program [arguments]

The options are command-line flags that specify the namespaces to unshare before executing program with the specified arguments.

命令行中的 options 选项用于指定新创建的名字空间类型,名字空间创建后即在其中利用 arguments 所指定的参数运行 program 所指定的程序。

The key steps in the implementation of the unshare command are straightforward:

用于实现 unshare 命令的关键代码步骤很简单:

/* Code to initialize 'flags' according to command-line options
   omitted */

unshare(flags);

/* Now execute 'program' with 'arguments'; 'optind' is the index
   of the next command-line argument after options */

execvp(argv[optind], &argv[optind]);

A simple implementation of the unshare command (unshare.c) can be found here.

这里 可以找到 unshare 命令(unshare.c)的一个简单实现。

In the following shell session, we use our unshare.c program to execute a shell in a separate mount namespace. As we noted in last week’s article, mount namespaces isolate the set of filesystem mount points seen by a group of processes, allowing processes in different mount namespaces to have different views of the filesystem hierarchy.

如下所示,我们使用 unshare.c 程序在另一个 mount 名字空间中执行一个 shell。正如我们在上周的文章中指出的,mount 名字空间可以隔离进程所看到的文件系统挂载点,允许不同 mount 名字空间中的进程拥有不同的文件系统层次结构。

# echo $$                             # Show PID of shell
8490
# cat /proc/8490/mounts | grep mq     # Show one of the mounts in namespace
mqueue /dev/mqueue mqueue rw,seclabel,relatime 0 0
# readlink /proc/8490/ns/mnt          # Show mount namespace ID 
mnt:[4026531840]
# ./unshare -m /bin/bash              # Start new shell in separate mount namespace
# readlink /proc/$$/ns/mnt            # Show mount namespace ID 
mnt:[4026532325]

Comparing the output of the two readlink commands shows that the two shells are in separate mount namespaces. Altering the set of mount points in one of the namespaces and checking whether that change is visible in the other namespace provides another way of demonstrating that the two programs are in separate namespaces:

比较两个 readlink 命令的输出显示这两个 shell 位于单独的 mount 名字空间中。在其中一个名字空间中更改一组挂载点并检查该更改是否在其他名字空间中可见,这提供了另一种演示两个进程位于不同名字空间中的方式:

# umount /dev/mqueue                  # Remove a mount point in this shell
# cat /proc/$$/mounts | grep mq       # Verify that mount point is gone
# cat /proc/8490/mounts | grep mq     # Is it still present in the other namespace?
mqueue /dev/mqueue mqueue rw,seclabel,relatime 0 0

As can be seen from the output of the last two commands, the /dev/mqueue mount point has disappeared in one mount namespace, but continues to exist in the other.

从最后两条命令的输出可以看到, /dev/mqueue 挂载点在一个 mount 名字空间中消失了,但在另一个 mount 名字空间中仍然存在。

结束语 (Concluding remarks)

In this article we’ve looked at the fundamental pieces of the namespace API and how they are employed together. In the follow-on articles, we’ll look in more depth at some other namespaces, in particular, the PID and user namespaces; user namespaces open up a range of new possibilities for applications to use kernel interfaces that were formerly restricted to privileged applications.

本文我们研究了名字空间相关的最基本的几个 API 以及它们的使用。在后续文章中,我们将更深入地研究其他一些名字空间,特别是 PID 和 user 名字空间; user 名字空间为应用程序执行以前仅限于特权应用程序使用的内核接口创造了一种新的可能性。

(2013-01-15: updated the concluding remarks to reflect the fact that there will be more than one following article.)

(2013-01-15:更新了结束语部分,补充说明后继将会有不止一篇文章。)



Read Album:

Read Related:

Read Latest: