泰晓科技 -- 聚焦 Linux - 追本溯源,见微知著!


LWN 531114: 名字空间实作,第一章:名字空间(namespaces)概述

Wang Chen 创作于 2018/03/15

原文:Namespaces in operation, part 1: namespaces overview 原创:By Michael Kerrisk @ Jan 4, 2013 翻译:By unicornx 校对:By w-simon

The Linux 3.8 merge window saw the acceptance of Eric Biederman’s sizeable series of user namespace and related patches. Although there remain some details to finish—for example, a number of Linux filesystems are not yet user-namespace aware—the implementation of user namespaces is now functionally complete.

在 Linux 3.8 版本中合入了大量由 Eric Biederman 提交的 user 名字空间(namespace)相关补丁修改。尽管还有一些细节需要补充(例如,许多 Linux 文件系统尚未支持 user 名字空间),但 user 名字空间的基本框架功能已经完成。

The completion of the user namespaces work is something of a milestone, for a number of reasons. First, this work represents the completion of one of the most complex namespace implementations to date, as evidenced by the fact that it has been around five years since the first steps in the implementation of user namespaces (in Linux 2.6.23). Second, the namespace work is currently at something of a “stable point”, with the implementation of most of the existing namespaces being more or less complete. This does not mean that work on namespaces has finished: other namespaces may be added in the future, and there will probably be further extensions to existing namespaces, such as the addition of namespace isolation for the kernel log. Finally, the recent changes in the implementation of user namespaces are something of a game changer in terms of how namespaces can be used: starting with Linux 3.8, unprivileged processes can create user namespaces in which they have full privileges, which in turn allows any other type of namespace to be created inside a user namespace.

user 名字空间功能的完成绝对是一个里程碑式的成果。首先,这代表了迄今为止名字空间中最复杂的部分已宣告完成,有记录表明,user 名字空间的开发工作从刚开始(那时还是 Linux 2.6.23)到现在已有大约五年的时间。其次,名字空间相关功能目前趋于“稳定”,因为现有大部分的名字空间的实现或多或少都接近完成。但这并不意味着名字空间上的工作已经结束:因为将来可能还会添加其他类型的名字空间,并且可能会对现有名字空间进行进一步改进,例如为内核日志添加名字空间隔离的支持。最后,就名字空间的使用而言,user 名字空间功能的加入将极大地改变人们操作名字空间的方式:从 Linux 3.8 开始,非特权的进程也将可以创建 user 名字空间,并且在这个名字空间中它们可以拥有完全的权限,进而导致其可以在该 user 名字空间内创建任何其他类型的名字空间。

Thus, the present moment seems a good point to take an overview of namespaces and a practical look at the namespace API. This is the first of a series of articles that does so: in this article, we provide an overview of the currently available namespaces; in the follow-on articles, we’ll show how the namespace APIs can be used in programs.

因此,目前是时候让大家一起来完整了解一下名字空间的概念并具体学习一下操作名字空间的 API。本文是系列文章中的第一篇:在本文中,我们概述了内核当前支持的名字空间;在后续文章中,我们将展示如何在编程中使用名字空间的 API。

名字空间概述(The namespaces)

Currently, Linux implements six different types of namespaces. The purpose of each namespace is to wrap a particular global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global resource. One of the overall goals of namespaces is to support the implementation of containers, a tool for lightweight virtualization (as well as other purposes) that provides a group of processes with the illusion that they are the only processes on the system.

截至现在,Linux 实现了六种不同类型的名字空间。每种名字空间的目的都是将系统中的某种全局系统资源封装起来,使得某个名字空间内的进程只能看到自己拥有的资源实例。内核引入名字空间概念的总体目标之一是支持容器(containers) 的实现,容器是一种实现轻量级虚拟化(以及其他目的)的工具,它为一组进程提供了一个虚拟运行环境,让它们感觉自己独占了系统。

In the discussion below, we present the namespaces in the order that they were implemented (or at least, the order in which the implementations were completed). The CLONE_NEW* identifiers listed in parentheses are the names of the constants used to identify namespace types when employing the namespace-related APIs (clone(), unshare(), and setns()) that we will describe in our follow-on articles.

在下面的讨论中,我们按照它们实现的顺序(或者至少是实现完成的顺序)快速浏览一下每种名字空间。括号中列出的 CLONE_NEW* 标识符是在后续文章中所要介绍的名字空间相关 API(clone(), unshare(), 和 setns())里用于标识名字空间类型的常量名称。

Mount namespaces (CLONE_NEWNS, Linux 2.4.19) isolate the set of filesystem mount points seen by a group of processes. Thus, processes in different mount namespaces can have different views of the filesystem hierarchy. With the addition of mount namespaces, the mount() and umount() system calls ceased operating on a global set of mount points visible to all processes on the system and instead performed operations that affected just the mount namespace associated with the calling process.

mount 名字空间CLONE_NEWNS,Linux 2.4.19)用于隔离进程所访问的一组文件系统挂载点。基于该功能,不同 mount 名字空间中的进程可以具有不同的文件系统层次结构。也就是说在进程自己的 mount 名字空间中执行 mount()umount() 这样的系统调用不会影响系统全局的挂载点,而仅仅影响与调用进程关联的 mount 名字空间中的挂载点。

One use of mount namespaces is to create environments that are similar to chroot jails. However, by contrast with the use of the chroot() system call, mount namespaces are a more secure and flexible tool for this task. Other more sophisticated uses of mount namespaces are also possible. For example, separate mount namespaces can be set up in a master-slave relationship, so that the mount events are automatically propagated from one namespace to another; this allows, for example, an optical disk device that is mounted in one namespace to automatically appear in other namespaces.

mount 名字空间的用途之一是类似于为进程创建 chroot 环境(译者注:所谓 chroot 指的是在 Unix 以及 Unix-like 系统上改变当前进程的文件系统根目录的操作,一般是使用 chroot() 系统调用)。但是,相比于使用 chroot(),使用 mount 名字空间方式更安全也更灵活。除此之外还有其他一些更复杂的有关 mount 名字空间的用法。例如,两个独立的 mount 名字空间可以设置为主从关系,这么做会导致一些挂载事件自动从一个名字空间传播到另一个名字空间;其结果是挂载在一个名字空间中的光盘设备可以自动出现在其他名字空间中。

Mount namespaces were the first type of namespace to be implemented on Linux, appearing in 2002. This fact accounts for the rather generic “NEWNS” moniker (short for “new namespace”): at that time no one seems to have been thinking that other, different types of namespace might be needed in the future.

mount 名字空间是第一种在 Linux 上实现的名字空间类型,那是 2002 年的事情了。该名字空间类型对应的常量值叫 “NEWNS” (也就是 “new namespace” 的缩写):可见那时的人们似乎还没有想到将来会增加其他类型的名字空间。

UTS namespaces (CLONE_NEWUTS, Linux 2.6.19) isolate two system identifiers—nodename and domainname—returned by the uname() system call; the names are set using the sethostname() and setdomainname() system calls. In the context of containers, the UTS namespaces feature allows each container to have its own hostname and NIS domain name. This can be useful for initialization and configuration scripts that tailor their actions based on these names. The term “UTS” derives from the name of the structure passed to the uname() system call: struct utsname. The name of that structure in turn derives from “UNIX Time-sharing System”.

UTS 名字空间CLONE_NEWUTS,Linux 2.6.19)隔离了两个系统标识符:hostname 和 domainname,这两个标识符的值由 uname() 系统调用返回;并通过 sethostname() and setdomainname() 系统调用进行设置。UTS 名字空间功能允许每个容器拥有自己的 hostname 和 NIS domainname。这对初始化和配置脚本非常有用,脚本可以根据这两个标识符的值定制它们的操作。术语 “UTS” 来自传递给 uname() 系统调用的结构体类型的名称 :struct utsname。该结构体类型名称中 uts 是 “UNIX Time-sharing System” 的缩写。

IPC namespaces (CLONE_NEWIPC, Linux 2.6.19) isolate certain interprocess communication (IPC) resources, namely, System V IPC objects and (since Linux 2.6.30) POSIX message queues. The common characteristic of these IPC mechanisms is that IPC objects are identified by mechanisms other than filesystem pathnames. Each IPC namespace has its own set of System V IPC identifiers and its own POSIX message queue filesystem.

IPC 名字空间CLONE_NEWIPC,Linux 2.6.19)隔离了某些进程间通信(IPC)资源,包括 System V IPC 对象和 POSIX 消息队列(该特性从 Linux 2.6.30 开始加入内核)。这些 IPC 机制的共同特点是它们的 IPC 对象不是通过文件系统路径名来标识,而是通过其他的机制(译者注:而且这些 IPC 标识符对应创建的 IPC 对象都是全局性的资源)。启用 IPC 名字空间功能后,每个 IPC 名字空间都有自己的一组 System V IPC 标识符和它自己的 POSIX 消息队列文件系统。

PID namespaces (CLONE_NEWPID, Linux 2.6.24) isolate the process ID number space. In other words, processes in different PID namespaces can have the same PID. One of the main benefits of PID namespaces is that containers can be migrated between hosts while keeping the same process IDs for the processes inside the container. PID namespaces also allow each container to have its own init (PID 1), the “ancestor of all processes” that manages various system initialization tasks and reaps orphaned child processes when they terminate.

PID 名字空间CLONE_NEWPID,Linux 2.6.24)用于隔离进程标识符(PID)。换句话说,不同 PID 名字空间中的进程可以具有相同的 PID 值。引入 PID 名字空间的主要优点之一是支持在主机之间迁移容器时,可以使容器内进程的 PID 值保持不变。PID 名字空间还允许每个容器拥有自己的 init 进程(PID 的值为 1),即所有进程的“祖先”,该进程负责管理各种系统初始化任务,以及在孤儿进程终止时回收它们。

From the point of view of a particular PID namespace instance, a process has two PIDs: the PID inside the namespace, and the PID outside the namespace on the host system. PID namespaces can be nested: a process will have one PID for each of the layers of the hierarchy starting from the PID namespace in which it resides through to the root PID namespace. A process can see (e.g., view via /proc/PID and send signals with kill()) only processes contained in its own PID namespace and the namespaces nested below that PID namespace.

对主机上从属于某个 PID 名字空间的每个进程来说,它们都拥有两类 PID:名字空间内的 PID 和名字空间外的 PID。PID 名字空间可以嵌套:对于一个进程来说,从根(root)PID 名字空间(译者注:即主机上第一个缺省的 PID 名字空间)往下,层层派生,一直到该进程所在的 PID 名字空间,该进程在每一层 PID 名字空间中都有一个 PID 值与之对应。一个进程只能 “看到” 同一个 PID 空间中的其他进程以及往下派生的子 PID 名字空间中的进程(所谓“看到”,指的是,可以通过 /proc/PID 访问某个进程的信息或者在调用 kill() 时给定 PID 值来向某个进程发送信号)。

Network namespaces (CLONE_NEWNET, started in Linux 2.6.24 and largely completed by about Linux 2.6.29) provide isolation of the system resources associated with networking. Thus, each network namespace has its own network devices, IP addresses, IP routing tables, /proc/net directory, port numbers, and so on.

network 名字空间CLONE_NEWNET,从 Linux 内核 2.6.24 开始开发,直到 Linux 2.6.29 基本完成)实现了与网络相关的系统资源的隔离。基于该功能,每个 network 名字空间都有自己的网络设备,IP 地址,IP 路由表,/proc/net 目录,端口号等。

Network namespaces make containers useful from a networking perspective: each container can have its own (virtual) network device and its own applications that bind to the per-namespace port number space; suitable routing rules in the host system can direct network packets to the network device associated with a specific container. Thus, for example, it is possible to have multiple containerized web servers on the same host system, with each server bound to port 80 in its (per-container) network namespace.

network 名字空间使得容器对互联网的支持更好了:每个容器都可以有自己的(虚拟)网络设备并且容器中的应用程序可以在私有的名字空间范围内绑定端口号;在主机系统中通过恰当地配置路由规则可以将网络分组分发至与特定容器相关联的网络设备。举个例子来说:我们可以在同一个主机上安装多个 Web 服务器(每个服务器安装在一个容器中),每个服务器在其所在容器的 network 名字空间中绑定到私有的 80 端口。

User namespaces (CLONE_NEWUSER, started in Linux 2.6.23 and completed in Linux 3.8) isolate the user and group ID number spaces. In other words, a process’s user and group IDs can be different inside and outside a user namespace. The most interesting case here is that a process can have a normal unprivileged user ID outside a user namespace while at the same time having a user ID of 0 inside the namespace. This means that the process has full root privileges for operations inside the user namespace, but is unprivileged for operations outside the namespace.

user 名字空间CLONE_NEWUSER,在 Linux 2.6.23 时开始开发,在 Linux 3.8 中完成)实现了用户和组 ID 号 的空间隔离。换句话说,一个进程的用户 ID 和 组 ID 可以在不同的 user 名字空间中取不同的值。这里最有趣的例子是,一个进程的用户 ID 可以在某个 user 名字空间中是普通的非特权用户,而在另一个名字空间内是具备超级权限的 root。

Starting in Linux 3.8, unprivileged processes can create user namespaces, which opens up a raft of interesting new possibilities for applications: since an otherwise unprivileged process can hold root privileges inside the user namespace, unprivileged applications now have access to functionality that was formerly limited to root. Eric Biederman has put a lot of effort into making the user namespaces implementation safe and correct. However, the changes wrought by this work are subtle and wide ranging. Thus, it may happen that user namespaces have some as-yet unknown security issues that remain to be found and fixed in the future.

从 Linux 3.8 开始,一个普通权限的进程通过创建 user 名字空间可以实现许多有趣的特性:譬如通过在新的 user 名字空间中获取了 root 权限,原本受限的应用程序现在可以执行以前仅限于 root 用户才可以执行的操作。Eric Biederman 为确保 user 名字空间运行机制的的安全性和正确性付出了很多努力。这项工作带来的变化是微妙且影响广泛的。所以,user 名字空间可能还存在一些尚未发现和解决的安全问题。

结束语(Concluding remarks)

It’s now around a decade since the implementation of the first Linux namespace. Since that time, the namespace concept has expanded into a more general framework for isolating a range of global resources whose scope was formerly system-wide. As a result, namespaces now provide the basis for a complete lightweight virtualization system, in the form of containers. As the namespace concept has expanded, the associated API has grown—from a single system call (clone()) and one or two /proc files—to include a number of other system calls and many more files under /proc. The details of that API will form the subject of the follow-ups to this article.

现在距离 Linux 支持第一种名字空间已经有大约十年了。从那时起,名字空间概念已经扩展为一个更通用的框架,可用于隔离一系列原本属于全局可见的系统资源,并为容器这种完整的轻量级虚拟化系统提供了基础支持。随着名字空间概念的扩展,相关的 API 已经从单一系统调用(clone())和一两个 /proc 文件发展为多个其他系统调用和更多的 /proc 文件。有关这些 API 的细节将在后续文章中陆续介绍。

本系列文章索引(Series index)

The following list shows later articles in this series, along with their example programs:


Read Album:

Read Related:

Read Latest: