Title: Porting Linux to a new processor architecture, part 3: To the finish line Author: Joël Porquet@September 23, 2015 Translator: 通天塔 email@example.com Date: 20220406 Revisor: lzufalcon firstname.lastname@example.org Project: RISC-V Linux 内核剖析
编者按：该系列共 3 篇译文介绍如何将 Linux 移植到新的处理器架构，此为第 3 篇，敬请收藏或推荐给周边朋友。特别感谢作者、译者和校订老师通宵达旦地撰写、翻译和校订，大家多多点赞支持与鼓励。感兴趣的同学也可以参加该翻译成果所属的开源活动，各种文字与视频成果正在泰晓科技网站和公众号连载，也在持续召集爱好者中，见 刚组建的 RISC-V Linux 内核兴趣小组正在召集爱好者。
This series of articles provides an overview of the procedure one can follow when porting the Linux kernel to a new processor architecture. Part 1 and part 2 focused on the non-code-related groundwork and the early code, from the assembly boot code to the creation of the first kernel thread. Following on from those, the series concludes by looking at the last portion of the procedure. As will be seen, most of the remaining work for launching the
initprocess deals with thread and process management.
产生内核线程（Spawning kernel threads）
start_kernel()performs its last function call (to
rest_init()), the memory-management subsystem is fully operational, the boot processor is running and able to process both exceptions and interrupts, and the system has a notion of time.
start_kernel() 执行了最后一个函数调用 （
While the execution flow has so far been sequential and mono-threaded, the main job handled by
rest_init()before turning into the boot idle thread is to create two kernel threads:
kernel_init, which will be discussed in the next section, and
kthreadd. As one can imagine, creating these kernel threads (and any other kinds of threads for that matter, from user threads within the same process to actual processes) implies the existence of a complex process-management infrastructure. Most of the infrastructure to create a new thread is not architecture-specific: operations such as copying the
task_structstructure or the credentials, setting up the scheduler, and so on do not usually need any architecture-specific code. However, the process-management code must define a few architecture-specific parts, mainly for setting up the stack for each new thread and for switching between threads.
虽然到目前为止执行流程是按照顺序，并且是单线程的，但在进入启动空闲线程（boot idle thread）之前，
Linux always avoids creating new resources from scratch, especially new threads. With the exception of the initial thread (the one that has so far been booting the system and that we have implicitly been discussing), the kernel always duplicates an existing thread and modifies the copy to make it into the desired new thread. The same principle applies after thread creation, when the new thread’s execution begins for the first time, as it is easier to resume the execution of a thread than to start it from scratch. This mainly means that the newly allocated stack must be initialized such that when switching to the new thread for the first time, the thread looks like it is resuming its execution—as if it had simply been stopped earlier.
To further understand this mechanism, delving a bit into the thread-switching mechanism and more specifically into the switch of execution flow implemented by the architecture-specific context-switching routine
switch_to()is required. This routine, which is always written in assembly language, is always called by the current (soon to be previous) thread while returning as the next (future current) thread. Part of this trick is achieved by saving the current context in the stack of the current thread, switching stack pointers to use the stack of the next thread, and restoring the saved context from it. As with a typical function,
switch_to()finally returns to the “calling” function using the instruction address that had been saved on the stack of the newly current thread.
In the case that the next thread had previously been running and was temporarily removed from the processor, returning to the calling function would be a normal event that would eventually lead the thread to resume the execution of its own code. However, for a brand new thread, there would not have been any function to call
switch_to()in order to save the thread’s context. This is why the stack of a new thread must be initialized to pretend that there has been a previous function call, enabling
switch_to()to return after restoring this new thread. Such a function is usually setup to be a few assembly lines acting as a trampoline to the thread’s code.
Note that switching to a kernel thread does not generally involve switching to another page table since the kernel address space, in which all kernel threads run, is defined in every page table structure. For user processes, the switch to their own page table is performed by the architecture-specific routine
第一个内核线程（The first kernel thread）
As explained in the source code, the only reason the kernel thread
kernel_initis created first is that it must obtain PID 1. This is the PID that the
initprocess (i.e. the first user space process born from
kernel_init) traditionally inherits.
正如在 源代码 中解释的那样，首先创建内核线程
kernel_init 的唯一原因是它必须获得 PID 1。这个 PID 是
Interestingly, the first task of
kernel_initis to wait for the second kernel thread,
kthreadd, to be ready.
kthreaddis the kernel thread daemon in charge of asynchronously spawning new kernel threads whenever requested. Once
kernel_initproceeds with the second phase of booting, which includes a few architecture-specific initializations.
In the case of a multiprocessor system,
kernel_initbegins by starting the other processors before initializing the various subsystems composing the driver model (e.g. devtmpfs, devices, buses, etc.) and, later, using the defined initialization calls to bring up the actual device drivers for the underlying hardware system. Before getting into the “fancy” device drivers (e.g. block device, framebuffer, etc.), it is probably a good idea to focus on having at least an operational terminal (by implementing the corresponding driver if necessary), especially since the early console set up by
early_printk()is supposed to be replaced by a real, full-featured console shortly after.
kernel_init 首先启动其他处理器，然后初始化多个构成驱动模型的子系统（例如 devtmpfs，设备，总线等），然后，通过提前定义好的初始化调用（init calls）来启动为底层硬件系统编写的实际设备驱动。在进入复杂的设备驱动（例如块设备驱动、Framebuffer 驱动等）之前，把精力集中在拥有至少一个可操作终端（必要时需实现相应的驱动程序）是很必要的，特别是在
early_printk() 提供的早期控制台（译注：新版已经被 earlycon 取代）被一个全功能的真实控制台（console）替代之前。
It is also through these initialization calls that the initramfs is unpacked and the initial root filesystem (rootfs) is mounted. There are a few options for mounting an initial rootfs but I have found initramfs to be the simplest when porting Linux. Basically this means that the rootfs is statically built at compilation time and integrated into the kernel binary image. After being mounted, the rootfs can give access to the mandatory
Finally, the init memory is freed (i.e. the memory containing code and data that were used only during the initialization phase and that are no longer needed) and the
initprocess that has been found on the rootfs is launched.
最后，init 内存被释放（即，内存中包含的代码和数据，这些内存只在初始化阶段使用，以后不再需要），并且 rootfs 中找到的
执行 init（Executing init）
At this point, launching
initwill probably result in an immediate fault when trying to fetch the first instruction. This is because, as with creating threads, being able to execute the
initprocess (and actually any user-space application) first involves a bit of groundwork.
The function that needs to be implemented in order to solve the instruction-fetching issue is the page fault handler. Linux is lazy, particularly when it comes to user applications and, by default, does not pre-load the text and data of applications into memory. Instead, it only sets up all of the kernel structures that are strictly required and lets applications fault at their first instruction because the pages containing their text segment have usually not been loaded yet.
This is actually perfectly intentional behavior since it is expected that such a memory fault will be caught and fixed by the page fault handler. This handler can be seen as an intricate switch statement that is able to treat every fault related to memory: from
vmalloc()faults that necessitate a synchronization with the reference page table to stack expansions in user applications. In this case, the handler will determine that the page fault corresponds to a valid virtual memory area (VMA) of the application and will consequently load the missing page in memory before retrying to run the application.
这完全是故意的行为，因为这个内存错误将会被捕捉，并且被页错误处理函数解决。这个处理函数可以被看作一个复杂的 switch 语句，它能够处理所有与内存相关的错误：从需要与引用页表同步的
Once the page fault handler is able to catch memory faults, it is likely that an extremely simple
initprocess can be executed. However, it will not be able to do much as it cannot yet request any service from the kernel through system calls, such as printing to the terminal. To this end, the system-call infrastructure must be completed with a few architecture-specific parts. System calls are treated as software interrupts since they are accessed by a user instruction that makes the processor automatically switch to kernel mode, like hardware interrupts do. Besides defining the list of system calls supported by the port, handling system calls involves enhancing the interrupt and exception handler with the additional ability to receive them.
Once there is support for system calls, it should now be possible to execute a “hello world”
initthat is able to open the main console and write a message. But there are still missing pieces in order to have a full-featured
initthat is able to start other applications and communicate with them as well as exchange data with the kernel.
The first step toward this goal concerns the management of signals and, more particularly, signal delivery (either from another process or from the kernel itself). If a process has defined a handler for a specific signal, then this handler must be called whenever the given signal is pending. Such an event occurs when the targeted process is about to get scheduled again. More specifically, this means that when resuming the process, right at the moment of the next transition back to user mode, the execution flow of the process must be altered in order to execute the handler instead. Some space must also be made on the application’s stack for the execution of the handler. Once the handler has finished its execution and has returned to the kernel (via a system call that had been previously injected into the handler’s context), the context of the process is restored so that it can resume its normal execution.
The second and last step for fully running user-space applications deals with user-space memory access: when the kernel wants to copy data from or to user-space pages. Such an operation can be quite dangerous if, for example, the application gives a bogus pointer, which would potentially result in kernel panics (or security vulnerabilities) if it is not checked properly. To circumvent this problem, it is necessary to write architecture-specific routines that use some assembly magic to register the addresses of all of the instructions performing the actual accesses to the user-space memory in an exception table. As explained in this LWN article from 2001, “if ever a fault happens in kernel mode, the fault handler scans through the exception table trying to match the address of the faulting instruction with a table entry. If a match is found, a special error exit is taken, the copy operation fails gracefully, and the system call returns a segmentation fault error.”
完全运行用户程序的第二个也是最后一个步骤是用户空间的内存访问：当内核想要从用户空间页中读写数据。一些操作可能相当危险，例如，如果应用程序给出一个伪指针，如果指针不被严格的检查，可能会导致内核崩溃（或者安全漏洞）。为了解决这个问题，有必要编写架构相关的例程（routines），这些例程使用一些汇编代码，将所有指令（实际访问用户空间内存）的地址注册到一个异常表中。如 2001 年 LWN article 所述：“如果在内核模式中发生错误，则错误处理程序通过异常表进行扫描，试图将错误指令与表项匹配。如果找到匹配，就会产生一个特殊的错误退出，内核读写用户空间的内存操作将优雅地失败，系统调用返回一个段错误。
Once a full-featured
initprocess is able to run and give access to a shell, it probably signals the end of the porting process. But it is most likely only the beginning of the adventure, as the port now needs to be maintained (as the internal APIs sometimes change quickly), and can also be enhanced in numerous ways: adding support for multiprocessor and NUMA systems, implementing more device drivers, etc.
init 进程能够运行，并且能够提供一个 shell 的入口，这可能就是本次移植过程的结束信号。但是整个冒险可能刚刚开始，因为这个移植现在需要进行维护（因为内部的 API 有时变化的很快），而且还可以通过以下几种方式进行增强：增加多处理器支持和 NUMA 系统，实现更多设备驱动等。
By describing the long journey of porting Linux to a new processor architecture, I hope that this series of articles will contribute to remedying the lack of documentation in this area and will help the next brave programmer who one day embarks upon this challenging, but ultimately rewarding, experience.
通过描述将 Linux 移植到新处理器架构的漫长过程，我希望本系列文章将有助于弥补这方面文档的不足，并将帮助下一个勇敢的程序员，有朝一日，他们也会发起类似挑战，并终将在人生履历上增加灿烂的一笔。
[The author would like to thank Ena Lupine for her help in writing and publishing these articles.]
[作者要感谢 Ena Lupine 在撰写和发表这些文章时提供的帮助。]
- RISC-V UEFI 架构支持详解，第 1 部分 - OpenSBI/U-Boot/UEFI 简介
- RISC-V OpenSBI 快速上手
- 将 Linux 移植到新的处理器架构，第 2 部分：早期代码
- RISC-V 处理器指令级性能评测尝试
- 两分钟内极速体验 RISC-V Linux 系统发行版