泰晓科技 -- 聚焦 Linux - 追本溯源,见微知著!
网站地址:https://tinylab.org

泰晓Linux知识星球:1300+知识点,520+用户
请稍侯

RISC-V Linux 内核及周边技术动态第 79 期

呀呀呀 创作于 2024/02/25

时间:20240221
编辑:晓怡
仓库:RISC-V Linux 内核技术调研活动
赞助:PLCT Lab, ISCAS

内核动态

进程调度

v1: sched/clock: Make local_clock() notrace

The “perf” clock in /sys/kernel/tracing/trace_clock enables local_clock(), where on machines that have CONFIG_HAVE_UNSTABLE_SCHED_CLOCK set is a normal function. This function can be traced.

v2: sched: cpufreq: Rename map_util_perf to sugov_apply_dvfs_headroom

We are providing headroom for the utilization to grow until the next decision point to pick the next frequency. Give the function a better name and give it some documentation. It is not really mapping anything.

v1: sched/core: introduce CPUTIME_FORCEIDLE_TASK

As core sched uses rq_clock() as clock source to account forceidle time, irq time will be accounted into forceidle time. However, in some scenarios, forceidle sum will be much larger than exec runtime,

v1: net: sched: Annotate struct tc_pedit with __counted_by

Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions).

安全增强

v1: refcount: Annotated intentional signed integer wrap-around

Mark the various refcount_t functions with _signed_wrap, as we depend on the wrapping behavior to detect the overflow and perform saturation. Silences warnings seen with the LKDTM REFCOUNT* tests:

v1: arm64: syscall: Direct PRNG kstack randomization

The existing arm64 stack randomization uses the kernel rng to acquire 5 bits of address space randomization. This is problematic because it creates non determinism in the syscall path when the rng needs to be generated or reseeded. This shows up as large tail latencies in some benchmarks and directly affects the minimum RT latencies as seen by cyclictest.

v4: bpf: Replace bpf_lpm_trie_key 0-length array with flexible array

Adjust the kernel code to use struct bpf_lpm_trie_key_u8 through-out, and for the selftest to use struct bpf_lpm_trie_key_hdr. Add a comment to the UAPI header directing folks to the two new options.

v1: leaking_addresses: Provide mechanism to scan binary files

Introduce –kallsyms argument for scanning binary files for known symbol addresses. This would have found the exposure in /sys/kernel/notes:

v1: PM: hibernate: Don’t ignore return from set_memory_ro()

set_memory_ro() and set_memory_rw() can fail, leaving memory unprotected.

Take the returned value into account and abort in case of failure.

v4: nbd: null check for nla_nest_start

nla_nest_start() may fail and return NULL. Insert a check and set errno based on other call sites within the same source code.

v1: bpf-next: bpf: Check return from set_memory_rox() and friends

arch_protect_bpf_trampoline() and alloc_new_pack() call set_memory_rox() which can fail, leading to unprotected memory.

Take into account return from set_memory_XX() functions and add __must_check flag to arch_protect_bpf_trampoline().

v1: Adjust brk randomness

It was recently pointed out[1] that x86_64 brk entropy was not great, and that on all architectures the brk can (when the random offset is 0) be immediately adjacent to .bss, leaving no gap that could stop linear overflows from the .bss. Address both issues.

v3: fortify: Add KUnit tests for runtime overflows

This series is the rest of the v2 series that was half landed last year, and finally introduces KUnit runtime testing of the CONFIG_FORTIFY_SOURCE APIs. Additionally FORTIFY failure messages are improved to give more context about read/write and sizes.

v1: enic: Avoid false positive under FORTIFY_SOURCE

FORTIFY_SOURCE has been ignoring 0-sized destinations while the kernel code base has been converted to flexible arrays. In order to enforce the 0-sized destinations (e.g. with __counted_by), the remaining 0-sized destinations need to be handled. Unfortunately, struct vic_provinfo resists full conversion, as it contains a flexible array of flexible arrays, which is only possible with the 0-sized fake flexible array.

v2: sock: Use unsafe_memcpy() for sock_copy()

While testing for places where zero-sized destinations were still showing up in the kernel, sock_copy() and inet_reqsk_clone() were found, which are using very specific memcpy() offsets for both avoiding a portion of struct sock, and copying beyond the end of it (since struct sock is really just a common header before the protocol-specific allocation). Instead of trying to unravel this historical lack of container_of(), just switch to unsafe_memcpy(), since that’s effectively what was happening already (memcpy() wasn’t checking 0-sized destinations while the code base was being converted away from fake flexible arrays).

v1: fortify: Include more details when reporting overflows

When a memcpy() would exceed the length of an entire structure, no detailed WARN would be emitted, making debugging a bit more challenging. Similarly, other buffer overflow reports would have no size information reported.

v2: module: Don’t ignore errors from set_memory_XX()

set_memory_ro(), set_memory_nx(), set_memory_x() and other helpers can fail and return an error. In that case the memory might not be protected as expected and the module loading has to be aborted to avoid security issues.

v2: cocci: Add rules to find str_plural() replacements

Add rules for finding places where str_plural() can be used. This currently finds:

v1: lib/string_choices: Add str_plural() helper

Add str_plural() helper to replace existing open implementations used by many drivers and help improve future user facing messages.

v9: Introduce mseal

This is V9 version, with removing MAP_SEALABLE and PROT_SEAL from mmap(), adding perfrmance benchmark and a test to demo the sealing of read-only memory segment of elf mapping.

v6: pwm: Improve lifetime tracking for pwm_chips

this is v6 of the series introducing better lifetime tracking for pwmchips that addresses (for now theoretic) lifetime issues of pwm chips. Addressing these is a necessary precondition to introduce chardev support for PWMs.

v4: Tegra30: add support for LG tegra based phones

Bring up Tegra 3 based LG phones Optimus 4X HD and Optimus Vu based on LG X3 board.

异步 IO

v4: block atomic writes

This series introduces a proposal to implementing atomic writes in the kernel for torn-write protection.

This series takes the approach of adding a new “atomic” flag to each of pwritev2() and iocb->ki_flags - RWF_ATOMIC and IOCB_ATOMIC, respectively. When set, these indicate that we want the write issued “atomically”.

Only direct IO is supported and for block devices here. For this, atomic write HW is required, like SCSI ATOMIC WRITE (16).

v9: io_uring: Statistics of the true utilization of sq threads.

Count the running time and actual IO processing time of the sqpoll thread, and output the statistical data to fdinfo.

Variable description: “work_time” in the code represents the sum of the jiffies of the sq thread actually processing IO, that is, how many milliseconds it actually takes to process IO. “total_time” represents the total time that the sq thread has elapsed from the beginning of the loop to the current time point, that is, how many milliseconds it has spent in total.

v1: io_uring/napi: enable even with a timeout of 0

1 usec is not as short as it used to be, and it makes sense to allow 0 for a busy poll timeout - this means just do one loop to check if we have anything available. Add a separate ->napi_enabled to check if napi has been enabled or not.

Rust For Linux

v1: Arc methods for linked list

This patchset contains two useful methods for the Arc type. They will be used in my Rust linked list implementation, which Rust Binder uses. See the Rust Binder RFC [1] for more information. Both these commits and the linked list that uses them are present in the branch referenced by the RFC.

[net-next RFC PATCH 0/3] net: phy: detach PHY driver OPs from phy_driver struct

Posting as RFC due to the massive change to a fundamental struct.

While adding some PHY ID for Aquantia, I notice that there is a big problem with duplicating OPs with each PHY.

v1: rust: Add container_of and offset_of macros

Add Rust counterparts to these C macros. container_of is useful for C struct subtyping, to recover the original pointer to the container structure. offset_of is useful for struct-relative addressing.

v1: rust: upgrade to Rust 1.77.0

This is the next upgrade to the Rust toolchain, from 1.76.0 to 1.77.0 (i.e. the latest) [1].

See the upgrade policy [2] and the comments on the first upgrade in commit 3ed03f4da06e (“rust: upgrade to Rust 1.68.2”).

v1: kbuild: rust: use -Zdebuginfo-compression

Rust 1.74.0 introduced (unstable) support for the -Zdebuginfo-compression flag, thus use it.

v1: kbuild: rust: use -Zdwarf-version to support DWARFv5

Rust 1.64.0 introduced (unstable) support for the -Zdwarf-version flag, which allows to select DWARFv5, thus use it.

v1: rust: bindings: Order headers alphabetically

As the comment on top of the file suggests, sort the headers alphabetically.

No functional changes.

v2: rust: stop using ptr_metadata feature

The byte_sub method was stabilized in Rust 1.75.0. By using that method, we no longer need the unstable ptr_metadata feature for implementing Arc::from_raw.

v3: rust: str: add {make,to}_{upper,lower}case() to CString

Add functions to convert a CString to upper- / lowercase, either in-place or by creating a copy of the original CString.

Naming followes the one from the Rust stdlib, where functions starting with ‘to’ create a copy and functions starting with ‘make’ perform an in-place conversion.

BPF

v2: bpf-next: selftests/bpf: Move test_dev_cgroup to prog_tests

Move test_dev_cgroup.c to prog_tests/dev_cgroup.c to be able to run it with test_progs. Replace dev_cgroup.bpf.o with skel header file, dev_cgroup.skel.h and load program from it accourdingly.

v4: bpf-next: Check cfi_stubs before registering a struct_ops type.

Recently, cfi_stubs were introduced. However, existing struct_ops types that are not in the upstream may not be aware of this, resulting in kernel crashes. By rejecting struct_ops types that do not provide cfi_stubs properly during registration, these crashes can be avoided.

v1: bpf: libbpf: clarify batch lookup semantics

The batch lookup APIs copy key memory into out_batch, which is then supplied in later calls to in_batch. Thus both parameters need to point to memory large enough to hold a single key (other than an initial NULL in_batch). For many maps, keys are pointer sized or less, but for larger maps, it’s important to point to a larger block of memory to avoid memory corruption.

v3: bpf-next: Create shadow types for struct_ops maps in skeletons

This patchset allows skeleton users to change the values of the fields in struct_ops maps at runtime. It will create a shadow type pointer in a skeleton for each struct_ops map, allowing users to access the values of fields through these pointers. For instance, if there is an integer field named “FOO” in a struct_ops map called “testmap”, you can access the value of “FOO” in this way.

v1: bpf-next: bpf: Shrink size of struct bpf_map/bpf_array.

Back in 2018 the commit be95a845cc44 (“bpf: avoid false sharing of map refcount with max_entries”) refcnt don’t share a cache line with max_entries that is used to bounds check map access. That was done to make spectre style attacks harder. The main mitigation is done via code similar to array_index_nospec(), of course. This was an additional precaution. It increased the size of “struct bpf_map” a little, but it’s affect on all other maps (like array) is significant, since “struct bpf_map” is typically the first member in other map types.

v2: net-next: Change BPF_TEST_RUN use the system page pool for live XDP frames

Now that we have a system-wide page pool, we can use that for the live frame mode of BPF_TEST_RUN (used by the XDP traffic generator), and avoid the cost of creating a separate page pool instance for each syscall invocation. See the individual patches for more details.

v1: arm64: mm: support dynamic vmalloc/pmd configuration

Reworks ARM’s virtual memory allocation infrastructure to support dynamic enforcement of page middle directory PXNTable restrictions rather than only during the initial memory mapping. Runtime enforcement of this bit prevents write-then-execute attacks, where malicious code is staged in vmalloc’d data regions, and later the page table is changed to make this code executable.

v1: bpf-next: mm: Introduce vm_area_[un]map_pages().

vmap() API is used to map a set of pages into contiguous kernel virtual space.

BPF would like to extend the vmap API to implement a lazily-populated contiguous kernel virtual space which size and start address is fixed early.

The vmap API has functions to request and release areas of kernel address space: get_vm_area() and free_vm_area().

v1: bpf-next: bpf: probe-read bpf_d_path() and add new acquire/release BPF kfuncs

On a number of occasions [0, 1, 2], usage of the pre-existing BPF helper bpf_d_path() under certain circumstances has led to memory corruption issues.

v1: bpf-next: bpf: make tracing program support multi-attach

For now, the BPF program of type BPF_PROG_TYPE_TRACING is not allowed to be attached to multiple hooks, and we have to create a BPF program for each kernel function, for which we want to trace, even through all the program have the same (or similar) logic. This can consume extra memory, and make the program loading slow if we have plenty of kernel function to trace.

v1: bpf-next: bpf: Add a generic bits iterator

Introducing three new kfuncs, namely bpf_iter_bits_{new,next,destroy}, to support the newly added bits iter functionality. These functions enable seamless iteration of bits from a specified memory area.

v1: bpf-next: Allow struct_ops maps with a large number of programs

The BPF struct_ops previously only allowed for one page to be used for the trampolines of all links in a map. However, we have recently run out of space due to the large number of BPF program links. By allocating additional pages when we exhaust an existing page, we can accommodate more links in a single map.

v2: bpf-next: check bpf_func_state->callback_depth when pruning states

discussion [0]. The details of the fix are in patch #2. A change to the test case test_tcp_custom_syncookie.c is necessary, otherwise updated verifier won’t be able to process it due to instruction complexity limit. This change is done in patch #1.

v1: selftests/bpf: Move test_dev_cgroup to prog_tests

Move test_dev_cgroup to prog_tests to be able to run it with test_progs. Replace dev_cgroup.bpf.o with skel header file, dev_cgroup.skel.h and load program from it accourdingly.

v5: net-next: Enable SGMII and 2500BASEX interface mode switching for Intel platforms

During the interface mode change, the ‘phylink_major_config’ function will be triggered in phylink. The modification of the following functions will be triggered to support the switching between SGMII and 2500BASEX interfaces mode for the Intel platform.

v1: bpf-next: bpf: Check cfi_stubs before registering a struct_ops type.

Recently, cfi_stubs were introduced. However, existing struct_ops types that are not in the upstream may not be aware of this, resulting in kernel crashes. By rejecting struct_ops types that do not provide cfi_stubs during registration, these crashes can be avoided.

v1: bpf-next: bpf: improve duplicate source code line detection

Verifier log avoids printing the same source code line multiple times when a consecutive block of BPF assembly instructions are covered by the same original (C) source code line. This greatly improves verifier log legibility.

[v5: Combine perf and bpf for fast eval of hw breakpoint conditions]](http://lore.kernel.org/bpf/20240214173950.18570-1-khuey@kylehuey.com/)

Currently, rr uses software breakpoints that trap (via ptrace) to the supervisor, and evaluates the condition from the supervisor. If the asynchronous event is delivered in a tight loop (thus requiring the breakpoint condition to be repeatedly evaluated) the overhead can be immense. A patch to rr that uses hardware breakpoints via perf events with an attached BPF program to reject breakpoint hits where the condition is not satisfied reduces rr’s replay overhead by 94% on a pathological (but a real customer-provided, not contrived) rr trace.

v2: bpf-next: allow HID-BPF to do device IOs

[Still a RFC: there are a lot of FIXMEs in the code, and calling the sleepable timer cb actually crashes.] [Also using bpf-next as the base tree as there will be conflicting changes otherwise]

This is crashing, and I have a few questions in the code (look for all of the FIXMEs), so sending this now before I become insane :)

v3: net-next: dma: skip calling no-op sync ops when possible

The series grew from Eric’s idea and patch at [0]. The idea of using the shortcut for direct DMA as well belongs to Chris.

When an architecture doesn’t need DMA synchronization and the buffer is not an SWIOTLB buffer, most of times the kernel and the drivers end up calling DMA sync operations for nothing. Even when DMA is direct, this involves a good non-inline call ladder and eats a bunch of CPU time. With IOMMU, this results in calling indirect calls on hotpath just to check what is already known and return. XSk is been using a custom shortcut for that for quite some time. I recently wanted to introduce a similar one for Page Pool. Let’s combine all this into one generic shortcut, which would cover all DMA sync ops and all types of DMA (direct, IOMMU, …).

v3: bpf-next: libbpf: make remark about zero-initializing bpf_*_info structs

In some situations, if you fail to zero-initialize the bpf_{prog,map,btf,link}info structs supplied to the set of LIBBPF helpers bpf{prog,map,btf,link}get_info_by_fd(), you can expect the helper to return an error. This can possibly leave people in a situation where they’re scratching their heads for an unnnecessary amount of time. Make an explicit remark about the requirement of zero-initializing the supplied bpf{prog,map,btf,link}_info structs for the respective LIBBPF helpers.

v2: bpf-next: libbpf: make remark about zero-initializing bpf_*_info structs

In some situations, if you fail to zero-initialize the bpf_{prog,map,btf,link}info structs supplied to the set of LIBBPF helpers bpf{prog,map,btf,link}get_info_by_fd(), you can expect the helper to return an error. This can possibly leave people in a situation where they’re scratching their heads for an unnnecessary amount of time. Make an explicit remark about the requirement of zero-initializing the supplied bpf{prog,map,btf,link}_info structs for the respective LIBBPF helpers.

v2: bpf-next: Create shadow variables for struct_ops in skeletons

This RFC is for gathering feedback/opinions on the design. Based on the feedback received for v1, I made some modifications.

v1: bpf-next: bpf: use O(log(N)) binary search to find line info record

Real-world BPF applications keep growing in size. Medium-sized production application can easily have 50K+ verified instructions, and its line info section in .BTF.ext has more than 3K entries.

v1: Corrected GPL license name

The bpf_doc script refers to the GPL as the “GNU Privacy License”. I strongly suspect that the author wanted to refer to the GNU General Public License, under which the Linux kernel is released, as, to the best of my knowledge, there is no license named “GNU Privacy License”.

v2: bpf-next: libbpf: add support to GCC in CORE macro definitions

Due to internal differences between LLVM and GCC the current implementation for the CO-RE macros does not fit GCC parser, as it will optimize those expressions even before those would be accessible by the BPF backend.

v1: bpf-next: libbpf: make remark about zero-initializing bpf_*_info structs

In some situations, if you fail to zero-initialize the bpf__info buffer supplied to the set of LIBBPF helpers bpf_{prog,map,btf,link}_get_info_by_fd(), you can expect the helper to return an error. This can possibly leave people in a situation where they’re scratching their heads for an unnnecessary amount of time. Make an explicit remark about the requirement of zero-initializing the supplied bpf__info buffers for the respective LIBBPF helpers to prevent exactly this situation.

v1: net-next: Use per-task storage for XDP-redirects on PREEMPT_RT

In [0] I introduced explicit locking for resources which are otherwise locked implicit locked by local_bh_disable() and this protections goes away if the lock in local_bh_disable() is removed on PREEMPT_RT.

v1: bpf-next: ARC: Add eBPF JIT support

This will add eBPF JIT support to the 32-bit ARCv2 processors. The implementation is qualified by running the BPF tests on a Synopsys HSDK board with “ARC HS38 v2.1c at 500 MHz” as the 4-core CPU.

周边技术动态

Qemu

v5: riscv: set vstart_eq_zero on mark_vs_dirty

In this new version we removed the remaining brconds() from trans_rvbf16.c.inc like Richard suggested in patch 3. Richard, I kept your ack in that patch.

v1: target/riscv: Add missing include guard in pmu.h

Add missing include guard in pmu.h to avoid the problem of double inclusion.

v1: RISC-V: Modularize common match conditions for trigger

According to RISC-V Debug specification, the enabled privilege levels of the trigger is common match conditions for all the types of the trigger. This series modularize the code for checking the privilege levels of type 2/3/6 triggers by implementing functions trigger_common_match() and trigger_priv_match().

v1: RISC-V: Implement CSR tcontrol in debug spec

The RISC-V Debug specification defines CSR “tcontrol” in the trigger module:https://github.com/riscv/riscv-debug-spec

v4: riscv: named features riscv,isa, ‘svade’ rework

This new version is rebased with alistair/riscv-to-apply.next and with more acks added.

support for having both 32 and 64 bit RISC-V CPUs in one QEMU machine

I have a situation when I need to use third-party 32-bit RISC-V CPU when rest is all 64-bit RISC-V CPUs. I have seen that some steps were already made in the direction to enable such configuration (https://riscv.org/blog/2023/01/run-32-bit-applications-on-64-bit-linux-kernel-liu-zhiwei-guo-ren-t-head-division-of-alibaba-cloud/), I am wondering if someone can shed more light on it.

v1: Improve the performance of RISC-V vector unit-stride ld/st instructions

When glibc with RVV support [1], the memcpy benchmark will run 2x to 60x slower than the scalar equivalent on QEMU and it hurts developer productivity.

From the performance analysis result, we can observe that the glibc memcpy spends most of the time in the vector unit-stride load/store helper functions.

v1: target: riscv: Add Svvptc extension support

The Svvptc extension describes a uarch that does not cache invalid TLB entries: that’s the case for qemu so there is nothing particular to implement other than the introduction of this extension, which is done here.

U-Boot

v1: riscv: mbv: Enhance MB-V support with also enabling SPL

enhance MB-V support with SPL configuration to support OpenSBI. All of that changes are out of generic Risc-V support that’s why happy to take it via my tree. Please let me know if you want this to take via riscv subtree.



Read Album:

Read Related:

Read Latest: