RISC-V Linux 内核及周边技术动态第 95 期

呀呀呀创作于 2024/06/11

时间：20240609
编辑：晓瑜
仓库：RISC-V Linux 内核技术调研活动
赞助：PLCT Lab, ISCAS

内核动态

RISC-V 架构支持

v2: riscv: Improve exception and system call latency

Many CPUs implement return address branch prediction as a stack. The RISCV architecture refers to this as a return address stack (RAS).

v3: vmalloc: Modify the alloc_vmap_area() error message for better diagnostics

This message is misleading because ‘vmalloc=’ is supported on arm32, x86 platforms and is not a valid kernel parameter on a number of other platforms (in particular its not supported on arm64,alpha,loongarch,arc, csky,hexagon,microblaze,mips,nios2,openrisc,parisc,m64k,powerpc,riscv,sh, um,xtensa,s390,sparc). With the update, the output gets modified to include the function parameters along with the start and end of the virtual memory range allowed.

v16: riscv: sophgo: add clock support for sg2042

This series adds clock controller support for sophgo sg2042.

v1: riscv: Per-thread envcfg CSR support

This series (or equivalent) is a prerequisite for both user-mode pointer masking and CFI support, as those are per-thread features are controlled by fields in the envcfg CSR.

v7: Linux RISC-V IOMMU Support

This patch series introduces support for RISC-V IOMMU architected hardware into the Linux kernel.

v5: Add Svade and Svadu Extensions Support

Svade and Svadu extensions represent two schemes for managing the PTE A/D bit. When the PTE A/D bits need to be set, Svade extension intdicates that a related page fault will be raised.

v4: riscv: Memory Hot(Un)Plug support

Memory Hot(Un)Plug support (and ZONE_DEVICE) for the RISC-V port

v0: RISCV: Report vector unaligned accesses hwprobe

Detected if a system traps into the kernel on an vector unaligned access. Add the result to a new key in hwprobe.

v2: riscv: sophgo: add thermal sensor support for cv180x/sg200x SoCs

This series implements driver for Sophgo cv180x/sg200x on-chip thermal sensor and adds thermal zones for CV1800B SoCs.

v6: Add support for a few Zc* extensions, Zcmop and Zimop

Add support for (yet again) more RVA23U64 missing extensions. Add support for Zimop, Zcmop, Zca, Zcf, Zcd and Zcb extensions ISA string parsing, hwprobe and kvm support. Zce, Zcmt and Zcmp extensions have been left out since they target microcontrollers/embedded CPUs and are not needed by RVA23U64.

v2: Add the core reset for UARTs of StarFive JH7110

The UART of StarFive JH7110 needs two reset signals (apb, core) to initialize. This patch series adds the missing core reset.

v2: riscv: stacktrace: Add USER_STACKTRACE support

Currently, userstacktrace is unsupported for riscv. So use the perf_callchain_user() code as blueprint to implement the arch_stack_walk_user() which add userstacktrace support on riscv. Meanwhile, we can use arch_stack_walk_user() to simplify the implementation of perf_callchain_user().

v6: RISC-V: ACPI: Add external interrupt controller support

This series adds support for the below ECR approved by ASWG. The series primarily enables irqchip drivers for RISC-V ACPI based platforms. The series can be broadly categorized like below.

LoongArch 架构支持

v1: loongarch: Only select HAVE_OBJTOOL and allow ORC unwinder if the inline assembler supports R_LARCH_{32,64}_PCREL

GAS <= 2.41 does not support generating R_LARCH_{32,64}PCREL for “label - .” and it generates R_LARCH{ADD,SUB}{32,64} pairs instead. objtool cannot handle R_LARCH_{ADD,SUB}{32,64} pair in __jump_table (static key implementation) and etc.

v1: LoongArch: KVM: Discard dirty page tracking on readonly memslot

For readonly memslot such as UEFI bios or UEFI var space, guest can not write this memory space directly. So it is not necessary to track dirty pages for readonly memslot. Here there is such optimization in function kvm_arch_commit_memory_region().

进程调度

v1: sched: Initialize the vruntime of a new task when it is first enqueued

When create a new task, we initialize vruntime of the new task at sched_cgroup_fork(). However, the timing of executing this action is too early and may not be accurate.

v1: sched/fair: Prevent cpu_busy_time from exceeding actual_cpu_capacity

Because the effective_cpu_util() would return a util which maybe bigger than the actual_cpu_capacity, this could cause the pd_busy_time calculation errors.

内存管理

v1: mm: sparse: clarify a variable name and its value

Setting ‘limit’ variable to 0 might seem like it means “no limit”. But in the memblock API, 0 actually means the ‘MEMBLOCK_ALLOC_ACCESSIBLE’ enum, which limits the physical address range based on ‘memblock.current_limit’. This can be confusing.

v2: mm: zswap: handle incorrect attempts to load of large folios

Zswap does not support storing or loading large folios. Until proper support is added, attempts to load large folios from zswap are a bug.

v2: mm: introduce pmd/pte_needs_soft_dirty_wp helpers and utilize them

This patchset introduces the pte_need_soft_dirty_wp and pmd_need_soft_dirty_wp helpers to determine if write protection is required for softdirty tracking.

v2: Introduce a store type enum for the Maple tree

This series implements two work items: “aligning mas_store_gfp() with mas_preallocate()” and “enum for store type”.

v7: enable bs > ps in XFS

This is the seventh version of the series that enables block size > page size (Large Block Size) in XFS targetted for inclusion in 6.11.

v1: 6.6.y: mm: ratelimit stat flush from workingset shrinker

One of our workloads (Postgres 14 + sysbench OLTP) regressed on newer upstream kernel and on further investigation, it seems like the cause is the always synchronous rstat flush in the count_shadow_nodes() added by the commit f82e6bf9bb9b (“mm: memcg: use rstat for non-hierarchical stats”).

v1: rust: alloc: add __GFP_HIGHMEM flag

Make it possible to allocate memory that doesn’t need to mapped into the kernel’s address space. This flag is useful together with Page::alloc_page .

v1: mm: zswap: add VM_BUG_ON() if large folio swapin is attempted

With ongoing work to support large folio swapin, it is important to make sure we do not pass large folios to zswap_load() without implementing proper support.

v1: mm: zswap: limit number of zpools based on CPU and RAM

This patch limits the number of zpools used by zswap on smaller systems.

v2: mm/memblock: Add “reserve_mem” to reserved named memory at boot up

Reserve unspecified location of physical memory from kernel command line

v1: support large folio swap-out and swap-in for shmem

Shmem will support large folio allocation to get a better performance, however, the memory reclaim still splits the precious large folios when trying to swap-out shmem, which may lead to the memory fragmentation issue and can not take advantage of the large folio for shmeme.

v1: mm: introduce pmd/pte_need_soft_dirty_wp helpers for softdirty write-protect

This patch introduces the pte_need_soft_dirty_wp and pmd_need_soft_dirty_wp helpers to determine if write protection is required for softdirty tracking. This can enhance code readability and improve its overall appearance.

v3: maple_tree: modified return type of mas_wr_store_entry()

Since the return value of mas_wr_store_entry() is not used, the return type can be changed to void.

v13: mm: report per-page metadata information

This patch adds 2 fields to /proc/vmstat that can used as shown below:

v1: mm/mm_init.c: don’t initialize page->lru again

After init_reserved_page(), we expect __init_single_page() has done its work to the page, which already initialize page->lru properly.

v1: Enable P2PDMA in Userspace RDMA

This patch series enables P2PDMA memory to be used in userspace RDMA transfers.

v1: ML infrastructure in Linux kernel

Initiate a discussion related to an unified infrastructure for ML workloads and user-space drivers.

v2: -next: mm/hugetlb_cgroup: rework on cftypes

This patchset provides an intuitive view of the control files through static templates of cftypes, improve the readability of the code.

**[v1: mm/mm_init.c: simplify logic of deferred_[init

free]_pages](http://lore.kernel.org/linux-mm/20240605010742.11667-1-richard.weiyang@gmail.com/)**

Function deferred_[init|free]_pages are only used in deferred_init_maxorder(), which makes sure the range to init/free is within MAX_ORDER_NR_PAGES size.

**[v3: ioctl()-based API to query VMAs from /proc//maps](http://lore.kernel.org/linux-mm/20240605002459.4091285-1-andrii@kernel.org/)**

Implement binary ioctl()-based interface to /proc//maps file to allow applications to query VMA information more efficiently than reading *all* VMAs nonselectively through text-based interface of /proc//maps file.

文件系统

v1: fs: allow listmount() with reversed ordering

A few smaller cleanups included in this series.

[PATCHES]v1: rework of struct fd handling

Experimental series trying to sanitize the handling of struct fd.  Lightly tested, in serious need of review.

v4: Improve readability of copy_tree

This involves renaming the opaque variables (e.g., p, q, r, s) to be more descriptive, aiming to make the code easier to understand.

v5: fs: Improve eventpoll logging to stop indicting timerfd

This change addresses this problem by changing the way eventpoll wakesources are named

v1: vfs: add rcu-based find_inode variants for iget ops

Instantiating a new inode normally takes the global inode hash lock twice:

v2: Employ `copy mount tree from src to dst` concept in copy_tree

Variable names in copy_tree (e.g., p, q, r, s) are opaque; renaming them to be more descriptive would aim to make the code easier to understand.

v1: possible way to deal with dup2() vs. allocated but still not opened descriptors

It’s outside of POSIX scope and any userland code that might run into it is buggy. However, we need to make sure that nothing breaks kernel-side. We used to have interesting bugs in that area and so did *BSD kernels.

v1: fs_parse: add uid & gid option parsing helpers

Multiple filesystems take uid and gid as options, and the code to create the ID from an integer and validate it is standard boilerplate that can be moved into common parsing helper functions, so do that for consistency and less cut&paste.

[HACK PATCH] fs: dodge atomic in putname if ref == 1

The struct used to be refcounted with regular inc/dec ops, atomic usage showed up in commit 03adc61edad4 (“audit,io_uring: io_uring openat triggers audit reference count underflow”).

v1: NFSv4: set sb_flags to second superblock

Added sb_flags parameter to d_automount callback function and fs_context_for_submount(). NFSv4 uses this parameter to set the second superblock.

v2: printk: add threaded printing + the rest

This is v2 of a series to implement threaded console printing as well as some other minor pieces (such as proc and sysfs support). This series is only a subset of the original v1 [0].

v1: iomap: keep on increasing i_size in iomap_write_end()

Commit ‘943bc0882ceb (“iomap: don’t increase i_size if it’s not a write operation”)’ breaks xfs with realtime device on generic/561, the problem is when unaligned truncate down a xfs realtime inode with rtextsize > 1 fs block, xfs only zero out the EOF block but doesn’t zero out the tail blocks that aligned to rtextsize, so if we don’t increase i_size in iomap_write_end(), it could expose stale data after we do an append write beyond the aligned EOF block.

v1: sys_ringbuffer

New syscall for mapping generic ringbuffers for arbitary (supported) file descriptors.

v7: block atomic writes

This series introduces a proposal to implementing atomic writes in the kernel for torn-write protection.

v1: fs/ntfs3: dealing with situations where dir_search_u may return null

If hdr_find_e() fails to find an entry in the index buffer, dir_search_u() maybe return NULL.

v1: readdir: Add missing quote in macro comment

Add a missing double quote in the unsafe_copy_dirent_name() macro comment.

v1: blk: optimization for classic polling

This removes the dependency on interrupts to wake up task. Set task state as TASK_RUNNING, if need_resched() returns true, while polling for IO completion.

网络设备

v1: can: treewide: decorate flexible array members with __counted_by()

A new __counted_by() attribute was introduced in [1]. It makes the compiler’s sanitizer aware of the actual size of a flexible array member, allowing for additional runtime checks.

v2: net-next: net: flow dissector: allow explicit passing of netns

Change since last version:fix kdoc comment warning reported by kbuild robot, no other changes,thus retaining RvB tags from Eric and Willem.

v4: bpf-next: bpf: Support dumping kfunc prototypes from BTF

This patchset enables both detecting as well as dumping compilable prototypes for kfuncs.

v2: net: bnxt_en: Cap the size of HWRM_PORT_PHY_QCFG forwarded response

Firmware interface 1.10.2.118 has increased the size of HWRM_PORT_PHY_QCFG response beyond the maximum size that can be forwarded. When the VF’s link state is not the default auto state, the PF will need to forward the response back to the VF to indicate the forced state. This regression may cause the VF to fail to initialize.

v6: af_packet: Handle outgoing VLAN packets without hardware offloading

The issue initially stems from libpcap. The ethertype will be overwritten as the VLAN TPID if the network interface lacks hardware VLAN offloading.

v1: net-next: net: dsa: generate port ifname if exists or invalid

In the case where a DSA port (via DTB label) had an interface name that collided with an existing netdev name, register_netdevice failed with -EEXIST, and the port was not usable.

v1: isdn: add missing MODULE_DESCRIPTION() macros

make allmodconfig && make W=1 C=1 reports: Add the missing invocations of the MODULE_DESCRIPTION() macro.

v1: net/sched: initialize noop_qdisc owner

When the noop_qdisc owner isn’t initialized, then it will be 0, so packets will erroneously be regarded as having been subject to recursion as long as only CPU 0 queues them.

v3: net-next: Enable PTP timestamping/PPS for AM65x SR1.0 devices

This patch series enables support for PTP in AM65x SR1.0 devices.

v5: iwl-net: ice: Do not get coalesce settings while in reset

Getting coalesce settings while reset is in progress can cause NULL pointer deference bug.

v4: can: m_can: don’t enable transceiver when probing

The m_can driver sets and clears the CCCR.INIT bit during probe (both when testing the NON-ISO bit, and when configuring the chip).

v4: iwl-next: ice: Add support for devlink local_forwarding param.

Add support for driver-specific devlink local_forwarding param. Supported values are “enabled”, “disabled” and “prioritized”. Default configuration is set to “enabled”.

v3: net-next: net: core: Unify dstats with tstats and lstats, implement generic dstats collection

The struct pcpu_dstats (“dstats”) has a few variations from the other two stats types (struct pcpu_sw_netstats and struct pcpu_lstats), and doesn’t have generic helpers for collecting the per-cpu stats into a struct rtnl_link_stats64.

v5: Series to deliver Ethernet for STM32MP13

Rework dwmac glue to simplify management for next stm32 (integrate RFC from Marek)

v20: net-next: Add Realtek automotive PCIe driver

This series includes adding realtek automotive ethernet driver and adding rtase ethernet driver entry in MAINTAINERS file.

v6: net-next: net: ethernet: mtk_eth_soc: ppe: add support for multiple PPEs

Add the missing pieces to allow multiple PPEs units, one for each GMAC. mtk_gdm_config has been modified to work on targted mac ID, the inner loop moved outside of the function to allow unrelated operations like setting the MAC’s PPE index.

v1: CDC-NCM: add support for Apple’s private interface

This private interface lacks a status endpoint, presumably because there isn’t a physical cable that can be unplugged, nor any speed changes to be notified about.

v2: net-next: net: pse-pd: Add new PSE c33 features

This patch series adds new c33 features to the PSE API.

v13: net-next: Introduce PHY listing and link_topology tracking

This is V13 for the link topology addition, allowing to track all PHYs that are linked to netdevices.

v1: net: bnxt_en: Adjust logging of firmware messages in case of released token in __hwrm_send()

In case of token is released due to token->state == BNXT_HWRM_DEFERRED, released token (set to NULL) is used in log messages. This issue is expected to be prevented by HWRM_ERR_CODE_PF_UNAVAILABLE error code.

v5: net-next: locking: Introduce nested-BH locking.

Disabling bottoms halves acts as per-CPU BKL. On PREEMPT_RT code within local_bh_disable() section remains preemtible. As a result high prior tasks (or threaded interrupts) will be blocked by lower-prio task (or threaded interrupts) which are long running which includes softirq sections.

v2: net: gve: ignore nonrelevant GSO type bits when processing TSO headers

TSO currently fails when the skb’s gso_type field has more than one bit set.

v3: ipsec-next: Add IP-TFS mode to xfrm

This patchset adds a new xfrm mode implementing on-demand IP-TFS. IP-TFS (AggFrag encapsulation) has been standardized in RFC9347.

安全增强

v4: batman-adv: Add flex array to struct batadv_tvlv_tt_data

The “struct batadv_tvlv_tt_data” uses a dynamically sized set of trailing elements. Specifically, it uses an array of structures of type “batadv_tvlv_tt_vlan_data”. So, use the preferred way in the kernel declaring a flexible array .

v1: mm/pstore: Reserve named unspecified memory across boots

Reserve unspecified location of physical memory from kernel command line

v4: Hardening perf subsystem

This is an effort to get rid of all multiplications from allocation functions in order to prevent integer overflows .

异步 IO

v1: Wait on cancelations at release time

The idea is to ensure that we’ve done any fputs that we need to when a task using a ring exit, so that we don’t leave references that will get put “shortly afterwards”.

v1: io_uring: check for non-NULL file pointer in io_file_can_poll()

In earlier kernels, it was possible to trigger a NULL pointer dereference off the forced async preparation path, if no file had been assigned. The trace leading to that looks as follows:

Rust For Linux

v2: Rust bindings for cpufreq and OPP core + sample driver

This RFC adds initial rust bindings for two subsystems, cpufreq and operating performance points (OPP). The bindings are provided for most of the interface these subsystems expose.

v1: Tracepoints and static branch/call in Rust

An important part of a production ready Linux kernel driver is tracepoints. So to write production ready Linux kernel drivers in Rust, we must be able to call tracepoints from Rust code. This patch series adds support for calling tracepoints declared in C from Rust.

v1: arch: um: rust: Add i386 support for Rust

At present, Rust in the kernel only supports 64-bit x86, so UML has followed suit.

v5: Rust block device driver API and null block driver

This revision includes a check to validate the block size in the abstractions rather than in the driver. Also, the `GenDisk` type state was changed to a builder pattern.

v3: net::phy add unified API for C22 and C45

add unified API for C22 and C45, reading/writing registers and genphy_read_status().

BPF

v3: bpf: Using binary search to improve the performance of btf_find_by_name_kind

Currently, we are only using the linear search method to find the type id by the name, which has a time complexity of O(n). This change involves sorting the names of btf types in ascending order and using binary search, which has a time complexity of O(log(n)).

v1: bpf: don’t call mmap_read_trylock() from IRQ context

syzbot is reporting that the same local lock is held when trying to hold mmap sem from both IRQ enabled context and IRQ context.

v1: bpf-next: bpf: Track delta between “linked” registers.

The “undo” pass was introduced in LLVM https://reviews.llvm.org/D121937 to prevent this optimization, but it cannot cover all cases.

v1: ftrace: Skip __fentry__ location of overridden weak functions

The case is that, based on current compiler behavior

v3: bpftool: Query only cgroup-related attach types

From strace and kernel tracing, I found netkit returned ENXIO and this command failed. I think this AttachType(BPF_NETKIT_PRIMARY) is not relevant to cgroup.

v11: net-next: Device Memory TCP

GIT PULL: Networking for v6.10-rc3

Including fixes from BPF and big collection of fixes for WiFi core and drivers.

v2: bpf-next: Regular expression support for test output matching

This is v2 on the regular expression for test output matching patches.

v1: bpf-next: libbpf: auto-attach skeletons struct_ops

Similarly to `bpf_program`, support `bpf_map` automatic attachment in `bpf_object__attach_skeleton`. Currently only struct_ops maps could be attached.

v2: bpf-next: libbpf: BTF field iterator

Add BTF field (type and string fields, right now) iterator support instead of using existing callback-based approaches, which make it harder to understand and support BTF-processing code.

v1: bpf-next: uprobe, bpf: Add session support

this patchset is adding support for session uprobe attachment and using it through bpf link for bpf programs.

v1: bpf: Support bpf shadow stack

This works for all bpf selftests, but it is expensive. To avoid runtime kmalloc, we could preallocate some spaces, e.g., percpu pages to be used for stack. This should work for non-sleepable programs.

v1: bpf: Support shadow stack

Try to add 3rd argument to bpf program where the 3rd argument is the frame pointer to bpf program stack.

周边技术动态

Qemu

v1: target/riscv: support atomic instruction fetch (Ziccif)

Support 4-byte atomic instruction fetch when instruction is natural aligned.

v4: target/riscv: Support RISC-V privilege 1.13 spec

Based on the change log for the RISC-V privilege 1.13 spec, add the support for ss1p13.

v4: target/riscv/kvm: QEMU support for KVM Guest Debug on RISC-V

This series implements QEMU KVM Guest Debug on RISC-V, with which we could debug RISC-V KVM guest from the host side, using software breakpoints.

v4: target/riscv: raise an exception when CSRRS/CSRRC writes a read-only CSR

Both CSRRS and CSRRC always read the addressed CSR and cause any read side effects regardless of rs1 and rd fields. Note that if rs1 specifies a register holding a zero value other than x0, the instruction will still attempt to write the unmodified value back to the CSR and will cause any attendant side effects.

v5: RISC-V: Modularize common match conditions for trigger

This series modularize the code for checking the privilege levels of type 2/3/6 triggers by implementing functions trigger_common_match() and trigger_priv_match().

v2: riscv-to-apply queue