泰晓科技 -- 聚焦 Linux - 追本溯源,见微知著!


RISC-V Linux 内核及周边技术动态第 42 期

呀呀呀 创作于 2023/04/17

仓库:RISC-V Linux 内核技术调研活动


RISC-V 架构支持

v18: -next: riscv: Add vector ISA support

This patchset is implemented based on vector 1.0 spec to add vector support in riscv Linux kernel. There are some assumptions for this implementations.

v1: riscv: mm: execute local TLB flush after populating vmemmap

The spare_init() calls memmap_populate() many times to create VA to PA mapping for the VMEMMAP area, where all “strcut page” are located once CONFIG_SPARSEMEM_VMEMMAP is defined. These “struct page” are later initialized in the zone_sizes_init() function. However, during this process, no sfence.vma instruction is executed for this VMEMMAP area. This omission may cause the hart to fail to perform page table work because some data related to the address translation is invisible to the hart. To solve this issue, the local_flush_tlb_kernel_range() is called right after the spare_init() to execute a sfence.vma instruction for the VMEMMAP area, ensuring that all data related to the address translation is visible to the hart.

v3: Add PLL clocks driver for StarFive JH7110 SoC

This patch serises are to add PLL clocks driver and providers by writing and reading syscon registers for the StarFive JH7110 RISC-V SoC. And add documentation to describe StarFive System Controller(syscon) Registers.

v1: riscv: Allow userspace to directly access perf counters

riscv used to allow direct access to cycle/time/instret counters, bypassing the perf framework, this patchset intends to allow the user to mmap any counter when accessed through perf. But we can’t break the existing behaviour so we introduce a sysctl perf_user_access like arm64 does, which defaults to the legacy mode described above.

v8: Add non-coherent DMA support for AX45MP

On the Andes AX45MP core, cache coherency is a specification option so it may not be supported. In this case DMA will fail. To get around with this issue this patch series does the below:

1] Andes alternative ports is implemented as errata which checks if the IOCP is missing and only then applies to CMO errata. One vendor specific SBI EXT (ANDES_SBI_EXT_IOCP_SW_WORKAROUND) is implemented as part of errata.

v4: Add JH7110 MIPI DPHY RX support

This patchset adds mipi dphy rx driver for the StarFive JH7110 SoC. It is used to transfer CSI camera data. The series has been tested on the VisionFive 2 board.

v4: Add new partial clock and reset drivers for StarFive JH7110

This patch serises are base on the basic JH7110 SYSCRG/AONCRG drivers and add new partial clock drivers and reset supports about System-Top-Group(STG), Image-Signal-Process(ISP) and Video-Output(VOUT) for the StarFive JH7110 RISC-V SoC. These clocks and resets could be used by DMA, VIN and Display modules.

v16: Microchip Soft IP corePWM driver

Uwe & I had a long back and forth about period calculations on v13, my ultimate conclusion being that, after some testing of the “corrected” calculation in hardware, the original calculation was correct. I think we had gotten sucked into discussion the calculation of the period itself, when we were in fact trying to calculate a bound on the period instead. That discussion is here: https://lore.kernel.org/linux-pwm/Y+ow8tfAHo1yv1XL@wendy/

v1: Add JH7110 cpufreq support

The StarFive JH7110 SoC has four RISC-V cores, and it supports up to 4 cpu frequency loads.

This patchset adds the compatible strings into the allowlist for supporting the generic cpufreq driver on JH7110 SoC. Also, it enables the axp15060 pmic for the cpu power source.

v1: Add JH7110 DPHY PMU support

This patchset adds mipi dphy power domain driver for the StarFive JH7110 SoC. It is used to turn on dphy power switch. The series has been tested on the VisionFive 2 board.

v4: -next: support allocating crashkernel above 4G explicitly on riscv

On riscv, the current crash kernel allocation logic is trying to allocate within 32bit addressible memory region by default, if failed, try to allocate without 4G restriction.

v1: RISC-V: Detect Ssqosid extension and handle sqoscfg CSR

This RFC series adds initial support for the Ssqosid extension and the sqoscfg CSR as specified in Chapter 2 of the RISC-V Capacity and Bandwidth Controller QoS Register Interface (CBQRI) specification [1].

v1: riscv: enable BUILDTIME_TABLE_SORT for !MMU

BUILDTIME_TABLE_SORT works for !MMU as well, so enable it.


v2: sched/topology: add for_each_numa_cpu() macro

for_each_cpu() is widely used in kernel, and it’s beneficial to create a NUMA-aware version of the macro.

Recently added for_each_numa_hop_mask() works, but switching existing codebase to it is not an easy process.

v6: sched/numa: add per-process numa_balancing



A large number of page faults will cause performance loss when numa balancing is performing. Thus those processes which care about worst-case performance need numa balancing disabled. Others, on the contrary, allow a temporary performance loss in exchange for higher average performance, so enable numa balancing is better for them.

v1: sched/core: Make sched_dynamic_mutex static

The sched_dynamic_mutex is only used within the file. Make it static.

v1: sched: Rate limit migrations

This WIP patch rate-limits migrations to 32 migrations per 10ms window for each task.


v8: mm: process/cgroup ksm support

So far KSM can only be enabled by calling madvise for memory regions. To be able to use KSM for more workloads, KSM needs to have the ability to be enabled / disabled at the process / cgroup level.

v5: Replace invocations of prandom_u32() with get_random_u32()

The security improvements for prandom_u32 done in commits c51f8f88d705 from October 2020 and d4150779e60f from May 2022 didn’t handle the cases when prandom_bytes_state() and prandom_u32_state() are used.

v1: mm: rename reclaim_pages() to reclaim_folios()

As commit a83f0551f496 (“mm/vmscan: convert reclaim_pages() to use a folio”) changes the arg from page_list to folio_list, but not the defination, let’s correct it and rename it to reclaim_folios too.

v1: [v2] mm: make arch_has_descending_max_zone_pfns() static

clang produces a build failure on x86 for some randconfig builds after a change that moves around code to mm/mm_init.c:

Cannot find symbol for section 2: .text. mm/mm_init.o: failed

v1: NFSD memory allocation optimizations

I’ve found a few ways to optimize the release of pages in NFSD. Please let me know if I’m abusing the release_pages() and pagevec APIs.

v1: mm/folio: Avoid special handling for order value 0 in folio_set_order

folio_set_order(folio, 0); which is an abuse of folio_set_order as 0-order folio does not have any tail page to set order. folio->_folio_nr_pages is set to 0 for order 0 in folio_set_order. It is required because _folio_nr_pages overlapped with page->mapping and leaving it non zero caused “bad page” error while freeing gigantic hugepages. This was fixed in Commit ba9c1201beaa (“mm/hugetlb: clear compound_nr before freeing gigantic pages”). Also commit a01f43901cfb (“hugetlb: be sure to free demoted CMA pages to CMA”) now explicitly clear page->mapping and hence we won’t see the bad page error even if _folio_nr_pages remains unset. Also the order 0 folios are not supposed to call folio_set_order, So now we can get rid of folio_set_order(folio, 0) from hugetlb code path to clear the confusion.

v4: modules/kmod: replace implementation with a sempahore

Changes on this v4:

o Really add Matthew Wilcox’ preferred tribal knowledge docso Add all the pending tags

v1: lib/percpu_counter, cpu/hotplug: Cure the cpu_dying_mask woes

The cpu_dying_mask is not only undocumented but also to some extent a misnomer. It’s purpose is to capture the last direction of a cpu_up() or cpu_down() operation taking eventual rollback operations into account.

v5: Introduce Copy-On-Write to Page Table

This patch is primarily aimed at optimizing the memory usage of page table in processes with large address space, which can potentailly lead to improved the fork system calll latency under certain conditions. However, we’re planning to improve the fork latency in the future but not in this patch.

v1: mm: page_alloc: Skip regions with hugetlbfs pages when allocating 1G pages

A bug was reported by Yuanxi Liu where allocating 1G pages at runtime is taking an excessive amount of time for large amounts of memory. Further testing allocating huge pages that the cost is linear i.e. if allocating 1G pages in batches of 10 then the time to allocate nr_hugepages from 10->20->30->etc increases linearly even though 10 pages are allocated at each step. Profiles indicated that much of the time is spent checking the validity within already existing huge pages and then attempting a migration that fails after isolating the range, draining pages and a whole lot of other useless work.

v1: mm: page_alloc: Assume huge tail pages are valid when allocating contiguous pages

A bug was reported by Yuanxi Liu where allocating 1G pages at runtime is taking an excessive amount of time for large amounts of memory. Further testing allocating huge pages that the cost is linear i.e. if allocating 1G pages in batches of 10 then the time to allocate nr_hugepages from 10->20->30->etc increases linearly even though 10 pages are allocated at each step.

v3: module: avoid userspace pressure on unwanted allocations

This v3 series follows up on the second iteration of these patches [0]. This and other pending changes are avaiable on 20230413-module-alloc-opts branch [1] which is based on modules-next.

v2: mm: ksm: support hwpoison for ksm page

Currently, ksm does not support hwpoison. As ksm is being used more widely for deduplication at the system level, container level, and process level, supporting hwpoison for ksm has become increasingly important. However, ksm pages were not processed by hwpoison in 2009 [1].

v1: migrate_pages: Never block waiting for the page lock

Currently when we try to do page migration and we’re in “synchronous” mode (and not doing direct compaction) then we’ll wait an infinite amount of time for a page lock. This does not appear to be a great idea.

v1: Setting memory policy for restrictedmem file

This patchset builds upon the memfd_restricted() system call that was discussed in the ‘KVM: mm: fd-based approach for supporting KVM’ patch series [1].

v1: change ->index to PAGE_SIZE for hugetlb pages

This RFC patch series attempts to simplify the page cache code by removing special casing code for hugetlb pages. Normal pages in the page cache are indexed by PAGE_SIZE while hugetlb pages are indexed by their huge page size. This was previously tried but the xarray was not performant enough for the changes.

v2: -next: mm: hwpoison: support recovery from HugePage copy-on-write faults

copy-on-write of hugetlb user pages with uncorrectable errors will result in a kernel crash. This is because the copy is performed in kernel mode and in general we can not handle accessing memory with such errors while in kernel mode. Commit a873dfe1032a (“mm, hwpoison: try to recover from copy-on write faults”) introduced the routine copy_user_highpage_mc() to gracefully handle copying of user pages with uncorrectable errors. However, the separate hugetlb copy-on-write code paths were not modified as part of commit a873dfe1032a.

v6: Ignore non-LRU-based reclaim in memcg reclaim

Upon running some proactive reclaim tests using memory.reclaim, we noticed some tests flaking where writing to memory.reclaim would be successful even though we did not reclaim the requested amount fully Looking further into it, I discovered that sometimes we overestimate the number of reclaimed pages in memcg reclaim.

v1: printk: Export console trace point for kcsan/kasan/kfence/kmsan

The console tracepoint is used by kcsan/kasan/kfence/kmsan test modules. Since this tracepoint is not exported, these modules iterate over all available tracepoints to find the console trace point. Export the trace point so that it can be directly used.

v7: ksm: support tracking KSM-placed zero-pages

The core idea of this patch set is to enable users to perceive the number of any pages merged by KSM, regardless of whether use_zero_page switch has been turned on, so that users can know how much free memory increase is really due to their madvise(MERGEABLE) actions. But the problem is, when enabling use_zero_pages, all empty pages will be merged with kernel zero pages instead of with each other as use_zero_pages is disabled, and then these zero-pages are no longer monitored by KSM.

v1: mm: hwpoison: coredump: support recovery from dump_user_range()

The dump_user_range() is used to copy the user page to a coredump file, but if a hardware memory error occurred during copy, which called from __kernel_write_iter() in dump_user_range(), it crashs,

v1: selftests/mm: Replace obsolete memalign() with posix_memalign()

memalign() is obsolete according to its manpage.

Replace memalign() with posix_memalign().

v1: mm: huge_memory: Replace obsolete memalign() with posix_memalign()

memalign() is obsolete according to its manpage.

Replace memalign() with posix_memalign()

v2: mm: hugetlb_vmemmap: provide stronger vmemmap allocation guarantees

HugeTLB pages have a struct page optimizations where struct pages for tail pages are freed. However, when HugeTLB pages are destroyed, the memory for struct pages (vmemmap) need to be allocated again.

v1: mm: hugetlb_vmemmap: provide stronger vmemmap allocaction gurantees

HugeTLB pages have a struct page optimizations where struct pages for tail pages are freed. However, when HugeTLB pages are destroyed, the memory for struct pages (vmemmap) need to be allocated again.


v1: fanotify: support watching filesystems and mounts inside userns

An unprivileged user is allowed to create an fanotify group and add inode marks, but not filesystem and mount marks.

v2: fs/proc: add Kthread flag to /proc/$pid/status

The command ps -ef and top -c mark kernel thread by ‘[’ and ‘]’, but sometimes the result is not correct. The task->flags in /proc/$pid/stat is good, but we need remember the value of PF_KTHREAD is 0x00200000 and convert dec to hex. If we have no binary program and shell script which read /proc/$pid/stat, we can know it directly by cat /proc/$pid/status.

v1: Monitoring unmounted fs with fanotify

Followup on my quest to close the gap with inotify functionality, here is a proposal for FAN_UNMOUNT event.

v2: Alter fcntl to handle int arguments correctly

According to the documentation of fcntl, some commands take an int as argument. In practice not all of them enforce this behaviour, as they instead accept a more permissive long and in most cases not even a range check is performed.

An issue could possibly arise from a combination of the handling of the varargs in user space and the ABI rules of the target, which may result in the top bits of an int argument being non-zero.

v1: mm/filemap: allocate folios according to the blocksize

If the blocksize is larger than the pagesize allocate folios with the correct order.

v1: convert create_page_buffers to create_folio_buffers

One of the first kernel panic we hit when we try to increase the block size > 4k is inside create_page_buffers()[1]. Even though buffer.c function do not support large folios (folios > PAGE_SIZE) at the moment, these changes are required when we want to remove that constraint.

v3: Introduce provisioning primitives for thinly provisioned storage

This patch series adds a mechanism to pass through provision requests on stacked thinly provisioned block devices.

v1: fs/ntfs3: disable page fault during ntfs_fiemap()

syzbot is reporting circular locking dependency between ntfs_file_mmap() (which has mm->mmap_lock => ni->ni_lock dependency) and ntfs_fiemap() (which has ni->ni_lock => mm->mmap_lock dependency).

v1: Backport several patches to 5.10.y

Antgroup is using 5.10.y in product environment, we found several patches are missing in 5.10.y tree. These patches are needed for us. So we backported them to 5.10.y

v6: net-next: splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1

Here’s the first tranche of patches towards providing a MSG_SPLICE_PAGES internal sendmsg flag that is intended to replace the ->sendpage() op with calls to sendmsg(). MSG_SPLICE_PAGES is a hint that tells the protocol that it should splice the pages supplied if it can and copy them if not.

v1: [RESEND] fs: opportunistic high-res file timestamps

(Apologies for the resend, but I didn’t send this with a wide enough distribution list originally).

v1: fs: opportunistic high-res file timestamps

While I don’t think we can practically optimize away ctime updates like we do with i_version, I do like the idea of using this scheme to indicate when we need to use a high-res timestamp.

v1: fanotify: Enable FAN_REPORT_FID on more filesystem types

If kernel supports FAN_REPORT_ANY_FID, use this flag to allow testing also filesystems that do not support fsid or NFS file handles (e.g. fuse).

v9: Implement copy offload support

The patch series covers the points discussed in November 2021 virtual call [LSF/MM/BFP TOPIC] Storage: Copy Offload [0]. We have covered the initial agreed requirements in this patchset and further additional features suggested by community. Patchset borrows Mikulas’s token based approach for 2 bdev implementation.

v4: Providing mount in memfd_restricted() syscall

This patchset builds upon the memfd_restricted() system call that was discussed in the ‘KVM: mm: fd-based approach for supporting KVM’ patch series, at https://lore.kernel.org/lkml/20221202061347.1070246-1-chao.p.peng@linux.intel.com/T/

v2: sysv: don’t call sb_bread() with pointers_lock held

syzbot is reporting sleep in atomic context in SysV filesystem [1], for sb_bread() is called with rw_spinlock held.

A “write_lock(&pointers_lock) => read_lock(&pointers_lock) deadlock” bug and a “sb_bread() with write_lock(&pointers_lock)” bug were introduced by “Replace BKL for chain locking with sysvfs-private rwlock” in Linux 2.5.12.

v1: blk: optimization for classic polling

This removes the dependency on interrupts to wake up task. Set task state as TASK_RUNNING, if need_resched() returns true, while polling for IO completion. Earlier, polling task used to sleep, relying on interrupt to wake it up. This made some IO take very long when interrupt-coalescing is enabled in NVMe.


v1: brcmfmac: Demote some kernel errors to info

brcmfmac has some messages that are KERN_ERR even though they are harmless. This is spooking and confusing people, because they end up being the only kernel messages on their boot console with common error-only printk levels (at least on Apple Macs).

v1: net: virtio-net: reject small vring sizes

Check vring size and fail probe if a transmit/receive vring size is smaller than MAX_SKB_FRAGS + 2.

At the moment, any vring size is accepted. This is problematic because it may result in attempting to transmit a packet with more fragments than there are descriptors in the ring.

v1: net-next: ethtool mm API improvements

Currently the ethtool –set-mm API permits the existence of 2 configurations which don’t make sense:

  • pmac-enabled false tx-enabled true
  • tx-enabled false verify-enabled true

v1: net-next: Ocelot/Felix driver support for preemptible traffic classes

The series “Add tc-mqprio and tc-taprio support for preemptible traffic classes” from: https://lore.kernel.org/netdev/20230220122343.1156614-1-vladimir.oltean@nxp.com/

was eventually submitted in a form without the support for the Ocelot/Felix switch driver. This patch set picks up that work again, and presents a fairly modified form compared to the original.

v2: net: net/sched: clear actions pointer in miss cookie init fail

Palash reports a UAF when using a modified version of syzkaller[1].

When ‘tcf_exts_miss_cookie_base_alloc()’ fails in ‘tcf_exts_init_ex()’ a call to ‘tcf_exts_destroy()’ is made to free up the tcf_exts resources. In flower, a call to ‘__fl_put()’ when ‘tcf_exts_init_ex()’ fails is made; Then calling ‘tcf_exts_destroy()’, which triggers an UAF since the already freed tcf_exts action pointer is lingering in the struct.

v2: net-next: tsnep: XDP socket zero-copy support

Implement XDP socket zero-copy support for tsnep driver. I tried to follow existing drivers like igc as far as possible. But one main

v2: net-next: r8169: use new macros from netdev_queues.h

Add one missing subqueue version of the macros, and use the new macros in r8169 to simplify the code.

v6: net-next: XDP Rx HWTS metadata for stmmac driver

Implemented XDP receive hardware timestamp metadata for stmmac driver.

This patchset is tested with tools/testing/selftests/bpf/xdp_hw_metadata. Below are the test steps and results.

v1: net-next: sctp: add some missing peer_capables in sctp info dump

The 1st patch removes the unused and obsolete hostname_address from sctp_association peer and also the bit from sctp_info peer_capables, and then reuses its bit for reconf_capable and use the higher available bit for intl_capable in the 2nd patch.

v6: ip.7: Add “special and reserved addresses” section

Break out the discussion of special and reserved IPv4 addresses into a subsection, formatted as a pair of definition lists, and briefly describing three cases in which Linux no longer treats addresses specially, where other systems do or did.

v1: net-next: eth: mlx5: avoid iterator use outside of a loop

Fix the following warning about risky iterator use:

drivers/net/ethernet/mellanox/mlx5/core/eq.c:1010 mlx5_comp_irq_get_affinity_mask() warn: iterator used outside loop: ‘eq’

v1: ice: document RDMA devlink parameters

Commit e523af4ee560 (“net/ice: Add support for enable_iwarp and enable_roce devlink param”) added support for the enable_roce and enable_iwarp parameters in the ice driver. It didn’t document these parameters in the ice devlink documentation file. Add this documentation, including a note about the mutual exclusion between the two modes.

v1: net-next: net: skbuff: hide some bitfield members

There is a number of protocol or subsystem specific fields in struct sk_buff which are only accessed by one subsystem. We can wrap them in ifdefs with minimal code impact.

This gives us a better chance to save a 2B and a 4B holes resulting with the following savings (assuming a lucky kernel config):

v2: net-next: ax25: exit linked-list searches earlier

There’s no need to loop until the end of the list if we have a result.

Device callsigns are unique, so there can only be one dev returned from ax25_addr_ax25dev(). If not, there would be inconsistencies based on order of insertion, and refcount leaks.

v1: net-next: selftests: openvswitch: add support for testing upcall interface

The existing selftest suite for openvswitch will work for regression testing the datapath feature bits, but won’t test things like adding interfaces, or the upcall interface. Here, we add some additional test facilities.

v1: net: wwan: Expose secondary AT port on DATA1

Our use-case needs two AT ports available: One for running a ppp daemon, and another one for management

This patch enables a second AT port on DATA1

答复: v1: net: Add check for csum_start in skb_partial_csum_set()

Conceivably this can be added, though it is a bit complex for devices with variable length link layer headers. And it would have to happen not only for packet sockets, but all users of virtio_net_hdr.

v1: net-next: net: phy: add driver for MediaTek SoC built-in GE PHYs

Some of MediaTek’s Filogic SoCs come with built-in Gigabit Ethernet PHYs which require calibration data from the SoC’s efuse. Add support for these PHYs to the mediatek-ge driver if built for MediaTek’s ARM64 SoCs.

v2: net-next: virtio/vsock: support datagrams

This series introduces support for datagrams to virtio/vsock.

It is a spin-off (and smaller version) of this series from the summer:https://lore.kernel.org/all/cover.1660362668.git.bobby.eshleman@bytedance.com/

v1: Enable multiple MCAN on AM62x

On AM62x there is one MCAN in MAIN domain and two in MCU domain. The MCANs in MCU domain were not enabled since there is no hardware interrupt routed to A53 GIC interrupt controller. Therefore A53 Linux cannot be interrupted by MCU MCANs.

v1: net: Revert “net/mlx5: Enable management PF initialization”

Paul reports that it causes a regression with IB on CX4 and FW 12.18.1000. In addition I think that the concept of “management PF” is not fully accepted and requires a discussion.

v1: net-next: net: page_pool: add pages and released_pages counters

Introduce pages and released_pages counters to page_pool ethtool stats in order to track the number of allocated and released pages from the pool.

GIT PULL: Networking for v6.3-rc7

Including fixes from bpf, and bluetooth.

Not all that quiet given spring celebrations, but “current” fixes are thinning out, which is encouraging. One outstanding regression in the mlx5 driver when using old FW, not blocking but we’re pushing for a fix.

v5: Add EMAC3 support for sa8540p-ride (devicetree/clk bits)

This is a forward port / upstream refactor of code delivered downstream by Qualcomm over at [0] to enable the DWMAC5 based implementation called EMAC3 on the sa8540p-ride dev board.

v9: Another crack at a handshake upcall mechanism

Here is v9 of a series to add generic support for transport layer security handshake on behalf of kernel socket consumers (user space consumers use a security library directly, of course).

v1: net-next: lib/win_minmax: export symbol of minmax_running_min

This commit export the symbol of the function minmax_running_min to make it accessible to dynamically loaded modules. It can make this library more general, especially for those congestion control algorithm modules who wants to implement a windowed min filter.

v1: staging: octeon: Convert to use phylink

The purpose of this patches is to provide support for SFP cage to Octeon ethernet driver.

v4: net-next: Add SCM_PIDFD and SO_PEERPIDFD

  1. Implement SCM_PIDFD, a new type of CMSG type analogical to SCM_CREDENTIALS, but it contains pidfd instead of plain pid, which allows programmers not to care about PID reuse problem.

  2. Add SO_PEERPIDFD which allows to get pidfd of peer socket holder pidfd. This thing is direct analog of SO_PEERCRED which allows to get plain PID.

  3. Add SCM_PIDFD / SO_PEERPIDFD kselftest

v2: bpf-next: bpf: add netfilter program type

The new program type is ‘tracing style’, i.e. there is no context access rewrite done by verifier, the function argument (struct bpf_nf_ctx) isn’t stable. There is no support for direct packet access, dynptr api should be used instead.

v1: net-next: Support tunnel mode in mlx5 IPsec packet offload

This series extends mlx5 to support tunnel mode in its IPsec packet offload implementation.

v2: net: Finish up ->msg_control{,_user} split

Commit 1f466e1f15cf (“net: cleanly handle kernel vs user buffers for ->msg_control”) introduced the msg_control_user and msg_control_is_user fields in struct msghdr, to ensure that user pointers are represented as such. It also took care of converting most users of struct msghdr::msg_control where user pointers are involved. It did however miss a number of cases, and some code using msg_control inappropriately has also appeared in the meantime.

v8: net/packet: support mergeable feature of virtio

Packet sockets, like tap, can be used as the backend for kernel vhost. In packet sockets, virtio net header size is currently hardcoded to be the size of struct virtio_net_hdr, which is 10 bytes; however, it is not always the case: some virtio features, such as mrg_rxbuf, need virtio net header to be 12-byte long.

v5: net-next: Support MACsec VLAN

This patch series introduces support for hardware (HW) offload MACsec devices with VLAN configuration. The patches address both scenarios where the VLAN header is both the inner and outer header for MACsec.

v3: net: sched: sch_qfq: prevent slab-out-of-bounds in qfq_activate_agg

If the TCA_QFQ_LMAX value is not offered through nlattr, lmax is determined by the MTU value of the network device. The MTU of the loopback device can be set up to 2^31-1. As a result, it is possible to have an lmax value that exceeds QFQ_MIN_LMAX.

v1: net-next: bridge: Add per-{Port, VLAN} neighbor suppression

In order to minimize the flooding of ARP and ND messages in the VXLAN network, EVPN includes provisions [1] that allow participating VTEPs to suppress such messages in case they know the MAC-IP binding and can reply on behalf of the remote host. In Linux, the above is implemented in the bridge driver using a per-port option called “neigh_suppress” that was added in kernel version 4.15 [2].

异步 IO

v1: liburing: io_uring sendto

There are two patches in this series. The first patch adds io_uring_prep_sendto() function. The second patch addd the manpage and CHANGELOG.

v3: liburing: multishot timeout support

Changes on the liburing side to support multishot timeouts.

v1: io_uring: complete request via task work in case of DEFER_TASKRUN

So far io_req_complete_post() only covers DEFER_TASKRUN by completing request via task work when the request is completed from IOWQ.

However, uring command could be completed from any context, and if io uring is setup with DEFER_TASKRUN, the command is required to be completed from current context, otherwise wait on IORING_ENTER_GETEVENTS can’t be wakeup, and may hang forever.

v2: liburing: add multishot timeout support

Single change to sync the new IORING_TIMEOUT_MULTISHOT flag with kernel.

Mostly unit tests for multishot timeouts.

v1: io_uring/uring_cmd: take advantage of completion batching

We know now what the completion context is for the uring_cmd completion handling, so use that to have io_req_task_complete() decide what the best way to complete the request is. This allows batching of the posted completions if we have multiple pending, rather than always doing them one-by-one.

Rust For Linux

v1: rust: init: broaden the blanket impl of Init

This makes it possible to use T as a impl Init<T, E> for every error type E instead of just Infallible.

v1: MAINTAINERS: add Benno Lossin as Rust reviewer

Benno has been involved with the Rust for Linux project for the better part of a year now. He has been working on solving the safe pinned initialization problem [1], which resulted in the pin-init API patch series [2] that allows to reduce the need for unsafe code in the kernel. He is also working on the field projection RFC for Rust [3] to bring pin-init as a language feature.

v1: v4.1: rust: lock: add Guard::do_unlocked

It releases the lock, executes some function provided by the caller, then reacquires the lock. This is preparation for the implementation of condvars, which will sleep after between unlocking and relocking.

v5: scripts: make rust-analyzer for out-of-tree modules

Adds support for out-of-tree rust modules to use the rust-analyzer make target to generate the rust-project.json file.

The change involves adding an optional parameter external_src to the generate_rust_analyzer.py which expects the path to the out-of-tree module’s source directory. When this parameter is passed, I have chosen not to add the non-core modules (samples and drivers) into the result since these are not expected to be used in third party modules. Related changes are also made to the Makefile and rust/Makefile allowing the rust-analyzer target to be used for out-of-tree modules as well.


v1: A new bpf map type for fuzzy matching key

For supporting fuzzy matching in bpf map as described in the original question [0], we come up with a proposal that would like to have some advice or comments from bpf thread. Thanks a lot for all the feedback :)

We plan to implement a new bpf map type, naming BPF_FM_MAP, standing for fuzzy matching map. The basic idea is implementing a trie-tree using map of map runtime structure.

v2: bpf-next: Shared ownership for local kptrs

The above program will fail verification due to current owning / non-owning ref logic: after bpf_list_push_back, n is a non-owning reference and thus cannot be passed to bpf_rbtree_add. The only way to get an owning reference for the node that was added is to bpf_list_pop_{front,back} it.

v2: libbpf: correct the macro KERNEL_VERSION for old kernel

The introduced header file linux/version.h in libbpf_probes.c may have a wrong macro KERNEL_VERSION for calculating LINUX_VERSION_CODE in some old kernel (Debian9,10). Below is a version info example from Debian 10.

v1: vmlinux.lds.h: Discard .note.gnu.property section

It looks like CONFIG_DEBUG_INFO_BTF is already (inadvertently) stripping it from vmlinux due to how GNU properties are merged by the linker (see “How GNU properties are merged” in the ld man page).


First of all, I personally love open source, linux and virtio. I have also participated in community work such as virtio for a long time.

v1: net-next: bpf, net: Support redirecting to ifb with bpf

In our container environment, we are using EDT-bpf to limit the egress bandwidth. EDT-bpf can be used to limit egress only, but can’t be used to limit ingress. Some of our users also want to limit the ingress bandwidth.

v3: net: mana: Add support for jumbo frame

The set adds support for jumbo frame, with some optimization for the RX path.

v10: bpf: XDP-hints: API change for RX-hash kfunc bpf_xdp_metadata_rx_hash

Current API for bpf_xdp_metadata_rx_hash() returns the raw RSS hash value, but doesn’t provide information on the RSS hash type (part of 6.3-rc).

This patchset proposal is to change the function call signature via adding a pointer value argument for providing the RSS hash type.

v1: bpf-next: bpf: Handle NULL in bpf_local_storage_free.

During OOM bpf_local_storage_alloc() may fail to allocate ‘storage’ and call to bpf_local_storage_free() with NULL pointer will cause a crash like:

v6: bpf-next: xsk: Support UMEM chunk_size > PAGE_SIZE

The main purpose of this patchset is to add AF_XDP support for UMEM chunk sizes > PAGE_SIZE. This is enabled for UMEMs backed by HugeTLB pages.

v1: selftests/bpf: ignore pointer types check with clang

This is due to the fact that bpftool emits duplicate data types with

v1: bpf-next: samples/bpf: sampleip: Replace PAGE_OFFSET with _text address

Macro PAGE_OFFSET(0xffff880000000000) in sampleip_user.c is inaccurate, for example, in aarch64 architecture, this value depends on the CONFIG_ARM64_VA_BITS compilation configuration, this value defaults to 48, the corresponding PAGE_OFFSET is 0xffff800000000000, if we use the value defined in sampleip_user.c, then all KSYMs obtained by sampleip are (user)

v1: bpf-next: New BPF map and BTF security LSM hooks

Add new LSM hooks, bpf_map_create_security and bpf_btf_load_security, which are meant to allow highly-granular LSM-based control over the usage of BPF subsytem. Specifically, to control the creation of BPF maps and BTF data objects, which are fundamental building blocks of any modern BPF application.

v1: Smack modifications for: security: Allow all LSMs to provide xattrs for inode_init_security hook

Very very quick modification. Not tested.

v1: bpf: lirc program type should not require SYS_CAP_ADMIN

Make it possible to load lirc program type with just CAP_BPF.

v2: bpf-next: xsk: Elide base_addr comparison in xp_unaligned_validate_desc

Remove redundant (base_addr >= pool->addrs_cnt) comparison from the conditional.

v1: bpf-next: tools/resolve_btfids: Ignore libsubcmd

Since commit af03299d8536(“tools/resolve_btfids: Install subcmd headers”) introduce subcmd headers directory, we should ignore it.

v1: perf bperf: Avoid use after free via union

If bperf sets leader_skel or follower_skel then it appears bpf_skel is set and can trigger the following use-after-free

v1: bpf-next: xsk: Simplify xp_aligned_validate_desc implementation

Perform the chunk boundary check like the page boundary check in xp_desc_crosses_non_contig_pg(). This simplifies the implementation and reduces the number of branches.

v1: bpf-next: Dynptr convenience helpers

This patchset is the 3rd in the dynptr series. The 1st (dynptr fundamentals) can be found here [0] and the second (skb + xdp dynptrs) can be found here [1].

v2: bpf-next: Introduce BPF_MA_REUSE_AFTER_RCU_GP

As discussed in v1, currently the freed objects in bpf memory allocator may be reused immediately by the new allocation, it introduces use-after-bpf-ma-free problem for non-preallocated hash map and makes lookup procedure return incorrect result. The immediate reuse also makes introducing new use case more difficult (e.g. qp-trie).



v3: riscv: Add support for the Zfa extension

This patch introduces the RISC-V Zfa extension, which introduces additional floating-point extensions:

  • fli (load-immediate) with pre-defined immediates
  • fminm/fmaxm (like fmin/fmax but with different NaN behaviour)
  • fround/froundmx (round to integer)
  • fcvtmod.w.d (Modular Convert-to-Integer)
  • fmv* to access high bits of float register bigger than XLEN
  • Quiet comparison instructions (fleq/fltq)

v1: riscv: Raise an exception if pte reserved bits are not cleared

As per the specification, in 64-bit, if any of the pte reserved bits 60-54 is set, an exception should be triggered (see 4.4.1, “Addressing and Memory Protection”), so implement this behaviour in the address translation process.

v1: target/riscv: Add support for BF16 extensions

Specification for BF16 extensions can be found in: https://github.com/riscv/riscv-bfloat16

The port is available here: https://github.com/plctlab/plct-qemu/tree/plct-bf16-upstream

v3: target/riscv: implement query-cpu-definitions

In this v3 I removed patches 3 and 4 of v2.

Patch 3 now implements a new type that the generic CPUs (any, rv32, rv64, x-rv128) were converted to. This type will be used by query-cpu-definitions to determine if a given cpu is static or not based on its type. This approach was suggested by Richard Henderson in the v2 review.

v1: target/riscv: Restore the predicate() NULL check behavior

When reading a non-existent CSR QEMU should raise illegal instruction exception, but currently it just exits due to the g_assert() check.

This actually reverts commit 0ee342256af9205e7388efdf193a6d8f1ba1a617, Some comments are also added to indicate that predicate() must be provided for an implemented CSR.

v1: target/riscv: Separate implicitly-enabled and explicitly-enabled extensions

The patch tries to separate the multi-letter extensions that may implicitly-enabled by misa.EXT from the explicitly-enabled cases, so that the misa.EXT can truely disabled by write_misa(). With this separation, the implicitly-enabled zve64d/f and zve32f extensions will no work if we clear misa.V. And clear misa.V will have no effect on the explicitly-enalbed zve64d/f and zve32f extensions.

v1: target/riscv: Add support for PC-relative translation

This patchset tries to add support for PC-relative translation.

The existence of CF_PCREL can improve performance with the guest kernel’s address space randomization. Each guest process maps libc.so (et al) at a different virtual address, and this allows those translations to be shared.

v1: target/riscv: Use check for relationship between Zdinx/Zhinx{min} and Zfinx

Zdinx/Zhinx{min} require Zfinx. And require relationship is usually done by check currently.


v4: Add StarFive JH7110 PCIe drvier support

This patchset needs to apply after patchset in [1]. These PCIe series patches are based on the JH7110 RISC-V SoC and VisionFive V2 board.

[1] https://patchwork.ozlabs.org/project/uboot/cover/20230329034224.26545-1-yanhong.wang@starfivetech.com

v1: riscv: Support riscv64 image type

Allow U-Boot to load 32 or 64 bits RISC-V Kernel Image distinguishly. It helps to avoid someone maybe make a mistake to run 32-bit U-Boot to load 64-bit kernel.

Read Album:

Read Related:

Read Latest: