RISC-V Linux 内核及周边技术动态第 90 期

呀呀呀创作于 2024/05/07

时间：20240505
编辑：晓瑜
仓库：RISC-V Linux 内核技术调研活动
赞助：PLCT Lab, ISCAS

内核动态

RISC-V 架构支持

v3: riscv: sophgo: add USB phy support for CV18XX series

Add USB PHY support for CV18XX/SG200X series

v5: riscv: sophgo: Add SG2042 external hardware monitor support

Add support for the onboard hardware monitor for SG2042.

v2: riscv: make image compression configurable

This series fixes that so the compression method is configurable and KBUILD_IMAGE is set to the chosen (possibly uncompressed) kernel image which is then used by targets like ‘make install’ and ‘make bindeb-pkg’ and ‘make tar-pkg’.

GIT PULL: KVM/riscv changes for 6.10

We have the following KVM RISC-V changes for 6.10: 1) Support guest breakpoints using ebreak 2) Introduce per-VCPU mp_state_lock and reset_cntx_lock 3) Virtualize SBI PMU snapshot and counter overflow interrupts 4) New selftests for SBI PMU and Guest ebreak

v6: riscv: Support vendor extensions and xtheadvector

This patch series ended up much larger than expected, please bear with me! The goal here is to support vendor extensions, starting at probing the device tree and ending with reporting to userspace.
The main design objective was to allow vendors to operate independently of each other. This has been achieved by delegating vendor extensions to a their own files and then accumulating the extensions in arch/riscv/kernel/vendor_extensions.c.
Each vendor will have their own list of extensions they support.

v3: KVM: Set vcpu->preempted/ready iff scheduled out while running

This series changes KVM to mark a vCPU as preempted/ready if-and-only-if it’s scheduled out while running. i.e. Do not mark a vCPU preempted/ready if it’s scheduled out during a non-KVM_RUN ioctl() or when userspace is doing KVM_RUN with immediate_exit=true.

v4: Linux RISC-V IOMMU Support

This patch series introduces support for RISC-V IOMMU architected hardware into the Linux kernel.

This series introduces RISC-V IOMMU hardware initialization and complete single-stage translation with paging domain support.

GIT PULL: RISC-V SoC drivers for v6.10

A few different bits of SoC-related Kconfig work. The first part of this is shared with the DT updates - the modification of all SOC_CANAAN users to SOC_CANAAN_K210 to split the existing m-mode nommu k210 away from the k230 that is able to be used in a “common” kernel.

GIT PULL: RISC-V Devicetrees for v6.10

Canaan: Basic support for the k230 from Canaan and two boards based on it.
Microchip: A simple addition of a power-monitor on the Icicle dev board, as the binding for it is now in mainline.
StarFive: Support for the Milk-V Mars. This board is incredibly similar to the VisionFive v2 that is already supported, with only the really ethernet configuration being slightly different. Emil requested that a common dtsi file, so my fixes branch is pulled into for-next to avoid an annoying conflict between moved content and some erroneously added nodes that were removed as fixes this cycle.

v3: Add Pinctrl driver for Starfive JH8100 SoC

Starfive JH8100 SoC consists of 4 pinctrl domains - sys_east, sys_west, sys_gmac, and aon. This patch series adds pinctrl drivers for these 4 pinctrl domains and this patch series is depending on the JH8100 base patch series in [1] and [2]. The relevant dt-binding documentation for each pinctrl domain has been updated accordingly.
[1]https://lore.kernel.org/lkml/20231201121410.95298-1-jeeheng.sia@starfivetech.com/
[2]https://lore.kernel.org/lkml/20231206115000.295825-1-jeeheng.sia@starfivetech.com/

v1: dt-bindings: mfd: Use full path to other schemas

When referencing other schema, it is preferred to use an absolute path (/schemas/….), which allows also an seamless move of particular schema out of Linux kernel to dtschema.

v1: Add support for GPIO based CS

The Microchip PolarFire SoC SPI controller supports multiple chip selects. However, only one chip select is connected in the MSS. Therefore, use GPIO descriptors to configure additional chip select lines.

v2: Enable SPCR table for console output on RISC-V

The ACPI SPCR code has been used to enable console output for ARM64 and X86. The same code can be reused for RISC-V. Furthermore, SPCR table is mandated for headless system as outlined in the RISC-V BRS Specification, chapter 6.

v3: kprobe/ftrace: bail out if ftrace was killed

If an error happens in ftrace, ftrace_kill() will prevent disarming kprobes. Eventually, the ftrace_ops associated with the kprobes will be freed, yet the kprobes will still be active, and when triggered, they will use the freed memory, likely resulting in a page fault and panic.

v1: riscv: dts: microchip: add pac1934 power-monitor to icicle

The binding for this landed in v6.9, add the description. In the off-chance that there were people carrying local patches for this based on the driver shipped on the Microchip website (or vendor kernel) both the binding and sysfs filenames changed during upstreaming.

v5: RISC-V: ACPI: Add external interrupt controller support

This series adds support for the below ECR approved by ASWG. The series primarily enables irqchip drivers for RISC-V ACPI based platforms.

GIT PULL: RISC-V Sophgo Devicetrees for v6.10

Please pull dt changes for RISC-V/Sophgo.

v1: KVM: Fold kvm_arch_sched_in() into kvm_arch_vcpu_load()

While fiddling with an idea for optimizing state management on AMD CPUs, I wanted to skip re-saving certain host state when a vCPU is scheduled back in, as the state (theoretically) shouldn’t change for the task while it’s scheduled out. Actually doing that was annoying and unnecessarily brittle due to having a separate API for the kvm_sched_in() case (the state save needed to be in kvm_arch_vcpu_load() for the common path).
The only real downside I see is that arm64 and riscv end up having to pass “false” for their direct usage of kvm_arch_vcpu_load(), and passing boolean literals isn’t ideal. But that can be solved by adding an inner helper that omits the @sched_in param (I almost added a patch to do that, but I couldn’t convince myself it was necessary).
The other motivation for this is to avoid yet another arch hook, and more arbitrary ordering, if there’s a future need to hook kvm_sched_out() (we’ve come close on the x86 side several times).

v2: bpf-next: riscv, bpf: Support per-CPU insn and inline bpf_get_smp_processor_id()

v1: riscv: mm: Support > 1GB kernel image size when creating early page table

By default, when creating early page table, only one PMD page table, but if kernel image size exceeds 1GB, it need two PMD page table, otherwise, it would BUG_ON in create_kernel_page_table.
In addition, if trap earlier, trap vector doesn’t yet set properly, current value maybe set by previous firmwire, typically it’s the _start of kernel, it’s confused and difficult to debuge, so set it earlier.

v3: of: property: Add fw_devlink support for interrupt-map property

This creates fw_devlink between consumers (PCI host controller) and supplier (interrupt controller) based on “interrupt-map” DT property.

v1: kbuild: simplify generic vdso installation code

With commit 4b0bf9a01270 (“riscv: compat_vdso: install compat_vdso.so.dbg to /lib/modules/*/vdso/”) applied, all debug VDSO files are installed in $(MODLIB)/vdso/.
Simplify the installation rule.

v4: Add support for a few Zc* extensions as well as Zcmop

Add support for (yet again) more RVA23U64 missing extensions. Add support for Zcmop, Zca, Zcf, Zcd and Zcb extensions isa string parsing, hwprobe and kvm support. Zce, Zcmt and Zcmp extensions have been left out since they target microcontrollers/embedded CPUs and are not needed by RVA23U64.
Since Zc* extensions states that C implies Zca, Zcf (if F and RV32), Zcd (if D), this series modifies the way ISA string is parsed and now does it in two phases. First one parses the string and the second one validates it for the final ISA description.
This series is based on the Zimop one [1]. An additional fix [2] should be applied to correctly test that series.

v7: mm: jit/text allocator

The patches are also available in git: https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=execmem/v7

进程调度

v4: sched/fair: allow disabling sched_balance_newidle with sched_relax_domain_level

v1: net/sched: adjust device watchdog timer to detect stopped queue at right time

Applications are sensitive to long network latency, particularly heartbeat monitoring ones. Longer the tx timeout recovery higher the risk with such applications on a production machines. This patch remedies, yet honoring device set tx timeout.
Modify watchdog next timeout to be shorter than the device specified. Compute the next timeout be equal to device watchdog timeout less the how long ago queue stop had been done. At next watchdog timeout tx timeout handler is called into if still in stopped state. Either called or not called, restore the watchdog timeout back to device specified.

v1: sched/proc: Print user_cpus_ptr for task status

The commit 851a723e45d1c(“sched: Always clear user_cpus_ptr in do_set_cpus_allowed()”) would clear the user_cpus_ptr when call the do_set_cpus_allowed.
In order to determine whether the user_cpus_ptr is taking effect, it is better to print the task’s user_cpus_ptr.

v1: sched/core: Test online status in available_idle_cpu()

The current implementation of available_idle_cpu() doesn’t test whether a possible cpu is offline. On s390 this dereferences a NULL pointer in arch_vcpu_is_preempted() because lowcore is not allocated for offline cpus. On x86, tracing also shows calls to available_idle_cpu() after a cpu is disabled, but it looks like this isn’t causing any (obvious) issue. Nevertheless, add a check and return early if the cpu isn’t online.

内存管理

v1: mm: workingset reporting

This patch series provides workingset reporting of user pages in lruvecs, of which coldness can be tracked by accessed bits and fd references. However, the concept of workingset applies generically to all types of memory, which could be kernel slab caches, discardable userspace caches (databases), or CXL.mem. Therefore, data sources might come from slab shrinkers, device drivers, or the userspace. IMO, the kernel should provide a set of workingset interfaces that should be generic enough to accommodate the various use cases, and be extensible to potential future use cases. The current proposed interfaces are not sufficient in that regard, but I would like to start somewhere, solicit feedback, and iterate.

v1: selftests/exec: build with -fPIE instead of -pie, to make clang happy

clang doesn’t deal well with “-pie -static”: it warns that -pie is an unused option here. Changing to “-fPIE -static” solves this problem for clang, while keeping the gcc results identical. Also, the runtime results are the same for both clang and gcc builds.

**[v1: ioctl()-based API to query VMAs from /proc//maps](http://lore.kernel.org/linux-mm/20240504003006.3303334-1-andrii@kernel.org/)**

Implement binary ioctl()-based interface to /proc//maps file to allow applications to query VMA information more efficiently than through textual processing of /proc//maps contents. See patch #2 for the context, justification, and nuances of the API design.
This patch set was based on top of next-20240503 tag in linux-next tree.

v1: Page counters optimizations

This patchset reorganizes page_counter structures which helps to make memory cgroup and hugetlb cgroup structures smaller (20%-35%) and more cache-effective. It also eliminates useless tracking of protected memory usage when it’s not needed.

v4: arm64: Permission Overlay Extension

This series implements the Permission Overlay Extension introduced in 2022 VMSA enhancements [1]. It is based on v6.9-rc5.
One possible issue with this version, I took the last bit of HWCAP2.

v5: enable bs > ps in XFS

This is the fifth version of the series that enables block size > page size (Large Block Size) in XFS. The context and motivation can be seen in cover letter of the RFC v1 [0]. We also recorded a talk about this effort at LPC [1], if someone would like more context on this effort.
The major change on this v5 is truncation to min order now included and has been tested. The main issue which was observed was root cuased, and Matthew was able to identify a fix for it in xarray, that fix is now queued up on mm-hotfixes-unstable [2].
A lot of emphasis has been put on testing using kdevops, starting with an XFS baseline [3]. The testing has been split into regression and progression.

v2: mm/vmstat: sum up all possible CPUs instead of using vm_events_fold_cpu

When unplugging a CPU, the current code merges its vm_events with an online CPU. Because, during summation, it only considers online CPUs, which is a crude workaround. By transitioning to summing up all possible CPUs, we can eliminate the need for vm_events_fold_cpu.

v1: Address hugetlbfs mmap behavior

This patch proposes to fix hugetlbfs mmap behavior so that the file size does not get updated in the mmap call.

This patch adds a ‘nommapfilesz’ mount option to hugetlbfs mount option. The mount option name can be changed if there is a better name suggested.
Submitting this patch as a RFC to get feedback on the approach and if there is any reason that requires file size to be extended by mmap in hugetlbfs case.

v3: large folios swap-in: handle refault cases first

This patch is extracted from the large folio swapin series[1], primarily addressing the handling of scenarios involving large folios in the swap cache.

v3: fs/coredump: Enable dynamic configuration of max file note size

Introduce the capability to dynamically configure the maximum file note size for ELF core dumps via sysctl. This enhancement removes the previous static limit of 4MB, allowing system administrators to adjust the size based on system-specific requirements or constraints.

v2: cgroup: add tests to verify the zswap writeback path

Initate writeback with the steps described in the commit message and check using memory.stat.zswpwb if zswap writeback occurred.

v2: cgroup: Add documentation for missing zswap memory.stat

This includes zswpin, zswpout and zswpwb.

v1: selftests/damon: add DAMOS quota goal test

Extend DAMON selftest-purpose sysfs wrapper to support DAMOS quota goal, and implement a simple selftest for the feature using it.

v6: mm/rmap: do not add fully unmapped large folio to deferred split list

In __folio_remove_rmap(), a large folio is added to deferred split list if any page in a folio loses its final mapping. But it is possible that the folio is fully unmapped and adding it to deferred split list is unnecessary.

v1: selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL

The failing hugetlb vmsplice() COW tests keep confusing people, and having tests that have been failing for years and likely will keep failing for years to come because nobody cares enough is rather suboptimal.

v1: Enhance soft hwpoison handling and injection

This series aim at the following enhancement -
Let one hwpoison injector, that is, madvise(MADV_HWPOISON) to behave more like as if a real UE occurred. Because the other two injectors such as hwpoison-inject and the ‘einj’ on x86 can’t, and it seems to me we need a better simulation to real UE scenario.
For years, if the kernel is unable to unmap a hwpoisoned page, it send a SIGKILL instead of SIGBUS to prevent user process from potentially accessing the page again. But in doing so, the user process also lose important information: vaddr, for recovery. Fortunately, the kernel already has code to kill process re-accessing a hwpoisoned page, so remove the ‘!unmap_success’ check.
Right now, if a thp page under GUP longterm pin is hwpoisoned, and kernel cannot split the thp page, memory-failure simply ignores the UE and returns. That’s not ideal, it could deliver a SIGBUS with useful information for userspace recovery.

v1: by_n compression and decompression with Intel IAA

With the introduction of the ‘canned’ compression algorithm [1], we see better latencies than the ‘dynamic’ Deflate, and a better compression ratio than ‘fixed’ Deflate.

v4: memcg: reduce memory consumption by memcg stats

Most of the memory overhead of a memcg object is due to memcg stats maintained by the kernel. Since stats updates happen in performance critical codepaths, the stats are maintained per-cpu and numa specific stats are maintained per-node * per-cpu. This drastically increase the overhead on large machines i.e. large of CPUs and multiple numa nodes. This patch series tries to reduce the overhead by at least not allocating the memory for stats which are not memcg specific.

v1: XArray: Set the marks correctly when splitting an entry

If we created a new node to replace an entry which had search marks set, we were setting the search mark on every entry in that node. That works fine when we’re splitting to order 0, but when splitting to a larger order, we must not set the search marks on the sibling entries.

v1: mm/debug_vm_pgtable: Test pmd_leaf() behavior with pmd_mkinvalid()

An invalidated pmd should still cause pmd_leaf() to return true. Let’s test for that to ensure all arches remain consistent.

v1: cgroup/rstat: add cgroup_rstat_cpu_lock helpers and tracepoints

This closely resembles helpers added for the global cgroup_rstat_lock in commit fc29e04ae1ad (“cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints”). This is for the per CPU lock cgroup_rstat_cpu_lock.

v1: mm: memcg: use READ_ONCE()/WRITE_ONCE() to access stock->nr_pages

A memcg pointer in the per-cpu stock can be accessed by drain_all_stock() and consume_stock() in parallel, causing a potential race.
This happens because drain_all_stock() is reading stock->nr_pages, while consume_stock() might be updating the same address, causing a potential data-race.
Make the shared addresses bulletproof regarding to reads and writes, similarly to what stock->cached_objcg and stock->cached. Annotate all accesses to stock->nr_pages with READ_ONCE()/WRITE_ONCE().

v15: Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support

This patchset is also available at:
https://github.com/amdese/linux/commits/snp-host-v15
and is based on top of the series:
“Add SEV-ES hypervisor support for GHCB protocol version 2”https://lore.kernel.org/kvm/20240501071048.2208265-1-michael.roth@amd.com/https://github.com/amdese/linux/commits/sev-init2-ghcb-v1
which in turn is based on commit 20cc50a0410f (just before v14 SNP patches):
https://git.kernel.org/pub/scm/virt/kvm/kvm.git/log/?h=kvm-coco-queue

v1: Add SEV-ES hypervisor support for GHCB protocol version 2

This patchset is also available at:
https://github.com/amdese/linux/commits/sev-init2-ghcb-v1
and is based on commit 20cc50a0410f (just before the v13 SNP patches) from:
https://git.kernel.org/pub/scm/virt/kvm/kvm.git/log/?h=kvm-coco-queue

v4: Reclaim lazyfree THP without splitting

This series adds support for reclaiming PMD-mapped THP marked as lazyfree without needing to first split the large folio via split_huge_pmd_address().

v1: exec: x86: Ensure SIGBUS delivered on MCE

To ensure it is terminated with a SIGBUS we 1. let pending work run in the bprm_execve error case.

v1: RDMA/umem: pin_user_pages*() can temporarily fail due to migration glitches

This

happens because a few years ago, pin_user_pages*() APIs were upgraded to automatically migrate pages away from ZONE_MOVABLE, but the callers were not upgraded to handle any migration failures. And in fact, they can’t easily do so anyway, because the migration return code was filtered out: -EAGAIN failures from migration are squashed, along with any other failure, into -ENOMEM, thus hiding details from the upper layer callers.
Although so far I have not pinpointed the cause of such transient refcount increases, these are sufficiently common (and expected by the entire design) that I think we have enough information to proceed directly to a fix. This patch shows my preferred solution.

v1: mm/memory: cleanly support zeropage in vm_insert_page(), vm_map_pages() and vmf_insert_mixed()

There is interest in mapping zeropages via vm_insert_pages() into MAP_SHARED mappings.

This series tries to take the careful approach of only allowing the zeropage where it is likely safe to use, preventing that it could accidentially get mapped writable during a write fault, mprotect() etc, and preventing issues with FOLL_LONGTERM in the future with other users.

v1: mm/hugetlb: align cma on allocation order, not demotion order

Align the CMA area for hugetlb gigantic pages to their size, not the size that they can be demoted to. Otherwise there might be misaligned sections at the start and end of the CMA area that will never be used for hugetlb page allocations.

v3: ptdump: add intermediate directory support

Add an optional note_non_leaf parameter to ptdump, causing note_page to be called on non-leaf descriptors. Implement this functionality on arm64 by printing table descriptors along with table-specific permission sets.

v5: Fast kernel headers: split linux/mm.h

This patch set aims to clean up the linux/mm.h header and reduce dependencies on it by moving parts out.
This patch set borrows the name “fast kernel headers” from Ingo Molnar’s effort a few years ago. While this kind of refactoring does indeed improve build times because the amount of code that has to be processed in each compilation unit is reduced, build speed is the least important advantage.

文件系统

v1: epoll: try to be a bit better about file lifetimes

epoll is a mess, and does various invalid things in the name of performance.

v1: Documentation: Add initial iomap document

This adds an initial first draft of iomap documentation. Hopefully this will come useful to those who are looking for converting their filesystems to iomap. Currently this is in text format since this is the first draft.

[PATCHES v2]v1: set_blocksize() rework

Branch updated and force-pushed (same place). Individual patches in followups.

v1: fs: Do not allow get_file() to resurrect 0 f_count

Failure with f_count reference counting are better contained by an actual reference counting type, like refcount_t. added a refcount_long_t API, and then converted f_count to refcount_long_t.

v1: Change failover behavior for DIRECT writes in ext4/block fops

The iomap_dio_rw() return code -ENOTBLK means page invalidation failed before submitting the bio.

v1: proc: Move fdinfo PTRACE_MODE_READ check into the inode .permission operation

The following commits loosened the permissions of /proc//fdinfo/ directory, as well as the files within it, from 0500 to 0555 while also introducing a PTRACE_MODE_READ check between the current task and's task: This change in behavior broke .NET prior to v7. See the github link below for the v7 commit that inadvertently/quietly (?) fixed .NET after the kernel changes mentioned above.Return to the old behavior by moving the PTRACE_MODE_READ check out of the file .open operation and into the inode .permission operation:

v1: fs/xattr: unify *at syscalls

Use the same parameter ordering for all four newly added *xattrat syscalls:
dirfd, pathname, at_flags, ...
Also consistently use unsigned int as the type for at_flags.

v7: netfs, cifs: Delegate high-level I/O to netfslib

Here are patches to convert cifs to use the netfslib library. I’ve tested them with and without a cache. There appears to be a signifcant performance improvement in buffered writeback (around 50% throughput rate with fio tests).
The patches remove around 2000 lines from CIFS.

v2: netfs, afs, 9p, cifs: Rework netfs to use ->writepages() to copy to cache

The primary purpose of these patches is to rework the netfslib writeback implementation such that pages read from the cache are written to the cache through ->writepages(), thereby allowing the fscache page flag to be retired.

网络设备

v3: net-next: net: dsa: mv88e6xxx: control mdio bus-id truncation for long paths

Compare the return value of snprintf against maximum bus-id length to detect truncation.
Truncation at the beginning was considered as a workaround, however that is still subject to name collisions in sysfs where only the first characters differ.

v6: bpf-next: Replace mono_delivery_time with tstamp_type

introduces a new enum in the skbuff.h, again no change in functionality of the existing available code in kernel , just making the code scalable.
Additional bit was added to support tai timestamp type to avoid tstamp drops in the forwarding path when testing TC-ETF.updating bpf filter.c Some updates to bpf header files with introduction to BPF_SKB_CLOCK_TAI and documentation updates stating deprecation of BPF_SKB_TSTAMP_UNSPEC and BPF_SKB_TSTAMP_DELIVERY_MONO
Handles forwarding of UDP packets with TAI clock id tstamp_type type with supported changes for tc_redirect/tc_redirect_dtime to handle forwarding of UDP packets with TAI tstamp_type

v1: net: stmmac: Initialize the other members except the est->lock

Reinitialize the whole est structure would also reset the mutex lock which is embedded in the est structure, and then trigger the following warning. To address this, define all the other members except mutex lock as a struct group and use that for the reinitialization. We also need to require the mutex lock when doing this initialization.

v2: net-next: selftests: drv-net: add checksum tests

Run tools/testing/selftest/net/csum.c as part of drv-net. The test direction is reversed between receive and transmit tests, so that the NIC under test is always the local machine.
Missing are the PF_PACKET based send tests (‘-P’). These use virtio_net_hdr to program hardware checksum offload. Which requires looking up the local MAC address and (harder) the MAC of the next hop.

v1: net: ethernet: ti: am65-cpsw-nuss: create platform device for port nodes

After this change, an ‘of_node’ link from ‘/sys/devices/platform’ to ‘/sys/firmware/devicetree’ will be created. The ‘ethernet-ports’ device allows multiple netdevs to have the exact same parent device, e.g. port@x netdevs are child nodes of ethernet-ports.

v1: net-next: rtnetlink: more rcu conversions for rtnl_fill_ifinfo()

We want to no longer rely on RTNL for “ip link show” command.
This is a long road, this series takes care of some parts.

v2: net-next: locking: Introduce nested-BH locking.

Disabling bottoms halves acts as per-CPU BKL. On PREEMPT_RT code within local_bh_disable() section remains preemtible. As a result high prior tasks (or threaded interrupts) will be blocked by lower-prio task (or threaded interrupts) which are long running which includes softirq sections.
The proposed way out is to introduce explicit per-CPU locks for resources which are protected by local_bh_disable() and use those only on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds.
The series introduces the infrastructure and converts large parts of networking which is largest stake holder here. Once this done the per-CPU lock from local_bh_disable() on PREEMPT_RT can be lifted.
v1…v2 https://lore.kernel.org/all/20231215171020.687342-1-bigeasy@linutronix.de/:
Jakub complained about touching networking drivers to make the additional locking work. Alexei complained about the additional locking within the XDP/eBFP case. This led to a change in how the per-CPU variables are accessed for the XDP/eBPF case. On PREEMPT_RT the variables are now stored on stack and the task pointer to the structure is saved in the task_struct while keeping every for !RT unchanged. This was proposed as a RFC inv1: https://lore.kernel.org/all/20240213145923.2552753-1-bigeasy@linutronix.de/
and then updated

v1: net: netlink: specs: Add missing bridge linkinfo attrs

Attributes for FDB learned entries were added to the if_link netlink api for bridge linkinfo but are missing from the rt_link.yaml spec. Add the missing attributes to the spec.

v7: net-next: add DCB and DSCP support for KSZ switches

This patch series is aimed at improving support for DCB (Data Center Bridging) and DSCP (Differentiated Services Code Point) on KSZ switches.
The main goal is to introduce global DSCP and PCP (Priority Code Point) mapping support, addressing the limitation of KSZ switches not having per-port DSCP priority mapping. This involves extending the DSA framework with new callbacks for managing trust settings for global DSCP and PCP maps. Additionally, we introduce IEEE 802.1q helpers for default configurations, benefiting other drivers too.
Change logs are in separate patches.
Compared to v6 this series includes some new patches for DSCP global mapping support and QoS selftest script for KSZ9477 switches.

v2: net-next: octeontx2-pf: Treat truncation of IRQ name as an error

According to GCC, the constriction of irq_name in otx2_open() may, theoretically, be truncated.
This patch takes the approach of treating such a situation as an error which it detects by making use of the return value of snprintf, which is the total number of bytes, excluding the trailing ‘\0’, that would have been written.

v1: wireless-next: wil6210: Do not use embedded netdev in wil6210_priv

Embedding net_device into structures prohibits the usage of flexible arrays in the net_device structure.

v1: wireless-next: wifi: ath12k: allocate dummy net_device dynamically

Embedding net_device into structures prohibits the usage of flexible arrays in the net_device structure.

v2: net: phy: bcm5481x: add support for BroadR-Reach mode

Add the 1BR10 link mode and capability to switch toBroadR-Reach as a PHY tunable value
Add the definitions of LRE registers, necessary to useBroadR-Reach modes on the BCM5481x PHY
Implementation of the BroadR-Reach modes for the BroadcomPHYs

v1: pull request (net-next): ipsec-next 2024-05-03

1) Remove Obsolete UDP_ENCAP_ESPINUDP_NON_IKE Support.This was defined by an early version of an IETF draftthat did not make it to a standard.
2) Introduce direction attribute for xfrm states.xfrm states have a direction, a stsate can be usedeither for input or output packet processing.Add a direction to xfrm states to make it clearfor what a xfrm state is used.
All patches from Antony Antony.
Please pull or let me know if there are problems.

v1: can: xilinx_can: Document driver description to list all supported IPs

Xilinx CAN driver supports AXI CAN, AXI CANFD, CANPS and CANFD PS IPs.
Modify the dt-bindings title to indicate that both controllers are supported.
Document all supported IPs in driver comment description.

v1: Introduce auxiliary bus IRQs sysfs

It adds an ‘irqs’ directory under the auxiliary device and includes an sysfs file within it. Sometimes, the PCI SF auxiliary devices share the IRQ with other SFs, a detail that is also not available to the users.

v1: net-next: mlx5: Add netdev-genl queue stats

This change adds support for the per queue netdev-genl API to mlx5.
Qstats are lower, fetched later
This appears to mean that the netdev-genl queue stats have lower numbers than the rtnl stats even though the rtnl stats are fetched first.

v1: net-next: lib: Allow for the DIM library to be modular

Allow the Dynamic Interrupt Moderation (DIM) library to be built as a module. This is particularly useful in an Android GKI (Google Kernel Image) configuration where everything is built as a module, including Ethernet controller drivers. Having to build DIMLIB into the kernel image with potentially no user is wasteful.

v1: l2tp: Support several sockets with same IP/port quadruple

This may mean opening several sockets, but then trafic will go to only one of them, losing the trafic for the tunnel of the other socket (or leaving it up to userland, consuming a lot of cpu%).
This can also happen when the l2tp provider uses a cluster, and load-balancing happens to migrate from one origin IP to another one, for which a socket was already established. Managing reassigning tunnels from one socket to another would be very hairy for userland. This fixes the three cases altogether.

v1: net-next: selftest: epoll_busy_poll: epoll busy poll tests

Add a simple test for the epoll busy poll ioctls.
This test ensures that the ioctls have the expected return codes and that the kernel properly gets and sets epoll busy poll parameters.
The test can be expanded in the future to do real busy polling (provided another machine to act as the client is available).

[net,PATCH v3] net: ks8851: Queue RX packets in IRQ handler instead of disabling BHs

The local_bh_disable()/local_bh_enable() approach works only in case the IRQ handler is protected by a spinlock, but does not work if the IRQ handler is protected by mutex, i.e. this works for KS8851 with Parallel bus interface, but not for KS8851 with SPI bus interface.

v1: net-next: Revert “net: mirror skb frag ref/unref helpers”

The reverted patch interacts very badly with commit 2cc3aeb5eccc (“skbuff: Fix a potential race while recycling page_pool packets”). The reverted

v1: net-next: net: no longer acquire RTNL in threaded_show()

dev->threaded can be read locklessly, if we add corresponding READ_ONCE()/WRITE_ONCE() annotations.

v1: net-next: tools: ynl: add –list-ops and –list-msgs to CLI

Add support for listing the operations Use double space after the name for slightly easier to read output.

v4: net-next: netdevsim: add NAPI support

Add NAPI support to netdevsim and register its Rx queues with NAPI instances. Then add a selftest using the new netdev Python selftest infra to exercise the existing Netdev Netlink API, specifically the queue-get API.
This expands test coverage and further fleshes out netdevsim as a test device. It’s still my goal to make it useful for testing things like flow steering and ZC Rx.

v1: net: rtnetlink: Correct nested IFLA_VF_VLAN_LIST attribute validation

Each attribute inside a nested IFLA_VF_VLAN_LIST is assumed to be a struct ifla_vf_vlan_info so the size of such attribute needs to be at least of sizeof(struct ifla_vf_vlan_info) which is 14 bytes. The current size validation in do_setvfinfo is against NLA_HDRLEN (4 bytes) which is less than sizeof(struct ifla_vf_vlan_info) so this validation is not enough and a too small attribute might be cast to a struct ifla_vf_vlan_info, this might result in an out of bands read access when accessing the saved (casted) entry in ivvl.

v1: net-next: rtnetlink: rtnl_stats_dump() changes

Getting rid of RTNL in rtnl_stats_dump() looks challenging.

GIT PULL: Networking for v6.9-rc7

are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git tags/net-6.9-rc7
for you to fetch changes up to 78cfe547607a83de60cd25304fa2422777634712:
Including fixes from bpf.
Relatively calm week, likely due to public holiday in most places. No known outstanding regressions.
Misc: a bunch of MAINTAINERS file updates

v17: net-next: Add Realtek automotive PCIe driver

This series includes adding realtek automotive ethernet driver and adding rtase ethernet driver entry in MAINTAINERS file.
This ethernet device driver for the PCIe interface of Realtek Automotive Ethernet Switch,applicable to RTL9054, RTL9068, RTL9072, RTL9075, RTL9068, RTL9071.

v1: pull request (net): ipsec 2024-05-02

1) Fix an error pointer dereference in xfrm_in_fwd_icmp.
2) Preserve vlan tags for ESP transport mode software GRO.
3) Fix a spelling mistake in an uapi xfrm.h comment.

[PATCH v5 net-next v5 0/6] Add TCP fraglist GRO support

One case where that’s currently unavoidable is when routing packets over PPPoE. Performance improves significantly when using fraglist GRO implemented in the same way as for UDP.
Here’s a measurement of running 2 TCP streams through a MediaTek MT7622 device (2-core Cortex-A53), which runs NAT with flow offload enabled from one ethernet port to PPPoE on another ethernet port + cake qdisc set to 1Gbps.

安全增强

v1: stackleak: don’t modify ctl_table argument

Sysctl handlers are not supposed to modify the ctl_table passed to them. Adapt the logic to work with a temporary variable, similar to how it is done in other parts of the kernel.
This is also a prerequisite to enforce the immutability of the argument through the callbacks prototy.

v2: string: Add additional __realloc_size() annotations for “dup” helpers

Several other “dup”-style interfaces could use the __realloc_size() attribute. Add KUnit test coverage where possible.

v3: hardening: Enable KCFI and some other options

Add some stuff that got missed along the way

v1: next: Bluetooth: hci_conn: Use struct_size() in hci_le_big_create_sync()

Use struct_size() instead of the open-coded version. Similarly to this other patch[1].

v1: next: Bluetooth: hci_sync: Use cmd->num_cis instead of magic number

At the moment of the check, cmd->num_cis holds the value of 0x1f, which is the max number of elements in the cmd->cis[] array at declaration, which is 0x1f.
So, avoid using 0x1f directly, and instead use cmd->num_cis. Similarly to this other patch[1].

v1: sctp: annotate struct sctp_assoc_ids with __counted_by()

Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS and CONFIG_FORTIFY_SOURCE .

v3: batman-adv: Add flex array to struct batadv_tvlv_tt_data

This code was detected with the help of Coccinelle, and audited and modified manually.

v1: x86/alternatives: Make FineIBT mode Kconfig selectable

Since FineIBT performs checking at the destination, it is weaker against attacks that can construct arbitrary executable memory contents. As such, some system builders want to run with FineIBT disabled by default. Allow the “cfi=kcfi” boot param mode to be selectable through Kconfig via the newly introduced CONFIG_CFI_AUTO_DEFAULT.

v1: objtool: Provide origin hint for elf_init_reloc_text_sym() errors

An error report from elf_init_reloc_text_sym() doesn’t say what list of symbols it is working on. Include this on the caller’s side so it can be reported when pathological conditions are encountered.

v1: lkdtm: Disable CFI checking for perms functions

The EXEC_RODATA test plays a lot of tricks to live in the .rodata section, and once again ran into objtool’s (completely reasonable) assumptions that executable code should live in an executable section. However, this manifested only under CONFIG_CFI_CLANG=y, as one of the .cfi_sites was pointing into the .rodata section.

v1: PM: hibernate: replace deprecated strncpy with strscpy

This kernel config option is simply assigned with the resume_file buffer. Use strscpy as it guarantees NUL-termination on the destination buffer.

v6: checkpatch: add check for snprintf to scnprintf

There is a general misunderstanding amongst engineers that {v}snprintf() returns the length of the data actually encoded into the destination array. To help prevent new instances of snprintf() from popping up, let’s add a check to checkpatch.pl.

v1: tty: rfcomm: prefer struct_size over open coded arithmetic

This code was detected with the help of Coccinelle, and audited and modified manually.

v1: clocksource/drivers/rda: Add sched_clock_register for RDA8810PL SoC

Add sched_clock_register during init bootup log before this patch:

v1: sctp: prefer struct_size over open coded arithmetic

This is an effort to get rid of all multiplications from allocation functions in order to prevent integer overflows [1][2].
This code was detected with the help of Coccinelle, and audited and modified manually.

v3: perf/x86/amd/uncore: Use kcalloc() instead of kzalloc()

This is an effort to get rid of all multiplications from allocation functions in order to prevent integer overflows.

v1: Input: ff-core - prefer struct_size over open coded arithmetic

This is an effort to get rid of all multiplications from allocation functions in order to prevent integer overflows.
This code was detected with the help of Coccinelle, and audited and modified manually.

异步 IO

v1: io_uring/io-wq: Use set_bit() and test_bit() at worker->flags

The structure io_worker->flags may be accessed through parallel data paths, leading to concurrency issues. When KCSAN is enabled, it reveals data races occurring in io_worker_handle_work and io_wq_activate_free_worker functions.

v1: io_uring: Require zeroed sqe->len on provided-buffers send

make the interface less tricky by forcing the length to only come from the buffer ring entry itself.

Rust For Linux

v2: rust: add ‘firmware’ field support to module! macro

This adds ‘firmware’ field support to module! macro, corresponds to MODULE_FIRMWARE macro. You can specify the file names of binary firmware that the kernel module requires. The information is embedded in the modinfo section of the kernel module. For example, a tool to build an initramfs uses this information to put the firmware files into the initramfs image.

v1: WIP: Draft: Alternative allocator support

This patch series does not raise a claim for inclusion in the kernel (yet) and, instead, serves as a baseline for further discussion.
This patch series re-enables the allocator_api and implements an extension trait AllocatorWithFlags expanding the Allocator trait to allow for arbitrary allocator implementations.

v1: rust: hrtimer: introduce hrtimer support

This patch adds support for intrusive use of the hrtimer system.
This patch is very similar to the workqueue I implemented. It seems like Assuming that this is mirroring the workqueue, then this is not necessary. The timer owns a refcount to the element, so the destructor cannot run while the timer is scheduled.
The documentation says that this is implemented by pointers to structs, but that is not the case.
BPF

v1: bpf-next: use network helpers, part 4

This patchset adds post_socket_cb pointer together with ‘struct post_socket_opts cb_opts’ into struct network_helper_opts to make start_server_addr() helper more flexible. With these modifications, many duplicate codes can be dropped.

v1: bpf-next: bpf: avoid clang-specific push/pop attribute pragmas in bpftool

This patch modifies bpftool in order to, instead of using the pragmas, define ATTR_PRESERVE_ACCESS_INDEX to conditionally expand to the CO-RE attribute:

v1: bpf-next: selftests/bpf: Use bpf_tracing.h instead of bpf_tcp_helpers.h

The bpf programs that this patch changes require the BPF_PROG macro. The idea is to retire bpf_tcp_helpers.c and consistently use vmlinux.h for the tests that require the kernel sockets. This patch tackles the obvious tests that can directly use bpf_tracing.h instead of bpf_tcp_helpers.h.

v2: bpf-next: bpf: avoid casts from pointers to enums in bpf_tracing.h

This patch fixes this by avoiding intermediate casts to void*, replaced with casts to `unsigned long long’, which is an integer type capable of safely store a BPF pointer, much like the standard uintptr_t.

v6: bpf-next: bpf: Inline helpers in arm64 and riscv JITs

v1: bpf-next: bpf: missing trailing slash in tools/testing/selftests/bpf/Makefile

This patch fixes the problem by adding the missing slash in the value for BPFTOOL_OUTPUT in the $(OUTPUT)/runqslower rule.

v3: net-next: Add new args into tcp_congestion_ops’ cong_control

This patchset attempts to add two new arguments into the hookpoint cong_control in tcp_congestion_ops. The new arguments are inherited from the caller tcp_cong_control and can be used by any bpf cc prog that implements its own logic inside this hookpoint.

v2: perf/core: Save raw sample data conditionally based on sample type

This patch checks sample type of an event before saving raw sample data in both BPF output and tracepoint event handling logic. Raw sample data will only be saved if explicitly requested, reducing overhead when it is not needed.

v3: bpf-next: Enable BPF programs to declare arrays of kptr, bpf_rb_root, and bpf_list_head.

The patch set aims to enable the use of these specific types in arrays and struct fields, providing flexibility. It examines the types of global variables or the value types of maps, such as arrays and struct types, recursively to identify these special types and generate field information for them.
The btf_features list can be used for pahole v1.26 and later - it is useful because if a feature is not yet implemented it will not exit with a failure message. This will allow us to add feature requests to the pahole options without having to check pahole versions in future; if the version of pahole supports the feature it will be added.

v1: arm64: implement raw_smp_processor_id() using thread_info

ARM64 defines THREAD_INFO_IN_TASK which means the cpu id can be found from current_thread_info()->cpu.
This improvement is in this very specific microbenchmark but it proves the point.
The percpu variable cpu_number is left as it is because it is used in set_smp_ipi_range()

v5: bpf-next: bpf, arm64: Support per-cpu instruction

v2: bpf-next: ARC: Add eBPF JIT support

This will add eBPF JIT support to the 32-bit ARCv2 processors. The implementation is qualified by running the BPF tests on a Synopsys HSDK board with “ARC HS38 v2.1c at 500 MHz” as the 4-core CPU.

v1: bpf-next: bpf: Allow skb dynptr for tp_btf

This makes bpf_dynptr_from_skb usable for tp_btf, so that we can easily parse skb in tracepoints. This has been discussed in [0], and Martin suggested to use dynptr (instead of helpers like bpf_skb_load_bytes).

v3: bpf-next: bpf_wq followup series

Few patches that should have been there from day 1.
Anyway, they are coming now.

v2: bpf-next: libbpf: support “module:function” syntax for tracing programs

In some situations, it is useful to explicitly specify a kernel module to search for a tracing program target (e.g. when a function of the same name exists in multiple modules or in vmlinux).
This change enables that by allowing the “module:function” syntax for the find_kernel_btf_id function. Thanks to this, the syntax can be used both from a SEC macro (i.e. SEC(fentry/module:function)) and via the bpf_program__set_attach_target API call.

v2: bpf_wq followup series

Few patches that should have been there from day 1.

v9: dwarves: pahole: Inject kfunc decl tags into BTF

This patchset teaches pahole to parse symbols in .BTF_ids section in vmlinux and discover exported kfuncs. Pahole then takes the list of kfuncs and injects a BTF_KIND_DECL_TAG for each kfunc.
This enables downstream users and tools to dynamically discover which kfuncs are available on a system by parsing vmlinux or module BTF, both available in /sys/kernel/btf.
This feature is enabled with –btf_features=decl_tag,decl_tag_kfuncs.

v3: bpf-next: selftests/bpf: Add sockaddr tests for kernel networking

This patch series adds test coverage for BPF sockaddr hooks and their interactions with kernel socket functions (i.e. kernel_bind(), kernel_connect(), kernel_sendmsg(), sock_sendmsg(), kernel_getpeername(), and kernel_getsockname()) while also rounding out IPv4 and IPv6 sockaddr hook coverage in prog_tests/sock_addr.c.

v1: bpf-next: Notify user space when a struct_ops object is detached/unregisterd

The subsystems consuming struct_ops objects may need to detach or unregister a struct_ops object due to errors or other reasons. It would be useful to notify user space programs so that error recovery or logging can be carried out.

v4: bpf-next: bpf/verifier: range computation improvements

New version. There is one extra patch which implements some code improvements suggested by Andrii.

v1: bpf-next: bpf: Fold LSH and ARSH pair to a single MOVSX for sign-extension

LLVM generates SRL and SRA instruction pair to implement sign-extension. For x86 and arm64, this instruction pair will be folded to a single instruction, but the LLVM BPF backend does not do such folding.

Since 32-bit to 64-bit sign-extension is a common case and we already have MOVSX instruction for sign-extension, this patch tries to fold the 32-bit to 64-bit LSH and ARSH pair to a single MOVSX instruction.

v1: bpf-next: Free strdup memory in selftests

Two fixes to free strdup memory in selftests to avoid memory leaks.

v1: samples: bpf: Add valid info for VMLINUX_BTF

set the path in error info which seems more intuitive

v1: bpf-next: bpf: add support to read cpu_entry in bpf program

Add new field “cpu_entry” to bpf_perf_event_data which could be read by bpf programs attached to perf events. The value contains the CPU value recorded by specifying sample_type with PERF_SAMPLE_CPU when calling perf_event_open().

周边技术动态

Qemu

v5: riscv: thead: Add th.sxstatus CSR emulation

The th.sxstatus CSR can be used to identify available custom extension on T-Head CPUs. An important property of this patch is, that the th.sxstatus MAEE field is not set (indicating that XTheadMae is not available). XTheadMae is a memory attribute extension (similar to Svpbmt) which is implemented in many T-Head CPUs (C906, C910, etc.) and utilizes bits in PTEs that are marked as reserved. QEMU maintainers prefer to not implement XTheadMae, so we need give kernels a mechanism to identify if XTheadMae is available in a system or not. And this patch introduces This mechanism in QEMU in a way that’s compatible with real HW

Buildroot

configs/pine64_star64: new defconfig

This patch adds a new defconfig for the Star64 board made by Pine64. This board is based on the Starfive JH7110 RISC-V 64 bits SoC. This patch uses a custom Kernel and U-Boot made for this board. The SPL has to be signed with the Starfive SPL-Tool which is a software provided by the vendor to get the necessary headers on the SPL.