泰晓科技 -- 聚焦 Linux - 追本溯源,见微知著!
网站地址:https://tinylab.org

泰晓RISC-V实验箱,转战RISC-V,开箱即用
请稍侯

RISC-V Linux 内核及周边技术动态第 58 期

呀呀呀 创作于 2023/09/06

时间:20230903
编辑:晓依
仓库:RISC-V Linux 内核技术调研活动
赞助:PLCT Lab, ISCAS

内核动态

RISC-V 架构支持

v2: RISCV: Add kvm Sstc timer selftest

The RISC-V arch_timer selftest is used to validate Sstc timer functionality in a guest, which sets up periodic timer interrupts and check the basic interrupt status upon its receipt.

GIT PULL: RISC-V Patches for the 6.6 Merge Window, Part 1

The following changes since commit 06c2afb862f9da8dc5efa4b6076a0e48c3fbaaa5:

Linux 6.5-rc1 (2023-07-09 13:53:13 -0700)

are available in the Git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux.git tags/riscv-for-linus-6.6-mw1

for you to fetch changes up to 89775a27ff6d0396b44de0d6f44dcbc25221fdda:

lib/Kconfig.debug: Restrict DEBUG_INFO_SPLIT for RISC-V (2023-08-31 00:18:37 -0700)

RISC-V Patches for the 6.6 Merge Window, Part 1

  • Support for the new “riscv,isa-extensions” and “riscv,isa-base” device tree interfaces for probing extensions.
  • Support for userspace access to the performance counters.
  • Support for more instructions in kprobes.
  • Crash kernels can be allocated above 4GiB.
  • Support for KCFI.
  • Support for ELFs in !MMU configurations.
  • ARCH_KMALLOC_MINALIGN has been reduced to 8.
  • mmap() defaults to sv48-sized addresses, with longer addresses hidden behind a hint (similar to Arm and Intel).
  • Also various fixes and cleanups.

v5: riscv: add userland instruction dump to RISC-V splats

Add userland instruction dump and rename dump_kernel_instr() to dump_instr().

v1: soc: renesas: Kconfig: For ARCH_R9A07G043 select the required configs if dependencies are met

To prevent randconfig build issues when enabling the RZ/Five SoC, consider selecting specific configurations only when their dependencies are satisfied.

v1: riscv: Kconfig.errata: Add dependency for RISCV_SBI in ERRATA_ANDES config

Andes errata uses sbi_ecalll() which is only available if RISCV_SBI is enabled. So add an dependency for RISCV_SBI in ERRATA_ANDES config to avoid any build failures.

v1: riscv: Kconfig: Select DMA_DIRECT_REMAP only if MMU is enabled

kernel/dma/mapping.c has its use of pgprot_dmacoherent() inside an #ifdef CONFIG_MMU block. kernel/dma/pool.c has its use of pgprot_dmacoherent() inside an #ifdef CONFIG_DMA_DIRECT_REMAP block. So select DMA_DIRECT_REMAP only if MMU is enabled for RISCV_DMA_NONCOHERENT config.

v3: Enable 4-bit tx support

This patch series aims to enable 4-bit tx support for RZ/{G2L,G2LC,G2UL,V2L} SMARC EVKs.

This patch series dependupon [1] [1] https://lore.kernel.org/all/20230830145835.296690-1-biju.das.jz@bp.renesas.com/

v3: kbuild: Show marked Kconfig fragments in “help”

Currently the Kconfig fragments in kernel/configs and arch/*/configs that aren’t used internally aren’t discoverable through “make help”, which consists of hard-coded lists of config fragments. Instead, list all the fragment targets that have a “# Help: “ comment prefix so the targets can be generated dynamically.

v2: RISC-V: Enable cbo.zero in usermode

In order for usermode to issue cbo.zero, it needs privilege granted to issue the extension instruction (patch 2) and to know that the extension is available and its block size (patch 3). Patch 1 could be separate from this series (it just fixes up some error messages), patches 4-5 convert the hwprobe selftest to a statically-linked, TAP test and patch 6 adds a new hwprobe test for the new information as well as testing CBO instructions can or cannot be issued as appropriate.

v2: riscv: correct pt_level name via pgtable_l5/4_enabled

Sorry for just re-sending your patch, but I’d had this build fix floating around as a suggestion and figured it’d be easier to just send it so I can take this more quickly. That patch looks cleaner to me, but happy to hear if anyone has a better way to do it.

v1: Change tuning implementation

This series of patches changes the tuning implementation, from the previous way of reading and writing system controller registers to reading and writing UHS_REG_EXT register, thus optimizing the tuning of obtaining delay-chain.

v1: riscv: kprobes: allow writing to x0

Instructions can write to x0, so we should simulate these instructions normally.

v2: riscv: provide riscv-specific is_trap_insn()

uprobes expects is_trap_insn() to return true for any trap instructions, not just the one used for installing uprobe. The current default implementation only returns true for 16-bit c.ebreak if C extension is enabled. This can confuse uprobes if a 32-bit ebreak generates a trap exception from userspace: uprobes asks is_trap_insn() who says there is no trap, so uprobes assume a probe was there before but has been removed, and return to the trap instruction. This causes an infinite loop of entering and exiting trap handler.

v3: riscv: SCS support

This series adds Shadow Call Stack (SCS) support for RISC-V. SCS uses compiler instrumentation to store return addresses in a separate shadow stack to protect them against accidental or malicious overwrites. More information about SCS can be found here:

https://clang.llvm.org/docs/ShadowCallStack.html

进程调度

v1: sched/rt: Disallow writing invalid values to sched_rt_period_us

The validation of the value written to sched_rt_period_us was broken because:

  • the sysclt_sched_rt_period is declared as unsigned int
  • parsed by proc_do_intvec()
  • the range is asserted after the value parsed by proc_do_intvec()

v1: powerpc/smp: Shared processor sched optimizations

PowerVM systems configured in shared processors mode have some unique challenges. Some device-tree properties will be missing on a shared processor. Hence some sched domains may not make sense for shared processor systems.

v2: freezer,sched: Use saved_state to reduce some spurious wakeups

After commit f5d39b020809 (“freezer,sched: Rewrite core freezer logic”), tasks that transition directly from TASK_FREEZABLE to TASK_FROZEN are always woken up on the thaw path. Prior to that commit, tasks could ask freezer to consider them “frozen enough” via freezer_do_not_count(). The commit replaced freezer_do_not_count() with a TASK_FREEZABLE state which allows freezer to immediately mark the task as TASK_FROZEN without waking up the task. This is efficient for the suspend path, but on the thaw path, the task is always woken up even if the task didn’t need to wake up and goes back to its TASK_(UN)INTERRUPTIBLE state. Although these tasks are capable of handling of the wakeup, we can observe a power/perf impact from the extra wakeup.

v1: sched: add kernel-doc for set_cpus_allowed_ptr

This is an exported symbol, so it should have kernel-doc. Add a note to very similar function do_set_cpus_allowed to avoid confusion and misuse.

内存管理

v1: mm, memcg: expose swapcache stat for memcg v1

Since commit b6038942480e (“mm: memcg: add swapcache stat for memcg v2”) adds swapcache stat for the cgroup v2, it seems there is no reason to hide it in memcg v1. Conversely, with swapcached it is more accurate to evaluate the available memory for memcg.

v1: mm/khugepaged: make reserved memory adaptively

In the 64k page configuration of ARM64, the size of THP is 512MB, which usually reserves almost 5% of memory. However, the probability of THP usage is not high, especially in the madvise configure. and THP is not usually used, but a large amount of memory is reserved for THP use, resulting in a lot of memory waste.

v1: zram: support for specific numa node for zram

This patch series adds a parameter “numa_id” to zram to support the use of memory in a specific node, and attempts to obtain the benefits of using kvzalloc_node to obtain huge page table mappings.

v1: net-next: sock: Be aware of memcg pressure on alloc

As a cloud service provider, we encountered a problem in our production environment during the transition from cgroup v1 to v2 (partly due to the heavy taxes of accounting socket memory in v1). Say one workload behaves fine in cgroupv1 with memcg limit configured to 10GB memory and another 1GB tcpmem, but will suck (or even be OOM-killed) in v2 with 11GB memory due to burst memory usage on socket, since there is no specific limit for socket memory in cgroupv2 and relies largely on workloads doing traffic control themselves.

v4: memcg: non-unified flushing for userspace stats

Most memcg flushing contexts using “unified” flushing, where only one flusher is allowed at a time (others skip), and all flushers need to flush the entire tree. This works well with high concurrency, which mostly comes from in-kernel flushers (e.g. reclaim, refault, ..).

v1: mm: make __GFP_SKIP_ZERO visible to skip zero operation

There is no explicit gfp flags to let the allocation skip zero operation when CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y. I would like to make __GFP_SKIP_ZERO be visible even if kasan is not configured.

GIT PULL: percpu changes for v6.6-rc1

There is 1 bigger change to percpu_counter’s api allowing for init and destroy of multiple counters via percpu_counter_init_many() and percpu_counter_destroy_many(). This is used to help begin remediating a performance regression with percpu rss stats.

v2: x86/clear_huge_page: multi-page clearing

This series adds a multi-page clearing primitive, clear_pages(), which enables more effective use of x86 string instructions by advertising the real region-size to be cleared.

v1: Use nth_page() in place of direct struct page manipulation

On SPARSEMEM without VMEMMAP, struct page is not guaranteed to be contiguous, since each memory section’s memmap might be allocated independently. hugetlb pages can go beyond a memory section size, thus direct struct page manipulation on hugetlb pages/subpages might give wrong struct page. Kernel provides nth_page() to do the manipulation properly. Use that whenever code can see hugetlb pages.

v2: Introduce __mt_dup() to improve the performance of fork()

In the process of duplicating mmap in fork(), VMAs will be inserted into the new maple tree one by one. When inserting into the maple tree, the maple tree will be rebalanced multiple times. The rebalancing of maple tree is not as fast as the rebalancing of red-black tree and will be slower. Therefore, __mt_dup() is introduced to directly duplicate the structure of the old maple tree, and then modify each element of the new maple tree. This avoids rebalancing and some extra copying, so is faster than the original method. More information can refer to [1].

v2: Optimize mmap_exit for large folios

This is v2 of a series to improve performance of process teardown, taking advantage of the fact that large folios are increasingly regularly pte-mapped in user space; supporting filesystems already use large folios for pagecache memory, and large folios for anonymous memory are (hopefully) on the horizon.

v5: mm: vmscan: try to reclaim swapcache pages if no swap space

When spaces of swap devices are exhausted, only file pages can be reclaimed. But there are still some swapcache pages in anon lru list. This can lead to a premature out-of-memory.

v1: hugetlb: set hugetlb page flag before optimizing vmemmap

Currently, vmemmap optimization of hugetlb pages is performed before the hugetlb flag (previously hugetlb destructor) is set identifying it as a hugetlb folio. This means there is a window of time where an ordinary folio does not have all associated vmemmap present. The core mm only expects vmemmap to be potentially optimized for hugetlb and device dax. This can cause problems in code such as memory error handling that may want to write to tail struct pages.

v1: stackdepot: allow evicting stack traces

Currently, the stack depot grows indefinitely until it reaches its capacity. Once that happens, the stack depot stops saving new stack traces.

This creates a problem for using the stack depot for in-field testing and in production.

v2: Mitigate a vmap lock contention v2

Hello, folk!

This is the v2, the series which tends to minimize the vmap lock contention. It is based on the tag: v6.5-rc6. Here you can find a documentation about it:

wget ftp://vps418301.ovh.net/incoming/Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf

even though it is a bit outdated(it follows v1), it still gives a good overview on the problem and how it can be solved. On demand and by request i can update it.

v1: mm: vmscan: use per-zone watermark when determine file_is_tiny

When setting swapiness to 0, the anon pages should be reclaimed if and only if the value of file_is_tiny is true.

__zone_watermark_ok uses per-zone watermark and lowmem_reserve to determine whether allocating page from the zone. In the mean time, file_is_tiny is calculated by per-node watermark. There are inconsistencies between the two scenarios.

v4: MDWE without inheritance

Joey recently introduced a Memory-Deny-Write-Executable (MDWE) prctl which tags current with a flag that prevents pages that were previously not executable from becoming executable. This tag always gets inherited by children tasks. (it’s in MMF_INIT_MASK)

v1: Add printf attribute to kselftest functions

Kselftest.h declares many variadic functions that can print some formatted message while also executing selftest logic. These declarations don’t have any compiler mechanism to verify if passed arguments are valid in comparison with format specifiers used in printf() calls.

文件系统

v1: fcntl: add fcntl(F_CHECK_ORIGINAL_MEMFD)

This change introduces a new fcntl to check if an fd points to a memfd’s original open fd (the one created by memfd_create).

v2: efivarfs: Add Mount Option For Efivarfs

We want to support fwupd for updating system firmware on Reven. Capsule updates need to create UEFI variables. Our current approach to UEFI variables of just allowing access to a static list of them at boot time won’t work here.

v1: kernel: Add Mount Option For Efivarfs

Add uid and gid in efivarfs’s mount option, so that we can mount the file system with ownership. This approachis used by a number of other filesystems that don’t have native support for ownership.

v1: vfs: add inode lockdep assertions

Thread “Use exclusive lock for file_remove_privs” [1] reports an issue which should have been found by asserts – inode not write locked by the caller.

It did not happen because the attempt to do it in notify_change: WARN_ON_ONCE(!inode_is_locked(inode));

passes if the inode is only read-locked: static inline int rwsem_is_locked(struct rw_semaphore *sem) {return atomic_long_read(&sem->count) != 0; }

v2: security: Move IMA and EVM to the LSM infrastructure

IMA and EVM are not effectively LSMs, especially due the fact that in the past they could not provide a security blob while there is another LSM active.

That changed in the recent years, the LSM stacking feature now makes it possible to stack together multiple LSMs, and allows them to provide a security blob for most kernel objects. While the LSM stacking feature has some limitations being worked out, it is already suitable to make IMA and EVM as LSMs.

In short, while this patch set is big, it does not make any functional change to IMA and EVM. IMA and EVM functions are called by the LSM infrastructure in the same places as before (except ima_post_path_mknod()), rather being hardcoded calls, and the inode metadata pointer is directly stored in the inode security blob rather than in a separate rbtree.

More specifically, patches 1-11 make IMA and EVM functions suitable to be registered to the LSM infrastructure, by aligning function parameters.

v1: NFS: switch back to using kill_anon_super

NFS switch to open coding kill_anon_super in 7b14a213890a (“nfs: don’t call bdi_unregister”) to avoid the extra bdi_unregister call. At that point bdi_destroy was called in nfs_free_server and thus it required a later freeing of the anon dev_t. But since 0db10944a76b (“nfs: Convert to separately allocated bdi”) the bdi has been free implicitly by the sb destruction, so this isn’t needed anymore.

v3: Supporting same fsid mounting through the single-dev compat_ro feature

is to allow btrfs to have the same filesystem mounting at the same time; for more details, please take a look in the:

v1: fs: have setattr_copy handle multigrain timestamps appropriately

The setattr codepath is still using coarse-grained timestamps, even on multigrain filesystems. To fix this, we need to fetch the timestamp for ctime updates later, at the point where the assignment occurs in setattr_copy.

v1: Document impact of user namespaces and negative permissions

I’m sending out this patch series to document the current situation regarding negative permissions and user namespaces.

From what I understand, the general agreement is that negative permissions are not recommended and should be avoided. This is why the ability to somewhat bypass these permissions using user namespaces is tolerated, as it’s deemed not worth the complexity to address this without breaking exsting programs such as podman.

GIT PULL: sysctl changes for v6.6-rc1

The following changes since commit 06c2afb862f9da8dc5efa4b6076a0e48c3fbaaa5:

Linux 6.5-rc1 (2023-07-09 13:53:13 -0700)

are available in the Git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/ tags/sysctl-6.6-rc1

for you to fetch changes up to 53f3811dfd5e39507ee3aaea1be09aabce8f9c98:

v3: 0/5: fuse direct write consolidation and parallel IO

This series consolidates DIO writes into a single code path via fuse_cache_write_iter/generic_file_direct_write. Before it was only used for O_DIRECT and when writeback cache was not enabled. For server/daemon dio enforcement (FOPEN_DIRECT_IO) another code path was used before, but I think that is not needed and just IOCB_DIRECT needs to be set/enforced. When writeback-cache was enabled another code path was used, with a fallback to write-through - for direct IO that should not be needed either.

v1: mtd: switch to keying by dev_t

For this cycle Jan, Christoph, and myself switched the generic super code to key superblocks for block devices by device number (sb->s_dev) instead of block device pointers (sb->s_bdev).

v2: xarray: Document necessary flag in alloc functions

Adds a new line to the docstrings of functions wrapping __xa_alloc() and __xa_alloc_cyclic(), informing about the necessity of flag XA_FLAGS_ALLOC being set previously.

The documentation so far says that functions wrapping __xa_alloc() and __xa_alloc_cyclic() are supposed to return either -ENOMEM or -EBUSY in case of an error. If the xarray has been initialized without the flag XA_FLAGS_ALLOC, however, they fail with a different, undocumented error code.

v1: vfs: use helpers for calling f_op->{read,write}_iter() in read_write.c

use helpers for calling f_op->{read,write}_iter() in read_write.c

v1: blk: optimization for classic polling

This removes the dependency on interrupts to wake up task. Set task state as TASK_RUNNING, if need_resched() returns true, while polling for IO completion. Earlier, polling task used to sleep, relying on interrupt to wake it up. This made some IO take very long when interrupt-coalescing is enabled in NVMe.

网络设备

v4: bpf-next: add BPF_F_PERMANENT flag for sockmap skmsg redirect

v3->v4: Change the two helpers’s description.Let BPF_F_PERMANENT takes precedence over apply/cork_bytes.

v2: wpan-next: ieee802154: Associations between devices

[I know we are in the middle of the merge window, I don’t think it matters on the wpan side, so as the wpan subsystem did not evolve much since the previous merge window I figured I would not delay the sending of this series given the fact that I should have send it at the beginning of the summer…]

Now that we can discover our peer coordinators or make ourselves dynamically discoverable, we may use the information about surrounding devices to create PANs dynamically. This involves of course:

  • Requesting an association to a coordinator, waiting for the response
  • Sending a disassociation notification to a coordinator
  • Receiving an association request when we are coordinator, answering the request (for now all devices are accepted up to a limit, to be refined)
  • Sending a disassociation notification to a child
  • Users may request the list of associated devices (the parent and the children).

v4: can: xilinx_can: Add ECC feature support

Add ECC feature support to Tx and Rx FIFOs for Xilinx CAN Controller. Part of this feature configuration and counter registers added in Xilinx AXI CAN Controller for 1bit/2bit ECC errors count and reset. Also driver reports 1bit/2bit ECC errors for FIFOs based on ECC error interrupts.

v1: can: etas_es58x: Add check for alloc_can_err_skb

Add check for the return value of alloc_can_err_skb in order to avoid NULL pointer dereference.

v3: net: ipv6/addrconf: avoid integer underflow in ipv6_create_tempaddr

The existing code incorrectly casted a negative value (the result of a subtraction) to an unsigned value without checking. For example, if /proc/sys/net/ipv6/conf/*/temp_prefered_lft was set to 1, the preferred lifetime would jump to 4 billion seconds. On my machine and network the shortest lifetime that avoided underflow was 3 seconds.

v1: Prevent potential write out of bounds

The function flow_rule_alloc in net/core/flow_offload.c [2] gets an unsigned int num_actions (line 10) and later traverses the actions in the rule (line 24) setting hw.stats to FLOW_ACTION_HW_STATS_DONT_CARE.

v1: netfilter: nf_tables: ignore -EOPNOTSUPP on flowtable device offload setup

On many embedded devices, it is common to configure flowtable offloading for a mix of different devices, some of which have hardware offload support and some of which don’t. The current code limits the ability of user space to properly set up such a configuration by only allowing adding devices with hardware offload support to a offload-enabled flowtable. Given that offload-enabled flowtables also imply fallback to pure software offloading, this limitation makes little sense. Fix it by not bailing out when the offload setup returns -EOPNOTSUPP

v1: net: deal with integer overflows in kmalloc_reserve()

This allowed various crash as reported by syzbot [1] and Kyle Zeng.

Problem is that if @size is bigger than 0x80000001, kmalloc_size_roundup(size) returns 2^32.

kmalloc_reserve() uses a 32bit variable (obj_size), so 2^32 is truncated to 0.

kmalloc(0) returns ZERO_SIZE_PTR which is not handled by skb allocations.

Following trace can be triggered if a netdev->mtu is set close to 0x7fffffff

We might in the future limit netdev->mtu to more sensible limit (like KMALLOC_MAX_SIZE).

This patch is based on a syzbot report, and also a report and tentative fix from Kyle Zeng.

[1] BUG: KASAN: user-memory-access in __build_skb_around net/core/skbuff.c:294 [inline] BUG: KASAN: user-memory-access in __alloc_skb+0x3c4/0x6e8 net/core/skbuff.c:527 Write of size 32 at addr 00000000fffffd10 by task syz-executor.4/22554

CPU: 1 PID: 22554 Comm: syz-executor.4 Not tainted 6.1.39-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/03/2023 Call trace: dump_backtrace+0x1c8/0x1f4 arch/arm64/kernel/stacktrace.c:279 show_stack+0x2c/0x3c arch/arm64/kernel/stacktrace.c:286 __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x120/0x1a0 lib/dump_stack.c:106 print_report+0xe4/0x4b4 mm/kasan/report.c:398 kasan_report+0x150/0x1ac mm/kasan/report.c:495 kasan_check_range+0x264/0x2a4 mm/kasan/generi

v1: net-next: net: add sysctl to disable rfc4862 5.5.3e lifetime handling

This change adds a sysctl to opt-out of RFC4862 section 5.5.3e’s valid lifetime derivation mechanism.

RFC4862 section 5.5.3e prescribes that the valid lifetime in a Router Advertisement PIO shall be ignored if it less than 2 hours and to reset the lifetime of the corresponding address to 2 hours. An in-progress 6man draft (see draft-ietf-6man-slaac-renum-07 section 4.2) is currently looking to remove this mechanism. While this draft has not been moving particularly quickly for other reasons, there is widespread consensus on section 4.2 which updates RFC4862 section 5.5.3e.

v2: bpf: selftests/bpf: Include build flavors for install target

When using the “install” or targets depending on install, e.g. “gen_tar”, the BPF machine flavors weren’t included.

v1: net: another round of data-race annotations

Series inspired by some syzbot reports, taking care of 4 socket fields that can be read locklessly.

v1: iproute2-next: devlink: implement dump selector for devlink objects show commands

First 5 patches are preparations for the last one.

Motivation:

For SFs, one devlink instance per SF is created. There might be thousands of these on a single host. When a user needs to know port handle for specific SF, he needs to dump all devlink ports on the host which does not scale good.

v2: nf: netfilter/osf: avoid OOB read

The opt_num field is controlled by user mode and is not currently validated inside the kernel. An attacker can take advantage of this to trigger an OOB read and potentially leak information.

v2: net: igb: disable virtualization features on 82580

Disable virtualization features on 82580 just as on i210/i211. This avoids that virt functions are acidentally called on 82850.

v2: net: dsa: hsr: Enable HSR HW offloading for KSZ9477

This patch series provides support for HSR HW offloading in KSZ9477 switch IC.

To test this feature: ip link add name hsr0 type hsr slave1 lan1 slave2 lan2 supervision 45 version 1 ifconfig lan1 up;ifconfig lan2 up ifconfig hsr0 192.168.0.1 up

v1: net: phy: micrel: Correct bit assignment for MICREL_KSZ8_P1_ERRATA flag

The previous assignment of the phy_device quirk for the MICREL_KSZ8_P1_ERRATA flag was incorrect, working only due to coincidental conditions. Specifically:

  • The flag MICREL_KSZ8_P1_ERRATA, intended for KSZ88xx switches, was mistakenly overlapping with the MICREL_PHY_FXEN and MICREL_PHY_50MHZ_CLK flags.
  • MICREL_PHY_FXEN is used by the KSZ8041 PHY, and its related code path wasn’t executed for KSZ88xx PHYs and other way around.
  • Additionally, the code path associated with the MICREL_PHY_50MHZ_CLK flag wasn’t executed for KSZ88xx either.

v3: skbuff: skb_segment, Call zero copy functions before using skbuff frags

Commit bf5c25d60861 (“skbuff: in skb_segment, call zerocopy functions once per nskb”) added the call to zero copy functions in skb_segment(). The change introduced a bug in skb_segment() because skb_orphan_frags() may possibly change the number of fragments or allocate new fragments altogether leaving nrfrags and frag to point to the old values. This can cause a panic with stacktrace like the one below.

v1: igb: disable virtualization features on 82580

Disable virtualization features on 82580 just as on i210/i211. This avoids that virt functions are acidentally called on 82850.

v5: net: Avoid TCP resets when using ECMP for load-balancing between multiple servers.

All packets in the same flow (L3/L4 depending on multipath hash policy) should be directed to the same target, but after [0]/[1] we see stray packets directed towards other targets. This, for instance, causes RST to be sent on TCP connections.

v3: net: phy: Provide Module 4 KSZ9477 errata (DS80000754C)

The KSZ9477 errata points out (in ‘Module 4’) the link up/down problems when EEE (Energy Efficient Ethernet) is enabled in the device to which the KSZ9477 tries to auto negotiate.

The suggested workaround is to clear advertisement of EEE for PHYs in this chip driver.

v1: bpf-next: bpf, sockmap: Rename sock_map_get_from_fd to sock_map_prog_attach

What function sock_map_get_from_fd does is to attach a bpf prog to a sock map, so rename it to sock_map_prog_attach to make it more readable.

v1: ptp: Demultiplexed timestamp channels

Add the posibility to demultiplex the timestamp channels for external timestamp event channels.

In some applications it can be necessary to have different consumers for different timestamp channels. For example, synchronize to an external pps source with linuxptp ts2phc while timestmping external events with another application.

v4: Move Loongson1 MAC arch-code to the driver dir

In order to convert Loongson1 MAC platform devices to the devicetree nodes, Loongson1 MAC arch-code should be moved to the driver dir. Add dt-binding document and update MAINTAINERS file accordingly.

In other words, this patchset is a preparation for converting Loongson1 platform devices to devicetree.

v1: net-next: Add support for ICSSG on AM64x EVM

This series adds support for ICSSG driver on AM64x EVM.

First patch of the series adds compatible for AM64x EVM in icssg-prueth dt binding. Second patch adds support for AM64x compatible in the ICSSG driver.

This series depends on [1] which is posted as RFC.

[1] https://lore.kernel.org/all/20230830110847.1219515-1-danishanwar@ti.com/

Thanks and Regards, Md Danish Anwar

v1: net-next: Add Half Duplex support for ICSSG Driver

This series adds support for half duplex operation for ICSSG driver.

In order to support half-duplex operation at 10M and 100M link speeds, the PHY collision detection signal (COL) should be routed to ICSSG GPIO pin (PRGx_PRU0/1_GPI10) so that firmware can detect collision signal and apply the CSMA/CD algorithm applicable for half duplex operation. A DT property, “ti,half-duplex-capable” is introduced for this purpose in the first patch of the series. If board has PHY COL pin conencted to PRGx_PRU1_GPIO10, this DT property can be added to eth node of ICSSG, MII port to support half duplex operation at that port.

v1: net-next: Introduce switch mode and TAPRIO offload support for ICSSG driver

This series adds support for switch-mode and TAPRIO offload for ICSSG driver. This series also introduces helper APIs to configure firmware maintained FDB (Forwarding Database) and VLAN tables. These APIs are later used by ICSSG driver in switch mode.

Thanks and Regards, Md Danish Anwar

v2: net: read sk->sk_family once in sk_mc_loop()

syzbot is playing with IPV6_ADDRFORM quite a lot these days, and managed to hit the WARN_ON_ONCE(1) in sk_mc_loop()

We have many more similar issues to fix.

v1: net: sctp: annotate data-races around sk->sk_wmem_queued

sk->sk_wmem_queued can be read locklessly from sctp_poll()

Use sk_wmem_queued_add() when the field is changed, and add READ_ONCE() annotations in sctp_writeable() and sctp_assocs_seq_show()

v1: -next: ptp: ptp_ines: Use list_for_each_entry() helper

Convert list_for_each() to list_for_each_entry() so that the this list_head pointer and list_entry() call are no longer needed, which can reduce a few lines of code. No functional changed.

v1: -next: isdn: capi, Use list_for_each_entry() helper

Convert list_for_each() to list_for_each_entry() so that the l list_head pointer and list_entry() call are no longer needed, which can reduce a few lines of code. No functional changed.

v1: net-next: ipv6: do not merge differe type and protocol routes

Different with IPv4, IPv6 will auto merge the same metric routes into multipath routes. But the different type and protocol routes are also merged, which will lost user’s configure info. e.g.

GIT PULL: sysctl changes for v6.6-rc1

The following changes since commit 06c2afb862f9da8dc5efa4b6076a0e48c3fbaaa5:

Linux 6.5-rc1 (2023-07-09 13:53:13 -0700)

are available in the Git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/ tags/sysctl-6.6-rc1

for you to fetch changes up to 53f3811dfd5e39507ee3aaea1be09aabce8f9c98:

sysctl: Use ctl_table_size as stopping criteria for list macro (2023-08-15 15:26:18 -0700)

v1: RJ45 to SFP auto-sensing and switching in mux-ed single-mac devices (XOR RJ/SFP)

I and some folks in CC are working to properly port all thefunctions of a Zyxel ex5601-t0 to OpenWrt.

The manufacturer decided to use a single SerDes connectedto both an SPF cage and an RJ45 phy. A simple GPIO isused to control a 2 Channel 2:1 MUX to switch the two SGMII pairsbetween the RJ45 and the SFP.

v1: net-next: net: stmmac: failure to probe without MAC interface specified

Alexander Stein reports that commit a014c35556b9 (“net: stmmac: clarify

GIT PULL: Networking for 6.6

The following changes since commit b5cc3833f13ace75e26e3f7b51cd7b6da5e9cf17:

Merge tag ‘net-6.5-rc8’ of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net (2023-08-24 08:23:13 -0700)

are available in the Git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git tags/net-next-6.6

for you to fetch changes up to c873512ef3a39cc1a605b7a5ff2ad0a33d619aa8:

安全增强

v2: pstore: Base compression input buffer size on estimated compressed size

Commit 1756ddea6916 (“pstore: Remove worst-case compression size logic”) removed some clunky per-algorithm worst case size estimation routines on the basis that we can always store pstore records uncompressed, and these worst case estimations are about how much the size might inadvertently increase due to encapsulation overhead when the input cannot be compressed at all. So if compression results in a size increase, we just store the original data instead.

v1: trace/events/task.h: Replace strlcpy with strscpy

strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy().

No return values were used, so direct replacement is safe.

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [2] https://github.com/KSPP/linux/issues/89

v3: scsi: target: Replace strlcpy with strscpy

strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy().

Direct replacement is safe here since return value of -errno is used to check for truncation instead of sizeof(dest).

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [2] https://github.com/KSPP/linux/issues/89

v4: kobject: Replace strlcpy with strscpy

strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy().

Direct replacement is safe here since return value of -errno is used to check for truncation instead of sizeof(dest).

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [2] https://github.com/KSPP/linux/issues/89

v1: ocfs2: Replace strlcpy with strscpy

The main patch this series is targeting is v1: 2: which replaces strlcpy() call with strscpy(). However, while I was tinkering through the code I noticed that module_param_call is marked obsolete and module_param_cb is preferred instead. So I have included v1: 1: which does that.

A crucial thing in v1: 2: I would like to bring to reviewer’s attention is that it changes behavior for the case where sizeof(@buffer) < DLMFS_CAPABILITIES. Currently, this is silently ignored but with the current change it returns -errno.

v1: m68k/atari: Replace strlcpy with strscpy

strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy().

Direct replacement is safe here since return value of -errno is used to check for truncation instead of sizeof(dest).

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [2] https://github.com/KSPP/linux/issues/89

v1: init/version.c: Replace strlcpy with strscpy

strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy().

Direct replacement is safe here since return value of -errno is used to check for truncation instead of sizeof(dest).

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [2] https://github.com/KSPP/linux/issues/89

v1: Introduce new wrappers to copy user-arrays

David Airlie suggested that we could implement new wrappers around (v)memdup_user() for duplicating user arrays.

This small patch series first implements the two new wrapper functions memdup_array_user() and vmemdup_array_user(). They calculate the array-sizes safely, i.e., they return an error in case of an overflow.

异步 IO

v1: io_uring: Don’t set affinity on a dying sqpoll thread

syzbot <syzbot+c74fea926a78b8a91042@syzkaller.appspotmail.com> writes:

Hello,

syzbot found the following issue on:

HEAD commit: 626932085009 Add linux-next specific files for 20230825 git tree: linux-next console output: https://syzkaller.appspot.com/x/log.txt?x=12a97797a80000 kernel config: https://syzkaller.appspot.com/x/.config?x=8a8c992a790e5073 dashboard link: https://syzkaller.appspot.com/bug?extid=c74fea926a78b8a91042 compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40

Unfortunately, I don’t have any reproducer for this issue yet.

Rust For Linux

v4: rust: workqueue: add bindings for the workqueue

This patchset contains bindings for the kernel workqueue.

One of the primary goals behind the design used in this patch is that we must support embedding the work_struct as a field in user-provided types, because this allows you to submit things to the workqueue without having to allocate, making the submission infallible. If we didn’t have to support this, then the patch would be much simpler. One of the main things that make it complicated is that we must ensure that the function pointer in the work_struct is compatible with the struct it is contained within.

BPF

v10: bpf-next: selftests/bpf: Optimize kallsyms cache

We need to optimize the kallsyms cache, including optimizations for the number of symbols limit, and, some test cases add new kernel symbols (such as testmods) and we need to refresh kallsyms (reload or refresh).

v2: bpf: Enable IRQ after irq_work_raise() completes

The patchset aims to fix the problem that bpf_mem_alloc() may return NULL unexpectedly when multiple bpf_mem_alloc() are invoked concurrently under process context and there is still free memory available. The problem was found when doing stress test for qp-trie but the same problem also exists for bpf_obj_new() as demonstrated in patch #3.

v1: bpf-next: Implement cpuv4 support for s390x

This series adds the cpuv4 support to the s390x eBPF JIT. Patches 1-4 are preliminary bugfixes. Patches 5-9 implement the new instructions. Patches 10-11 enable the tests.

v1: bpf: Annotate bpf_long_memcpy with data_race

syzbot reported a data race splat between two processes trying to update the same BPF map value via syscall on different CPUs:

BUG: KCSAN: data-race in bpf_percpu_array_update / bpf_percpu_array_update

v9: bpf-next: selftests/bpf: trace_helpers.c: optimize kallsyms cache

Static ksyms often have problems because the number of symbols exceeds the MAX_SYMS limit. Like changing the MAX_SYMS from 300000 to 400000 in commit e76a014334a6(“selftests/bpf: Bump and validate MAX_SYMS”) solves the problem somewhat, but it’s not the perfect way.

This commit uses dynamic memory allocation, which completely solves the problem caused by the limitation of the number of kallsyms.

GIT PULL: Networking for 6.6

The following changes since commit b5cc3833f13ace75e26e3f7b51cd7b6da5e9cf17:

Merge tag ‘net-6.5-rc8’ of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net (2023-08-24 08:23:13 -0700)

are available in the Git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git tags/net-next-6.6

for you to fetch changes up to c873512ef3a39cc1a605b7a5ff2ad0a33d619aa8:

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net (2023-08-29 07:44:56 +0200)

v1: bpf-next: selftests/bpf: Include build flavors for install target

When using the “install” or targets depending on install, e.g. “gen_tar”, the BPF machine flavors weren’t included.

A command like:| make ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu- O=/workspace/kbuild
| HOSTCC=gcc FORMAT= SKIP_TARGETS=”arm64 ia64 powerpc sparc64 x86 sgx”
| -C tools/testing/selftests gen_tar would not include bpf/no_alu32, bpf/cpuv4, or bpf/bpf-gcc.

Include the BPF machine flavors for “install” make target.

v1: bpf-next: bpftool: Support dumping BTF object by name

Like maps and progs, add support to dump BTF objects by name ([0]).

[0] Closes: https://github.com/libbpf/bpftool/issues/56

v1: bpf-next: bpf: Add missed stats for kprobes

It’s still technically possible to create kprobe without perf link (using SET_BPF perf ioctl) in which case we don’t have a way to retrieve the kprobe’s ‘missed’ count. However both libbpf and cilium/ebpf libraries use perf link if it’s available, and for old kernels without perf link support we can use BPF program to retrieve the kprobe missed count.

周边技术动态

Qemu

v9: riscv: ‘max’ CPU, detect user choice in TCG

This new version contains suggestions made by Andrew Jones in v8.

Most notable change is the removal of the opensbi.py test in patch 11, which was replaced by a TuxBoot test. It’s more suitable to test the integrity of all the extensions enabled by the ‘max’ CPU.

v2: target/riscv: Use accelerated helper for AES64KS1I

Use the accelerated SubBytes/ShiftRows/AddRoundKey AES helper to implement the first half of the key schedule derivation. This does not actually involve shifting rows, so clone the same value into all four columns of the AES vector to counter that operation.

v1: target/riscv/pmp.c: respect mseccfg.RLB for pmpaddrX changes

When the rule-lock bypass (RLB) bit is set in the mseccfg CSR, the PMP configuration lock bits must not apply. While this behavior is implemented for the pmpcfgX CSRs, this bit is not respected for changes to the pmpaddrX CSRs. This patch ensures that pmpaddrX CSR writes work even on locked regions when the global rule-lock bypass is enabled.

v1: linux-user/riscv: Add new extensions to hwprobe

This patch adds the new extensions in linux 6.5 to the hwprobe syscall.

And fixes RVC check to OR with correct value. The previous variable contains 0 therefore it did work.

Buildroot

[branch/next] boot/edk2: bump to version edk2-stable202308

commit: https://git.buildroot.net/buildroot/commit/?id=5c9f31041a2c36e3b97cd2f9577f68d88ee91174 branch: https://git.buildroot.net/buildroot/commit/?id=refs/heads/next

For change log since version edk2-stable202305, see:

  • https://github.com/tianocore/edk2/releases/tag/edk2-stable202308

The main motivations of this bump are the RISC-V QEMU Virt support improvements (not yet supported in Buildroot).

U-Boot

v1: Universal Payload initial series

Universal Payload (UPL) is an upcoming Industry Standard for firmware components. UPL is designed to improve interoperability within the firmware industry, allowing mixing and matching of projects with less friction and fewer project-specific implementations. UPL is cross-platform, supporting ARM, x86 and RISC-V initially.

This series provides some initial support for this, for comment only.

[RFX PATCH 0/9] Universal Payload initial series

Universal Payload (UPL) is an upcoming Industry Standard for firmware components. UPL is designed to improve interoperability within the firmware industry, allowing mixing and matching of projects with less friction and fewer project-specific implementations. UPL is cross-platform, supporting ARM, x86 and RISC-V initially.

This series provides some initial support for this, for comment only.

v2: spl: bootstage: move bootstage_stash before jumping to image

Regarding IH_OS_OPENSBI, IH_OS_LINUX and IH_OS_TEE, there is no chance to stash bootstage record because they do not return to SPL after jumping to the image. Hence, this patch separates the final stage bootstage code into spl_bootstage_finish and call the function before jumping to the image.

v2: bootstage support for risc-v

This adds to support bootstage for risc-v. timer_get_boot_us function is required to record each boot stages with microsecond timestamp.



Read Album:

Read Related:

Read Latest: