泰晓科技 -- 聚焦 Linux - 追本溯源,见微知著!
网站地址:https://tinylab.org

泰晓RISC-V实验箱,转战RISC-V,开箱即用
请稍侯

RISC-V Linux 内核及周边技术动态第 52 期

呀呀呀 创作于 2023/07/06

时间:20230705
编辑:晓依
仓库:RISC-V Linux 内核技术调研活动
赞助:PLCT Lab, ISCAS

内核动态

RISC-V 架构支持

v7: -next: support allocating crashkernel above 4G explicitly on riscv 1

On riscv, the current crash kernel allocation logic is trying to allocate within 32bit addressible memory region by default, if failed, try to allocate without 4G restriction.

v1: riscv: Start of DRAM should at least be aligned on PMD size for the direct mapping

So that we do not end up mapping the whole linear mapping using 4K pages, which is slow at boot time, and also very likely at runtime.

So make sure we align the start of DRAM on a PMD boundary.

v4: Add initialization of clock for StarFive JH7110 SoC

This patchset adds initial rudimentary support for the StarFive Quad SPI controller driver. And this driver will be used in StarFive’s VisionFive 2 board. In 6.4, the QSPI_AHB and QSPI_APB clocks changed from the default ON state to the default OFF state, so these clocks need to be enabled in the driver.At the same time, dts patch is added to this series.

v1: Add SPI module for StarFive JH7110 SoC

This patchset adds initial rudimentary support for the StarFive SPI controller. And this driver will be used in StarFive’s VisionFive 2 board. The first patch constrain minItems of clocks for JH7110 SPI and Patch 2 adds support for StarFive JH7110 SPI.

v6: Add PLL clocks driver and syscon for StarFive JH7110 SoC

This patch serises are to add PLL clocks driver and providers by writing and reading syscon registers for the StarFive JH7110 RISC-V SoC. And add documentation and nodes to describe StarFive System Controller(syscon) Registers. This patch serises are based on Linux 6.4.

v4: riscv: Allow userspace to directly access perf counters

riscv used to allow direct access to cycle/time/instret counters, bypassing the perf framework, this patchset intends to allow the user to mmap any counter when accessed through perf. But we can’t break the existing behaviour so we introduce a sysctl perf_user_access like arm64 does, which defaults to the legacy mode described above.

v3: RISC-V: Probe DT extension support using riscv,isa-extensions & riscv,isa-base

Based on my latest iteration of deprecating riscv,isa [1], here’s an implementation of the new properties for Linux. The first few patches, up to “RISC-V: split riscv_fill_hwcap() in 3”, are all prep work that further tames some of the extension related code, on top of my already applied series that cleans up the ISA string parser. Perhaps “RISC-V: shunt isa_ext_arr to cpufeature.c” is a bit gratuitous, but I figured a bit of coalescing of extension related data structures would be a good idea. Note that riscv,isa will still be used in the absence of the new properties. Palmer suggested adding a Kconfig option to turn off the fallback for DT, which I have gone and done. It’s locked behind the NONPORTABLE option for good reason.

v1: riscv: optimize ELF relocation function in riscv

The patch can optimize the running times of insmod command by modify ELF relocation function. In the 5.10 and latest kernel, when install the riscv ELF drivers which contains multiple symbol table items to be relocated, kernel takes a lot of time to execute the relocation. For example, we install a 3+MB driver need 180+s. We focus on the riscv architecture handle R_RISCV_HI20 and R_RISCV_LO20 type items relocation function in the arch\riscv\kernel\module.c and find that there are two-loops in the function. If we modify the begin number in the second for-loops iteration, we could save significant time for installation. We install the same 3+MB driver could just need 2s.

v10: Add non-coherent DMA support for AX45MP

On the Andes AX45MP core, cache coherency is a specification option so it may not be supported. In this case DMA will fail. To get around with this issue this patch series does the below:

1] Andes alternative ports is implemented as errata which checks if the IOCP is missing and only then applies to CMO errata. One vendor specific SBI EXT (ANDES_SBI_EXT_IOCP_SW_WORKAROUND) is implemented as part of errata.

v5: dt-bindings: riscv: deprecate riscv,isa

When the RISC-V dt-bindings were accepted upstream in Linux, the base ISA etc had yet to be ratified. By the ratification of the base ISA, incompatible changes had snuck into the specifications - for example the Zicsr and Zifencei extensions were spun out of the base ISA.

v5: RISCV: Add KVM_GET_REG_LIST API

KVM_GET_REG_LIST will dump all register IDs that are available to KVM_GET/SET_ONE_REG and It’s very useful to identify some platform regression issue during VM migration.

GIT PULL: RISC-V Patches for the 6.5 Merge Window, Part 1

The following changes since commit ac9a78681b921877518763ba0e89202254349d1b:

Linux 6.4-rc1 (2023-05-07 13:34:35 -0700)

are available in the Git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux.git tags/riscv-for-linus-6.5-mw1

v1: Add missing pins for RZ/Five SoC

This patch series intends to incorporate the absent port pins P19 to P28, which are exclusively available on the RZ/Five SoC.

v2: riscv: Add BUG_ON() for no cpu nodes in devicetree

When only the ACPI tables are passed to kernel, the tiny devictree created by EFI Stub doesn’t provide cpu nodes.

v1: riscv: KCFI support

This series adds KCFI support for RISC-V. KCFI is a fine-grained forward-edge control-flow integrity scheme supported in Clang >=16, which ensures indirect calls in instrumented code can only branch to functions whose type matches the function pointer type, thus making code reuse attacks more difficult.

v1: RISC-V: Provide a more helpful error message on invalid ISA strings

This adds a warning for the cases where the ISA string isn’t valid. It’s still above the BUG_ON cut, but hopefully it’s at least a bit easier for users.

v4: riscv: Discard vector state on syscalls

The RISC-V vector specification states:Executing a system call causes all caller-saved vector registers(v0-v31, vl, vtype) and vstart to become unspecified.

The vector registers are set to all 1s, vill is set (invalid), and the vector status is set to Dirty.

v1: arch,fbdev: Move screen_info into arch/

The variables screen_info and edid_info provide information about the system’s screen, and possibly EDID data of the connected display. Both are defined and set by architecture code. But both variables are declared in non-arch header files. Dependencies are at bease loosely tracked. To resolve this, move the global state screen_info and its companion edid_info into arch/. Only declare them on architectures that define them. List dependencies on the variables in the Kconfig files. Also clean up the callers.

v1: riscv: BUG_ON() for no cpu nodes in setup_smp

When booting with ACPI tables, the tiny devictree created by EFI Stub doesn’t provide cpu nodes.

In setup_smp(), of_parse_and_init_cpus() will bug on !found_boot_cpu if acpi_disabled. That’s unclear, so bug for no cpu nodes before of_parse_and_init_cpus().

v8: Add JH7110 USB PHY driver support

This patchset adds USB and PCIe PHY for the StarFive JH7110 SoC. The patch has been tested on the VisionFive 2 board.

v1: RISC-V: Document the ISA string parsing rules for ACPI

We’ve had a ton of issues around the ISA string parsing rules elsewhere in RISC-V, so let’s at least be clear about what the rules are so we can try and avoid more issues.

v1: tools/nolibc: shrink arch support

This patchset further improves porting of nolibc to new architectures, it is based on our previous v5 sysret helper series [1].

It mainly shrinks the assembly _start by moving most of its operations to a C version of _start_c() function. and also, it removes the old sys_stat() support by using the sys_statx() instead and therefore, removes all of the arch specific sys_stat_struct.

v2: RISC-V: archrandom support

This patchset adds support for the archrandom API to the RISC-V architecture.

The ratified crypto scalar extensions provide entropy bits via the seed CSR, as exposed by the Zkr extension.

v5: tools/nolibc: add a new syscall helper

It mainly applies the core part of suggestions from Thomas (Many thanks) and cleans up the multiple whitespaces issues reported by scripts/checkpatch.pl.

v1: riscv: sigcontext: Correct the comment of sigreturn

The real-time signals enlarged the sigset_t type, and most architectures have changed to using rt_sigreturn as the only way. The riscv is one of them, and there is no sys_sigreturn in it. Only some old architecture preserved sys_sigreturn as part of the historical burden.

GIT PULL: RISC-V: make ARCH_THEAD preclude XIP_KERNEL

Randy reported build errors in linux-next where XIP_KERNEL was enabled. ARCH_THEAD requires alternatives to support the non-standard ISA extensions used by the THEAD cores, which are mutually exclusive with XIP kernels. Clone the dependency list from the Allwinner entry, since Allwinner’s D1 uses T-Head cores with the same non-standard extensions.

v1: Make SV39 the default address space

Make sv39 the default address space for mmap as some applications currently depend on this assumption. The RISC-V specification enforces that bits outside of the virtual address range are not used, so restricting the size of the default address space as such should be temporary. A hint address passed to mmap will cause the largest address space that fits entirely into the hint to be used. If the hint is less than or equal to 1«38, a 39-bit address will be used. After an address space is completely full, the next smallest address space will be used.

v3: Add support for Allwinner PWM on D1/T113s/R329 SoCs

This series adds support for PWM controller on new Allwinner’s SoCs, such as D1, T113s and R329. The implemented driver provides basic functionality for control PWM channels.

进程调度

v3: sched/core: introduce sched_core_idle_cpu()

As core scheduling introduced, a new state of idle is defined as force idle, running idle task but nr_running greater than zero.

v1: sched/core: Use empty mask to reset cpumasks in sched_setaffinity()

Since commit 8f9ea86fdf99 (“sched: Always preserve the user requested cpumask”), user provided CPU affinity via sched_setaffinity(2) is perserved even if the task is being moved to a different cpuset. However, that affinity is also being inherited by any subsequently created child processes which may not want or be aware of that affinity.

v3: Sched/fair: Block nohz tick_stop when cfs bandwidth in use

CFS bandwidth limits and NOHZ full don’t play well together. Tasks can easily run well past their quotas before a remote tick does accounting. This leads to long, multi-period stalls before such tasks can run again. Currentlyi, when presented with these conflicting requirements the scheduler is favoring nohz_full and letting the tick be stopped. However, nohz tick stopping is already best-effort, there are a number of conditions that can prevent it, whereas cfs runtime bandwidth is expected to be enforced.

内存管理

v3: MDWE without inheritance

Joey recently introduced a Memory-Deny-Write-Executable (MDWE) prctl which tags current with a flag that prevents pages that were previously not executable from becoming executable. This tag always gets inherited by children tasks. (it’s in MMF_INIT_MASK)

v2: mm/slub: refactor freelist to use custom type

Currently the SLUB code represents encoded freelist entries as “void*”. That’s misleading, those things are encoded under CONFIG_SLAB_FREELIST_HARDENED so that they’re not actually dereferencable.

v1: block: Make blkdev_get_by_*() return handle

this patch series implements the idea of blkdev_get_by_*() calls returning bdev_handle which is then passed to blkdev_put() [1]. This makes the get and put calls for bdevs more obviously matching and allows us to propagate context from get to put without having to modify all the users (again!). In particular I need to propagate used open flags to blkdev_put() to be able count writeable opens and add support for blocking writes to mounted block devices. I’ll send that series separately.

v1: mm: memory-failure: add missing set_mce_nospec() for memory_failure()

If memory_failure() succeeds to hwpoison a page, the set_mce_nospec() is expected to be called to prevent speculative access to the page by marking it not-present. Add such missing call to set_mce_nospec() in async memory failure handling scene.

v1: mm: page_alloc: avoid false page outside zone error info

If pfn is outside zone boundaries in the first round, ret will be set to 1. But if pfn is changed to inside the zone boundaries in zone span seqretry path, ret is still set to 1 leading to false page outside zone error info.

v3: Documentation: admin-guide: correct “it’s” to possessive “its”

Correct 2 uses of “it’s” to the possessive “its” as needed.

v2: variable-order, large folios for anonymous memory

This is v2 of a series to implement variable order, large folios for anonymous memory. The objective of this is to improve performance by allocating larger chunks of memory during anonymous page faults. See [1] for background.

[PATCH v10 rebased on v6.4 00/25] DEPT(Dependency Tracker)

From now on, I can work on LKML again! I’m wondering if DEPT has been helping kernel debugging well even though it’s a form of patches yet.

v1: mm: make MEMFD_CREATE into a selectable config option

The memfd_create() syscall, enabled by CONFIG_MEMFD_CREATE, is useful on its own even when not required by CONFIG_TMPFS or CONFIG_HUGETLBFS.

Split it into its own proper bool option that can be enabled by users.

v2: Documentation: mm/memfd: vm.memfd_noexec

Add documentation for sysctl vm.memfd_noexec

Link:https://lore.kernel.org/linux-mm/CABi2SkXUX_QqTQ10Yx9bBUGpN1wByOi_=gZU6WEy5a8MaQY3Jw@mail.gmail.com/T/

v2: mm/slub: disable slab merging in the default configuration

Make CONFIG_SLAB_MERGE_DEFAULT default to n unless CONFIG_SLUB_TINY is enabled. Benefits of slab merging is limited on systems that are not memory constrained: the memory overhead is low and evidence of its effect on cache hotness is hard to come by.

v25: crash: Kernel handling of CPU and memory hot un/plug

This series is dependent upon “refactor Kconfig to consolidate KEXEC and CRASH options”.https://lore.kernel.org/lkml/20230626161332.183214-1-eric.devolder@oracle.com/

Once the kdump service is loaded, if changes to CPUs or memory occur, either by hot un/plug or off/onlining, the crash elfcorehdr must also be updated.

v1: mm: Always downgrade mmap_lock if requested

Now that stack growth must always hold the mmap_lock for write, we can always downgrade the mmap_lock to read and safely unmap pages from the page table, even if we’re next to a stack.

v1: writeback: Account the number of pages written back

nr_to_write is a count of pages, so we need to decrease it by the number of pages in the folio we just wrote, not by 1. Most callers specify either LONG_MAX or 1, so are unaffected, but writeback_sb_inodes() might end up writing 512x as many pages as it asked for.

v24: crash: Kernel handling of CPU and memory hot un/plug

This series is dependent upon “refactor Kconfig to consolidate KEXEC and CRASH options”.https://lore.kernel.org/lkml/20230626161332.183214-1-eric.devolder@oracle.com/

Once the kdump service is loaded, if changes to CPUs or memory occur, either by hot un/plug or off/onlining, the crash elfcorehdr must also be updated.

v1: fs/address_space: add alignment padding for i_map and i_mmap_rwsem to mitigate a false sharing.

When running UnixBench/Shell Scripts, we observed high false sharing for accessing i_mmap against i_mmap_rwsem.

UnixBench/Shell Scripts are typical load/execute command test scenarios, the i_mmap will be accessed frequently to insert/remove vma_interval_tree. Meanwhile, the i_mmap_rwsem is frequently loaded. Unfortunately, they are in the same cacheline.

v2: mm/slub: Optimize slub memory usage

In the previous version [1], we were able to reduce slub memory wastage, but the total memory was also increasing so to solve this problem have modified the patch as follow:

1) If min_objects * object_size > PAGE_ALLOC_COSTLY_ORDER, then it will return with PAGE_ALLOC_COSTLY_ORDER. 2) Similarly, if min_objects * object_size < PAGE_SIZE, then it will return with slub_min_order. 3) Additionally, I changed slub_max_order to 2. There is no specific reason for using the value 2, but it provided the best results in terms of performance without any noticeable impact.

文件系统

v2: 0/6: block: Add config option to not allow writing to mounted devices

This is second version of the patches to add config option to not allow writing to mounted block devices. For motivation why this is interesting see patch 1/6. I’ve been testing the patches more extensively this time and I’ve found couple of things that get broken by disallowing writes to mounted block devices: 1) Bind mounts get broken because get_tree_bdev() / mount_bdev() first try toclaim the bdev before searching whether it is already mounted. Patch 6reworks the mount code to avoid this problem. 2) btrfs mounting is likely having the same problem as 1). It should be fixableAFAICS but for now I’ve left it alone until we settle on the rest of theseries. 3) “mount -o loop” gets broken because util-linux keeps the loop device openread-write when attempting to mount it. Hopefully fixable within util-linux. 4) resize2fs online resizing gets broken because it tries to open the blockdevice read-write only to call resizing ioctl. Trivial to fix withine2fsprogs.

v1: block: Make blkdev_get_by_*() return handle

this patch series implements the idea of blkdev_get_by_*() calls returning bdev_handle which is then passed to blkdev_put() [1]. This makes the get and put calls for bdevs more obviously matching and allows us to propagate context from get to put without having to modify all the users (again!). In particular I need to propagate used open flags to blkdev_put() to be able count writeable opens and add support for blocking writes to mounted block devices. I’ll send that series separately.

v5: fanotify accounting for fs/splice.c

Previously: https://lore.kernel.org/linux-fsdevel/jbyihkyk5dtaohdwjyivambb2gffyjs3dodpofafnkkunxq7bu@jngkdxx65pux/t/#u

In short:

  • most read/write APIs generate ACCESS/MODIFY for the read/written file(s)
  • except the [vm]splice/tee family (actually, since 6.4, splice itself /does/ generate events but onlyfor the non-pipes being spliced from/to; this commit is Fixes:ed)
  • userspace that registers (i|fa)notify on pipes usually relies on it actually working (coreutils tail -f is the primo example)
  • it’s sub-optimal when someone with a magic syscall can fill up a pipe simultaneously ensuring it will never get serviced

[PATCH v10 rebased on v6.4 00/25] DEPT(Dependency Tracker)

From now on, I can work on LKML again! I’m wondering if DEPT has been helping kernel debugging well even though it’s a form of patches yet.

GIT PULL: iomap: new code for 6.5

Please pull this branch with changes for iomap for 6.5-rc1.

As usual, I did a test-merge with the main upstream branch as of a few minutes ago, and didn’t see any conflicts. Please let me know if you encounter any problems.

v1: proc: proc_setattr for /proc/$PID/net

Just applied your patchset on v6.4, and then:

  • revert the 1st patch: ‘selftests/nolibc: drop test chmod_net’ manually

  • do the ‘run’ test of nolibc on arm/vexpress-a9

v3: fuse: add a new fuse init flag to relax restrictions in no cache mode

Patch 1 is a fix for private mmap in FOPEN_DIRECT_IO modeThis is added here together since the later two depends on it. Patch 2 is the main dish Patch 3 is to maintain direct write logic for shared mmap in FOPEN_DIRECT_IO mode

v1: fs: Optimize unixbench’s file copy test

The iomap_set_range_uptodate function checks if the file is a private mapping,and if it is, it needs to do something about it.UnixBench’s file copy tests are mostly share mapping, such a check would reduce file copy scores, so we added the unlikely macro for optimization. and the score of file copy can be improved after branch optimization.

v1: fanotify: disallow mount/sb marks on kernel internal pseudo fs

Hopefully, nobody is trying to abuse mount/sb marks for watching all anonymous pipes/inodes.

I cannot think of a good reason to allow this - it looks like an oversight that dated back to the original fanotify API.

GIT PULL: sysctl changes for v6.5-rc1

The following changes since commit f1fcbaa18b28dec10281551dfe6ed3a3ed80e3d6:

Linux 6.4-rc2 (2023-05-14 12:51:40 -0700)

are available in the Git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/ tags/v6.5-rc1-sysctl-next

for you to fetch changes up to 2f2665c13af4895b26761107c2f637c2f112d8e9:

sysctl: replace child with an enumeration (2023-06-18 02:32:54 -0700)

网络设备

v3: net: nfp: clean mc addresses in application firmware when closing port

When moving devices from one namespace to another, mc addresses are cleaned in software while not removed from application firmware. Thus the mc addresses are remained and will cause resource leak.

v2: iwl-net: ice: prevent call trace during reload

Calling ethtool during reload can lead to call trace, because VSI isn’t configured for some time, but netdev is alive.

To fix it add rtnl lock for VSI deconfig and config. Set ::num_q_vectors to 0 after freeing and add a check for ::tx/rx_rings in ring related ethtool ops.

Add proper unroll of filters in ice_start_eth().

v1: net: octeontx2-af: Promisc enable/disable through mbox

In Legacy silicon, promisc mode is only modified through CGX mbox messages. In CN10KB silicon, it modified from CGX mbox and NIX. This breaks legacy application behaviour. Fix this by removing call from NIX.

v2: vduse: add support for networking devices

This small series enables virtio-net device type in VDUSE. With it, basic operation have been tested, both with virtio-vdpa and vhost-vdpa using DPDK Vhost library series adding VDUSE support using split rings layout (merged in DPDK v23.07-rc1).

v1: net: ftmac100: add multicast filtering possibility

If netdev_mc_count() is not zero and not IFF_ALLMULTI, filter incoming multicast packets. The chip has a Multicast Address Hash Table for allowed multicast addresses, so we fill it.

v1: net: sched: Undo tcf_bind_filter in case of errors in set callbacks

Five different classifier (fw, bpf, u32, matchall, and flower) are calling tcf_bind_filter in their callbacks, but weren’t undoing it by calling tcf_unbind_filter if their was an error after binding.

This patch set fixes all this by calling tcf_unbind_filter in such cases.

v5: bpf-next: Add SO_REUSEPORT support for TC bpf_sk_assign

We want to replace iptables TPROXY with a BPF program at TC ingress. To make this work in all cases we need to assign a SO_REUSEPORT socket to an skb, which is currently prohibited. This series adds support for such sockets to bpf_sk_assing.

v1: resubmit: net: fec: Refactor: rename adapter to fep

Rename local struct fec_enet_private *adapter to fep in fec_ptp_gettime() to match the rest of the driver

v1: igb: Add support for AF_XDP zero-copy

Disclaimer: My first patches to Intel drivers, implemented AF_XDP zero-copy feature which seemed to be missing for igb. Not sure if it was a conscious choice to not spend time implementing this for older devices, nevertheless I send them to the list for review.

v1: net: phy: at803x: support qca8081 1G version chip

This patch series add supporting qca8081 1G version chip, the 1G version chip can be identified by the register mmd7.0x901d bit0.

v1: net-next: bnxt_en: use dev_consume_skb_any() in bnxt_tx_int

Replace dev_kfree_skb_any() with dev_consume_skb_any() in bnxt_tx_int() to clear the unnecessary noise of “kfree_skb” event.

v2: net: dsa: SERDES support for mv88e632x family

This patch series brings SERDES support for the mv88e632x family.

v1: can: j1939: prevent deadlock by changing j1939_socks_lock to rwlock

The following 3 locks would race against each other, causing the deadlock situation in the Syzbot bug report:

  • j1939_socks_lock
  • active_session_list_lock
  • sk_session_queue_lock

A reasonable fix is to change j1939_socks_lock to an rwlock, since in the rare situations where a write lock is required for the linked list that j1939_socks_lock is protecting, the code does not attempt to acquire any more locks. This would break the circular lock dependency, where, for example, the current thread already locks j1939_socks_lock and attempts to acquire sk_session_queue_lock, and at the same time, another thread attempts to acquire j1939_socks_lock while holding sk_session_queue_lock.

v2: bpf-next: XDP metadata via kfuncs for ice

This series introduces XDP hints via kfuncs [0] to the ice driver.

Series brings the following existing hints to the ice driver:

  • HW timestamp
  • RX hash with type

Series also introduces new hints and adds their implementation to ice and veth:

  • VLAN tag with protocol
  • Checksum level

v1: net: Replace strlcpy with strscpy

strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy(). No return values were used, so direct replacement is safe.

v1: bpf, net: Allow setting SO_TIMESTAMPING* from BPF

BPF applications, e.g., a TCP congestion control, might benefit from precise packet timestamps. These timestamps are already available in __sk_buff and bpf_sock_ops, but could not be requested: A BPF program was not allowed to set SO_TIMESTAMPING* on a socket. This change enables BPF programs to actively request the generation of timestamps from a stream socket.

v1: bpf-next: xsk: honor SO_BINDTODEVICE on bind

Initial creation of an AF_XDP socket requires CAP_NET_RAW capability. A privileged process might create the socket and pass it to a non-privileged process for later use. However, that process will be able to bind the socket to any network interface. Even though it will not be able to receive any traffic without modification of the BPF map, the situation is not ideal.

v3: octeontx2-pf: Add additional check for MCAM rules

Due to hardware limitation, MCAM drop rule with ether_type == 802.1Q and vlan_id == 0 is not supported. Hence rejecting such rules.

v1: netconsole: Append kernel version to message

Create a new netconsole Kconfig option that prepends the kernel version in the netconsole message. This is useful to map kernel messages to kernel version in a simple way, i.e., without checking somewhere which kernel version the host that sent the message is using.

v2: nf: netfilter: conntrack: Avoid nf_ct_helper_hash uses after free

If nf_conntrack_init_start() fails (for example due to a register_nf_conntrack_bpf() failure), the nf_conntrack_helper_fini() clean-up path frees the nf_ct_helper_hash map.

v1: vdpa: reject F_ENABLE_AFTER_DRIVER_OK if backend does not support it

With the current code it is accepted as long as userland send it.

Although userland should not set a feature flag that has not been offered to it with VHOST_GET_BACKEND_FEATURES, the current code will not complain for it.

v1: Add a driver for the Marvell 88Q2110 PHY

Add support for 1000BASE-T1 to the phy_device driver and add a first

[net PATCH] octeontx2-af: Install TC filter rules in hardware based on priority

As of today, hardware does not support installing tc filter rules based on priority. This patch fixes the issue and install the hardware rules based on priority. The final hardware rules will not be dependent on rule installation order, it will be strictly priority based, same as software.

v1: net/sched: act_pedit: Add size check for TCA_PEDIT_PARMS_EX

The attribute TCA_PEDIT_PARMS_EX is not be included in pedit_policy and one malicious user could fake a TCA_PEDIT_PARMS_EX whose length is smaller than the intended sizeof(struct tc_pedit). Hence, the dereference in tcf_pedit_init() could access dirty heap data.

[net PATCH V2] octeontx2-pf: Add additional check for MCAM rules.

Due to hardware limitation, MCAM drop rule with ether_type == 802.1Q and vlan_id == 0 is not supported. Hence rejecting such rules.

v1: I3C MCTP net driver

This series adds an I3C transport for the kernel’s MCTP network protocol. MCTP is a communication protocol between system components (BMCs, drives, NICs etc), with higher level protocols such as NVMe-MI or PLDM built on top of it (in userspace). It runs over various transports such as I2C, PCIe, or I3C.

v4: wifi:mac80211: Replace the ternary conditional operator with conditional-statements

Replacing ternary conditional operators with conditional statements ensures proper expression of meaning while making it easier for the compiler to generate code.

v5: vsock: MSG_ZEROCOPY flag support

Difference with copy way is not significant. During packet allocation, non-linear skb is created and filled with pinned user pages. There are also some updates for vhost and guest parts of transport - in both cases i’ve added handling of non-linear skb for virtio part. vhost copies data from such skb to the guest’s rx virtio buffers. In the guest, virtio transport fills tx virtio queue with pages from skb.

v5: vsock: enable setting SO_ZEROCOPY

For AF_VSOCK, zerocopy tx mode depends on transport, so this option must be set in AF_VSOCK implementation where transport is accessible (if transport is not set during setting SO_ZEROCOPY: for example socket is not connected, then SO_ZEROCOPY will be enabled, but once transport will be assigned, support of this type of transmission will be checked).

v1: selftests/net: Add xt_policy config for xfrm_policy test

This is because IPsec “policy” match support is not available to the kernel.

This patch adds CONFIG_NETFILTER_XT_MATCH_POLICY as a module to the selftests/net/config file, so that make kselftest-merge can take this into consideration.

v1: Add virtio_rtc module and related changes

This patch series adds the virtio_rtc module, and related bugfixes and small interface extensions. The virtio_rtc module implements a driver compatible with the proposed Virtio RTC device specification [1]. The Virtio RTC (Real Time Clock) device provides information about current time. The device can provide different clocks, e.g. for the UTC or TAI time standards, or for physical time elapsed since some past epoch. The driver can read the clocks with simple or more accurate methods.

安全增强

v1: pstore: Replace crypto API compression with zlib calls

The pstore layer implements support for compression of kernel log output, using a variety of compressions algorithms provided by the [deprecated] crypto API ‘comp’ interface.

This appears to have been somebody’s pet project rather than a solution to a real problem: the original deflate compression is reasonably fast, compressed well and is comparatively small in terms of code footprint, and so the flexibility that the crypto API integration provides does little more than complicate the code for no reason.

v1: Revert “fortify: Allow KUnit test to build without FORTIFY”

The standard for KUnit is to not build tests at all when required functionality is missing, rather than doing test “skip”. Restore this for the fortify tests, so that architectures without CONFIG_ARCH_HAS_FORTIFY_SOURCE do not emit unsolvable warnings.

v1: wifi: mt76: Replace strlcpy with strscpy

strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy().

v1: kobject: Replace strlcpy with strscpy

strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy().

v1: kyber, blk-wbt: Replace strlcpy with strscpy

This patch series replaces strlcpy in the kyber and blk-wbt tracing subsystems wherever trivial replacement is possible, i.e return value from strlcpy is unused. The patches themselves are independent of each other and are applied to different subsystems. They are included as a series for ease of review.

v1: perf: Replace strlcpy with strscpy

strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy(). No return values were used, so direct replacement is safe.

v1: next: media: venus: Use struct_size_t() helper in pkt_session_unset_buffers()

Prefer struct_size_t() over struct_size() when no pointer instance of the structure type is present.

v2: pid: Replace struct pid 1-element array with flex-array

For pid namespaces, struct pid uses a dynamically sized array member, “numbers”. This was implemented using the ancient 1-element fake flexible array, which has been deprecated for decades. Replace it with a C99 flexible array, refactor the array size calculations to use struct_size(), and address elements via indexes. Note that the static initializer (which defines a single element) works as-is, and requires no special handling.

[GIT PULL v2] flexible-array transformations for 6.5-rc1

The following changes since commit f1fcbaa18b28dec10281551dfe6ed3a3ed80e3d6:

Linux 6.4-rc2 (2023-05-14 12:51:40 -0700)

are available in the Git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux.git tags/flex-array-transformations-6.5-rc1

v3: Add documentation for sysctl vm.memfd_noexec

Add documentation for sysctl vm.memfd_noexec

Thanks to Dominique Martinet asmadeus@codewreck.org who reported this. see [1] for context.

[1] https://lore.kernel.org/linux-mm/CABi2SkXUX_QqTQ10Yx9bBUGpN1wByOi_=gZU6WEy5a8MaQY3Jw@mail.gmail.com/T/

v1: usb: ch9: Replace bmSublinkSpeedAttr 1-element array with flexible array

Since commit df8fc4e934c1 (“kbuild: Enable -fstrict-flex-arrays=3”), UBSAN_BOUNDS no longer pretends 1-element arrays are unbounded. Walking bmSublinkSpeedAttr will trigger a warning, so make it a proper flexible array. Add a union to keep the struct size identical for userspace in case anything was depending on the old size.

v1: next: scsi: aacraid: Replace one-element array with flexible-array member in struct user_sgmap

Replace one-element array with flexible-array member in struct user_sgmap and refactor the rest of the code, accordingly.

Issue found with the help of Coccinelle and audited and fixed, manually.

This results in no differences in binary output.

v1: next: scsi: aacraid: Use struct_size() helper in code related to struct sgmapraw

Prefer struct_size() over open-coded versions.

v1: next: scsi: aacraid: Use struct_size() helper in aac_get_safw_ciss_luns()

Prefer struct_size() over open-coded versions.

This results in no differences in binary output.

v1: next: scsi: aacraid: Replace one-element arrays with flexible-array members

This series aims to replace one-element arrays with flexible-array members in multiple structures in drivers/scsi/aacraid/aacraid.h.

This helps with the ongoing efforts to globally enable -Warray-bounds and get us closer to being able to tighten the FORTIFY_SOURCE routines on memcpy().

These issues were found with the help of Coccinelle and audited and fixed, manually.

GIT PULL: flexible-array transformations for 6.5-rc1

The following changes since commit f1fcbaa18b28dec10281551dfe6ed3a3ed80e3d6:

Linux 6.4-rc2 (2023-05-14 12:51:40 -0700)

are available in the Git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux.git tags/flex-array-transformations-6.5-rc1

v1: pstore: ramoops: support pmsg size larger than kmalloc limitation

Current pmsg implementation is using kmalloc for pmsg record buffer, which has max size limits based on page size. Currently even we allocate enough space with pmsg-size, pmsg will still fail if the file size is larger than what kmalloc allowed.

v4: Randomized slab caches for kmalloc()

When exploiting memory vulnerabilities, “heap spraying” is a common technique targeting those related to dynamic memory allocation (i.e. the “heap”), and it plays an important role in a successful exploitation. Basically, it is to overwrite the memory area of vulnerable object by triggering allocation in other subsystems or modules and therefore getting a reference to the targeted memory location. It’s usable on various types of vulnerablity including use after free (UAF), heap out- of-bound write and etc.

异步 IO

v3: Add a sysctl to disable io_uring system-wide

Over the last few years we’ve seen many critical vulnerabilities in io_uring[1] which could be exploited by an unprivileged process to gain control over the kernel. This patch introduces a new sysctl which disables the creation of new io_uring instances system-wide.

v1: io_uring: Add {} to maintain consistency in code format

In io_issue_sqe, the if (ret == IOU_OK) branch uses {}, so to maintain code format consistency, it is better to add {} in the else branch.

v4: io_uring: Add io_uring command support for sockets

Enable io_uring commands on network sockets. Create two new SOCKET_URING_OP commands that will operate on sockets.

In order to call ioctl on sockets, use the file_operations->io_uring_cmd callbacks, and map it to a uring socket function, which handles the SOCKET_URING_OP accordingly, and calls socket ioctls.

Rust For Linux

v1: rust: types: make Opaque be !Unpin

Adds a PhantomPinned field to Opaque<T>. This removes the last Rust guarantee: the assumption that the type T can be freely moved. This is not the case for many types from the C side (e.g. if they contain a struct list_head). This change removes the need to add a PhantomPinned field manually to Rust structs that contain C structs which must not be moved.

v1: rust: macros: add paste! proc macro

This macro provides a flexible way to concatenated identifiers together and it allows the resulting identifier to be used to declare new items, which concat_idents! does not allow. It also allows identifiers to be transformed before concatenated.

v1: rust: build: Define MODULE macro iif the CONFIG_MODULES is enabled

The LoongArch does not currently support modules when built with clang. A pre-processor error is expected on building modules, that’s caused by:

#if defined(MODULE) && defined(CONFIG_AS_HAS_EXPLICIT_RELOCS)# if has_attribute(model)# define PER_CPU_ATTRIBUTES __attribute((model(“extreme”)))# else# error compiler support for the model attribute is necessary when a recent assembler is used# endif#endif

v2: rust: alloc: Add realloc and alloc_zeroed to the GlobalAlloc impl

While there are default impls for these methods, using the respective C api’s is faster. Currently neither the existing nor these new GlobalAlloc method implementations are actually called. Instead the _rust* function defined below the GlobalAlloc impl are used. With rustc 1.71 these functions will be gone and all allocation calls will go through the GlobalAlloc implementation.

v1: Rust device mapper abstractions

This is a version of device mapper abstractions. Based on these, we also implement a linear target as a PoC. Any suggestions are welcomed, thanks!

BPF

v3: um: vector: Replace undo_user_init in old code with out_free_netdev

Thanks for your response and suggestions, I made some mistakes. This is a resubmitted patch. I got some errors with my local repository, so I lost the commit SHA-1 ID.

v9: bpf-next: selftests/bpf: Add benchmark for bpf memory allocator

The benchmark could be used to compare the performance of hash map operations and the memory usage between different flavors of bpf memory allocator (e.g., no bpf ma vs bpf ma vs reuse-after-gp bpf ma). It also could be used to check the performance improvement or the memory saving provided by optimization.

v1: bpf, net: Allow setting SO_TIMESTAMPING* from BPF

BPF applications, e.g., a TCP congestion control, might benefit from precise packet timestamps. These timestamps are already available in __sk_buff and bpf_sock_ops, but could not be requested: A BPF program was not allowed to set SO_TIMESTAMPING* on a socket. This change enables BPF programs to actively request the generation of timestamps from a stream socket.

v1: x86/BPF: Add new BPF helper call bpf_rdtsc

This patch series adds a new x86 arch specific BPF helper, bpf_rdtsc() which can be used for reading the hardware time stamp counter (TSC.) Currently the same counter is directly accessible from userspace (using RDTSC instruction), and kernel space using various rdtsc_*() APIs, however eBPF lacks the support.

v1: fs: Add kfuncs to handle idmapped mounts

Since the introduction of idmapped mounts, file handling has become somewhat more complicated. If the inode has been found through an idmapped mount the idmap of the vfsmount must be used to get proper i_uid / i_gid. This is important, for example, to correctly take into account idmapped files when caching, LSM or for an audit.

[v3 PATCH bpf-next 0/6] bpf: add percpu stats for bpf_map

This series adds a mechanism for maps to populate per-cpu counters on insertions/deletions. The sum of these counters can be accessed by a new kfunc from map iterator and tracing programs.

v5: RFC: introduce page_pool_alloc() API

In [1] & [2] & [3], there are usecases for veth and virtio_net to use frag support in page pool to reduce memory usage, and it may request different frag size depending on the head/tail room space for xdp_frame/shinfo and mtu/packet size. When the requested frag size is large enough that a single page can not be split into more than one frag, using frag support only have performance penalty because of the extra frag count handling for frag support.

v1: bpf-next: bpf: Support new insns from cpu v4

This patch set added kernel support for insns proposed in [1] except BPF_ST which already has full kernel support. Beside the above proposed insns, LLVM will generate BPF_ST insn as well under -mcpu=v4 ([2]).

The patchset implements interpreter and jit support for these new insns. It has minimum verifier support in order to pass bpf selftests. More work will be required to cover verification and other aspects (e.g. blinding, etc.).

[PATCH RESEND v3 bpf-next 00/14] BPF token

This patch set introduces new BPF object, BPF token, which allows to delegate a subset of BPF functionality from privileged system-wide daemon (e.g., systemd or any other container manager) to a trusted unprivileged application. Trust is the key here. This functionality is not about allowing unconditional unprivileged BPF usage. Establishing trust, though, is completely up to the discretion of respective privileged application that would create a BPF token, as different production setups can and do achieve it through a combination of different means (signing, LSM, code reviews, etc), and it’s undesirable and infeasible for kernel to enforce any particular way of validating trustworthiness of particular process.

v1: fprobe: Ensure running fprobe_exit_handler() finished before calling rethook_free()

Ensure running fprobe_exit_handler() has finished before calling rethook_free() in the unregister_fprobe() so that caller can free the fprobe right after unregister_fprobe().

unregister_fprobe() ensured that all running fprobe_entry/exit_handler() have finished by calling unregister_ftrace_function() which synchronizes RCU. But commit 5f81018753df (“fprobe: Release rethook after the ftrace_ops is unregistered”) changed to call rethook_free() after unregister_ftrace_function(). So call rethook_stop() to make rethook disabled before unregister_ftrace_function() and ensure it again.

v8: bpf-next: bpf, x86: allow function arguments up to 12 for TRACING

Therefore, let’s enhance it by increasing the function arguments count allowed in arch_prepare_bpf_trampoline(), for now, only x86_64.

In the 1st patch, we save/restore regs with BPF_DW size to make the code in save_regs()/restore_regs() simpler.

In the 2nd patch, we make arch_prepare_bpf_trampoline() support to copy function arguments in stack for x86 arch. Therefore, the maximum arguments can be up to MAX_BPF_FUNC_ARGS for FENTRY, FEXIT and MODIFY_RETURN. Meanwhile, we clean the potential garbage value when we copy the arguments on-stack.

**[v1: bpf-next: Support defragmenting IPv(46) packets in BPF](http://lore.kernel.org/bpf/cover.1687819413.git.dxu@dxuuu.xyz/)**

In the context of a middlebox, fragmented packets are tricky to handle. The full 5-tuple of a packet is often only available in the first fragment which makes enforcing consistent policy difficult. So stateful tracking is the only sane option. RFC 8900 [0] calls this out as well in section 6.3:

Middleboxes [...] should process IP fragments in a manner that is
consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes
must maintain state in order to achieve this goal.

v1: Interest in additional endianness documentation

Thank you to everyone in the community for building/working on such a great tool! I am helping build a userspace implementation of eBPF and following Dave’s standardization process closely.

周边技术动态

U-Boot

u-boot compilation failure for Sifive unmatched board

This is Satish, compiling u-boot code based on the reference page: https://github.com/carlosedp/riscv-bringup/blob/master/unmatched/Readme.md#install-toolchain-to-build-kernel

u-boot is failing with following commit id & its tag is commit d637294e264adfeb29f390dfc393106fd4d41b17 (HEAD, tag: v2022.01)

Pull request: u-boot-rockchip-20230629

Please pull the fixex for rockchip platform:

  • rockchip inno phy fix;
  • pinctrl driver in SPL arort in specific case;
  • fix IO port voltage for rock5b-rk3588 board;

CI: https://source.denx.de/u-boot/custodians/u-boot-rockchip/-/pipelines/16732

Trying to boot JH7110 RISCV-V CPU from MMC

I am trying to use upstream u-boot + opensbi, to boot my visionfive2 SBC I got from external SD card.

v1: riscv: sifive: fu70: downclock CPU clock for stability

When building the package rustc for AOSC OS on HiFive Unmatched, random SIGSEGV prevents the package from getting correctly built. Downclocking the CPU PLL clock seems to allow rustc to be built, although taking much more time.



Read Album:

Read Related:

Read Latest: